## 9. Additional Exercises

##### Exercise 1: Integrating Multiple Data Sources
- **Objective**: Combine data from an Excel file, a SQL database, and a JSON API into a single DataFrame.
- **Tasks**:
  - Use the Excel Connector to load data from an Excel file.
  - Fetch data from a SQL database using the SQL Connector.
  - Retrieve data from a JSON API using the API Connector.
  - Integrate all these datasets using the Data Integrator.
- **Challenge**: Ensure that the integrated dataset is properly aligned and handle any inconsistencies in data formats or missing values.

#### Step 1: Use the Excel Connector to load data from an Excel file.

In [None]:
from dataanalysistoolkit.data_sources import ExcelConnector, SQLConnector, APIConnector
from dataanalysistoolkit.integrators import DataIntegrator

# Load data from the provided Excel file
excel_connector = ExcelConnector('/mnt/data/example_data.xlsx')
df_excel = excel_connector.load_data(sheet_name='Sheet1')
print("Data from Excel:")
print(df_excel.head())


#### Step 2: Fetch data from a SQL database using the SQL Connector.

For this step, we'll assume you have a PostgreSQL database set up with a table named `customer_data`. If you need to adjust the connection details or table name, you can do so in the code below.

In [None]:
# Replace with your actual database URI
sql_connector = SQLConnector('postgresql://username:password@localhost:5432/mydatabase')

# Executing a SQL query to fetch data
query = "SELECT * FROM customer_data LIMIT 5"
df_sql = sql_connector.query_data(query)
print("Data from SQL Database:")
print(df_sql.head())


#### Step 3: Retrieve data from a JSON API using the API Connector.

For demonstration purposes, we'll assume you have access to an API endpoint. You can replace the URL and authentication details as needed.


In [None]:
# Replace with the actual API base URL and authentication credentials
api_connector = APIConnector('https://api.example.com', auth=('username', 'password'))

# Fetching data from a specific endpoint
endpoint = 'data_endpoint'
response = api_connector.get(endpoint)

# Assuming the response is JSON and converting it to a DataFrame
df_api = pd.DataFrame(response.json())
print("Data from API:")
print(df_api.head())


#### Step 4: Integrate all these datasets using the Data Integrator.

In [None]:
# Initialize the Data Integrator
integrator = DataIntegrator()

# Adding dataframes to the integrator
integrator.add_data(df_excel)
integrator.add_data(df_sql)
integrator.add_data(df_api)

# Assuming we want to concatenate the dataframes
combined_df = integrator.concatenate_data()
print("Combined Data:")
print(combined_df.head())

# Alternatively, if you need to merge the dataframes on a common key:
# combined_df = integrator.merge_data(on='common_key')


#### Challenge

Ensure that the integrated dataset is properly aligned and handle any inconsistencies in data formats or missing values.


In [None]:
from dataanalysistoolkit.preprocessor import DataFormatter

# Initialize the Data Formatter
formatter = DataFormatter(combined_df)

# Standardize date formats
formatter.standardize_dates('date_column', date_format='%Y-%m-%d')

# Normalize numeric columns
numeric_columns = ['numeric_column1', 'numeric_column2']
formatter.normalize_numeric(numeric_columns)

# Categorize columns
category_columns = ['category_column1', 'category_column2']
formatter.categorize_columns(category_columns)

# Fill missing values
formatter.fill_missing_values('column_with_missing_data', fill_value=0)

print("Cleaned and Formatted Combined Data:")
print(combined_df.head())


This completes Exercise 1 in the Jupyter notebook. You have successfully integrated data from an Excel file, a SQL database, and a JSON API into a single DataFrame, then cleaned and formatted the data for further analysis.

##### Exercise 2: Data Cleaning and Transformation
- **Objective**: Clean and transform the integrated dataset from Exercise 1.
- **Tasks**:
  - Identify and fill missing values in the dataset.
  - Standardize the format of any date columns.
  - Normalize numeric columns and convert categorical columns to a standard format.
  - Create a new column based on a custom transformation logic.
- **Challenge**: Try to automate as much of the data cleaning process as possible, considering future data imports.

##### Exercise 3: Handling Large and Complex Datasets
- **Objective**: Work with a larger and more complex dataset of your choice (e.g., a dataset from Kaggle or a public API).
- **Tasks**:
  - Import the dataset using the appropriate connector(s).
  - Explore different integration techniques to handle large datasets efficiently.
  - Perform advanced data formatting and transformation tasks tailored to the dataset's specifics.
- **Challenge**: Optimize the data import process for speed and memory efficiency, especially if dealing with very large datasets.

##### Exercise 4: Customizing the Data Import Process
- **Objective**: Extend or customize the DataAnalysisToolkit to suit a unique data import requirement.
- **Tasks**:
  - Identify a specific need or limitation in the current data import process.
  - Modify an existing connector or create a new one to address this need.
  - Test your custom solution with relevant data sources.
- **Challenge**: Ensure that your custom solution is robust, handles errors gracefully, and integrates well with the rest of the toolkit.

##### Exercise 5: Real-world Application
- **Objective**: Apply the DataAnalysisToolkit to a real-world data analysis project.
- **Tasks**:
  - Identify a real-world problem that can be addressed through data analysis.
  - Collect and import data from relevant sources using the toolkit.
  - Clean, transform, and integrate the data in preparation for analysis.
- **Challenge**: Provide insights, visualizations, or a predictive model based on the integrated dataset.


In [None]:
Sure! Let's continue with Exercise 1 by using the provided Excel file and writing out the rest of the steps in a Jupyter notebook.

### Exercise 1: Integrating Multiple Data Sources

**Objective:** Combine data from an Excel file, a SQL database, and a JSON API into a single DataFrame.

#### Step 1: Use the Excel Connector to load data from an Excel file.

```python
from dataanalysistoolkit.data_sources import ExcelConnector, SQLConnector, APIConnector
from dataanalysistoolkit.integrators import DataIntegrator

# Load data from the provided Excel file
excel_connector = ExcelConnector('/mnt/data/example_data.xlsx')
df_excel = excel_connector.load_data(sheet_name='Sheet1')
print("Data from Excel:")
print(df_excel.head())
```

#### Step 2: Fetch data from a SQL database using the SQL Connector.

For this step, we'll assume you have a PostgreSQL database set up with a table named `customer_data`. If you need to adjust the connection details or table name, you can do so in the code below.

```python
# Replace with your actual database URI
sql_connector = SQLConnector('postgresql://username:password@localhost:5432/mydatabase')

# Executing a SQL query to fetch data
query = "SELECT * FROM customer_data LIMIT 5"
df_sql = sql_connector.query_data(query)
print("Data from SQL Database:")
print(df_sql.head())
```

#### Step 3: Retrieve data from a JSON API using the API Connector.

For demonstration purposes, we'll assume you have access to an API endpoint. You can replace the URL and authentication details as needed.

```python
# Replace with the actual API base URL and authentication credentials
api_connector = APIConnector('https://api.example.com', auth=('username', 'password'))

# Fetching data from a specific endpoint
endpoint = 'data_endpoint'
response = api_connector.get(endpoint)

# Assuming the response is JSON and converting it to a DataFrame
df_api = pd.DataFrame(response.json())
print("Data from API:")
print(df_api.head())
```

#### Step 4: Integrate all these datasets using the Data Integrator.

```python
# Initialize the Data Integrator
integrator = DataIntegrator()

# Adding dataframes to the integrator
integrator.add_data(df_excel)
integrator.add_data(df_sql)
integrator.add_data(df_api)

# Assuming we want to concatenate the dataframes
combined_df = integrator.concatenate_data()
print("Combined Data:")
print(combined_df.head())

# Alternatively, if you need to merge the dataframes on a common key:
# combined_df = integrator.merge_data(on='common_key')
```

#### Challenge: Ensure that the integrated dataset is properly aligned and handle any inconsistencies in data formats or missing values.

```python
from dataanalysistoolkit.preprocessor import DataFormatter

# Initialize the Data Formatter
formatter = DataFormatter(combined_df)

# Standardize date formats
formatter.standardize_dates('date_column', date_format='%Y-%m-%d')

# Normalize numeric columns
numeric_columns = ['numeric_column1', 'numeric_column2']
formatter.normalize_numeric(numeric_columns)

# Categorize columns
category_columns = ['category_column1', 'category_column2']
formatter.categorize_columns(category_columns)

# Fill missing values
formatter.fill_missing_values('column_with_missing_data', fill_value=0)

print("Cleaned and Formatted Combined Data:")
print(combined_df.head())
```

This completes Exercise 1 in the Jupyter notebook. You have successfully integrated data from an Excel file, a SQL database, and a JSON API into a single DataFrame, then cleaned and formatted the data for further analysis.
