### Exercise Case Study Notebook: Data Cleaning and Preparation

1. Problem and Objective:
   - Introduce a dataset with various data quality issues
   - Goal: Clean and prepare the data for analysis and modeling

2. Data Loading:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
df = pd.read_csv("exercise_dataset.csv")

# Display the first few rows and basic information
print(df.head())
print(df.info())

3. Data Cleaning and Preparation Tasks:

a. Handle Inconsistencies:
   - Question: Identify inconsistencies in the 'category' column. How would you standardize these values?

b. Handle Missing Values:
   - Task: Visualize missing values in the dataset
   - Question: Which imputation method would be most appropriate for the 'age' column? Implement your chosen method.

c. Handle Categorical Values:
   - Task: Identify categorical columns in the dataset
   - Question: Choose an appropriate encoding method for the 'education' column. Justify your choice and implement it.

d. Handle Duplicates:
   - Task: Check for duplicate entries in the dataset
   - Question: How would you handle potential duplicates? Implement your approach.

e. Handle Outliers:
   - Task: Identify potential outliers in the 'income' column
   - Question: Propose and implement a method to deal with these outliers.

f. Feature Engineering:
   - Task: Create a new feature 'age_group' by binning the 'age' column
   - Question: Suggest and implement another relevant feature based on the existing data.

g. Scaling and Normalization:
   - Question: Which numerical features would benefit from scaling? Apply an appropriate scaling method to these features.

h. Feature Selection:
   - Task: Implement a simple correlation-based feature selection method
   - Question: Based on the results, which features would you select for further analysis?

4. Submission:

In [None]:

X_test = pd.read_csv("module5_exercise_test.csv", sep=",", index_col='id')
submission = pd.DataFrame({
    'id': X_test.index,
    'target': 0  # Replace with your prediction
})

submission.to_csv('submission.csv', index=False, sep=',')




5. Final Questions:
   - Summarize the key data cleaning and preparation steps you performed.
   - How might these preprocessing steps impact subsequent analysis or modeling?
   - What additional data quality checks or preprocessing steps would you recommend?