## Handling Missing Data with Imputation
This Process is normally done in the early stages of data preprocessing. Imputation is the process of replacing missing data with substituted values.
### To do so, we have four options:
1. Remove rows with missing data. This is the simplest approach but may lead to loss of valuable information if many rows are removed.
2. Remove columns with missing data. This is more appropriate when the number of columns is small.However, leads to loss of maybe important information.
3. Fill missing data with a specific value (e.g., mean, median, mode). This is the most common approach and works well for numerical data by introducing artificial data.
4. Use advanced imputation techniques (e.g., KNN, MICE)

## Common Imputation Techniques:
1. Mean Imputation
2. Median Imputation
3. Mode Imputation
4. K-Nearest Neighbors (KNN) Imputation
5. Multivariate Imputation by Chained Equations (MICE)
6. Iterative Imputation

In [11]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# Create a sample DataFrame with missing values
# Sample dataset
data = {
    'Age': [25, np.nan, 30, 22, np.nan, 28],
    'Salary': [50000, 60000, np.nan, 52000, 58000, np.nan],
    'City': ['New York', 'Los Angeles', np.nan, 'Chicago', 'Houston', 'Phoenix']
}


# Convert to DataFrame
df = pd.DataFrame(data)
print("Original DataFrame with Missing Values:")
print(df)


# Define the imputation strategies for numerical and categorical columns
numerical_imputer = SimpleImputer(strategy='mean')

# This is fine for such small datasets, better alternative is to use KNNImputer or fill_value='unknown'
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Create a ColumnTransformer to apply different imputers to different columns
preprocessor = ColumnTransformer([
    ('numerical_imputer', numerical_imputer, ['Age', 'Salary']),
    ('categorical_imputer', categorical_imputer, ['City'])
])

# Create a pipeline that first imputes missing values
imputation_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit and transform the DataFrame
df_imputed = imputation_pipeline.fit_transform(df)

# Optional to convert the result back to a DataFrame
df_imputed = pd.DataFrame(df_imputed, columns=['Age', 'Salary', 'City'])
print("\nDataFrame after Imputation:")
print(df_imputed)

Original DataFrame with Missing Values:
    Age   Salary         City
0  25.0  50000.0     New York
1   NaN  60000.0  Los Angeles
2  30.0      NaN          NaN
3  22.0  52000.0      Chicago
4   NaN  58000.0      Houston
5  28.0      NaN      Phoenix

DataFrame after Imputation:
     Age   Salary         City
0   25.0  50000.0     New York
1  26.25  60000.0  Los Angeles
2   30.0  55000.0      Chicago
3   22.0  52000.0      Chicago
4  26.25  58000.0      Houston
5   28.0  55000.0      Phoenix
