# Handling Missing Values in Machine Learning

## Introduction

Missing values are one of the most common data quality issues in real-world datasets. They occur when no data value is stored for a particular variable in an observation. Handling missing values is a critical step in the data preprocessing pipeline because most machine learning algorithms cannot work with missing data.

### Why Do Missing Values Occur?

1. **Data Entry Errors**: Human mistakes during manual data entry
2. **System Failures**: Technical glitches during data collection
3. **Non-Response**: Survey participants skipping questions
4. **Data Merging Issues**: Information lost when combining datasets
5. **Sensor Malfunctions**: Equipment failures in IoT or automated systems
6. **Privacy Concerns**: Sensitive information intentionally removed
7. **Data Not Applicable**: Some fields may not apply to all records

### Impact of Missing Values

- **Biased Results**: Can lead to inaccurate conclusions
- **Reduced Statistical Power**: Smaller effective sample size
- **Algorithm Failure**: Many ML models don't accept missing values
- **Invalid Insights**: Patterns may be hidden or distorted

### Types of Missing Data

1. **MCAR (Missing Completely At Random)**: No relationship between missing values and any other data
2. **MAR (Missing At Random)**: Missingness depends on other observed variables
3. **MNAR (Missing Not At Random)**: Missingness depends on the missing value itself

Understanding the type of missingness helps choose the appropriate handling strategy.

---

Let's explore various techniques to handle missing values with practical examples!

## Step 1: Import Libraries and Create Sample Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 100)

# Create a sample dataset with missing values
np.random.seed(42)
data = {
    'Age': [25, 30, np.nan, 35, 40, np.nan, 28, 33, 45, 50, np.nan, 29, 38, np.nan, 42],
    'Salary': [50000, 60000, 55000, np.nan, 80000, 70000, np.nan, 65000, 90000, np.nan, 58000, 62000, np.nan, 75000, 85000],
    'Experience': [2, 5, 3, np.nan, 10, 8, 4, 6, np.nan, 15, 3, 5, 9, np.nan, 12],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'IT', 'HR', 'IT', np.nan, 'Finance', 'IT', 'HR', np.nan, 'Finance', 'IT', 'HR'],
    'Performance_Score': [3.5, 4.2, 3.8, np.nan, 4.5, 4.0, np.nan, 4.1, 4.7, 4.8, 3.9, 4.3, np.nan, 4.6, 4.4],
    'City': ['New York', 'San Francisco', 'Chicago', 'Boston', np.nan, 'Seattle', 'Austin', np.nan, 'Denver', 'Portland', 'Miami', 'Atlanta', np.nan, 'Dallas', 'Phoenix']
}

df = pd.DataFrame(data)

print("Original Dataset:")
print("=" * 100)
print(df)
print("\n" + "=" * 100)
print("Dataset Shape:", df.shape)
print("=" * 100)

## Step 2: Detect and Analyze Missing Values

Before handling missing values, we need to understand their extent and pattern.

In [None]:
# Check for missing values
print("Missing Values Analysis:")
print("=" * 100)

# Method 1: Count of missing values
missing_count = df.isnull().sum()
print("\n1. Missing Values Count:")
print(missing_count)

# Method 2: Percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': missing_count.values,
    'Missing_Percentage': missing_percentage.values
}).sort_values('Missing_Percentage', ascending=False)

print("\n2. Missing Values Summary:")
print(missing_df)

# Method 3: Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Bar plot
axes[0].bar(missing_df['Column'], missing_df['Missing_Count'], color='coral', alpha=0.7)
axes[0].set_xlabel('Columns', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Missing Count', fontsize=12, fontweight='bold')
axes[0].set_title('Missing Values Count by Column', fontsize=14, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(axis='y', alpha=0.3)

# Heatmap
sns.heatmap(df.isnull(), cbar=True, cmap='viridis', yticklabels=False, ax=axes[1])
axes[1].set_title('Missing Values Heatmap (Yellow = Missing)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Columns', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print(f"Total Missing Values: {df.isnull().sum().sum()}")
print(f"Total Cells: {df.shape[0] * df.shape[1]}")
print(f"Overall Missing Percentage: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100:.2f}%")
print("=" * 100)

## Method 1: Deletion Techniques

### 1.1 Listwise Deletion (Complete Case Analysis)

Remove entire rows that contain any missing values. This is the simplest approach but can lead to significant data loss.

**When to Use:**
- Missing data is MCAR (Missing Completely At Random)
- Very small percentage of missing values (<5%)
- Large dataset where losing some rows won't impact analysis

**Pros:** Simple, no bias if MCAR
**Cons:** Loss of information, reduced sample size, potential bias if not MCAR

In [None]:
# Listwise Deletion
df_listwise = df.dropna()

print("LISTWISE DELETION (Complete Case Analysis)")
print("=" * 100)
print("\nOriginal Shape:", df.shape)
print("After Deletion:", df_listwise.shape)
print(f"Rows Removed: {df.shape[0] - df_listwise.shape[0]} ({((df.shape[0] - df_listwise.shape[0]) / df.shape[0]) * 100:.1f}%)")

print("\nResulting Dataset:")
print(df_listwise)
print("\n" + "=" * 100)

# Visualization
fig, ax = plt.subplots(figsize=(10, 5))
categories = ['Original', 'After Listwise Deletion']
counts = [df.shape[0], df_listwise.shape[0]]
colors = ['steelblue', 'coral']
bars = ax.bar(categories, counts, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)} rows',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

ax.set_ylabel('Number of Rows', fontsize=12, fontweight='bold')
ax.set_title('Impact of Listwise Deletion', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("‚ö†Ô∏è WARNING: Listwise deletion removed 93.3% of the data!")
print("This is not recommended when you have this much missing data.")
print("=" * 100)

### 1.2 Pairwise Deletion (Available Case Analysis)

Use all available data for each analysis, only excluding specific missing values.

**When to Use:**
- Multiple analyses on different variable combinations
- Want to maximize available data for each analysis

**Pros:** Uses more data than listwise deletion
**Cons:** Can lead to inconsistent sample sizes across analyses

In [None]:
# Pairwise Deletion Example
print("PAIRWISE DELETION (Available Case Analysis)")
print("=" * 100)

# Calculate correlations using pairwise deletion (default in pandas)
numeric_cols = ['Age', 'Salary', 'Experience', 'Performance_Score']
correlation_pairwise = df[numeric_cols].corr()

print("\nCorrelation Matrix (using pairwise deletion):")
print(correlation_pairwise.round(3))

# Show how many observations were used for each correlation
print("\n\nSample sizes used for each correlation:")
for col1 in numeric_cols:
    for col2 in numeric_cols:
        valid_pairs = df[[col1, col2]].dropna().shape[0]
        print(f"{col1} vs {col2}: {valid_pairs} observations")

# Visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_pairwise, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={'label': 'Correlation'})
plt.title('Correlation Matrix (Pairwise Deletion)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üí° NOTE: Different correlations used different numbers of observations")
print("This can lead to inconsistent results across analyses.")
print("=" * 100)

### 1.3 Column Deletion

Remove entire columns that have too many missing values (typically >60-70% missing).

**When to Use:**
- Column has excessive missing values
- Column is not critical for analysis
- Cost of imputation outweighs benefit

**Pros:** Simple, removes problematic features
**Cons:** Loss of potentially useful information

In [None]:
# Column Deletion
threshold = 0.3  # Drop columns with more than 30% missing values

print("COLUMN DELETION")
print("=" * 100)
print(f"\nThreshold: Drop columns with > {threshold*100}% missing values")

# Calculate missing percentage for each column
missing_pct = df.isnull().sum() / len(df)
columns_to_drop = missing_pct[missing_pct > threshold].index.tolist()

print(f"\nColumns to drop: {columns_to_drop}")
print(f"Columns with missing > {threshold*100}%:")
for col in columns_to_drop:
    print(f"  - {col}: {missing_pct[col]*100:.1f}% missing")

# Drop columns
df_column_dropped = df.drop(columns=columns_to_drop)

print(f"\nOriginal Columns: {df.shape[1]}")
print(f"After Column Deletion: {df_column_dropped.shape[1]}")
print(f"Columns Removed: {len(columns_to_drop)}")

print("\nRemaining Dataset:")
print(df_column_dropped.head(10))

# Visualization
fig, ax = plt.subplots(figsize=(12, 5))
all_cols = list(df.columns)
colors = ['red' if col in columns_to_drop else 'green' for col in all_cols]
missing_percentages = [missing_pct[col] * 100 for col in all_cols]

bars = ax.bar(all_cols, missing_percentages, color=colors, alpha=0.7, edgecolor='black')
ax.axhline(y=threshold*100, color='black', linestyle='--', linewidth=2, label=f'Threshold ({threshold*100}%)')
ax.set_xlabel('Columns', fontsize=12, fontweight='bold')
ax.set_ylabel('Missing Percentage (%)', fontsize=12, fontweight='bold')
ax.set_title('Column Deletion: Missing Value Threshold', fontsize=14, fontweight='bold')
ax.tick_params(axis='x', rotation=45)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("‚úÖ Columns with acceptable missing values are retained.")
print("=" * 100)

## Method 2: Imputation Techniques

Imputation fills missing values with estimated values based on available data. This preserves sample size and can provide better results than deletion.

### 2.1 Mean/Median/Mode Imputation

Replace missing values with statistical measures of the available data.

**Mean Imputation:** Use average value (for normally distributed numerical data)
**Median Imputation:** Use middle value (for skewed numerical data, robust to outliers)
**Mode Imputation:** Use most frequent value (for categorical data)

**When to Use:**
- MCAR (Missing Completely At Random) data
- Small to moderate amount of missing values
- Quick baseline solution

**Pros:** Simple, fast, preserves sample size
**Cons:** Reduces variance, ignores relationships between variables, can distort distributions

In [None]:
# Mean/Median/Mode Imputation
print("MEAN/MEDIAN/MODE IMPUTATION")
print("=" * 100)

# Create copies for different imputation strategies
df_mean = df.copy()
df_median = df.copy()
df_mode = df.copy()

# Numerical columns for mean/median
numerical_cols = ['Age', 'Salary', 'Experience', 'Performance_Score']

# Mean Imputation
for col in numerical_cols:
    mean_value = df[col].mean()
    df_mean[col].fillna(mean_value, inplace=True)
    print(f"Mean Imputation - {col}: filled with {mean_value:.2f}")

print("\n" + "-" * 100)

# Median Imputation
for col in numerical_cols:
    median_value = df[col].median()
    df_median[col].fillna(median_value, inplace=True)
    print(f"Median Imputation - {col}: filled with {median_value:.2f}")

print("\n" + "-" * 100)

# Mode Imputation for categorical
categorical_cols = ['Department', 'City']
for col in categorical_cols:
    mode_value = df[col].mode()[0]
    df_mode[col].fillna(mode_value, inplace=True)
    print(f"Mode Imputation - {col}: filled with '{mode_value}'")

# Combine median for numerical and mode for categorical (best practice)
df_imputed = df.copy()
for col in numerical_cols:
    df_imputed[col].fillna(df[col].median(), inplace=True)
for col in categorical_cols:
    df_imputed[col].fillna(df[col].mode()[0], inplace=True)

print("\n" + "=" * 100)
print("Dataset after Imputation (Median + Mode):")
print(df_imputed)

# Visualization: Compare distributions before and after imputation
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Distribution Comparison: Before vs After Imputation', fontsize=16, fontweight='bold')

for idx, col in enumerate(['Age', 'Salary']):
    row = idx // 2
    col_idx = idx % 2
    
    axes[row, col_idx].hist(df[col].dropna(), bins=10, alpha=0.5, label='Original', color='blue', edgecolor='black')
    axes[row, col_idx].hist(df_imputed[col], bins=10, alpha=0.5, label='After Imputation', color='red', edgecolor='black')
    axes[row, col_idx].set_xlabel(col, fontsize=11, fontweight='bold')
    axes[row, col_idx].set_ylabel('Frequency', fontsize=11, fontweight='bold')
    axes[row, col_idx].set_title(f'{col} Distribution', fontsize=12, fontweight='bold')
    axes[row, col_idx].legend()
    axes[row, col_idx].grid(alpha=0.3)

# Department distribution
dept_original = df['Department'].value_counts()
dept_imputed = df_imputed['Department'].value_counts()
x_pos = np.arange(len(dept_imputed))
axes[1, 0].bar(x_pos - 0.2, [dept_original.get(d, 0) for d in dept_imputed.index], 
               0.4, label='Original', alpha=0.7, color='blue')
axes[1, 0].bar(x_pos + 0.2, dept_imputed.values, 0.4, label='After Imputation', 
               alpha=0.7, color='red')
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(dept_imputed.index, rotation=45)
axes[1, 0].set_ylabel('Count', fontsize=11, fontweight='bold')
axes[1, 0].set_title('Department Distribution', fontsize=12, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# Statistics comparison
stats_comparison = pd.DataFrame({
    'Original_Mean': [df[col].mean() for col in numerical_cols[:2]],
    'Imputed_Mean': [df_imputed[col].mean() for col in numerical_cols[:2]],
    'Original_Std': [df[col].std() for col in numerical_cols[:2]],
    'Imputed_Std': [df_imputed[col].std() for col in numerical_cols[:2]]
}, index=['Age', 'Salary'])
axes[1, 1].axis('off')
table = axes[1, 1].table(cellText=stats_comparison.round(2).values,
                          rowLabels=stats_comparison.index,
                          colLabels=stats_comparison.columns,
                          cellLoc='center',
                          loc='center')
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)
axes[1, 1].set_title('Statistics Comparison', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("‚úÖ Imputation complete! Missing values: ", df_imputed.isnull().sum().sum())
print("‚ö†Ô∏è NOTE: Imputation reduced variance in the data (see std deviation changes)")
print("=" * 100)

### 2.2 Forward Fill and Backward Fill

Replace missing values with the previous (forward fill) or next (backward fill) valid value. Useful for time-series data.

**Forward Fill (ffill):** Use the last known value
**Backward Fill (bfill):** Use the next known value

**When to Use:**
- Time-series data
- Data with natural ordering
- When recent/upcoming values are good predictors

**Pros:** Preserves temporal patterns, no assumptions about distribution
**Cons:** Only works with ordered data, may propagate errors

In [None]:
# Forward Fill and Backward Fill
print("FORWARD FILL AND BACKWARD FILL")
print("=" * 100)

# Create a time-series style dataset
time_data = {
    'Date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'Temperature': [20, 22, np.nan, 24, np.nan, 26, 27, np.nan, 29, 30],
    'Humidity': [65, np.nan, 68, 70, np.nan, 72, np.nan, 75, 76, np.nan]
}
df_time = pd.DataFrame(time_data)

print("\nOriginal Time-Series Data:")
print(df_time)

# Forward Fill
df_ffill = df_time.copy()
df_ffill[['Temperature', 'Humidity']] = df_ffill[['Temperature', 'Humidity']].ffill()

print("\n" + "-" * 100)
print("After Forward Fill (ffill):")
print(df_ffill)

# Backward Fill
df_bfill = df_time.copy()
df_bfill[['Temperature', 'Humidity']] = df_bfill[['Temperature', 'Humidity']].bfill()

print("\n" + "-" * 100)
print("After Backward Fill (bfill):")
print(df_bfill)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Temperature comparison
axes[0].plot(df_time['Date'], df_time['Temperature'], 'o--', label='Original (with gaps)', 
             markersize=8, color='blue', alpha=0.6)
axes[0].plot(df_ffill['Date'], df_ffill['Temperature'], 's-', label='Forward Fill', 
             markersize=6, color='green', alpha=0.7)
axes[0].plot(df_bfill['Date'], df_bfill['Temperature'], '^-', label='Backward Fill', 
             markersize=6, color='red', alpha=0.7)
axes[0].set_xlabel('Date', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Temperature', fontsize=11, fontweight='bold')
axes[0].set_title('Temperature: Forward vs Backward Fill', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].tick_params(axis='x', rotation=45)

# Humidity comparison
axes[1].plot(df_time['Date'], df_time['Humidity'], 'o--', label='Original (with gaps)', 
             markersize=8, color='blue', alpha=0.6)
axes[1].plot(df_ffill['Date'], df_ffill['Humidity'], 's-', label='Forward Fill', 
             markersize=6, color='green', alpha=0.7)
axes[1].plot(df_bfill['Date'], df_bfill['Humidity'], '^-', label='Backward Fill', 
             markersize=6, color='red', alpha=0.7)
axes[1].set_xlabel('Date', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Humidity', fontsize=11, fontweight='bold')
axes[1].set_title('Humidity: Forward vs Backward Fill', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üí° KEY OBSERVATIONS:")
print("  ‚Ä¢ Forward Fill: Missing values take the LAST known value")
print("  ‚Ä¢ Backward Fill: Missing values take the NEXT known value")
print("  ‚Ä¢ Best for time-series where recent values are relevant")
print("=" * 100)

### 2.3 K-Nearest Neighbors (KNN) Imputation

Use K-nearest neighbors algorithm to predict missing values based on similar observations.

**How it Works:**
1. Find K most similar observations (based on other features)
2. Use their values to estimate the missing value
3. Typically uses mean of K neighbors for numerical data

**When to Use:**
- Complex patterns in data
- Relationships between features are important
- Moderate amount of missing data

**Pros:** Considers relationships between features, more accurate than simple imputation
**Cons:** Computationally expensive, sensitive to outliers, requires feature scaling

In [None]:
# KNN Imputation
print("K-NEAREST NEIGHBORS (KNN) IMPUTATION")
print("=" * 100)

# Prepare data for KNN (only numerical columns)
df_knn = df[numerical_cols].copy()

print("\nOriginal Data (Numerical columns only):")
print(df_knn)
print(f"\nMissing values: {df_knn.isnull().sum().sum()}")

# Apply KNN Imputation
knn_imputer = KNNImputer(n_neighbors=3, weights='distance')
df_knn_imputed = pd.DataFrame(
    knn_imputer.fit_transform(df_knn),
    columns=df_knn.columns
)

print("\n" + "-" * 100)
print("After KNN Imputation (K=3):")
print(df_knn_imputed)
print(f"\nMissing values: {df_knn_imputed.isnull().sum().sum()}")

# Compare with median imputation
df_median_comp = df_knn.copy()
for col in df_median_comp.columns:
    df_median_comp[col].fillna(df_knn[col].median(), inplace=True)

# Visualization: Compare KNN vs Median Imputation
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('KNN Imputation vs Median Imputation', fontsize=16, fontweight='bold')

for idx, col in enumerate(['Age', 'Salary']):
    row = idx
    
    # Distribution comparison
    axes[row, 0].hist(df_knn[col].dropna(), bins=8, alpha=0.5, label='Original', 
                      color='blue', edgecolor='black')
    axes[row, 0].hist(df_knn_imputed[col], bins=8, alpha=0.5, label='KNN Imputed', 
                      color='green', edgecolor='black')
    axes[row, 0].hist(df_median_comp[col], bins=8, alpha=0.5, label='Median Imputed', 
                      color='red', edgecolor='black')
    axes[row, 0].set_xlabel(col, fontsize=11, fontweight='bold')
    axes[row, 0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
    axes[row, 0].set_title(f'{col} Distribution Comparison', fontsize=12, fontweight='bold')
    axes[row, 0].legend()
    axes[row, 0].grid(alpha=0.3)
    
    # Box plot comparison
    data_to_plot = [df_knn[col].dropna(), df_knn_imputed[col], df_median_comp[col]]
    axes[row, 1].boxplot(data_to_plot, labels=['Original', 'KNN', 'Median'])
    axes[row, 1].set_ylabel(col, fontsize=11, fontweight='bold')
    axes[row, 1].set_title(f'{col} Box Plot Comparison', fontsize=12, fontweight='bold')
    axes[row, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics comparison
print("\n" + "=" * 100)
print("STATISTICS COMPARISON: KNN vs Median Imputation")
print("=" * 100)

for col in ['Age', 'Salary']:
    print(f"\n{col}:")
    print(f"  Original - Mean: {df_knn[col].mean():.2f}, Std: {df_knn[col].std():.2f}")
    print(f"  KNN      - Mean: {df_knn_imputed[col].mean():.2f}, Std: {df_knn_imputed[col].std():.2f}")
    print(f"  Median   - Mean: {df_median_comp[col].mean():.2f}, Std: {df_median_comp[col].std():.2f}")

print("\n" + "=" * 100)
print("üí° KEY INSIGHTS:")
print("  ‚Ä¢ KNN imputation preserves variance better than median imputation")
print("  ‚Ä¢ KNN considers relationships between features")
print("  ‚Ä¢ KNN produces more realistic imputed values")
print("=" * 100)

### 2.4 Iterative Imputation (MICE - Multivariate Imputation by Chained Equations)

Uses machine learning models to predict missing values iteratively, modeling each feature with missing values as a function of other features.

**How it Works:**
1. Start with simple imputation (mean/median)
2. For each feature with missing values:
   - Use it as target variable
   - Use other features as predictors
   - Train a model and predict missing values
3. Repeat multiple times until convergence

**When to Use:**
- Complex multivariate relationships
- MAR (Missing At Random) data
- Need high accuracy

**Pros:** Most sophisticated, handles complex relationships, often most accurate
**Cons:** Computationally intensive, can be slow, requires careful parameter tuning

In [None]:
# Iterative Imputation (MICE)
print("ITERATIVE IMPUTATION (MICE)")
print("=" * 100)

# Prepare data
df_mice = df[numerical_cols].copy()

print("\nOriginal Data:")
print(df_mice)
print(f"\nMissing values: {df_mice.isnull().sum().sum()}")

# Apply Iterative Imputation
mice_imputer = IterativeImputer(max_iter=10, random_state=42, verbose=0)
df_mice_imputed = pd.DataFrame(
    mice_imputer.fit_transform(df_mice),
    columns=df_mice.columns
)

print("\n" + "-" * 100)
print("After MICE Imputation:")
print(df_mice_imputed.round(2))
print(f"\nMissing values: {df_mice_imputed.isnull().sum().sum()}")

# Compare all imputation methods
comparison_data = {
    'Original': df_mice['Age'].dropna().values,
    'Median': df_median_comp['Age'].values,
    'KNN': df_knn_imputed['Age'].values,
    'MICE': df_mice_imputed['Age'].values
}

# Visualization: Compare all methods
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Comparison of All Imputation Methods', fontsize=16, fontweight='bold')

# Age distribution
axes[0, 0].hist([df_mice['Age'].dropna(), df_median_comp['Age'], 
                 df_knn_imputed['Age'], df_mice_imputed['Age']], 
                bins=8, alpha=0.6, label=['Original', 'Median', 'KNN', 'MICE'],
                color=['blue', 'red', 'green', 'purple'], edgecolor='black')
axes[0, 0].set_xlabel('Age', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[0, 0].set_title('Age Distribution: All Methods', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# Box plots
data_to_plot = [df_mice['Age'].dropna(), df_median_comp['Age'], 
                df_knn_imputed['Age'], df_mice_imputed['Age']]
bp = axes[0, 1].boxplot(data_to_plot, labels=['Original', 'Median', 'KNN', 'MICE'],
                        patch_artist=True)
colors = ['lightblue', 'lightcoral', 'lightgreen', 'plum']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
axes[0, 1].set_ylabel('Age', fontsize=11, fontweight='bold')
axes[0, 1].set_title('Age Box Plot: All Methods', fontsize=12, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Statistics comparison table
stats_data = {
    'Method': ['Original', 'Median', 'KNN', 'MICE'],
    'Mean': [df_mice['Age'].mean(), df_median_comp['Age'].mean(), 
             df_knn_imputed['Age'].mean(), df_mice_imputed['Age'].mean()],
    'Std': [df_mice['Age'].std(), df_median_comp['Age'].std(), 
            df_knn_imputed['Age'].std(), df_mice_imputed['Age'].std()],
    'Min': [df_mice['Age'].min(), df_median_comp['Age'].min(), 
            df_knn_imputed['Age'].min(), df_mice_imputed['Age'].min()],
    'Max': [df_mice['Age'].max(), df_median_comp['Age'].max(), 
            df_knn_imputed['Age'].max(), df_mice_imputed['Age'].max()]
}
stats_df = pd.DataFrame(stats_data)

axes[1, 0].axis('off')
table = axes[1, 0].table(cellText=stats_df.round(2).values,
                          colLabels=stats_df.columns,
                          cellLoc='center',
                          loc='center')
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)
axes[1, 0].set_title('Statistics Comparison: Age', fontsize=12, fontweight='bold', pad=20)

# Variance comparison
variances = [df_mice['Age'].var(), df_median_comp['Age'].var(), 
             df_knn_imputed['Age'].var(), df_mice_imputed['Age'].var()]
methods = ['Original', 'Median', 'KNN', 'MICE']
colors_var = ['blue', 'red', 'green', 'purple']
axes[1, 1].bar(methods, variances, color=colors_var, alpha=0.7, edgecolor='black')
axes[1, 1].set_ylabel('Variance', fontsize=11, fontweight='bold')
axes[1, 1].set_title('Variance Preservation Comparison', fontsize=12, fontweight='bold')
axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("SUMMARY: Method Comparison")
print("=" * 100)
print(f"{'Method':<15} {'Mean':<10} {'Std Dev':<10} {'Variance':<10}")
print("-" * 100)
for method, mean_val, std_val, var_val in zip(
    ['Original', 'Median', 'KNN', 'MICE'],
    [df_mice['Age'].mean(), df_median_comp['Age'].mean(), 
     df_knn_imputed['Age'].mean(), df_mice_imputed['Age'].mean()],
    [df_mice['Age'].std(), df_median_comp['Age'].std(), 
     df_knn_imputed['Age'].std(), df_mice_imputed['Age'].std()],
    variances
):
    print(f"{method:<15} {mean_val:<10.2f} {std_val:<10.2f} {var_val:<10.2f}")

print("\n" + "=" * 100)
print("üí° KEY FINDINGS:")
print("  ‚Ä¢ Median: Fastest but reduces variance significantly")
print("  ‚Ä¢ KNN: Good balance between accuracy and computation")
print("  ‚Ä¢ MICE: Most sophisticated, best preserves relationships")
print("  ‚Ä¢ Choose based on: data size, missing %, computational resources, accuracy needs")
print("=" * 100)

## Decision Framework: Choosing the Right Method

### Quick Reference Guide

| Scenario | Recommended Method | Why? |
|----------|-------------------|------|
| < 5% missing, MCAR | Listwise Deletion | Simple, minimal data loss |
| Time-series data | Forward/Backward Fill | Preserves temporal patterns |
| Categorical variables | Mode Imputation | Most frequent value is reasonable |
| Numerical, quick baseline | Median Imputation | Robust to outliers, fast |
| Moderate missing, relationships matter | KNN Imputation | Balances accuracy and speed |
| Complex patterns, high accuracy needed | MICE/Iterative | Most sophisticated |
| > 70% missing in column | Column Deletion | Too much missing to impute reliably |

### Step-by-Step Decision Process

1. **Analyze Missing Data Pattern**
   - Check percentage of missing values
   - Identify type: MCAR, MAR, or MNAR
   - Visualize missing data patterns

2. **Consider Data Characteristics**
   - Data type (numerical vs categorical)
   - Dataset size
   - Presence of relationships between features
   - Time-series vs cross-sectional data

3. **Evaluate Trade-offs**
   - Computational resources available
   - Accuracy requirements
   - Data loss tolerance
   - Model complexity

4. **Implement and Validate**
   - Apply chosen method
   - Compare distributions before/after
   - Validate with cross-validation
   - Document assumptions made

In [None]:
# Decision Framework Visualization
print("DECISION FRAMEWORK VISUALIZATION")
print("=" * 100)

# Create a comparison summary
methods_summary = {
    'Method': ['Listwise Deletion', 'Column Deletion', 'Mean/Median', 
               'Forward/Backward Fill', 'KNN', 'MICE'],
    'Speed': ['Fast', 'Fast', 'Fast', 'Fast', 'Medium', 'Slow'],
    'Accuracy': ['Low', 'Low', 'Medium', 'Medium', 'High', 'Very High'],
    'Preserves Variance': ['Yes', 'N/A', 'No', 'Partial', 'Yes', 'Yes'],
    'Best For': ['MCAR, <5%', '>70% missing', 'Quick baseline', 
                 'Time-series', 'Moderate missing', 'Complex patterns'],
    'Complexity': ['Low', 'Low', 'Low', 'Low', 'Medium', 'High']
}

summary_df = pd.DataFrame(methods_summary)

print("\nMethod Comparison Summary:")
print(summary_df.to_string(index=False))

# Visualization: Method characteristics
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Speed comparison
speed_map = {'Fast': 3, 'Medium': 2, 'Slow': 1}
speeds = [speed_map[s] for s in methods_summary['Speed']]
colors_speed = ['green' if s==3 else 'orange' if s==2 else 'red' for s in speeds]
axes[0].barh(methods_summary['Method'], speeds, color=colors_speed, alpha=0.7, edgecolor='black')
axes[0].set_xlabel('Speed', fontsize=11, fontweight='bold')
axes[0].set_title('Computational Speed', fontsize=12, fontweight='bold')
axes[0].set_xticks([1, 2, 3])
axes[0].set_xticklabels(['Slow', 'Medium', 'Fast'])
axes[0].grid(axis='x', alpha=0.3)

# Accuracy comparison
accuracy_map = {'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4}
accuracies = [accuracy_map[a] for a in methods_summary['Accuracy']]
colors_acc = ['red' if a==1 else 'orange' if a==2 else 'lightgreen' if a==3 else 'darkgreen' 
              for a in accuracies]
axes[1].barh(methods_summary['Method'], accuracies, color=colors_acc, alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Accuracy', fontsize=11, fontweight='bold')
axes[1].set_title('Imputation Accuracy', fontsize=12, fontweight='bold')
axes[1].set_xticks([1, 2, 3, 4])
axes[1].set_xticklabels(['Low', 'Medium', 'High', 'Very High'], rotation=45)
axes[1].grid(axis='x', alpha=0.3)

# Complexity comparison
complexity_map = {'Low': 1, 'Medium': 2, 'High': 3}
complexities = [complexity_map[c] for c in methods_summary['Complexity']]
colors_comp = ['green' if c==1 else 'orange' if c==2 else 'red' for c in complexities]
axes[2].barh(methods_summary['Method'], complexities, color=colors_comp, alpha=0.7, edgecolor='black')
axes[2].set_xlabel('Complexity', fontsize=11, fontweight='bold')
axes[2].set_title('Implementation Complexity', fontsize=12, fontweight='bold')
axes[2].set_xticks([1, 2, 3])
axes[2].set_xticklabels(['Low', 'Medium', 'High'])
axes[2].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 100)
print("üìä INTERPRETATION:")
print("  ‚Ä¢ Green = Better/Faster/Simpler")
print("  ‚Ä¢ Orange = Moderate")
print("  ‚Ä¢ Red = Slower/Complex (but often more accurate)")
print("=" * 100)

## Summary and Best Practices

### Key Takeaways

1. **Missing data is common** - Nearly all real-world datasets have missing values
2. **No one-size-fits-all solution** - Choose method based on data characteristics
3. **Deletion is simple but risky** - Can lead to significant information loss
4. **Simple imputation is fast** - Good for baseline, but reduces variance
5. **Advanced methods are powerful** - KNN and MICE preserve relationships better
6. **Validate your approach** - Always check impact on distributions and model performance

### Best Practices

‚úÖ **DO:**
- Analyze the pattern and extent of missing data first
- Document why data is missing (if known)
- Consider the type of missingness (MCAR, MAR, MNAR)
- Compare distributions before and after imputation
- Use domain knowledge to guide method selection
- Test multiple methods and compare results
- Consider computational constraints
- Document your imputation strategy

‚ùå **DON'T:**
- Ignore missing values and hope the model handles them
- Apply one method blindly to all features
- Delete rows/columns without checking impact
- Use mean imputation for skewed distributions
- Impute categorical variables with numerical methods
- Forget to validate imputation results
- Assume imputed values are real observations

### Real-World Recommendations

**For Production ML Pipelines:**
1. Start with exploratory analysis of missing patterns
2. Use median/mode for quick baseline
3. Implement KNN or MICE for final models
4. Save imputation parameters from training data
5. Apply same imputation to test/production data
6. Monitor for new missing patterns in production

**Common Mistakes to Avoid:**
- Imputing before train-test split (data leakage!)
- Using test data statistics for imputation
- Not handling new missing patterns in production
- Over-complicating when simple methods work well

### When Missing Values Might Be Informative

Sometimes the fact that a value is missing carries information:
- "Income" missing might indicate unemployment
- "Age" missing might indicate privacy concerns
- "Medical test" missing might mean test wasn't needed

**Solution:** Create a binary indicator feature:
```python
df['Age_was_missing'] = df['Age'].isnull().astype(int)
```
Then impute the original feature. This way you preserve both the information and can still use the feature!

---

**Remember:** Handling missing values is as much an art as it is science. Understanding your data and the context is crucial for making the right choice!