## Handling Missing Values  
### MCAR — Missing Completely At Random  

##### In MCAR, missingness is independent of both observed and unobserved data, so simple, unconditional methods are statistically valid.  

###### This short project applies major simple, unconditional imputation methods on MCAR values.  
###### Mean and median values are compared against the original (Raw) variable’s mean and median to identify the most robust method for imputation.  


### Dataset Generation

In [1]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Create a simple dataset
n = 100
df = pd.DataFrame({"Age": np.random.randint(18, 60, size=n),
    "Income": np.random.randint(20000, 100000, size=n),
    "Education": np.random.choice(["Basic", "Graduation", "Master", "PhD"], size=n)})

# Induce MCAR missingness randomly

# 10% of Age values missing completely at random
df.loc[df.sample(frac=0.1, random_state=1).index, "Age"] = np.nan

# 15% of Income values missing completely at random
df.loc[df.sample(frac=0.15, random_state=2).index, "Income"] = np.nan

# 5% of Education values missing completely at random
df.loc[df.sample(frac=0.05, random_state=3).index, "Education"] = np.nan

print(df.head(15))


     Age   Income   Education
0   56.0      NaN      Master
1   46.0  50535.0      Master
2   32.0      NaN         PhD
3   25.0  72256.0         PhD
4   38.0  55222.0      Master
5   56.0  97373.0         PhD
6   36.0  99575.0         NaN
7   40.0  83335.0       Basic
8   28.0  30965.0         PhD
9   28.0  44538.0         PhD
10  41.0  90592.0      Master
11  53.0  28110.0  Graduation
12  57.0  99309.0         PhD
13  41.0      NaN       Basic
14  20.0      NaN      Master


## Focus of Analysis  
### Single Feature — Income  

For this dataset, the current analysis is centered on the **Income** variable.  
This feature is used to demonstrate different imputation techniques under the MCAR (Missing Completely At Random) assumption,  
and to compare how mean and median values change across various methods.

##### Dropna (not focused in this )
##### Fillna
##### SimpleImputer

### Fillna -method 
#### fill with mean, median, random selection

In [16]:
### fillna-mean
df["Income_mean"]=df["Income"]
df["Income_mean"]=df["Income_mean"].fillna(df["Income"].mean())

### fillna-mean
df["Income_median"]=df["Income"]
df["Income_median"]=df["Income_median"].fillna(df["Income"].median())

### fillna -randomchoice
df["Income_ran"]=df["Income"]
df["Income_ran"]=df['Income_ran'].fillna(np.random.choice(df['Income_ran'].dropna()))

df

Unnamed: 0,Age,Income,Education,Income_mean,Income_median,Income_ran
0,56.0,,Master,62006.552941,66576.0,85318.0
1,46.0,50535.0,Master,50535.000000,50535.0,50535.0
2,32.0,,PhD,62006.552941,66576.0,85318.0
3,25.0,72256.0,PhD,72256.000000,72256.0,72256.0
4,38.0,55222.0,Master,55222.000000,55222.0,55222.0
...,...,...,...,...,...,...
95,59.0,69811.0,Basic,69811.000000,69811.0,69811.0
96,56.0,22811.0,,22811.000000,22811.0,22811.0
97,58.0,76250.0,Basic,76250.000000,76250.0,76250.0
98,45.0,92082.0,PhD,92082.000000,92082.0,92082.0


### Simple Imputer 
#### Both mean and median

In [29]:
from sklearn.impute import SimpleImputer
### Mean Imputer
imputer = SimpleImputer(strategy='mean')
df["Income_simple_imputer_mean"]=df["Income"]
df[["Income_simple_imputer_mean"]] = imputer.fit_transform(df[['Income']])

### Median imputer
imputer2 = SimpleImputer(strategy='median')
df["Income_simple_imputer_median"]=df["Income"]
df[["Income_simple_imputer_median"]] = imputer2.fit_transform(df[['Income']])
df

Unnamed: 0,Age,Income,Education,Income_mean,Income_median,Income_ran,Income_simple_imputer_mean,Income_simple_imputer_median
0,56.0,,Master,62006.552941,66576.0,85318.0,62006.552941,66576.0
1,46.0,50535.0,Master,50535.000000,50535.0,50535.0,50535.000000,50535.0
2,32.0,,PhD,62006.552941,66576.0,85318.0,62006.552941,66576.0
3,25.0,72256.0,PhD,72256.000000,72256.0,72256.0,72256.000000,72256.0
4,38.0,55222.0,Master,55222.000000,55222.0,55222.0,55222.000000,55222.0
...,...,...,...,...,...,...,...,...
95,59.0,69811.0,Basic,69811.000000,69811.0,69811.0,69811.000000,69811.0
96,56.0,22811.0,,22811.000000,22811.0,22811.0,22811.000000,22811.0
97,58.0,76250.0,Basic,76250.000000,76250.0,76250.0,76250.000000,76250.0
98,45.0,92082.0,PhD,92082.000000,92082.0,92082.0,92082.000000,92082.0


## Comparison of Mean and Median Values Across Different Imputation Methods for Hypothetical MCAR Missing Data


In [39]:
v= {
    "Income(Raw)": [df["Income"].mean(), df["Income"].median()],
    "Income_mean": [df["Income_mean"].mean(),df["Income_mean"].median()],
    "Income_median": [df["Income_median"].mean(), df["Income_median"].median()],
    "Income_ran": [df["Income_ran"].mean(), df["Income_ran"].median()],
    "Income_simple_imputer_mean": [df["Income_simple_imputer_mean"].mean(), df["Income_simple_imputer_mean"].median()],
    "Income_simple_imputer_median": [df["Income_simple_imputer_median"].mean(),df["Income_simple_imputer_median"].median()]
}
df_mcar_summary = pd.DataFrame(v, index=["mean", "median"])
df_mcar_summary

Unnamed: 0,Income(Raw),Income_mean,Income_median,Income_ran,Income_simple_imputer_mean,Income_simple_imputer_median
mean,62006.552941,62006.552941,62691.97,65503.27,62006.552941,62691.97
median,66576.0,62006.552941,66576.0,70932.0,62006.552941,66576.0


## Transpose View for clear understanding

In [40]:
df_mcar_summary_Transpose_view=df_mcar_summary.T
df_mcar_summary_Transpose_view

Unnamed: 0,mean,median
Income(Raw),62006.552941,66576.0
Income_mean,62006.552941,62006.552941
Income_median,62691.97,66576.0
Income_ran,65503.27,70932.0
Income_simple_imputer_mean,62006.552941,62006.552941
Income_simple_imputer_median,62691.97,66576.0


## Final Verdict
### Median Imputation (Income_median or SimpleImputer(strategy='median'))
### Why

- Preserves median (critical for income)

- Minimal mean distortion

- Robust to outliers

- Works under MCAR

### Normal Median vs SimpleImputation Median

#### Even though manual median imputation and SimpleImputer with strategy="median" give the same numerical results, the preference for SimpleImputer comes from workflow, scalability, and integration
### SimpleImputer is preferred
- Automation & Reusability -- SimpleImputer can be fit once and then applied to multiple datasets (train/test splits, new incoming data).
- Consistency in ML Pipelines -- This ensures the same imputation logic is applied during training and prediction, avoiding data leakage.
- Flexibility -- You can switch strategies (mean, median, most_frequent, constant) without rewriting code.