# Handling Missing Numerical Data - GOLD VERSION üî•

**Author:** Sachin Laxman Masti  
**Goal:**  
- Samajhna ki kaunsa imputation method kab use karna chahiye  
- Har line ka code clearly samajhna  
- Model performance compare karna  

---


# üìå When To Use Which Imputation?

## 1Ô∏è‚É£ Mean Imputation
Use when:
- Data approx normal distribution me ho
- Outliers kam ho
- Linear models use kar rahe ho

Avoid when:
- Data skewed ho
- Strong outliers ho

---

## 2Ô∏è‚É£ Median Imputation
Use when:
- Data skewed ho
- Outliers present ho
- Robust solution chahiye

---

## 3Ô∏è‚É£ Arbitrary Value (-999 etc)
Use when:
- Tree models (Random Forest, XGBoost)
- Missing ko signal banana ho

Avoid in:
- Linear regression (distortion hoti hai)

---

## 4Ô∏è‚É£ End of Distribution (Mean + 3*Std)
Use when:
- Missing informative ho
- Missing ko extreme treat karna ho

---

## 5Ô∏è‚É£ Random Sample Imputation
Use when:
- Distribution preserve karni ho
- Statistical properties maintain karni ho

Avoid when:
- Dataset bahot small ho


## 1Ô∏è‚É£ Import Libraries

In [1]:
# Numerical operations ke liye
import numpy as np

# Data handling ke liye
import pandas as pd

# ML utilities
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Har run me same random result mile
np.random.seed(42)

## 2Ô∏è‚É£ Create Sample Dataset

In [2]:
# Artificial dataset bana rahe hain
data = {
    "age": [25, 30, 35, np.nan, 45, 50, np.nan, 60, 65, 70],  # Yaha missing values hain
    "salary": [30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000]
}

# DataFrame create kar rahe hain
df = pd.DataFrame(data)

df

Unnamed: 0,age,salary
0,25.0,30000
1,30.0,40000
2,35.0,50000
3,,60000
4,45.0,70000
5,50.0,80000
6,,90000
7,60.0,100000
8,65.0,110000
9,70.0,120000


## 3Ô∏è‚É£ Check Missing Values

In [3]:
# Har column me kitni missing values hain check karte hain
df.isnull().sum()

age       2
salary    0
dtype: int64

## 4Ô∏è‚É£ Train Test Split (Data Leakage Avoidance)

In [4]:
# Features (X) aur Target (y) define kar rahe hain
X = df[["age"]]
y = df["salary"]

# Data ko training aur testing me divide kar rahe hain
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

X_train

Unnamed: 0,age
0,25.0
7,60.0
2,35.0
9,70.0
4,45.0
3,
6,


## üîπ Mean Imputation

In [5]:
# Training data ka mean nikal rahe hain
mean_value = X_train["age"].mean()

# Missing values ko mean se replace kar rahe hain
X_train_mean = X_train.fillna(mean_value)
X_test_mean = X_test.fillna(mean_value)

# Model train kar rahe hain
model = LinearRegression()
model.fit(X_train_mean, y_train)

# Prediction kar rahe hain
pred_mean = model.predict(X_test_mean)

# Performance check kar rahe hain
print("R2 Score (Mean):", r2_score(y_test, pred_mean))

R2 Score (Mean): 0.9999007170435742


## üîπ Median Imputation

In [6]:
# Training data ka median nikal rahe hain
median_value = X_train["age"].median()

# Missing values ko median se fill kar rahe hain
X_train_median = X_train.fillna(median_value)
X_test_median = X_test.fillna(median_value)

model = LinearRegression()
model.fit(X_train_median, y_train)

pred_median = model.predict(X_test_median)

print("R2 Score (Median):", r2_score(y_test, pred_median))

R2 Score (Median): 0.9975596145775447


## üîπ Arbitrary Value Imputation

In [7]:
# Missing values ko -999 se replace kar rahe hain
X_train_arb = X_train.fillna(-999)
X_test_arb = X_test.fillna(-999)

model = LinearRegression()
model.fit(X_train_arb, y_train)

pred_arb = model.predict(X_test_arb)

print("R2 Score (Arbitrary):", r2_score(y_test, pred_arb))

R2 Score (Arbitrary): -0.004914948303515576


## üîπ End of Distribution Imputation

In [8]:
# Mean aur standard deviation nikal rahe hain
mean = X_train["age"].mean()
std = X_train["age"].std()

# Extreme value create kar rahe hain
end_value = mean + 3 * std

# Missing ko extreme value se replace kar rahe hain
X_train_end = X_train.fillna(end_value)
X_test_end = X_test.fillna(end_value)

model = LinearRegression()
model.fit(X_train_end, y_train)

pred_end = model.predict(X_test_end)

print("R2 Score (End of Distribution):", r2_score(y_test, pred_end))

R2 Score (End of Distribution): 0.32219081545914297


## üîπ Random Sample Imputation

In [9]:
# Function bana rahe hain random sample ke liye
def random_impute(series):
    # Agar value missing hai to random existing value choose karo
    return series.apply(
        lambda x: np.random.choice(series.dropna()) if pd.isnull(x) else x
    )

X_train_random = X_train.copy()
X_test_random = X_test.copy()

X_train_random["age"] = random_impute(X_train_random["age"])
X_test_random["age"] = random_impute(X_test_random["age"])

model = LinearRegression()
model.fit(X_train_random, y_train)

pred_random = model.predict(X_test_random)

print("R2 Score (Random):", r2_score(y_test, pred_random))

R2 Score (Random): 0.8448544207472782


<span style='color:black'> important </span>

iss code main agar har bar ek value ke liye alag alag number genrate ho raha hai to uss ke liye har time ek hi number genrate ho iss ke liye ye code hia. 
iss se har bar
**Har run me same random values milengi**

**Reproducible result milega**

**Research / production ke liye correct approach hai**

<span style='color:red'> Random Sample Imputation me train aur test dono me same random_state use karna thoda logically risky hota hai, kyunki ideally:

Random sampling sirf training data se learn hona chahiye

Test pe same distribution apply hona chahiye

Professional level pe:

Random sampling mapping training data se generate karte hain

Fir test pe apply karte hain. </span>

In [10]:
# Function bana rahe hain random sample ke liye (deterministic version)
def random_impute(series, random_state=42):
    
    # Fixed random generator create kar rahe hain
    rng = np.random.default_rng(random_state)
    
    # Non-missing values store kar rahe hain
    non_missing = series.dropna().values
    
    # Missing values ko random but reproducible tareeke se fill kar rahe hain
    return series.apply(
        lambda x: rng.choice(non_missing) if pd.isnull(x) else x
    )

X_train_random = X_train.copy()
X_test_random = X_test.copy()

X_train_random["age"] = random_impute(X_train_random["age"], random_state=42)
X_test_random["age"] = random_impute(X_test_random["age"], random_state=42)

model = LinearRegression()
model.fit(X_train_random, y_train)

pred_random = model.predict(X_test_random)

print("R2 Score (Random):", r2_score(y_test, pred_random))

R2 Score (Random): 0.9164204626529413
