
# üìå KNN Imputer & Iterative Imputer ‚Äì Complete Practical Guide

Author: Sachin Laxman Masti  
Goal: Future reference + GitHub documentation  

---

## üöÄ Why this notebook?

Yaha hum:
- KNN Imputer ko deeply samjhenge
- Iterative Imputer ko ML thinking ke sath samjhenge
- Kab use karna hai aur kab avoid karna hai
- Industry level best practice dekhenge

Ye notebook future ML projects ke liye ready reference hai.



# 1Ô∏è‚É£ KNN Imputer

## üß† Concept

KNN Imputer missing value ko **similar rows ke basis pe fill karta hai**.

Matlab:
Jo rows similar hain unki help se missing value estimate karo.

Ye distance calculate karta hai (usually Euclidean distance).
Fir K nearest neighbors select karta hai.
Unka average leke missing fill karta hai.

---

## ‚úÖ Kab Use Kare?

- Numeric data ho
- Features correlated ho
- Missing percentage low ho
- Pattern based fill chahiye

## ‚ùå Kab Avoid Kare?

- Dataset bahut bada ho (slow ho jata hai)
- Scaling nahi kiya ho
- High missing % ho


In [None]:

# Required libraries import kar rahe hain
import numpy as np
import pandas as pd

# KNN Imputer import kar rahe hain
from sklearn.impute import KNNImputer

# Example dataset bana rahe hain
data = {
    'age': [25, 30, np.nan, 28],
    'salary': [20000, 25000, 24000, 23000]
}

df = pd.DataFrame(data)

print("Original Data:")
print(df)

# KNN Imputer object bana rahe hain
# n_neighbors = 2 matlab 2 nearest rows consider karega
imputer = KNNImputer(n_neighbors=2)

# fit_transform se:
# 1) Data se pattern learn karega
# 2) Missing values fill karega
df_imputed = imputer.fit_transform(df)

print("\nAfter KNN Imputation:")
print(df_imputed)



## üîç Code Explanation

- KNNImputer(n_neighbors=2): 2 nearest rows choose karega
- fit_transform(): pattern learn + fill
- Output numpy array aata hai

‚ö† Important:
Scaling karna zaroori hai real dataset me,
warna distance calculation galat ho sakta hai.



# 2Ô∏è‚É£ Iterative Imputer

## üß† Concept

Iterative Imputer missing value ko **ML model se predict karta hai**.

Process:
- Ek column ko target bana deta hai
- Baaki columns se usko predict karta hai
- Missing value replace karta hai
- Ye process multiple iterations tak repeat hota hai

Ye MICE (Multiple Imputation by Chained Equations) concept pe based hai.

---

## ‚úÖ Kab Use Kare?

- Features strongly correlated ho
- MAR type missingness ho
- Intelligent filling chahiye
- Complex relationship capture karna ho

## ‚ùå Kab Avoid Kare?

- Small dataset
- Computation constraint ho
- MCAR case me simple imputer enough ho


In [None]:

# Iterative Imputer enable karna padta hai
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

print("Original Data:")
print(df)

# Iterative Imputer object bana rahe hain
imputer_iter = IterativeImputer(max_iter=10, random_state=42)

# fit_transform se model train + missing fill
df_iter_imputed = imputer_iter.fit_transform(df)

print("\nAfter Iterative Imputation:")
print(df_iter_imputed)



## üîç Code Explanation

- max_iter=10 ‚Üí 10 baar repeat karega process
- random_state=42 ‚Üí reproducibility ke liye
- Default model: BayesianRidge

Ye simple filling nahi karta.
Ye ML model train karke value predict karta hai.

‚ö† Always use inside Pipeline after train-test split
warna data leakage ho sakta hai.



# ‚öî KNN vs Iterative Quick Comparison

| Feature | KNN | Iterative |
|----------|------|------------|
| Logic | Distance based | Model based |
| Accuracy | Medium | High |
| Speed | Slow | Medium |
| Relationship capture | Weak | Strong |
| Complexity | Medium | High |

---

# üéØ Final Advice

Real world best practice:

1. Train-Test split karo
2. Pipeline ke andar imputer lagao
3. Cross-validation se compare karo
4. Model performance check karo

Preprocessing blindly mat karo.
Always validate with metrics.
