In [None]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
# load dataset
df = sns.load_dataset('titanic')

In [None]:
# check the missing values
df.isnull().sum().sort_values(ascending=False)



## üßπ **1. Impute Missing Values using Mean, Median, and Mode**

Missing values can reduce model accuracy and cause bias.
**Imputation** replaces missing values with statistical estimates.

---

### üîπ **Mean Imputation**

* Used for **numerical data**
* Best when data has **no extreme outliers**

```python
df['age'] = df['age'].fillna(df['age'].mean())
```

#### ‚úÖ Positive Points

* Simple and fast ‚ö°
* Keeps dataset size unchanged
* Works well for normally distributed data

#### ‚ùå Negative Points

* Affected by outliers ‚ùó
* Reduces data variability
* Can introduce bias if data is skewed

---

### üîπ **Median Imputation**

* Used for **numerical data**
* Best when data contains **outliers**

```python
df['age'] = df['age'].fillna(df['age'].median())
```

#### ‚úÖ Positive Points

* Robust to outliers ‚öñÔ∏è
* Better for skewed distributions
* More reliable than mean in real-world data

#### ‚ùå Negative Points

* Ignores relationships between features
* Still reduces variability
* Slightly less efficient for normal data

---

### üîπ **Mode Imputation**

* Used for **categorical data**
* Replaces missing values with the **most frequent value**

```python
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
```

#### ‚úÖ Positive Points

* Best choice for categorical features üè∑Ô∏è
* Easy to implement
* Maintains valid category values

#### ‚ùå Negative Points

* Can over-represent one category
* Increases class imbalance
* May hide important patterns

---

## üìå **Summary Table**

| Method | Best For                | Pros                   | Cons                       |
| ------ | ----------------------- | ---------------------- | -------------------------- |
| Mean   | Numeric (no outliers)   | Fast, simple           | Sensitive to outliers      |
| Median | Numeric (with outliers) | Robust                 | Less efficient             |
| Mode   | Categorical             | Easy, valid categories | Bias toward frequent class |

---



In [None]:
df['age'] = df['age'].fillna(df['age'].mean())

In [None]:
# check the missing values
df.isnull().sum().sort_values(ascending=False)

In [None]:
# filling categorical values the mode(most frequent value)
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
df['deck'] = df['deck'].fillna(df['deck'].mode()[0])

In [None]:
# check the missing values
df.isnull().sum().sort_values(ascending=False)




## üîç **2. K-Nearest Neighbors (KNN) Imputation**

**KNN Imputer** fills missing values by looking at the **K nearest samples** (neighbors) based on feature similarity.
The missing value is replaced using the **average (or most common value)** of its neighbors.

üìå Works mainly with **numerical data**.

---

### üîπ **How KNN Imputation Works**

1. Find the **K nearest rows** using distance (usually Euclidean).
2. Use neighbors‚Äô values to estimate the missing value.
3. Replace missing value with the computed result.

---

### üîπ **Example Code**

```python
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df[['age', 'salary']] = imputer.fit_transform(df[['age', 'salary']])
```

---

## ‚úÖ **Positive Points**

* Uses relationships between features üß†
* More accurate than mean/median in many cases
* Preserves data patterns
* Good for complex datasets

---

## ‚ùå **Negative Points**

* Works only with numerical data ‚ùó
* Computationally expensive for large datasets üê¢
* Sensitive to feature scaling
* Not suitable for high missing-rate data

---

## ‚ö†Ô∏è **Important Notes**

* Always **scale data** before using KNN Imputer

```python
from sklearn.preprocessing import StandardScaler
```

* Categorical data must be encoded first (but results may be misleading)

---

## üìå **When to Use KNN Imputer**

| Situation                    | Recommendation |
| ---------------------------- | -------------- |
| Small to medium dataset      | ‚úÖ Use KNN      |
| Strong feature relationships | ‚úÖ Use KNN      |
| Large dataset                | ‚ùå Avoid        |
| Categorical features         | ‚ùå Avoid        |

---

## üîÅ **Comparison with Mean/Median**

| Method        | Uses Feature Relationships | Speed   | Accuracy |
| ------------- | -------------------------- | ------- | -------- |
| Mean / Median | ‚ùå No                       | ‚ö° Fast  | Medium   |
| KNN Imputer   | ‚úÖ Yes                      | üê¢ Slow | High     |

---



In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
# load dataset
df  = sns.load_dataset('titanic')

from sklearn.impute  import KNNImputer

imputer = KNNImputer(n_neighbors=4)


df['age'] = imputer.fit_transform(df[['age']])




## üìà **3. Regression Imputation**

**Regression Imputer** estimates missing values by **predicting them using other features** through a regression model.

üìå Best suited for **numerical features** with strong relationships to other variables.

---

### üîπ **How Regression Imputation Works**

1. Select the feature with missing values (target).
2. Use remaining features as predictors.
3. Train a regression model on non-missing data.
4. Predict and replace missing values.

---

### üîπ **Example Code**

```python
from sklearn.linear_model import LinearRegression

# separate complete and missing data
complete = df[df['age'].notna()]
missing = df[df['age'].isna()]

X_train = complete.drop(columns='age')
y_train = complete['age']

X_test = missing.drop(columns='age')

model = LinearRegression()
model.fit(X_train, y_train)

df.loc[df['age'].isna(), 'age'] = model.predict(X_test)
```

---

## ‚úÖ **Positive Points**

* Uses relationships between features üß†
* More accurate than mean/median
* Preserves data trends
* Works well when variables are correlated

---

## ‚ùå **Negative Points**

* Assumes linear relationship ‚ùó
* Overfits if data is noisy
* Ignores uncertainty in predictions
* Computationally expensive

---

## ‚ö†Ô∏è **Important Notes**

* Only for **numerical features**
* Handle outliers before applying
* Scale features if needed
* Risk of data leakage if applied incorrectly

---

## üìå **When to Use Regression Imputer**

| Situation                  | Recommendation |
| -------------------------- | -------------- |
| Strong feature correlation | ‚úÖ Use          |
| Small missing percentage   | ‚úÖ Use          |
| Weak relationship          | ‚ùå Avoid        |
| Categorical features       | ‚ùå Avoid        |

---

## üîÅ **Comparison with Other Imputers**

| Method             | Uses Relationships | Complexity | Accuracy |
| ------------------ | ------------------ | ---------- | -------- |
| Mean / Median      | ‚ùå No               | Low        | Low      |
| KNN Imputer        | ‚úÖ Yes              | Medium     | High     |
| Regression Imputer | ‚úÖ Yes              | High       | High     |

---


In [None]:
df = sns.load_dataset('titanic')

In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10)

df['age'] = imputer.fit_transform(df[['age']])

In [None]:
df.isnull().sum().sort_values(ascending=False)