# Handling Missing Data

Strategies for dealing with nulls, NaNs, and empty values.

## Key Concepts
- **Identification:** Finding where data is missing.
- **Deletion:** Removing rows/columns (careful!).
- **Imputation:** Filling with Mean, Median, Mode, or Constant.
- **Advanced Imputation:** Using algorithms (KNN).

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create messy dataset
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, 8, 10],
    'C': ['foo', 'bar', 'baz', np.nan, 'qux'],
    'D': [10, 20, 30, 40, 50]
})

print("Original Messy Data:")
print(df)

## 1. Identifying Missing Values

In [None]:
# Check for nulls (True/False)
print(df.isnull())

In [None]:
# Count nulls per column
print("Missing Count:")
print(df.isnull().sum())

In [None]:
# Visualize missing data (Heatmap)
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

## 2. Deletion (Dropping)
Use when missing data is minimal or rows are useless.

In [None]:
# Drop rows with ANY missing value
dropped_rows = df.dropna()
print("Drop Rows:")
print(dropped_rows)

In [None]:
# Drop columns with ANY missing value
dropped_cols = df.dropna(axis=1)
print("Drop Cols:")
print(dropped_cols)

## 3. Imputation (Filling)

In [None]:
# Fill with Constant (e.g., 0 or 'Unknown')
filled_const = df.fillna(0)
print("Filled with 0:")
print(filled_const)

In [None]:
# Fill with Mean (Numerical)
df_mean = df.copy()
df_mean['A'] = df_mean['A'].fillna(df_mean['A'].mean())
print("Filled A with Mean:")
print(df_mean)

In [None]:
# Fill with Mode (Categorical)
df_mode = df.copy()
mode_val = df_mode['C'].mode()[0]
df_mode['C'] = df_mode['C'].fillna(mode_val)
print(f"Filled C with Mode ({mode_val}):")
print(df_mode)

## 4. Forward/Backward Fill
Good for time series data.

In [None]:
# Forward fill (propagate last valid observation)
print("Forward Fill:")
print(df.ffill())

## 5. Advanced: KNN Imputation
Use nearest neighbors to guess missing value.

In [None]:
from sklearn.impute import KNNImputer

# Select numerical columns
df_num = df[['A', 'B', 'D']]

imputer = KNNImputer(n_neighbors=2)
imputed_data = imputer.fit_transform(df_num)

df_imputed = pd.DataFrame(imputed_data, columns=['A', 'B', 'D'])
print("KNN Imputed:")
print(df_imputed)

## Practice Exercise
Clean the Titanic dataset 'Age' and 'Cabin' columns.

In [None]:
# Load titanic
titanic = sns.load_dataset('titanic')

# Check missing

# Fill Age with Median

# Drop Cabin (too many missing)

# Your code here

## Key Takeaways

✅ **IsNull().sum()** - Always your first step.
✅ **Drop** - Only if data is MAR (Missing At Random) and <5%.
✅ **Mean/Median** - Simple, effective for normal/skewed data.
✅ **KNN** - Powerful for correlated features.

**Next:** [Outlier Detection](03_outlier_detection.ipynb) →