## Handle Missing Values

Handling missing values is a crucial step in data cleaning and preparation. The right method depends on your data type, the amount of missingness, and the problem context. Here‚Äôs a detailed overview of how to handle missing values effectively:

## üß≠ Step 1: Identify Missing Values

In [None]:
import pandas as pd

# Check for missing values
df.isnull().sum()

# Percentage of missing values
df.isnull().mean() * 100


## üßπ Step 2: Understand the Cause

Before deciding how to handle them, determine why the values are missing:

* MCAR (Missing Completely at Random): No pattern. Safe to impute or drop.

* MAR (Missing at Random): Missingness depends on observed data.

* MNAR (Missing Not at Random): Missingness depends on unobserved data ‚Äî may need modeling or domain knowledge.

## Step 3: Choose a Handling Strategy
üîπ 1. Remove Missing Values

Use this if only a small portion is missing and removal won‚Äôt bias the dataset.

In [None]:
# Remove rows with any missing values
df = df.dropna()

# Remove columns with too many missing values
df = df.dropna(axis=1, thresh=len(df) * 0.6)  # keep columns with ‚â•60% non-missing


In [None]:
üîπ 2. Imputation (Filling Missing Values)
üî∏ Numerical Data:

# Mean / Median / Mode Imputation

In [None]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
# or
df['Age'].fillna(df['Age'].median(), inplace=True)


In [5]:
# KNN Imputer (more advanced)

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df[['col1','col2']] = imputer.fit_transform(df[['col1','col2']])


 #Regression Imputation

     Predict missing values using a regression model trained on complete cases.

In [None]:
#Categorical Data:

         Mode Imputation

In [None]:
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)


In [None]:
#'Unknown' Category

In [None]:
df['City'].fillna('Unknown', inplace=True)


In [None]:
üîπ 3. Model-Based Imputation

Use models (e.g., decision trees, random forests) to predict missing values using other features.
Tools like IterativeImputer in sklearn can automate this:

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)


In [None]:
üîπ 4. Flag Missingness

Add a binary flag column to indicate missingness ‚Äî sometimes the fact that data is missing is informative:

In [None]:
df['Age_missing'] = df['Age'].isnull().astype(int)


In [None]:
üîπ 5. Domain-Specific Handling

Sometimes, missing values have a specific meaning (e.g., ‚ÄúNo transaction‚Äù, ‚ÄúNot applicable‚Äù).
In such cases, replace with an appropriate value like 0 or NA_category.

## üß† Step 4: Validate the Result

After handling missing values:

In [None]:
df.isnull().sum()


In [None]:
Check distributions before and after imputation to ensure the data isn‚Äôt distorted.