# üìò Categorical Missing Value Handling in Machine Learning
This notebook explains different strategies to handle missing values in categorical features.
Each method includes:
- Concept explanation
- When to use
- Risks
- Fully commented code


## 1Ô∏è‚É£ Create Sample Dataset

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

# Create a sample dataset with categorical missing values
data = {
    'City': ['Pune', 'Mumbai', np.nan, 'Pune', 'Delhi', np.nan, 'Mumbai'],
    'Loan_Status': ['Approved', 'Rejected', np.nan, 'Approved', 'Rejected', np.nan, 'Approved']
}

# Convert dictionary to DataFrame
df = pd.DataFrame(data)

# Display dataset
df

## 2Ô∏è‚É£ Most Frequent Imputation
### üß† Concept
Replace missing values with the most common category.

### üìå When to Use
- Missing % is small
- No business meaning behind missing

### ‚ö† Risk
- Can introduce bias if missing has meaning


In [None]:
# Import SimpleImputer
from sklearn.impute import SimpleImputer

# Create imputer with most_frequent strategy
most_freq_imputer = SimpleImputer(strategy='most_frequent')

# Fit on entire column (for demo; in real ML fit only on train data)
df_most = df.copy()
df_most['City'] = most_freq_imputer.fit_transform(df[['City']])

# Display result
df_most

## 3Ô∏è‚É£ Constant Imputation (Create 'Missing' Category)
### üß† Concept
Fill missing values with a new category like 'Missing'.

### üìå When to Use
- Missing has real meaning
- Missing % is high

### üî• Advantage
- Model learns missing as a signal


In [None]:
# Create imputer with constant strategy
constant_imputer = SimpleImputer(strategy='constant', fill_value='Missing')

# Apply constant imputation
df_constant = df.copy()
df_constant['City'] = constant_imputer.fit_transform(df[['City']])

# Display result
df_constant

## 4Ô∏è‚É£ Random Imputation
### üß† Concept
Replace missing values randomly from existing categories.

### üìå When to Use
- Rarely used in production
- Mostly for experimentation

### ‚ö† Risk
- Results change every run
- Hard to reproduce


In [None]:
# Function for random categorical imputation
def random_impute(series):
    # Replace NaN with random choice from non-missing values
    return series.apply(lambda x: np.random.choice(series.dropna()) if pd.isnull(x) else x)

# Apply random imputation
df_random = df.copy()
df_random['City'] = random_impute(df_random['City'])

# Display result
df_random

## 5Ô∏è‚É£ Missing Indicator
### üß† Concept
Add a new column indicating whether value was missing.

### üìå When to Use
- Missing itself contains useful information

### üî• Professional Approach
Use together with imputation.


In [None]:
# Imputer with missing indicator
indicator_imputer = SimpleImputer(strategy='most_frequent', add_indicator=True)

# Apply transformation
transformed = indicator_imputer.fit_transform(df[['City']])

# Convert to DataFrame
df_indicator = pd.DataFrame(transformed, columns=['City_imputed', 'City_missing_flag'])

# Display result
df_indicator

## 6Ô∏è‚É£ What If 40% Values Are Missing?
### Options:
1. Drop column (if not important)
2. Use 'Missing' category
3. Try model-based imputation

### Decision Rule:
- Check business importance
- Check correlation with target
- Evaluate with cross-validation


## ‚úÖ Conclusion
- Use Most Frequent for small random missing
- Use Constant ('Missing') if missing has meaning
- Use Indicator when missing is a signal
- Always fit imputer only on training data

üöÄ Now this notebook is production-level reference ready.
