# 05 - Scaling, Encoding, and Imputation

Before feeding data to a model, we must ensure numeric features are on comparable scales, categorical features are numerically represented, and missing values are handled. This notebook covers the essential preprocessing transformers in scikit-learn.

## Learning Objectives

By the end of this notebook, you will be able to:

- Explain why feature scaling matters for distance-based and gradient-based models
- Apply StandardScaler, MinMaxScaler, and RobustScaler and know when to use each
- Encode categorical variables with OneHotEncoder and OrdinalEncoder
- Impute missing values with SimpleImputer and understand KNNImputer
- Correctly use `fit_transform` on training data and `transform` on test data

## Prerequisites

- Python fundamentals (NumPy, Pandas)
- Train/test splitting concepts (Notebooks 01-03)
- Basic feature engineering concepts (Notebook 04)
- Familiarity with Matplotlib for plotting

## Table of Contents

1. [Why Scaling Matters](#1-why-scaling-matters)
2. [StandardScaler](#2-standardscaler)
3. [MinMaxScaler](#3-minmaxscaler)
4. [RobustScaler](#4-robustscaler)
5. [Comparing All Three Scalers](#5-comparing-all-three-scalers)
6. [OneHotEncoder](#6-onehotencoder)
7. [OrdinalEncoder](#7-ordinalencoder)
8. [Imputation: Handling Missing Values](#8-imputation-handling-missing-values)
9. [The Golden Rule: fit on Train, transform on Test](#9-the-golden-rule-fit-on-train-transform-on-test)
10. [Common Mistakes](#10-common-mistakes)
11. [Exercise](#11-exercise)

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete.")

---

## 1. Why Scaling Matters

**Features on different scales cause problems for many algorithms:**

- **Distance-based models** (KNN, SVM, K-Means): features with larger ranges dominate distance calculations
- **Gradient descent** (Linear/Logistic Regression, Neural Networks): unscaled features create elongated loss surfaces, slowing convergence
- **Regularization** (L1/L2 penalties): penalizes coefficients unequally if features have different scales

**Models that do NOT need scaling:**
- Tree-based models (Decision Trees, Random Forest, Gradient Boosting) - they split on individual feature thresholds

In [None]:
# Demonstrate the scale problem
df_scale = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],           # range: 20 years
    'income': [30000, 50000, 70000, 90000, 120000],  # range: 90,000
    'rooms': [2, 3, 3, 4, 5]               # range: 3
})

print("Feature ranges before scaling:")
print(f"  age:    {df_scale['age'].min()} - {df_scale['age'].max()}    (range: {df_scale['age'].max() - df_scale['age'].min()})")
print(f"  income: {df_scale['income'].min()} - {df_scale['income'].max()} (range: {df_scale['income'].max() - df_scale['income'].min()})")
print(f"  rooms:  {df_scale['rooms'].min()} - {df_scale['rooms'].max()}      (range: {df_scale['rooms'].max() - df_scale['rooms'].min()})")
print()
print("Income dominates any distance calculation by a factor of ~30,000x!")

---

## 2. StandardScaler

**StandardScaler** (Z-score normalization) transforms each feature to have **mean = 0** and **standard deviation = 1**.

$$z = \frac{x - \mu}{\sigma}$$

Where:
- $\mu$ = mean of the feature
- $\sigma$ = standard deviation of the feature

**When to use:**
- Features are approximately normally distributed
- Default choice for most algorithms
- Required for PCA, SVM, Logistic Regression

In [None]:
# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'age': np.random.normal(40, 12, 200),
    'income': np.random.normal(60000, 15000, 200),
    'score': np.random.normal(75, 10, 200)
})

print("Before StandardScaler:")
print(data.describe().round(2))
print()

In [None]:
# Apply StandardScaler
scaler_std = StandardScaler()
data_std = pd.DataFrame(
    scaler_std.fit_transform(data),
    columns=data.columns
)

print("After StandardScaler:")
print(data_std.describe().round(2))
print()
print("Mean ~ 0, Std ~ 1 for all features.")

---

## 3. MinMaxScaler

**MinMaxScaler** scales features to a given range, typically $[0, 1]$.

$$x' = \frac{x - x_{min}}{x_{max} - x_{min}}$$

**When to use:**
- When you need bounded values (e.g., image pixel values)
- Neural networks often prefer inputs in $[0, 1]$
- Features do not have extreme outliers

In [None]:
# Apply MinMaxScaler
scaler_mm = MinMaxScaler()
data_mm = pd.DataFrame(
    scaler_mm.fit_transform(data),
    columns=data.columns
)

print("After MinMaxScaler:")
print(data_mm.describe().round(2))
print()
print("All features now in [0, 1] range.")

---

## 4. RobustScaler

**RobustScaler** uses the **median** and **interquartile range (IQR)** instead of mean and standard deviation, making it robust to outliers.

$$x' = \frac{x - \text{median}}{IQR}$$

Where $IQR = Q3 - Q1$ (75th percentile minus 25th percentile).

**When to use:**
- Data contains significant outliers
- You want to preserve the relative spacing of the majority of data

In [None]:
# Apply RobustScaler
scaler_rob = RobustScaler()
data_rob = pd.DataFrame(
    scaler_rob.fit_transform(data),
    columns=data.columns
)

print("After RobustScaler:")
print(data_rob.describe().round(2))
print()
print("Median ~ 0, IQR ~ 1 for all features.")

---

## 5. Comparing All Three Scalers

Let us compare how each scaler handles a dataset with **outliers**.

In [None]:
# Create data WITH outliers
np.random.seed(42)
normal_data = np.random.normal(50, 10, 190)
outliers = np.array([200, 220, 250, 300, 180, -50, -80, -100, 280, 350])
data_with_outliers = np.concatenate([normal_data, outliers]).reshape(-1, 1)

print(f"Data range: [{data_with_outliers.min():.0f}, {data_with_outliers.max():.0f}]")
print(f"Mean: {data_with_outliers.mean():.1f}, Median: {np.median(data_with_outliers):.1f}")
print(f"Note: Outliers pull the mean away from the median.")

In [None]:
# Scale with all three
std_scaled = StandardScaler().fit_transform(data_with_outliers)
mm_scaled = MinMaxScaler().fit_transform(data_with_outliers)
rob_scaled = RobustScaler().fit_transform(data_with_outliers)

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Original
axes[0, 0].hist(data_with_outliers, bins=30, color='gray', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Original Data (with outliers)', fontsize=12)
axes[0, 0].axvline(data_with_outliers.mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].axvline(np.median(data_with_outliers), color='blue', linestyle='--', label='Median')
axes[0, 0].legend()

# StandardScaler
axes[0, 1].hist(std_scaled, bins=30, color='salmon', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('StandardScaler', fontsize=12)
axes[0, 1].axvline(0, color='black', linestyle='-', alpha=0.5)

# MinMaxScaler
axes[1, 0].hist(mm_scaled, bins=30, color='steelblue', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('MinMaxScaler', fontsize=12)
axes[1, 0].axvline(0, color='black', linestyle='-', alpha=0.5)

# RobustScaler
axes[1, 1].hist(rob_scaled, bins=30, color='mediumseagreen', edgecolor='black', alpha=0.7)
axes[1, 1].set_title('RobustScaler', fontsize=12)
axes[1, 1].axvline(0, color='black', linestyle='-', alpha=0.5)

plt.suptitle('Scaler Comparison on Data with Outliers', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Summary statistics comparison
comparison = pd.DataFrame({
    'Scaler': ['StandardScaler', 'MinMaxScaler', 'RobustScaler'],
    'Min': [std_scaled.min(), mm_scaled.min(), rob_scaled.min()],
    'Max': [std_scaled.max(), mm_scaled.max(), rob_scaled.max()],
    'Mean': [std_scaled.mean(), mm_scaled.mean(), rob_scaled.mean()],
    'Std': [std_scaled.std(), mm_scaled.std(), rob_scaled.std()]
}).round(3)

print("Scaler comparison:")
print(comparison.to_string(index=False))
print()
print("Key takeaway: RobustScaler is least affected by outliers.")
print("MinMaxScaler squishes most data into a small range due to outliers.")

### Scaler Selection Guide

| Scaler | Formula | Best For | Sensitive to Outliers? |
|--------|---------|----------|------------------------|
| StandardScaler | $z = \frac{x - \mu}{\sigma}$ | General purpose, normally distributed data | Yes |
| MinMaxScaler | $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$ | Bounded ranges, neural networks | Very sensitive |
| RobustScaler | $x' = \frac{x - \text{median}}{IQR}$ | Data with outliers | No |

---

## 6. OneHotEncoder

**OneHotEncoder** converts each categorical value into a binary column (0 or 1). This is necessary because most ML models cannot handle string categories.

For a feature `color` with values `[red, blue, green]`:

| color | color_blue | color_green | color_red |
|-------|-----------|------------|----------|
| red   | 0 | 0 | 1 |
| blue  | 1 | 0 | 0 |
| green | 0 | 1 | 0 |

In [None]:
# Sample categorical data
df_cat = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red', 'green'],
    'size': ['S', 'M', 'L', 'XL', 'M', 'S']
})

print("Original categorical data:")
print(df_cat)
print()

In [None]:
# OneHotEncoder: default (keep all categories)
ohe = OneHotEncoder(sparse_output=False)  # dense output for readability
encoded = ohe.fit_transform(df_cat[['color']])

encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out())
print("OneHotEncoded (all categories):")
print(encoded_df)
print()

In [None]:
# OneHotEncoder: drop='first' to avoid multicollinearity
# With 3 colors, we only need 2 columns (the third is implied)
ohe_drop = OneHotEncoder(sparse_output=False, drop='first')
encoded_drop = ohe_drop.fit_transform(df_cat[['color']])

encoded_drop_df = pd.DataFrame(encoded_drop, columns=ohe_drop.get_feature_names_out())
print("OneHotEncoded (drop='first'):")
print(encoded_drop_df)
print()
print("Note: 'blue' is the reference category (all zeros = blue).")

In [None]:
# Handle unknown categories at test time
ohe_safe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_safe.fit(df_cat[['color']])

# Test data has a new category 'yellow' not seen during training
test_colors = pd.DataFrame({'color': ['red', 'yellow', 'blue']})
encoded_test = ohe_safe.transform(test_colors)

print("Test data with unknown category 'yellow':")
print(pd.DataFrame(encoded_test, columns=ohe_safe.get_feature_names_out()))
print()
print("'yellow' gets all zeros (handle_unknown='ignore').")

### Sparse vs Dense

- `sparse_output=True` (default): memory efficient for high-cardinality features
- `sparse_output=False`: returns a regular NumPy array, easier to inspect

Use sparse when you have many categories (e.g., zip codes, product IDs).

---

## 7. OrdinalEncoder

**OrdinalEncoder** maps categories to integers. Use this when categories have a **natural order** (ordinal data).

Examples of ordinal features:
- Education level: high_school < bachelors < masters < phd
- T-shirt size: S < M < L < XL
- Satisfaction: low < medium < high

In [None]:
# Ordinal encoding with explicit order
df_edu = pd.DataFrame({
    'education': ['bachelors', 'high_school', 'phd', 'masters', 'bachelors', 'high_school']
})

# Define the order explicitly
ord_enc = OrdinalEncoder(categories=[['high_school', 'bachelors', 'masters', 'phd']])
df_edu['education_encoded'] = ord_enc.fit_transform(df_edu[['education']])

print("OrdinalEncoder result:")
print(df_edu)
print()
print("Mapping: high_school=0, bachelors=1, masters=2, phd=3")
print("The numeric order reflects the educational hierarchy.")

In [None]:
# T-shirt sizes: ordinal encoding
df_size = pd.DataFrame({'size': ['M', 'S', 'XL', 'L', 'S', 'M', 'XL']})

size_enc = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])
df_size['size_encoded'] = size_enc.fit_transform(df_size[['size']])

print("Size encoding:")
print(df_size)
print()
print("WARNING: Do NOT use OrdinalEncoder for nominal categories (e.g., color).")
print("Encoding red=0, blue=1, green=2 falsely implies blue > red and green > blue.")

---

## 8. Imputation: Handling Missing Values

Real-world data almost always has missing values. **Imputation** fills them in with reasonable estimates.

**Strategies:**
- **Mean**: replace NaN with column mean (numeric, sensitive to outliers)
- **Median**: replace NaN with column median (numeric, robust to outliers)
- **Most frequent (mode)**: replace NaN with most common value (works for categorical)
- **KNNImputer**: use K-nearest neighbors to impute based on similar rows

In [None]:
# Create data with missing values
np.random.seed(42)
df_missing = pd.DataFrame({
    'age': [25, np.nan, 35, 40, np.nan, 55, 30, np.nan, 45, 50],
    'income': [30000, 45000, np.nan, 80000, 55000, np.nan, 42000, 65000, np.nan, 90000],
    'city': ['NYC', 'LA', 'NYC', np.nan, 'Chicago', 'LA', np.nan, 'NYC', 'Chicago', 'LA']
})

print("Data with missing values:")
print(df_missing)
print()
print("Missing value counts:")
print(df_missing.isnull().sum())

In [None]:
# SimpleImputer: mean strategy (for numeric columns)
imputer_mean = SimpleImputer(strategy='mean')
numeric_cols = ['age', 'income']

df_imputed = df_missing.copy()
df_imputed[numeric_cols] = imputer_mean.fit_transform(df_missing[numeric_cols])

print("After mean imputation:")
print(df_imputed[numeric_cols])
print(f"\nage mean used: {df_missing['age'].mean():.1f}")
print(f"income mean used: {df_missing['income'].mean():.0f}")

In [None]:
# SimpleImputer: median strategy (more robust to outliers)
imputer_median = SimpleImputer(strategy='median')
df_imputed_median = df_missing.copy()
df_imputed_median[numeric_cols] = imputer_median.fit_transform(df_missing[numeric_cols])

print("After median imputation:")
print(df_imputed_median[numeric_cols])
print(f"\nage median used: {df_missing['age'].median():.1f}")
print(f"income median used: {df_missing['income'].median():.0f}")

In [None]:
# SimpleImputer: most_frequent strategy (for categorical columns)
imputer_mode = SimpleImputer(strategy='most_frequent')
df_imputed_cat = df_missing.copy()
df_imputed_cat[['city']] = imputer_mode.fit_transform(df_missing[['city']])

print("After most_frequent imputation for 'city':")
print(df_imputed_cat[['city']])
print(f"\nMost frequent city: {df_missing['city'].mode()[0]}")

In [None]:
# KNNImputer: uses similar rows to impute (brief mention)
knn_imputer = KNNImputer(n_neighbors=3)
df_knn = df_missing.copy()
df_knn[numeric_cols] = knn_imputer.fit_transform(df_missing[numeric_cols])

print("After KNNImputer (n_neighbors=3):")
print(df_knn[numeric_cols])
print()
print("Note: KNNImputer considers the values of neighboring rows,")
print("so imputed values may differ from simple mean/median.")

### Imputation Strategy Summary

| Strategy | Best For | Pros | Cons |
|----------|----------|------|------|
| Mean | Numeric, no outliers | Simple, fast | Distorted by outliers |
| Median | Numeric, with outliers | Robust to outliers | Ignores feature relationships |
| Most Frequent | Categorical | Works for any dtype | Overrepresents mode |
| KNN | Numeric | Uses feature relationships | Slower, needs scaling |

---

## 9. The Golden Rule: fit on Train, transform on Test

**This is the single most important rule in preprocessing:**

1. `fit_transform(X_train)` - learn parameters (mean, std, categories) from training data AND transform it
2. `transform(X_test)` - apply the SAME learned parameters to test data

**NEVER call `fit` or `fit_transform` on test data!** Doing so causes **data leakage** - the model indirectly sees test set statistics.

In [None]:
# Demonstrate the correct approach
np.random.seed(42)
X = np.random.normal(50, 15, (100, 2))
y = (X[:, 0] + X[:, 1] > 100).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")
print(f"\nTrain stats: mean={X_train.mean(axis=0).round(2)}, std={X_train.std(axis=0).round(2)}")
print(f"Test stats:  mean={X_test.mean(axis=0).round(2)}, std={X_test.std(axis=0).round(2)}")

In [None]:
# CORRECT: fit on train, transform on test
scaler = StandardScaler()

# Step 1: fit_transform on training data
X_train_scaled = scaler.fit_transform(X_train)

# Step 2: transform (NOT fit_transform!) on test data
X_test_scaled = scaler.transform(X_test)

print("CORRECT approach:")
print(f"Scaler learned from train: mean={scaler.mean_.round(2)}, std={scaler.scale_.round(2)}")
print(f"\nTrain scaled: mean={X_train_scaled.mean(axis=0).round(4)}, std={X_train_scaled.std(axis=0).round(4)}")
print(f"Test scaled:  mean={X_test_scaled.mean(axis=0).round(4)}, std={X_test_scaled.std(axis=0).round(4)}")
print()
print("Note: Test mean/std are NOT exactly 0/1 because the scaler used TRAIN statistics.")
print("This is correct behavior!")

In [None]:
# WRONG: fitting scaler on test data separately
scaler_wrong = StandardScaler()
X_train_wrong = scaler_wrong.fit_transform(X_train)

scaler_wrong2 = StandardScaler()
X_test_wrong = scaler_wrong2.fit_transform(X_test)  # BAD! Fitting on test data

print("WRONG approach (fit on test):")
print(f"Train scaler mean: {scaler_wrong.mean_.round(2)}")
print(f"Test scaler mean:  {scaler_wrong2.mean_.round(2)}  <-- different!")
print()
print("PROBLEM: Train and test use DIFFERENT scaling parameters.")
print("This is data leakage and leads to inconsistent predictions.")

In [None]:
# Same rule applies to imputers and encoders
print("The fit/transform rule applies to ALL preprocessors:")
print()
print("  scaler.fit_transform(X_train)   -> scaler.transform(X_test)")
print("  imputer.fit_transform(X_train)  -> imputer.transform(X_test)")
print("  encoder.fit_transform(X_train)  -> encoder.transform(X_test)")
print()
print("NEVER call fit() or fit_transform() on test data!")

---

## 10. Common Mistakes

### Mistake 1: Fitting the Scaler on the Entire Dataset

If you scale before splitting, the scaler learns statistics from **all** data, including the test set. This is subtle but real data leakage.

In [None]:
# WRONG: Scale first, then split
np.random.seed(42)
X_all = np.random.normal(50, 15, (100, 2))

# BAD: scaler sees all data including future test data
scaler_leak = StandardScaler()
X_all_scaled = scaler_leak.fit_transform(X_all)  # leakage!
X_tr, X_te = train_test_split(X_all_scaled, test_size=0.2, random_state=42)

print("WRONG (scale then split):")
print(f"  Scaler saw all {len(X_all)} samples (including test data).")
print()

# CORRECT: Split first, then scale
X_tr2, X_te2 = train_test_split(X_all, test_size=0.2, random_state=42)
scaler_correct = StandardScaler()
X_tr2_scaled = scaler_correct.fit_transform(X_tr2)
X_te2_scaled = scaler_correct.transform(X_te2)

print("CORRECT (split then scale):")
print(f"  Scaler saw only {len(X_tr2)} training samples.")

### Mistake 2: Using pd.get_dummies in Production

`pd.get_dummies` is convenient for exploration but dangerous in production:
- It creates columns based on the categories **present in that specific data**
- If a category is missing from test data, the column is missing
- If a new category appears, an unexpected column is added

**Always use `OneHotEncoder`** - it remembers the categories from `fit` and handles unknown values.

In [None]:
# Demonstrate the get_dummies problem
train_cat = pd.DataFrame({'color': ['red', 'blue', 'green', 'red']})
test_cat = pd.DataFrame({'color': ['blue', 'red']})  # no 'green' in test

# pd.get_dummies: different columns!
train_dummies = pd.get_dummies(train_cat)
test_dummies = pd.get_dummies(test_cat)

print("pd.get_dummies (PROBLEMATIC):")
print(f"  Train columns: {list(train_dummies.columns)}")
print(f"  Test columns:  {list(test_dummies.columns)}")
print(f"  Column mismatch: 'color_green' missing from test!")
print()

# OneHotEncoder: consistent columns
ohe_prod = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
train_ohe = ohe_prod.fit_transform(train_cat)
test_ohe = ohe_prod.transform(test_cat)

cols = ohe_prod.get_feature_names_out()
print("OneHotEncoder (CORRECT):")
print(f"  Train columns: {list(cols)} -> shape {train_ohe.shape}")
print(f"  Test columns:  {list(cols)} -> shape {test_ohe.shape}")
print(f"  Columns always match!")

### Summary of Common Mistakes

| Mistake | Consequence | Fix |
|---------|-------------|-----|
| `fit_transform` on test data | Data leakage, optimistic metrics | Use `transform` only on test |
| Scale before splitting | Leakage through test statistics | Split first, then scale |
| `pd.get_dummies` in production | Column mismatch at inference | Use `OneHotEncoder` |
| Ignoring missing values | Model crashes or silent errors | Use `SimpleImputer`/`KNNImputer` |
| Ordinal encoding nominal data | False ordering imposed | Use `OneHotEncoder` for nominal |

---

## 11. Exercise

**Task:** You are given a dataset with numeric features, a categorical feature, and missing values. Perform all preprocessing steps correctly.

Steps:
1. Split the data into train (80%) and test (20%) with `random_state=42`
2. Impute missing numeric values using median strategy (fit on train, transform on test)
3. Scale numeric features with `StandardScaler` (fit on train, transform on test)
4. Encode the categorical feature with `OneHotEncoder` (fit on train, transform on test)
5. Verify that train and test have the same number of columns after preprocessing

In [None]:
# Exercise starter code
np.random.seed(42)
n = 200

exercise_data = pd.DataFrame({
    'age': np.where(np.random.random(n) < 0.1, np.nan, np.random.normal(40, 12, n)),
    'income': np.where(np.random.random(n) < 0.15, np.nan, np.random.normal(55000, 15000, n)),
    'score': np.random.normal(70, 10, n),
    'department': np.random.choice(['engineering', 'marketing', 'sales', 'hr'], n)
})
exercise_y = (exercise_data['score'] > 70).astype(int)

print("Exercise dataset:")
print(exercise_data.head(10))
print(f"\nShape: {exercise_data.shape}")
print(f"Missing values:\n{exercise_data.isnull().sum()}")

# YOUR CODE HERE
# Step 1: Split
# X_train, X_test, y_train, y_test = train_test_split(...)

# Step 2: Impute numeric columns (fit on train, transform on test)
# imputer = SimpleImputer(strategy='median')
# X_train[numeric_cols] = imputer.fit_transform(X_train[numeric_cols])
# X_test[numeric_cols] = imputer.transform(X_test[numeric_cols])

# Step 3: Scale numeric columns (fit on train, transform on test)
# scaler = StandardScaler()
# X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
# X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

# Step 4: Encode categorical column (fit on train, transform on test)
# ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# ...

# Step 5: Verify shapes match
# print(f"Train shape: {X_train_final.shape}")
# print(f"Test shape:  {X_test_final.shape}")