# Intro to ML: Preprocessing Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn import datasets

np.random.seed(42)

## When and Why Preprocess?
Preprocessing means **cleaning and transforming raw data** so models can use it.

- If data requires **cleaning/standardization** (fix units, handle missing values)
- Sometimes gives **better model performance** (features become comparable)
- Sometimes gives **faster training** (optimizers behave better)

**Analogy:** Convert everyone’s height to the **same unit** before comparing.

### Quick Think–Pair–Share
1. Your dataset has **age (0–100)** and **income (0–200,000)**. Which feature might dominate a distance-based model, and why?
2. Could we feed the strings `"cat"`, `"dog"`, `"fish"` directly into a regression model? What might go wrong?

### A:
YOUR ANSWER HERE

## Scaling
Some algorithms rely on **distances** or **gradients** and care about feature magnitudes.
- When features have very **different scales** (e.g., centimeters vs. meters), scale them.
- Certain algorithms **perform better/converge faster** if the data is scaled.

| Method | What it does | Typical range | Use in this lecture with |
|--|--|--|--|
| **StandardScaler** | $z = (x - \mu)/\sigma$ | mean≈0, std≈1 | KNN, Logistic Regression |
| **MinMaxScaler** | maps to [0,1] (default) | [0,1] | KNN, Logistic Regression |

*We’ll demonstrate scaling with KNN and Logistic Regression only (the models used in this notebook).*

### Q: Why might KNN benefit from scaling more than a model that doesn’t use distances directly?

### A:
YOUR ANSWER HERE

In [None]:
# Demo: KNN accuracy WITHOUT vs WITH scaling
X, y = datasets.make_classification(
    n_samples=600, n_features=3, n_informative=2, n_redundant=0, random_state=42
)
# Make one feature huge so it dominates Euclidean distance
X = X.copy()
X[:, 0] = X[:, 0] * 1000

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# KNN without scaling
knn_plain = KNeighborsClassifier(n_neighbors=5)
knn_plain.fit(X_train, y_train)
acc_plain = knn_plain.score(X_test, y_test)
print(f"Accuracy WITHOUT scaling: {acc_plain:.3f}")

# KNN with StandardScaler
knn_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier(n_neighbors=5))
])
knn_scaled.fit(X_train, y_train)
acc_scaled = knn_scaled.score(X_test, y_test)
print(f"Accuracy WITH    scaling: {acc_scaled:.3f}")

In [None]:
# Visualize big-scale feature BEFORE vs AFTER StandardScaler
plt.figure()
plt.hist(X[:,0], bins=50)
plt.title("Feature 0 BEFORE scaling")
plt.xlabel("Value")
plt.ylabel("Count")
plt.show()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
plt.figure()
plt.hist(X_scaled[:,0], bins=50)
plt.title("Feature 0 AFTER StandardScaler")
plt.xlabel("Value (z-score)")
plt.ylabel("Count")
plt.show()

### Q: What changed after scaling, and how might that affect KNN’s distance calculations?

### A:
YOUR ANSWER HERE

## Encoding
Convert categorical variables into numeric values so models can use them.

- **OneHotEncoder**: for categories with **no** natural order (nominal). Example: colors (red/green/blue).
- **OrdinalEncoder**: for categories **with** natural order (ordinal). Example: spiciness (low < medium < high).

⚠️ If you encode `cat=1, dog=2, fish=3` with an ordinal meaning, a model may treat `fish > dog > cat` as distances it should learn from.

### Classroom Check
Is `shirt_size` (S, M, L) nominal or ordinal? Is `zip_code` a number or a category in a model?

### A:
YOUR ANSWER HERE

In [None]:
# Toy dataset: snacks with a label 'yummy_label'
raw = pd.DataFrame({
    "snack": ["apple", "banana", "chips", "carrot", "chips", "apple"],
    "spiciness": ["low", "medium", "high", "low", "medium", "high"],
    "price_dollars": [1.0, 1.2, 2.5, 0.9, 2.7, 1.1],
    "yummy_label": [1, 1, 0, 1, 0, 1]
})
raw

In [None]:
# One-hot for 'snack' (nominal), Ordinal for 'spiciness' (ordered)
X = raw.drop(columns=["yummy_label"]) 
y = raw["yummy_label"]

nominal_cols = ["snack"]
ordinal_cols = ["spiciness"]
ordinal_order = [["low", "medium", "high"]]
num_cols = ["price_dollars"]

preprocess = ColumnTransformer([
    ("ohe", OneHotEncoder(handle_unknown="ignore"), nominal_cols),
    ("ord", OrdinalEncoder(categories=ordinal_order), ordinal_cols),
    ("num", "passthrough", num_cols)
])

clf = Pipeline([
    ("prep", preprocess),
    ("logit", LogisticRegression(max_iter=500))
])

clf.fit(X, y)
print("Training accuracy:", clf.score(X, y))

ohe = clf.named_steps["prep"].named_transformers_["ohe"]
encoded_names = list(ohe.get_feature_names_out(nominal_cols)) + ordinal_cols + num_cols
print("Encoded feature names:", encoded_names)

### Try It (2–3 min)
1. Add a new snack (e.g., `"popcorn"`) to the table and rerun. What happens with `handle_unknown="ignore"`?
2. Reverse the `ordinal_order` to `["high", "medium", "low"]`. How might that change results?
3. Swap `LogisticRegression` for `KNeighborsClassifier`. Does performance change?

## Principal Component Analysis
PCA **distills high-dimensional data** into fewer dimensions (principal components) that capture the most variance.

- Useful for **visualization** (2D/3D)
- Can speed up models or reduce noise
- Components are combinations of original features and may be **harder to interpret**

Always **scale** features before PCA so large units don’t dominate.

### Q: When might losing detail via PCA be okay? When might it be risky?

### A:
YOUR ANSWER HERE

In [None]:
# PCA on the Iris dataset (3 classes available inside this notebook via sklearn)
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
names = iris.target_names

pipe_pca = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2))
])
X_2d = pipe_pca.fit_transform(X_iris)
print("Explained variance ratio:", pipe_pca.named_steps["pca"].explained_variance_ratio_)

plt.figure(figsize=(6,4))
for label in np.unique(y_iris):
    mask = y_iris == label
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=names[label])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Iris in 2D via PCA")
plt.legend()
plt.show()

In [None]:
# Compare accuracy with and without PCA (both models are defined in this notebook)
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.25, random_state=42)

base = Pipeline([
    ("scaler", StandardScaler()),
    ("logit", LogisticRegression(max_iter=1000))
])
base.fit(X_train, y_train)
acc_base = base.score(X_test, y_test)

with_pca = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2)),
    ("logit", LogisticRegression(max_iter=1000))
])
with_pca.fit(X_train, y_train)
acc_pca = with_pca.score(X_test, y_test)

print(f"Accuracy without PCA: {acc_base:.3f}")
print(f"Accuracy with    PCA: {acc_pca:.3f}")
print("Note: PCA helps visualization; predictive accuracy may or may not improve.")

## Common Gotchas
- **Data leakage**: Fit scalers/encoders **only on training data**, then apply to validation/test.
- **Wrong encoder**: Don’t use ordinal encoding for unordered categories.
- **Unscaled PCA**: Scale before PCA.
- **High-cardinality one-hot**: Many unique categories can create lots of columns.

## Mini-Project (10–12 min)
Work in pairs **using only the data and tools in this notebook**:
1. Extend the **snack** table with 4–6 new rows.
2. Train a **KNN** classifier to predict `yummy_label`.
3. Compare accuracy **with** and **without** scaling numeric columns.
4. Try both **OneHot** and **Ordinal** encoding for `spiciness`. Which makes more sense? Why?
5. Share one insight with the class.

## Exit Ticket
1. Why do KNN and logistic regression often benefit from scaling?
2. When is `OrdinalEncoder` preferable to `OneHotEncoder`?
3. Name one downside of PCA.
4. What is **data leakage**, and how do we avoid it when scaling/encoding?

### A:
YOUR ANSWER HERE