<a href="https://colab.research.google.com/github/rhodes-byu/stat-486/blob/main/notebooks/03-pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b></b></p>

# scikit-learn Pipelines (Quick Tour)

A **pipeline** chains together **preprocessing steps** (transformers) and a final **estimator** (model).

Why use pipelines?
- **Prevents data leakage**: preprocessing is fit only on training folds during CV.
- **Reproducible & concise**: one object handles `fit`, `predict`, and `score` end-to-end.
- **Works seamlessly with CV / GridSearchCV**.

In scikit-learn, a pipeline looks like:

- Step 1..k: **Transformers** that implement `fit` and `transform` (e.g., imputation, scaling, one-hot encoding)
- Final step: an **Estimator** that implements `fit` (and typically `predict`, `score`)

We'll walk through a few working examples using **KNN** and **Linear Regression**.


In [None]:
import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# Transformers
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Models
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

# Datasets / synthetic data
from sklearn.datasets import load_diabetes, make_regression

np.random.seed(42)


## Example 1 — StandardScaler → KNN Regression

KNN is distance-based, so **feature scaling matters a lot**.
We'll use a synthetic regression dataset and compare:
- KNN without scaling
- KNN with scaling (in a pipeline)


In [None]:
X, y = make_regression(
    n_samples=1200, n_features=20, n_informative=10, noise=25.0, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print(X.shape)

In [None]:
knn = KNeighborsRegressor(n_neighbors=10)

pipe_knn_scaled = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("knn", KNeighborsRegressor(n_neighbors=10)),
])

knn.fit(X_train, y_train)
pipe_knn_scaled.fit(X_train, y_train)

pred = knn.predict(X_test)
pred_scaled = pipe_knn_scaled.predict(X_test)

print("KNN (no scaling)  R^2:", r2_score(y_test, pred))
print("KNN (with scaling) R^2:", r2_score(y_test, pred_scaled))


### Cross-validation (why pipelines help)

When you cross-validate, you want each fold to learn preprocessing **only from its training split**.
Pipelines make that automatic and safe.


In [None]:
scores = cross_val_score(pipe_knn_scaled, X, y, cv=5, scoring="r2")
print("CV R^2 scores:", np.round(scores, 3))
print("Mean CV R^2:", scores.mean())


## Example 2 — Imputation → Scaling → Linear Regression

We'll use the **diabetes** regression dataset (built into scikit-learn) and *artificially introduce missing values*.
Then we build a pipeline that:
1. imputes missing values (mean)
2. scales features
3. fits linear regression


In [None]:
diabetes = load_diabetes()
X = diabetes.data.copy()
y = diabetes.target.copy()

# Introduce ~7% missing values at random
missing_mask = np.random.rand(*X.shape) < 0.07
X[missing_mask] = np.nan

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

pipe_linreg = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler()),
    ("linreg", LinearRegression()),
])

pipe_linreg.fit(X_train, y_train)
pred = pipe_linreg.predict(X_test)

print("Test RMSE:", np.sqrt(mean_squared_error(y_test, pred)))
print("Test R^2:", r2_score(y_test, pred))


## Example 3 — Mixed data with ColumnTransformer

Real datasets often mix **numeric** and **categorical** columns. `ColumnTransformer` lets you apply
different preprocessing pipelines to different column subsets, then combine the results.

We'll create a small synthetic dataset with:
- numeric columns (with missing values)
- categorical columns (with missing values)

Then we fit a pipeline ending in **Linear Regression**.


In [None]:
n = 1000
rng = np.random.default_rng(42)

df = pd.DataFrame({
    "age": rng.normal(40, 12, size=n),
    "income": rng.normal(70000, 20000, size=n),
    "city": rng.choice(["NYC", "Boston", "Chicago"], size=n, p=[0.5, 0.25, 0.25]),
    "market_segment": rng.choice(["A", "B", "C"], size=n),
})

# Target depends on both numeric + categorical effects
city_effect = df["city"].map({"NYC": 5000, "Boston": -2000, "Chicago": 1000}).to_numpy()
market_segment_effect = df["market_segment"].map({"A": 3000, "B": 0, "C": -1500}).to_numpy()

y = (
    0.6 * df["income"].to_numpy()
    - 200 * df["age"].to_numpy()
    + city_effect
    + market_segment_effect
    + rng.normal(0, 4000, size=n)  # noise
)

# Introduce missingness
for col in ["age", "income"]:
    df.loc[rng.random(n) < 0.05, col] = np.nan
for col in ["city", "market_segment"]:
    df.loc[rng.random(n) < 0.05, col] = None

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.25, random_state=42)

df

In [None]:
numeric_features = ["age", "income"]
categorical_features = ["city", "market_segment"]

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("cat", categorical_pipe, categorical_features),
    ]
)

pipe_mixed = Pipeline(steps=[
    ("preprocess", preprocess),
    ("linreg", LinearRegression()),
])

pipe_mixed.fit(X_train, y_train)
pred = pipe_mixed.predict(X_test)

print("Test RMSE:", np.sqrt(mean_squared_error(y_test, pred)))
print("Test R^2:", r2_score(y_test, pred))


## Example 4 — Modular Design: Swap Models Easily

A big win of pipelines: you can **define preprocessing once** and reuse it with different models.
This makes model comparison clean and prevents code duplication.

We'll compare three models on the same preprocessed data:
- Linear Regression
- KNN Regression
- Ridge Regression (regularized linear model)


In [None]:
from sklearn.linear_model import Ridge

# Define preprocessing once (reuse the 'preprocess' from above)
# Now create different models with the SAME preprocessing

models = {
    "Linear Regression": LinearRegression(),
    "KNN (k=15)": KNeighborsRegressor(n_neighbors=15),
    "Ridge (alpha=1)": Ridge(alpha=1),
}

results = {}

for model_name, model in models.items():
    # Create pipeline with same preprocessing, different final estimator
    pipe = Pipeline(steps=[
        ("preprocess", preprocess),
        ("model", model),
    ])
    
    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, pred))
    r2 = r2_score(y_test, pred)
    
    results[model_name] = {"RMSE": rmse, "R^2": r2}
    print(f"{model_name:20s} | RMSE: {rmse:8.2f} | R^2: {r2:.4f}")


### Key Benefit: Clean Model Comparison

Notice how we:
1. **Defined `preprocess` once** (the `ColumnTransformer` from Example 3)
2. **Looped through models** and built identical pipelines with just the final estimator swapped
3. **No code duplication** for preprocessing logic

This pattern scales well when comparing many models or doing hyperparameter tuning with `GridSearchCV`.


## Notes & Common Patterns

- Use `Pipeline` for **ordered** steps: `imputer → scaler → model`.
- Use `ColumnTransformer` when different columns need different preprocessing.
- Hyperparameter tuning works naturally with pipelines (e.g., `GridSearchCV`), using parameter names like:
  - `knn__n_neighbors`
  - `preprocess__num__imputer__strategy`
- Pipelines help ensure **clean evaluation** and simpler production code.
