[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/02_01_exercise_guided.ipynb)

# Module 2 — Regression (Guided Exercise)

**Notebook:** `02_01_exercise_guided.ipynb`  
**Goal** here is *Adapt* (structured adaptation)

You will adapt the **regression workflow** from `02_00_main.ipynb` to model:

- **Target:** `transaction_loss_amount` (continuous; zero for most transactions; heavy right tail when fraud occurs)

**What you are practicing**
- Implementing a **reproducible** supervised learning pipeline
- Avoiding **data leakage** using scikit-learn Pipelines
- Making a small number of *specific* adaptations (baselines, target transform, regularization)

**Rules**
- Use the canonical dataset generator exactly as provided.
- Don't edit the generator code.
- implement the TODOs correctly (single intended modeling path).


## 0) Colab setup (required)

Run this cell first. It clones the course repo and makes the `core` package importable.


In [None]:
# Colab-first setup
!git clone https://github.com/tunnel-ai/way.git

import sys
sys.path.insert(0, "/content/way/src")

# Quick sanity check
import core
print("Imported core from:", core.__file__)


## 1) Generate the dataset

We will generate the dataset using:

`generate_transaction_risk_dataset(seed=1955)`

This is the *same dataset* used throughout Modules 1–4.


In [None]:
import numpy as np
import pandas as pd

from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

df = generate_transaction_risk_dataset(seed=1955)

print("Shape:", df.shape)
display(df.head())


## 2) Define target and leakage exclusions

We are predicting **transaction_loss_amount**.

**Important:** the dataset also contains other targets/labels (e.g., `is_fraud`).  
For regression in this module, treat those as *leakage risks* and exclude them from features.

You will:
- explicitly define **X** and **y**
- explicitly define **leakage exclusions**


In [None]:
TARGET = "transaction_loss_amount"

# Explicit leakage exclusions (labels / targets / obvious post-outcome signals)
LEAKAGE_COLS = [
    "is_fraud",                # classification target
    "transaction_loss_amount", # regression target
]

# If other label-ish columns exist, add them here (keep this list explicit).
# Example (if present): "fraud_probability", "loss_bucket", etc.

# Build X, y
y = df[TARGET].copy()

X = df.drop(columns=[c for c in LEAKAGE_COLS if c in df.columns]).copy()

print("X shape:", X.shape)
print("y shape:", y.shape)
print("Target summary:")
display(y.describe(percentiles=[0.5, 0.9, 0.95, 0.99]))


## 3) Quick target check (zero inflation + heavy tail)

This is not a full EDA—just enough to understand the target you're modeling.


In [None]:
import matplotlib.pyplot as plt

zero_rate = (y == 0).mean()
print(f"Zero rate (fraction of transactions with $0 loss): {zero_rate:.3f}")

# Plot non-zero loss distribution (log scale) to visualize heavy tail
nonzero = y[y > 0]
print("Non-zero count:", len(nonzero))

plt.figure(figsize=(7,4))
plt.hist(np.log1p(nonzero), bins=50)
plt.title("log1p(transaction_loss_amount) for fraud transactions (y>0)")
plt.xlabel("log1p(loss)")
plt.ylabel("count")
plt.show()


## 4) Train/validation split

We use a single held-out validation split for this guided exercise.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y,
    test_size=0.25,
    random_state=1955,
)

print("Train:", X_train.shape, y_train.shape)
print("Valid:", X_valid.shape, y_valid.shape)


## 5) Metrics helper (MAE, RMSE, R²)

Evaluate models on the validation set using:
- **MAE** (robust-ish)
- **RMSE** (penalizes large errors; tail-sensitive)
- **R²** (variance explained; can be negative)

**TODO:** Implement the `evaluate_regression()` function.


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def evaluate_regression(y_true, y_pred):
    """Return MAE, RMSE, R2 in a dict."""
    # TODO: compute MAE, RMSE, and R^2 and return them in a dict.
    # Hint: RMSE = sqrt(MSE)
    raise NotImplementedError

# Quick test (should run after TODO is implemented)
# print(evaluate_regression(np.array([0, 1, 2]), np.array([0, 2, 2])))


## 6) Baseline 1: Predict **zero** for every transaction

Because most transactions have **$0 loss**, a "predict zero" baseline is meaningful.

**TODO:** Compute baseline metrics on the validation set.


In [None]:
# TODO: create y_pred_zero for the validation set (all zeros) and evaluate it
# y_pred_zero = ...
# baseline_zero_metrics = evaluate_regression(y_valid, y_pred_zero)
# print(baseline_zero_metrics)


## 7) Preprocessing + OLS baseline model (Pipeline)

We will use:
- numeric features → impute + scale
- categorical features → impute + one-hot encode
- model → LinearRegression (OLS)

**Design choice for guided notebook:** we will **exclude `merchant_id`** (high cardinality) to keep the workflow stable.  
You will deal with high-cardinality encoding decisions in `02_02_exercise_open.ipynb`.


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

# Split columns by dtype (simple heuristic)
categorical_cols = [c for c in X_train.columns if X_train[c].dtype == "object"]
numeric_cols = [c for c in X_train.columns if c not in categorical_cols]

# Guided constraint: drop merchant_id if present (high-cardinality)
if "merchant_id" in categorical_cols:
    categorical_cols.remove("merchant_id")

print("Numeric cols:", len(numeric_cols))
print("Categorical cols:", len(categorical_cols))
print("Example categorical:", categorical_cols[:10])

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_cols),
        ("cat", categorical_pipe, categorical_cols),
    ],
    remainder="drop",
)

ols_model = LinearRegression()

ols_pipeline = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", ols_model),
])

ols_pipeline


## 8) Fit OLS pipeline and evaluate

**TODO:** Fit the pipeline and compute validation metrics.


In [None]:
# TODO: fit the pipeline
# ols_pipeline.fit(X_train, y_train)

# TODO: predict on validation
# y_pred_ols = ols_pipeline.predict(X_valid)

# TODO: evaluate
# ols_metrics = evaluate_regression(y_valid, y_pred_ols)
# print(ols_metrics)


## 9) Residual diagnostic (quick)

Residuals can reveal:
- nonlinearity
- heteroskedasticity
- systematic under/over-prediction in certain ranges

This is *not* a full statistical diagnostic.


In [None]:
# This cell assumes you have y_pred_ols from the previous section.
# If you named it differently, update accordingly.

# TODO: compute residuals and plot residuals vs predictions
# residuals = y_valid - y_pred_ols

# plt.figure(figsize=(7,4))
# plt.scatter(y_pred_ols, residuals, alpha=0.3)
# plt.axhline(0, linestyle="--")
# plt.title("Residuals vs Predicted (OLS)")
# plt.xlabel("Predicted loss ($)")
# plt.ylabel("Residual ($)")
# plt.show()


## 10) Adaptation 1: Model `log1p(loss)` and evaluate in dollars

Because the loss distribution is heavy-tailed, modeling `log1p(y)` often improves stability.

**TODO:**  
1) Create `y_train_log = log1p(y_train)`  
2) Fit the same pipeline but on `y_train_log`  
3) Predict on validation, then convert predictions back to dollars using `expm1()`  
4) Evaluate in dollars.


In [None]:
from sklearn.base import clone

# We'll reuse the *same* pipeline structure (same preprocessing).
# Only the target changes.

# TODO 1: log-transform training target
# y_train_log = ...

# TODO 2: clone the pipeline and fit on log target
# ols_log_pipeline = clone(ols_pipeline)
# ols_log_pipeline.fit(X_train, y_train_log)

# TODO 3: predict log-loss on validation and invert transform
# y_pred_log = ols_log_pipeline.predict(X_valid)
# y_pred_log_dollars = np.expm1(y_pred_log)

# TODO 4: evaluate in dollars
# ols_log_metrics = evaluate_regression(y_valid, y_pred_log_dollars)
# print(ols_log_metrics)


## 11) Adaptation 2: Ridge Regression with cross-validated alpha

Ridge adds an L2 penalty that shrinks coefficients and often improves generalization,
especially with many correlated predictors after one-hot encoding.

**TODO:** Replace the OLS model with `RidgeCV` and evaluate.

Constraints:
- Keep the preprocessing identical.
- Use a small, sensible alpha grid.


In [None]:
from sklearn.linear_model import RidgeCV

alphas = np.logspace(-3, 3, 25)

ridge_model = RidgeCV(alphas=alphas)

ridge_pipeline = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", ridge_model),
])

# TODO: fit ridge_pipeline on training data and evaluate on validation
# ridge_pipeline.fit(X_train, y_train)
# y_pred_ridge = ridge_pipeline.predict(X_valid)
# ridge_metrics = evaluate_regression(y_valid, y_pred_ridge)
# print(ridge_metrics)

# Optional: inspect chosen alpha
# print("Chosen alpha:", ridge_pipeline.named_steps["model"].alpha_)


## 12) Compare models (baseline vs OLS vs log-OLS vs Ridge)

**TODO:** Create a small comparison table of the metrics you computed.


In [None]:
# TODO: Build a DataFrame comparing:
# - baseline_zero_metrics
# - ols_metrics
# - ols_log_metrics
# - ridge_metrics
#
# Example:
# results = pd.DataFrame([
#     {"model": "baseline_zero", **baseline_zero_metrics},
#     {"model": "ols", **ols_metrics},
#     {"model": "ols_log1p", **ols_log_metrics},
#     {"model": "ridge", **ridge_metrics},
# ]).set_index("model")
# display(results)


## 13) Check in

1) **Why is the “predict zero” baseline meaningful for this target?**  
2) Did modeling `log1p(loss)` help? If yes, *why* might it help for a heavy-tailed target?  
3) If Ridge helped, what problem is it addressing in this feature space (especially after one-hot encoding)?  
4) Identify **one leakage risk** you avoided in this workflow, and how you avoided it.

