[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/02_03_extension.ipynb)

# Module 2 — Supervised Learning (Regression)
## 02_03_extension — Optional Exploration / Challenge (Regression)

**Rough goal:** Experiment beyond the baseline

### To do:
**Pick ONE exploration track** and run a small, controlled experiment on the **canonical dataset**.

### Keep fixed
- Use the canonical generator: `generate_transaction_risk_dataset(seed=1955)`
- Use the same train/validation split (fixed random seed)
- Report the same core metrics (MAE, RMSE, R²)
- Do your best to keep your work **reproducible** (no manual tweaks that are hard to rerun)

### Deliverable
At the end, think about (and discuss if time allows):
- What did you try?
- What happened (metrics + one plot)?
- What do you think explains the result?

In [None]:
# Colab-first setup (do not modify)
!git clone https://github.com/tunnel-ai/way.git
import sys
sys.path.insert(0, "/content/way/src")

In [None]:
# Imports
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression

In [None]:
# Canonical dataset (ground truth) — do not change the generator or seed
from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

df = generate_transaction_risk_dataset(seed=1955)
df.head()

## 1) Target and guardrails

We are predicting:

- **Target:** `transaction_loss_amount` (continuous; zero for non-fraud, heavy right tail for fraud)

A key modeling reality is **zero-inflation + heavy tails**:
- Many transactions have *exactly zero* loss
- Fraud transactions can have very large loss amounts

That is why **baseline comparisons** matter.

In [None]:
TARGET = "transaction_loss_amount"

# Conservative leakage exclusions (keep consistent with the main notebook. We will use this later.)
LEAKAGE_EXCLUSIONS = [
    "is_fraud",                 # directly determines whether loss can be > 0
]

# Basic checks
if TARGET not in df.columns:
    raise ValueError(f"Target column '{TARGET}' not found in dataset.")

y = df[TARGET].astype(float)
X = df.drop(columns=[TARGET] + [c for c in LEAKAGE_EXCLUSIONS if c in df.columns])

print("Rows:", len(df))
print("X columns:", X.shape[1])
print("Target summary:")
print(y.describe())
print("\n% zero loss:", (y == 0).mean())

In [None]:
# Train/validation split (fixed)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=1955
)

print("Train:", X_train.shape, " Val:", X_val.shape)

In [None]:
# Helper: regression metrics in dollars
def regression_report(y_true, y_pred, label="Model"):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    r2 = r2_score(y_true, y_pred)
    return pd.Series({"label": label, "MAE": mae, "RMSE": rmse, "R2": r2})

# Baseline: always predict 0 (important for zero-inflated targets)
baseline0 = np.zeros_like(y_val)
baseline_row = regression_report(y_val, baseline0, label="Baseline: predict 0")
baseline_row

## 2) A compact baseline pipeline (reference)

This is a lightweight reference pipeline you can reuse across tracks.

- Numeric: impute median + scale
- Categorical: impute most_frequent + one-hot encode
- Model: Linear Regression (baseline)

You may swap the **model** in your chosen track, but try to keep the **preprocessing logic** stable.

In [None]:
# Identify columns by dtype (simple heuristic)
numeric_cols = [c for c in X_train.columns if pd.api.types.is_numeric_dtype(X_train[c])]
categorical_cols = [c for c in X_train.columns if c not in numeric_cols]

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_cols),
        ("cat", categorical_pipe, categorical_cols)
    ],
    remainder="drop"
)

baseline_model = Pipeline(steps=[
    ("prep", preprocess),
    ("model", LinearRegression())
])

baseline_model.fit(X_train, y_train)
pred_val = baseline_model.predict(X_val)

baseline_lr_row = regression_report(y_val, pred_val, label="Baseline: LinearRegression + OHE")
pd.DataFrame([baseline_row, baseline_lr_row])

---

## 3) Choose ONE exploration track

Pick **one** of the tracks below

### Track A — Robustness to heavy tails (Huber-like thinking)
**Question:** Do we get better dollar-loss performance if we reduce sensitivity to extreme outliers?

Approach options:
- Use **SGDRegressor** with **Huber** loss (robust to outliers)
- Compare to the linear baseline

### Track B — Two-stage modeling (fraud → loss|fraud)
**Question:** Does explicitly modeling “zero vs positive” help?

Approach:
1) Train a classifier to predict whether loss > 0 (proxy for fraud event)
2) Train a regression model only on the positive-loss subset
3) Combine predictions:
   - If classifier predicts “no-loss”, predict 0
   - If classifier predicts “loss”, use regression estimate

### Track C — High-cardinality `merchant_id` (encoding stress test)
**Question:** How much does `merchant_id` help, and what does it cost?

Approach:
- Compare:
  1) Drop `merchant_id`
  2) Bucket top-K `merchant_id` values and one-hot encode
- Keep everything else the same

> Avoid modeling the whole zoo...One track, one comparison, one conclusion.

### Track A starter (robust regression)

Uncomment and run this track if you choose A.

In [None]:
# ===== TRACK A (Robust regression) =====
# Uncomment the block below to run Track A

# from sklearn.linear_model import SGDRegressor

# robust_model = Pipeline(steps=[
#     ("prep", preprocess),
#     ("model", SGDRegressor(
#         loss="huber",
#         epsilon=1.35,          # Huber transition point
#         alpha=1e-4,            # regularization strength
#         max_iter=2000,
#         random_state=1955
#     ))
# ])

# robust_model.fit(X_train, y_train)
# pred_val_robust = robust_model.predict(X_val)

# results = pd.DataFrame([
#     baseline_row,
#     baseline_lr_row,
#     regression_report(y_val, pred_val_robust, label="Track A: SGDRegressor (Huber) + OHE")
# ])
# results

### Track B starter (two-stage modeling)

Uncomment and run this track if you choose B.

Notes:
- This is still “supervised learning,” but it mixes **classification + regression**.
- It directly targets the **zero-inflation** structure of `transaction_loss_amount`.

In [None]:
# ===== TRACK B (Two-stage: loss event -> loss magnitude) =====
# Uncomment the block below to run Track B

# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import roc_auc_score

# # Stage 1: predict whether loss > 0
# y_train_event = (y_train > 0).astype(int)
# y_val_event = (y_val > 0).astype(int)

# clf = Pipeline(steps=[
#     ("prep", preprocess),
#     ("model", LogisticRegression(max_iter=1000))
# ])

# clf.fit(X_train, y_train_event)
# proba_val = clf.predict_proba(X_val)[:, 1]
# auc = roc_auc_score(y_val_event, proba_val)

# # Choose a threshold (simple default). You can tune this, but keep it reasonable and documented.
# threshold = 0.5
# pred_event = (proba_val >= threshold).astype(int)

# # Stage 2: regress loss magnitude on positive-loss training examples only
# pos_mask_train = y_train > 0
# X_train_pos = X_train.loc[pos_mask_train]
# y_train_pos = y_train.loc[pos_mask_train]

# reg_pos = Pipeline(steps=[
#     ("prep", preprocess),
#     ("model", LinearRegression())
# ])
# reg_pos.fit(X_train_pos, y_train_pos)

# # Combine predictions
# pred_loss_val = np.zeros_like(y_val, dtype=float)
# pred_loss_val[pred_event == 1] = reg_pos.predict(X_val.loc[pred_event == 1])

# results = pd.DataFrame([
#     baseline_row,
#     baseline_lr_row,
#     regression_report(y_val, pred_loss_val, label=f"Track B: two-stage (thr={threshold:.2f})")
# ])
# print("Stage-1 AUC (loss>0):", auc)
# results

### Track C starter (merchant_id stress test)

Uncomment and run this track if you choose C.

This track is about **feature engineering constraints**:
- `merchant_id` is high-cardinality (can explode one-hot)
- Top-K bucketing is a practical compromise: keep frequent merchants, bucket the rest as “OTHER”

In [None]:
# ===== TRACK C (merchant_id top-K bucketing) =====
# Uncomment the block below to run Track C

# def bucket_top_k(series: pd.Series, k: int = 50, other_label: str = "__OTHER__") -> pd.Series:
#     top = series.value_counts().nlargest(k).index
#     return series.where(series.isin(top), other_label)

# # Variant 1: DROP merchant_id
# X_train_drop = X_train.drop(columns=["merchant_id"], errors="ignore")
# X_val_drop = X_val.drop(columns=["merchant_id"], errors="ignore")

# num_cols_drop = [c for c in X_train_drop.columns if pd.api.types.is_numeric_dtype(X_train_drop[c])]
# cat_cols_drop = [c for c in X_train_drop.columns if c not in num_cols_drop]

# preprocess_drop = ColumnTransformer(
#     transformers=[
#         ("num", numeric_pipe, num_cols_drop),
#         ("cat", categorical_pipe, cat_cols_drop)
#     ],
#     remainder="drop"
# )

# model_drop = Pipeline(steps=[
#     ("prep", preprocess_drop),
#     ("model", LinearRegression())
# ])
# model_drop.fit(X_train_drop, y_train)
# pred_drop = model_drop.predict(X_val_drop)

# # Variant 2: BUCKET merchant_id to top-K and one-hot
# X_train_bucket = X_train.copy()
# X_val_bucket = X_val.copy()

# if "merchant_id" in X_train_bucket.columns:
#     X_train_bucket["merchant_id"] = bucket_top_k(X_train_bucket["merchant_id"], k=50)
#     # Use train top-K buckets to map val consistently:
#     top_train = X_train_bucket["merchant_id"].value_counts().nlargest(50).index
#     X_val_bucket["merchant_id"] = X_val_bucket["merchant_id"].where(
#         X_val_bucket["merchant_id"].isin(top_train), "__OTHER__"
#     )

# num_cols_b = [c for c in X_train_bucket.columns if pd.api.types.is_numeric_dtype(X_train_bucket[c])]
# cat_cols_b = [c for c in X_train_bucket.columns if c not in num_cols_b]

# preprocess_bucket = ColumnTransformer(
#     transformers=[
#         ("num", numeric_pipe, num_cols_b),
#         ("cat", categorical_pipe, cat_cols_b)
#     ],
#     remainder="drop"
# )

# model_bucket = Pipeline(steps=[
#     ("prep", preprocess_bucket),
#     ("model", LinearRegression())
# ])
# model_bucket.fit(X_train_bucket, y_train)
# pred_bucket = model_bucket.predict(X_val_bucket)

# results = pd.DataFrame([
#     baseline_row,
#     baseline_lr_row,
#     regression_report(y_val, pred_drop, label="Track C: drop merchant_id"),
#     regression_report(y_val, pred_bucket, label="Track C: top-50 merchant_id bucket + OHE"),
# ])
# results

---

## 4) Diagnostic plot 

After you run your chosen track, create **one diagnostic plot** that supports your explanation.

Suggested options:
- Actual vs Predicted (scatter, maybe log scale)
- Residuals vs Predicted
- Residuals vs transaction_amount (to show heteroskedasticity)

In [None]:
import matplotlib.pyplot as plt

# Replace `y_hat` with the predictions from your chosen track.
# Example: y_hat = pred_val_robust   (Track A)
# Example: y_hat = pred_loss_val     (Track B)
# Example: y_hat = pred_bucket       (Track C)

y_hat = pred_val  # default: baseline linear model

resid = y_val.values - y_hat

plt.figure(figsize=(6, 4))
plt.scatter(y_hat, resid, alpha=0.25)
plt.axhline(0, linewidth=1)
plt.xlabel("Predicted loss (dollars)")
plt.ylabel("Residual (actual - predicted)")
plt.title("Residuals vs Predicted (diagnostic)")
plt.show()

---

## 5) Think about (and discuss if time allows)

- Track chosen
- What you changed relative to the baseline
- Metrics (MAE, RMSE, R²) summary
- One diagnostic observation from your plot
- Your best explanation (connect to zero-inflation, heavy tails, high-cardinality encoding, or model misspecification)
- If you had 30 more minutes, what would you try next? An hour?