
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/02_00_main.ipynb)

In [None]:
# --- Course setup (un comment in run if using colab) ---
#!git clone https://github.com/tunnel-ai/way.git
#import sys; sys.path.insert(0, "/content/way/src")


# Module 2 — Supervised Learning: Regression 

**Target:** `transaction_loss_amount` (continuous, zero-inflated; heavy right tail for fraud cases)  
**dataset:** Same as the prior one. `core.generators.transaction_risk_dgp.generate_transaction_risk_dataset(seed=1955)`  

**How to run (Colab):**
1. Open this notebook from GitHub with the **Open in Colab** button. 
2. Run the first cell (it clones the repo and sets `sys.path`). It might be commented out, if so uncomment. 


In [None]:
# --- Imports ------------------------------------------------------------------
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.compose import TransformedTargetRegressor

RANDOM_STATE = 1955
np.random.seed(RANDOM_STATE)



In [None]:
# --- Load canonical dataset (don't modify generator) --------------------------
from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

df = generate_transaction_risk_dataset(seed=RANDOM_STATE)

print(df.shape)
df.head()


## 1) Problem framing: why regression is tricky here

`transaction_loss_amount` is **zero** for non-fraud transactions and **positive / heavy-tailed** for fraud transactions. So we have **zero-inflated, right-skewed** target distribution.

It's a common in practice (e.g., claims, chargebacks, returns): many zeros, a few large outcomes.


In [None]:
# --- Target distribution: zeros + heavy tail ----------------------------------
TARGET = "transaction_loss_amount"

y = df[TARGET]
zero_rate = (y == 0).mean()

print(f"Rows: {len(df):,}")
print(f"Zero rate in {TARGET}: {zero_rate:.1%}")
print("Target quantiles (including zeros):")
display(y.quantile([0, .5, .9, .95, .99, .995, .999]))

fig, ax = plt.subplots()
ax.hist(y, bins=100)
ax.set_title(f"Histogram of {TARGET} (raw scale)")
ax.set_xlabel(TARGET)
ax.set_ylabel("count")
plt.show()

fig, ax = plt.subplots()
ax.hist(np.log1p(y), bins=100)
ax.set_title(f"Histogram of log1p({TARGET})")
ax.set_xlabel(f"log1p({TARGET})")
ax.set_ylabel("count")
plt.show()


## 2) Train/test split + feature audit

We're going to build a fairly minimal (time!) **baseline pipeline** to manage:
- numeric features with missing values (impute + optional scaling)
- categorical features with missing values (impute + one-hot encoding)
- **high-cardinality** `merchant_id` (...carefully)

We will not “clean” the dataset outside pipelines, the preprocessing stays inside the model pipeline to avoid leakage.


In [None]:
# --- Define features / targets ------------------------------------------------
DROP = [TARGET]  # regression target

X = df.drop(columns=DROP)
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RANDOM_STATE
)

# Identify column types
categorical_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
numeric_cols = [c for c in X_train.columns if c not in categorical_cols]

print("n_train:", X_train.shape, "n_test:", X_test.shape)
print("Categorical:", categorical_cols)
print("Numeric (first 15):", numeric_cols[:15])

# Quick missingness + cardinality audit
miss = X_train.isna().mean().sort_values(ascending=False)
card = {c: X_train[c].nunique(dropna=True) for c in categorical_cols}
card_s = pd.Series(card).sort_values(ascending=False)

display(pd.DataFrame({"missing_rate": miss}).head(10))
display(pd.DataFrame({"n_unique": card_s}))


## 3) Baselines: “predict the mean” and “predict zero”

Two baselines:

1) **Mean baseline**: always predict the mean of the training target. Often surprisingly useful. Maybe not here. 
2) **Zero baseline**: always predict 0 (often really strong when there are many zeros)

If a model *can’t* beat these, it’s not learning a useful signal.


In [None]:
def regression_report(y_true, y_pred, label="model"):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    r2 = r2_score(y_true, y_pred)
    return pd.Series({"MAE": mae, "RMSE": rmse, "R2": r2}, name=label)

mean_pred = np.full_like(y_test, fill_value=float(y_train.mean()), dtype=float)
zero_pred = np.zeros_like(y_test, dtype=float)

results = pd.concat(
    [
        regression_report(y_test, mean_pred, "Baseline: mean(y_train)"),
        regression_report(y_test, zero_pred, "Baseline: always 0"),
    ],
    axis=1,
).T

results


## 4) Linear regression with a real preprocessing pipeline

Fit an **OLS linear regression** on a feature matrix that mixes:
- numeric variables (e.g., `transaction_amount`, `account_age_days`, risk counts)
- categorical variables (e.g., `channel`, `merchant_category`, `merchant_id`)

Key design choice/warning: how to encode high-cardinality categories like `merchant_id`?

We’ll start with OneHotEncoder using **frequency-based grouping** (not available in all sklearn versions... looking at you colab...).  

If not available, we’ll fall back to plain one-hot with `handle_unknown='ignore'`.


In [None]:
# --- Preprocessing ------------------------------------------------------------
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        # Scaling isn't strictly required for OLS, but helps when we move to regularization.
        ("scaler", StandardScaler(with_mean=False)),
    ]
)

# OneHotEncoder behavior can vary slightly by sklearn version! (root of all Python ills...)
# We'll try to enable frequency-based grouping to reduce merchant_id explosion.
def make_ohe():
    try:
        return OneHotEncoder(handle_unknown="ignore", min_frequency=50)
    except TypeError:
        return OneHotEncoder(handle_unknown="ignore")

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", make_ohe()),
    ]
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ],
    remainder="drop",
    sparse_threshold=0.3,
)

ols_model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", LinearRegression()),
    ]
)

ols_model


In [None]:
# --- Fit + evaluate -----------------------------------------------------------
ols_model.fit(X_train, y_train)

y_pred = ols_model.predict(X_test)

results_ols = pd.concat(
    [results, regression_report(y_test, y_pred, "OLS + one-hot")],
    axis=0
)
results_ols


### Diagnostics: prediction scatter + residuals

A linear model can be “good enough” on average while failing badly in the tail.  
Because loss is heavy-tailed, look at:
- predicted vs actual
- residuals vs predicted
- residual histogram (often skewed / heteroskedastic)


In [None]:
resid = y_test - y_pred

fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, alpha=0.4)
ax.set_title("Predicted vs actual (OLS)")
ax.set_xlabel("Actual")
ax.set_ylabel("Predicted")
plt.show()

fig, ax = plt.subplots()
ax.scatter(y_pred, resid, alpha=0.4)
ax.axhline(0, linewidth=1)
ax.set_title("Residuals vs predicted (OLS)")
ax.set_xlabel("Predicted")
ax.set_ylabel("Residual")
plt.show()

fig, ax = plt.subplots()
ax.hist(resid, bins=100)
ax.set_title("Residual histogram (OLS)")
ax.set_xlabel("Residual")
ax.set_ylabel("count")
plt.show()


## 5) A pragmatic fix: model the log-loss

When targets are right-skewed, a common baseline improvement is to model:

\[
\log(1 + \text{loss})
\]

Then transform predictions back. This often improves RMSE and makes residuals behave better, but:
- it changes the loss function implicitly
- it can under-predict the extreme tail

We’ll use `TransformedTargetRegressor` so the transform is handled safely.


In [None]:
log_ols = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", TransformedTargetRegressor(
            regressor=LinearRegression(),
            func=np.log1p,
            inverse_func=np.expm1
        )),
    ]
)

log_ols.fit(X_train, y_train)
y_pred_log = log_ols.predict(X_test)

results_log = pd.concat(
    [results_ols, regression_report(y_test, y_pred_log, "OLS on log1p(loss)")],
    axis=0
)
results_log


## 6) Polynomial regression (carefully)

Polynomial terms can capture **nonlinearities** and **interactions** in a way that stays “linear in parameters.”  
But polynomial expansion with many features can explode the feature space.

So here we demonstrate polynomial regression on a **single predictor** (`transaction_amount`) as a controlled example.


In [None]:
# --- Polynomial regression demo: transaction_amount only ----------------------
# This is intentionally small + interpretable.
poly_X_train = X_train[["transaction_amount"]].copy()
poly_X_test  = X_test[["transaction_amount"]].copy()

poly_model = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("poly", PolynomialFeatures(degree=3, include_bias=False)),
        ("ridge", Ridge(alpha=1.0, random_state=RANDOM_STATE)),
    ]
)

poly_model.fit(poly_X_train, y_train)
poly_pred = poly_model.predict(poly_X_test)

poly_results = pd.concat(
    [results_log, regression_report(y_test, poly_pred, "Poly(deg=3) on transaction_amount")],
    axis=0
)
poly_results


## 7) Regularization: Ridge / Lasso / Elastic Net

With many correlated predictors (and one-hot categories), OLS can be unstable.  
Regularization adds a penalty term that controls coefficient magnitude:

- **Ridge (L2)** shrinks coefficients smoothly (good with correlated features)
- **Lasso (L1)** can drive some coefficients to 0 (feature selection)
- **Elastic Net** mixes both

We’ll tune hyperparameters with cross-validation.


In [None]:
# --- Shared pipeline pieces ---------------------------------------------------
def make_reg_pipeline(regressor):
    return Pipeline(steps=[("preprocess", preprocess), ("model", regressor)])

param_grid = {
    "model__alpha": [0.01, 0.1, 1.0, 10.0, 100.0]
}

cv = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Ridge
ridge_pipe = make_reg_pipeline(Ridge(random_state=RANDOM_STATE))
ridge_search = GridSearchCV(ridge_pipe, param_grid=param_grid, cv=cv, scoring="neg_root_mean_squared_error", n_jobs=-1)
ridge_search.fit(X_train, y_train)

# Lasso (may need more iterations)
lasso_pipe = make_reg_pipeline(Lasso(max_iter=20000, random_state=RANDOM_STATE))
lasso_search = GridSearchCV(lasso_pipe, param_grid=param_grid, cv=cv, scoring="neg_root_mean_squared_error", n_jobs=-1)
lasso_search.fit(X_train, y_train)

# Elastic Net
enet_pipe = make_reg_pipeline(ElasticNet(max_iter=20000, random_state=RANDOM_STATE))
enet_grid = {
    "model__alpha": [0.01, 0.1, 1.0, 10.0],
    "model__l1_ratio": [0.1, 0.5, 0.9],
}
enet_search = GridSearchCV(enet_pipe, param_grid=enet_grid, cv=cv, scoring="neg_root_mean_squared_error", n_jobs=-1)
enet_search.fit(X_train, y_train)

print("Best Ridge:", ridge_search.best_params_)
print("Best Lasso:", lasso_search.best_params_)
print("Best ElasticNet:", enet_search.best_params_)


In [None]:
# --- Evaluate best regularized models on the test set -------------------------
ridge_best = ridge_search.best_estimator_
lasso_best = lasso_search.best_estimator_
enet_best  = enet_search.best_estimator_

ridge_pred = ridge_best.predict(X_test)
lasso_pred = lasso_best.predict(X_test)
enet_pred  = enet_best.predict(X_test)

final_results = pd.concat(
    [
        results_log,
        regression_report(y_test, ridge_pred, "Ridge (CV-tuned)"),
        regression_report(y_test, lasso_pred, "Lasso (CV-tuned)"),
        regression_report(y_test, enet_pred, "ElasticNet (CV-tuned)"),
    ],
    axis=0
)

final_results


## 8) High-cardinality `merchant_id`: a concrete experiment

`merchant_id` is high-cardinality. One-hot encoding can:
- create a **HUGE** sparse matrix
- memorize merchant effects (sometimes helpful, sometimes leakage-ish depending on context)
- increase variance if many merchants appear rarely

A quick sanity check is to compare performance **with** and **without** `merchant_id`.


In [None]:
# --- Compare a model with and without merchant_id -----------------------------
X_train_no_mid = X_train.drop(columns=["merchant_id"], errors="ignore")
X_test_no_mid  = X_test.drop(columns=["merchant_id"], errors="ignore")

cat_no_mid = X_train_no_mid.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_no_mid = [c for c in X_train_no_mid.columns if c not in cat_no_mid]

preprocess_no_mid = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_no_mid),
        ("cat", Pipeline(steps=[("imputer", SimpleImputer(strategy="most_frequent")),
                               ("onehot", make_ohe())]), cat_no_mid),
    ],
    remainder="drop",
    sparse_threshold=0.3,
)

ridge_no_mid = Pipeline(steps=[("preprocess", preprocess_no_mid), ("model", Ridge(alpha=ridge_search.best_params_["model__alpha"], random_state=RANDOM_STATE))])
ridge_no_mid.fit(X_train_no_mid, y_train)
pred_no_mid = ridge_no_mid.predict(X_test_no_mid)

compare_mid = pd.concat(
    [
        regression_report(y_test, ridge_pred, "Ridge (with merchant_id)"),
        regression_report(y_test, pred_no_mid, "Ridge (without merchant_id)"),
    ],
    axis=1
).T

compare_mid


## Some notes

- **Zero inflation matters**: always benchmark against “predict 0” as well as “predict the mean.”
- **Pipelines are useful... maybe even required?**: preprocessing belongs inside the model pipeline to avoid leakage.
- **The metric is a choice**: MAE vs RMSE tells you whether you care more about typical errors or tail errors.
- **Linear models are baselines**: they are fast, interpretable, and useful even when they are misspecified.
- **High-cardinality categories... have to make decisions**: one-hot (with grouping), hashing, target encoding, or dropping the feature.

## Check in-- how are we doing here? 
1. Load the canonical dataset and articulate why the regression target is structurally challenging.
2. Build a preprocessing pipeline for mixed numeric/categorical data.
3. Evaluate regression models using MAE, RMSE, and R².
4. Use residual plots to diagnose common issues.
5. Explain how Ridge/Lasso/ElasticNet change model behavior relative to OLS.
