# Module 3 — Supervised Learning: Classification (Open Exercise)

**File:** `03_02_exercise_open.ipynb`  
**Target:** `is_fraud` (binary)  
**Dataset:** `generate_transaction_risk_dataset(seed=1955)` (canonical)

---

## Open in Colab

> If you are viewing this notebook on GitHub, use the badge below to open it in Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](REPLACE_WITH_GITHUB_COLAB_LINK)

---

## Deliverable (what you will submit)

By the end, your notebook must include:

1. **A single final model pipeline** (preprocessing + model) trained on the canonical dataset.  
2. A **Decision Log** documenting **two** modeling decisions you made (see menu below), each supported by evidence (metrics + at least one plot).  
3. Evaluation on a held-out validation split using **all** of the following:
   - Confusion matrix at a chosen threshold
   - Precision, Recall, F1
   - ROC-AUC
   - PR-AUC (Average Precision)

> Important: Fraud is **imbalanced**. Accuracy alone is not sufficient.


In [None]:
# Colab-first setup (run this cell first)
!git clone https://github.com/tunnel-ai/way.git

import sys
sys.path.insert(0, "/content/way/src")

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    RocCurveDisplay, PrecisionRecallDisplay
)

import matplotlib.pyplot as plt

from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

In [None]:
# Generate canonical dataset (do not modify the generator)
df = generate_transaction_risk_dataset(seed=1955)

print(df.shape)
df.head()

## 1) Define target, features, and a fixed split

- Target: `is_fraud`
- Use a **stratified** split because fraud is rare.
- Keep the split fixed so your results are reproducible.


In [None]:
TARGET = "is_fraud"

# TODO: Decide what to do with these two target columns as features.
# - For classification (is_fraud), you MUST exclude transaction_loss_amount (it is post-event).
DROP_ALWAYS = [TARGET, "transaction_loss_amount"]

X = df.drop(columns=DROP_ALWAYS)
y = df[TARGET].astype(int)

# Stratified split (fixed random_state for reproducibility)
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Train fraud rate:", y_train.mean())
print("Val fraud rate:  ", y_val.mean())

## 2) Baseline(s)

Compute at least one baseline that respects class imbalance.

Suggested baselines:
- **Always predict non-fraud** (`ŷ = 0`)
- (Optional) Predict fraud with the **base rate** probability

Report: confusion matrix + precision/recall/F1 + ROC-AUC + PR-AUC where applicable.


In [None]:
# Baseline: always predict non-fraud
y_pred0 = np.zeros_like(y_val)
cm0 = confusion_matrix(y_val, y_pred0)

precision0 = precision_score(y_val, y_pred0, zero_division=0)
recall0 = recall_score(y_val, y_pred0, zero_division=0)
f10 = f1_score(y_val, y_pred0, zero_division=0)

# For ROC-AUC / PR-AUC we need probabilities; for this baseline use constant p=0
y_prob0 = np.zeros_like(y_val, dtype=float)
roc0 = roc_auc_score(y_val, y_prob0)
pr0 = average_precision_score(y_val, y_prob0)

print("Confusion matrix (always non-fraud):\n", cm0)
print({"precision": precision0, "recall": recall0, "f1": f10, "roc_auc": roc0, "pr_auc": pr0})

## 3) Decision menu (choose **two**)

You must choose **two** decisions from this menu, implement them, and justify them with evidence.

**Decision A — High-cardinality `merchant_id` handling**
- A1) Drop `merchant_id`
- A2) One-hot encode top-K most frequent merchants + "other"
- A3) Frequency encoding (count-based, target-agnostic)

**Decision B — Model family**
- B1) Logistic Regression (interpretable baseline)
- B2) Decision Tree (nonlinear rules)
- B3) Random Forest / Gradient Boosting (stronger performance, less interpretable)

**Decision C — Threshold rule**
- C1) Use 0.50
- C2) Choose threshold to maximize F1
- C3) Choose threshold using a simple **cost tradeoff** (FP cost vs FN cost)

**Decision D — Feature set**
- D1) Transaction context only
- D2) Context + customer/device signals
- D3) Exclude suspicious / noisy features (state your rationale)

---

### Decision Log (fill this in as you work)

- Decision 1: ___ (A/B/C/D + option)  
  Rationale: ___  
  Evidence: metrics + plot(s) ___

- Decision 2: ___ (A/B/C/D + option)  
  Rationale: ___  
  Evidence: metrics + plot(s) ___


## 4) Build a preprocessing pipeline (avoid leakage)

Rules:
- Do **not** manually fit encoders/scalers on the full dataset.
- Use a `Pipeline` + `ColumnTransformer` so preprocessing is fit on **train only**.


In [None]:
# Identify column types
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
num_cols = [c for c in X_train.columns if c not in cat_cols]

# TODO (Decision A): decide how to handle merchant_id
# Option A1: drop merchant_id
# Option A2: one-hot encode (will be big)
# Option A3: frequency encoding (target-agnostic)
#
# NOTE: OneHotEncoder on full merchant_id may be large; a top-K approach can help.

# Example: drop merchant_id (A1)
if "merchant_id" in cat_cols:
    # Comment this out if you *do* decide to encode merchant_id
    X_train = X_train.drop(columns=["merchant_id"])
    X_val = X_val.drop(columns=["merchant_id"])
    cat_cols = [c for c in cat_cols if c != "merchant_id"]

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols),
    ],
    remainder="drop"
)

preprocess

## 5) Choose a model (Decision B)

Pick **one** as your final model, but you may compare 2–3 quickly.

Recommended starting point:
- LogisticRegression with `class_weight="balanced"` (handles imbalance)

Then compare to:
- DecisionTreeClassifier
- RandomForestClassifier
- HistGradientBoostingClassifier


In [None]:
# TODO (Decision B): choose your model
# Start with logistic regression as a baseline.
model = LogisticRegression(
    max_iter=2000,
    class_weight="balanced",
    n_jobs=None
)

# Example alternatives (uncomment to try):
# model = DecisionTreeClassifier(max_depth=6, class_weight="balanced", random_state=42)
# model = RandomForestClassifier(n_estimators=300, random_state=42, class_weight="balanced", n_jobs=-1)
# model = HistGradientBoostingClassifier(max_depth=6, learning_rate=0.1, random_state=42)

clf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", model)
])

clf

In [None]:
# Fit
clf.fit(X_train, y_train)

# Predicted probabilities for the positive class
y_prob = clf.predict_proba(X_val)[:, 1]

# Default threshold (you may change this in Decision C)
threshold = 0.50
y_pred = (y_prob >= threshold).astype(int)

# Core metrics
cm = confusion_matrix(y_val, y_pred)
precision = precision_score(y_val, y_pred, zero_division=0)
recall = recall_score(y_val, y_pred, zero_division=0)
f1 = f1_score(y_val, y_pred, zero_division=0)
roc = roc_auc_score(y_val, y_prob)
pr = average_precision_score(y_val, y_prob)

print("Confusion matrix @ threshold =", threshold, "\n", cm)
print({"precision": precision, "recall": recall, "f1": f1, "roc_auc": roc, "pr_auc": pr})

## 6) Diagnostics: ROC and Precision–Recall curves

You must include at least one diagnostic plot in your evidence.
For imbalanced classification, the **PR curve** is often more informative than ROC.


In [None]:
fig = plt.figure(figsize=(6, 4))
RocCurveDisplay.from_predictions(y_val, y_prob)
plt.title("ROC Curve")
plt.show()

fig = plt.figure(figsize=(6, 4))
PrecisionRecallDisplay.from_predictions(y_val, y_prob)
plt.title("Precision–Recall Curve")
plt.show()

## 7) Thresholding (Decision C)

Pick one approach:
- **C1:** Use 0.50
- **C2:** Choose threshold that maximizes F1 on validation
- **C3:** Choose threshold using a simple cost tradeoff

For C3, define:
- cost_fp: cost of investigating a legitimate transaction
- cost_fn: cost of missing a fraud event

Then choose the threshold that minimizes expected cost.


In [None]:
# Threshold sweep
thresholds = np.linspace(0.01, 0.99, 99)

rows = []
for t in thresholds:
    yp = (y_prob >= t).astype(int)
    rows.append({
        "threshold": t,
        "precision": precision_score(y_val, yp, zero_division=0),
        "recall": recall_score(y_val, yp, zero_division=0),
        "f1": f1_score(y_val, yp, zero_division=0),
    })

thr_df = pd.DataFrame(rows)

# TODO (Decision C2): pick the threshold that maximizes F1
best_row = thr_df.loc[thr_df["f1"].idxmax()]
best_t = float(best_row["threshold"])
best_row, best_t

In [None]:
# OPTIONAL (Decision C3): cost-based thresholding
# Set these numbers to reflect a plausible tradeoff (you choose).
cost_fp = 1.0    # e.g., cost of reviewing a false alert
cost_fn = 25.0   # e.g., cost of missing a fraud event

cost_rows = []
for t in thresholds:
    yp = (y_prob >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_val, yp).ravel()
    expected_cost = cost_fp * fp + cost_fn * fn
    cost_rows.append({"threshold": t, "fp": fp, "fn": fn, "expected_cost": expected_cost})

cost_df = pd.DataFrame(cost_rows)
best_cost_row = cost_df.loc[cost_df["expected_cost"].idxmin()]
best_cost_row

## 8) Final Decision Log (required)

In 6–12 sentences total, record your two decisions and the evidence.

- Decision 1:  
- Decision 2:  

Include:
- The final threshold you used and why
- The final model family and why
- Whether you included `merchant_id` and why


## 9) Extend (optional)

If you finish early, do **one** of the following:

- Compare two feature sets (context only vs context + customer/device)
- Compare logistic regression vs one tree-based model
- Add `merchant_id` in a careful way and discuss overfitting risk
- Report and interpret the *precision at top-K* flagged transactions (operational metric)
