[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]
(https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/03_01_exercise_guided.ipynb)


In [None]:
# --- Course setup (uncomment and run if using Colab) --------------------------
#!git clone https://github.com/tunnel-ai/way.git
#import sys; sys.path.insert(0, "/content/way/src")


# Module 3 — Exercise 01 (Guided): Fraud Classification

**Role:** Guided practitioner (structured adaptation)

**Target:** `is_fraud`  
**Canonical dataset:** `generate_transaction_risk_dataset(seed=1955)`

### What you will do
You will adapt the Module 3 workflow by completing **targeted TODOs**.

By the end, you should be able to:
- Build a leakage-safe feature matrix for a binary classification target
- Train a baseline and a logistic regression pipeline
- Evaluate using **Precision–Recall** and **ROC** metrics (important for imbalanced data)
- Choose a decision threshold (not just a probability model)

> **Rules:** Use the canonical generator exactly as provided. Do not modify generator code.


In [None]:
# --- Imports ------------------------------------------------------------------
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_auc_score,
    average_precision_score,
    RocCurveDisplay,
    PrecisionRecallDisplay,
    precision_recall_curve,
    f1_score,
    precision_score,
    recall_score,
    accuracy_score,
)

RANDOM_STATE = 1955
np.random.seed(RANDOM_STATE)


In [None]:
# --- Load the canonical dataset ------------------------------------------------
from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

df = generate_transaction_risk_dataset(seed=1955)
print(df.shape)
df.head()

## 1) Define target and leakage-safe features

For **classification**, the target is `is_fraud`.

**Leakage guardrail:** do *not* include variables that directly encode the outcome.
- `transaction_loss_amount` is mechanically tied to fraud (it is >0 only when fraud occurs), so including it would leak the label.

### TODO 1
Create `y` as the `is_fraud` column and `X` as the remaining features **excluding**:
- `is_fraud`
- `transaction_loss_amount`

Then print:
- `y.mean()` (fraud rate)
- the number of columns in `X`


In [None]:
# TODO 1: Define X and y (leakage-safe)
TARGET = "is_fraud"
LEAKAGE_EXCLUDE = ["transaction_loss_amount"]

# --- your code here ------------------------------------------------------------
# y = ...
# X = ...

print("Fraud rate:", y.mean())
print("X columns:", X.shape[1])
X.head()

## 2) Train/validation split (stratified)

Fraud is rare, so we use a **stratified split** to preserve class balance.

### TODO 2
Create `X_train, X_val, y_train, y_val` using `train_test_split` with:
- `test_size=0.25`
- `random_state=RANDOM_STATE`
- `stratify=y`

Then print the fraud rate in train and validation.


In [None]:
# TODO 2: Stratified split
# X_train, X_val, y_train, y_val = ...

print("Train fraud rate:", y_train.mean())
print("Val fraud rate:  ", y_val.mean())

## 3) Baseline model

A good baseline for rare-event classification is the **majority-class** classifier.

### TODO 3
Fit a `DummyClassifier(strategy="most_frequent")` and report:
- Accuracy
- Precision
- Recall
- F1

*Note:* The baseline may have high accuracy but terrible recall — that’s the point.


In [None]:
# TODO 3: Baseline classifier
baseline = DummyClassifier(strategy="most_frequent")

# Fit
# baseline.fit(...)

# Predict class labels
# y_pred_base = ...

print("Accuracy:", accuracy_score(y_val, y_pred_base))
print("Precision:", precision_score(y_val, y_pred_base, zero_division=0))
print("Recall:", recall_score(y_val, y_pred_base, zero_division=0))
print("F1:", f1_score(y_val, y_pred_base, zero_division=0))

print("\nConfusion matrix:\n", confusion_matrix(y_val, y_pred_base))

## 4) Preprocessing pipeline (impute + encode + scale)

We will build a single sklearn **Pipeline** so that:
- Imputation and encoding are learned **only on training data**
- We avoid leakage during evaluation

### TODO 4
1) Identify:
- `numeric_features`
- `categorical_features`

2) Create a `ColumnTransformer` named `preprocess` with:
- Numeric: `SimpleImputer(strategy="median")` then `StandardScaler()`
- Categorical: `SimpleImputer(strategy="most_frequent")` then `OneHotEncoder(handle_unknown="ignore")`

Hint: `X_train.dtypes` is useful.


In [None]:
# TODO 4: Column lists and preprocess transformer
# numeric_features = ...
# categorical_features = ...

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

# preprocess = ColumnTransformer(...)

preprocess

## 5) Logistic regression model (interpretable baseline)

### TODO 5
Create a pipeline:
- `("preprocess", preprocess)`
- `("model", LogisticRegression(max_iter=2000, class_weight="balanced"))`

Then:
- Fit on training data
- Predict probabilities on validation (`predict_proba`) for ROC/PR metrics
- Compute **ROC-AUC** and **PR-AUC** (Average Precision)


In [None]:
# TODO 5: Logistic regression pipeline
log_reg = Pipeline(steps=[
    # ("preprocess", ...),
    # ("model", ...),
])

# Fit
# log_reg.fit(...)

# Probabilities for the positive class
# y_prob_lr = ...

print("ROC-AUC:", roc_auc_score(y_val, y_prob_lr))
print("PR-AUC (Avg Precision):", average_precision_score(y_val, y_prob_lr))

### Visualize ROC and Precision–Recall curves

For imbalanced classification, the **Precision–Recall** curve is often more informative than ROC.

### TODO 6
Plot both curves using `RocCurveDisplay.from_predictions` and `PrecisionRecallDisplay.from_predictions`.


In [None]:
# TODO 6: ROC and PR curves
plt.figure()
RocCurveDisplay.from_predictions(y_val, y_prob_lr)
plt.title("Logistic Regression — ROC Curve")
plt.show()

plt.figure()
PrecisionRecallDisplay.from_predictions(y_val, y_prob_lr)
plt.title("Logistic Regression — Precision–Recall Curve")
plt.show()

## 6) Turn probabilities into decisions (choose a threshold)

A classifier outputs probabilities, but operations often require **binary decisions** (flag / don’t flag).

### TODO 7
Use `precision_recall_curve` to compute precision, recall, thresholds.

Then choose a threshold using **one** of these rules:
- **Rule A:** smallest threshold that achieves **recall ≥ 0.80**
- **Rule B:** threshold that **maximizes F1**

After choosing a threshold, compute and print:
- confusion matrix
- precision / recall / F1

*This is the key skill:* moving from model score to decision policy.


In [None]:
# TODO 7: Threshold selection
prec, rec, thresh = precision_recall_curve(y_val, y_prob_lr)

# Note: precision_recall_curve returns thresholds of length n-1
# You may want to align arrays: prec[:-1], rec[:-1], thresh

# --- your code here ------------------------------------------------------------
# chosen_threshold = ...

y_pred_thresh = (y_prob_lr >= chosen_threshold).astype(int)

print("Chosen threshold:", chosen_threshold)
print("Precision:", precision_score(y_val, y_pred_thresh, zero_division=0))
print("Recall:", recall_score(y_val, y_pred_thresh, zero_division=0))
print("F1:", f1_score(y_val, y_pred_thresh, zero_division=0))
print("\nConfusion matrix:\n", confusion_matrix(y_val, y_pred_thresh))

## 7) A nonlinear alternative: Decision Tree

A single decision tree can capture **interactions** (e.g., risk score × channel × time), which logistic regression may miss.

### TODO 8
Fit a `DecisionTreeClassifier` inside the same preprocessing pipeline.
Use a small tree to avoid extreme overfitting:
- `max_depth=4`
- `min_samples_leaf=200`

Report ROC-AUC and PR-AUC on the validation set.


In [None]:
# TODO 8: Decision tree pipeline
tree_clf = Pipeline(steps=[
    # ("preprocess", preprocess),
    # ("model", DecisionTreeClassifier(...)),
])

# tree_clf.fit(...)
# y_prob_tree = ...

print("Tree ROC-AUC:", roc_auc_score(y_val, y_prob_tree))
print("Tree PR-AUC:", average_precision_score(y_val, y_prob_tree))

## 8) Short reflection (answer in 3–6 sentences each)

1) Why can accuracy be misleading for rare-event classification?
2) Which curve (ROC or Precision–Recall) changed your interpretation more, and why?
3) What did threshold selection force you to think about that ROC-AUC alone does not?

*(Write responses below as Markdown.)*


### Reflection

1) 

2) 

3) 
