[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]
(https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/01_00_main.ipynb)


In [None]:
# --- Course setup (un comment in run if using colab) ---
#!git clone https://github.com/tunnel-ai/way.git
#import sys; sys.path.insert(0, "/content/way/src")


# Module 1 — AI, ML, and Data Science (Demo Notebook)

This is the **instructor demo notebook** for Module 1. It is designed to run **top-to-bottom** with no external downloads. (the course 'package' should all be imported from the repo together...)

**Canonical dataset:** synthetic transactional risk data generated deterministically via a local generator. We will use portions of this dataset in several modules. 

- Classification target (Module 1 focus): `is_fraud`
- Regression target (Module 2): `transaction_loss_amount` (introduced briefly, used later)
- Unsupervised (Module 4): behavioral structure + anomaly detection (used later)

> A rough rule-- we *mostly* rely on generated data. This is not a course on data types, and time spent discussing data dictionaries is time not spent modeling. So we/I try to balance new interesting data with time/speed to modeling  


## 0) Setup and data generation

**Timebox:** ~10 minutes

We import the course generator and materialize the dataset with a fixed seed.


In [None]:
# Core libraries
import numpy as np
import pandas as pd

# Reproducibility
SEED = 1955
np.random.seed(SEED)

# Course dataset generator (local package)
from core.generators.transaction_risk_dgp import (
    generate_transaction_risk_dataset,
    TransactionRiskConfig,
    dataset_summary,
)

# Generate the canonical dataset (deterministic)
df = generate_transaction_risk_dataset(seed=SEED)

# Quick sanity check
dataset_summary(df)


## 1) Data dictionary (what each field means)

**Timebox:** ~10 minutes

A *data dictionary* is a practical artifact: it helps teams reason about what the data represents, what is measured, and what might be missing.

We will use the same dataset across multiple modules, but we will treat it *differently* depending on the learning goal.


In [None]:
df.dtypes

In [None]:
data_dictionary = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(t) for t in df.dtypes],
    "example_value": [df[c].dropna().iloc[0] if df[c].notna().any() else None for c in df.columns],
})
data_dictionary


### Notes about realism and some cheating, I mean, "teaching affordances"

This dataset is synthetic, but it intentionally includes realistic properties that matter for modeling practice:

- **Class imbalance**: fraud is rare (≈ 4% by default)
- **High-cardinality merchant IDs**: thousands of unique merchants with a power-law distribution
- **MNAR missingness**: missing `device_type` and sometimes `merchant_category` is *not random*
- **Multiple targets**: a classification target (`is_fraud`) and a regression target (`transaction_loss_amount`)
- **Optional post-event fields**: `chargeback_flag` and `manual_review_score` are present to teach **data leakage** (we will *not* use them for modeling in Module 1)


## 2) Quick profiling: distributions, missingness, and class balance

**Timebox:** ~10 minutes


In [None]:
df.shape, df.head()

In [None]:
missing = df.isna().mean().sort_values(ascending=False)
missing.to_frame("pct_missing").head(15)


In [None]:
fraud_rate = df["is_fraud"].mean()
fraud_rate


In [None]:
num_cols = [
    "transaction_amount",
    "transactions_last_24h",
    "transactions_last_7d",
    "time_since_last_transaction_minutes",
    "avg_transaction_amount_30d",
    "std_transaction_amount_30d",
]
df[num_cols].describe().T


## 3) Define the modeling problem for Module 1

**Timebox:** ~5 minutes

In Module 1, we treat this as a **static, tabular classification** problem:

- **Goal:** predict `is_fraud` from available transaction context
- **Non-goals (for now):** time series forecasting, streaming detection, delayed labels

We will also explicitly avoid leakage features.


In [None]:
TARGET = "is_fraud"

LEAKAGE_COLS = ["chargeback_flag", "manual_review_score", "fraud_probability_latent"]

# Identifiers often excluded; revisit later in the course if desired
ID_COLS = ["transaction_id"]

y = df[TARGET].astype(int)

X = df.drop(columns=[TARGET] + [c for c in LEAKAGE_COLS if c in df.columns] + [c for c in ID_COLS if c in df.columns])

X.shape, y.shape


## 4) Train/test split (and why it matters)

**Timebox:** ~5 minutes

We split once, then treat the test set as **locked**. In real practice, you tune on validation data and only evaluate on the test set at the end.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=SEED,
    stratify=y
)

X_train.shape, X_test.shape


## 5) Preprocessing with pipelines (important engineering pattern)

**Timebox:** ~20 minutes

Key idea: **fit preprocessing only on training data**, and package preprocessing + model into a single pipeline.

We will:
- impute missing values
- one-hot encode low-cardinality categoricals
- handle high-cardinality categoricals carefully (Merchant IDs)


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction import FeatureHasher
from sklearn.base import BaseEstimator, TransformerMixin

numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(exclude=[np.number]).columns.tolist()

numeric_features, categorical_features[:10], len(categorical_features)


In [None]:
high_cardinality = [c for c in categorical_features if c in ["merchant_id", "merchant_name"]]
low_cardinality = [c for c in categorical_features if c not in high_cardinality]

high_cardinality, low_cardinality[:10]


### Preprocessing strategy (Module 1 baseline)

- Numeric: median imputation
- Low-cardinality categoricals: impute missing as "Missing", then one-hot encode
- High-cardinality categoricals: hash encoding (feature hashing)

Feature hashing is a pragmatic way to deal with very large categorical vocabularies without exploding dimensionality.


In [None]:
class ColumnToDict(BaseEstimator, TransformerMixin):
    """Convert selected columns of a DataFrame into a list of dicts for FeatureHasher."""
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        records = X[self.cols].astype(str).to_dict(orient="records")
        return records

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

low_cat_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="Missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

high_cat_pipe = Pipeline(steps=[
    ("to_dict", ColumnToDict(high_cardinality)),
    ("hasher", FeatureHasher(n_features=2**12, input_type="dict")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("low_cat", low_cat_pipe, low_cardinality),
        ("high_cat", high_cat_pipe, high_cardinality),
    ],
    remainder="drop",
)

preprocess


## 6) Baseline classification model + evaluation

**Timebox:** ~20 minutes

We start with a (hopefully?) strong baseline:
- Logistic Regression in a pipeline
- Evaluate with confusion matrix + precision/recall/F1 (accuracy is not enough under imbalance)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve, average_precision_score

clf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=500)),
])

clf


In [None]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
cm


In [None]:
print(classification_report(y_test, y_pred, digits=4))

### Precision–Recall and thresholds 

In fraud-style problems, you often choose a threshold based on operational costs.
We’ll compute a PR curve and average precision (AP).


In [None]:
y_score = clf.predict_proba(X_test)[:, 1]
ap = average_precision_score(y_test, y_score)
ap


In [None]:
prec, rec, thr = precision_recall_curve(y_test, y_score)

pr_table = pd.DataFrame({
    "threshold": np.r_[thr, np.nan],
    "precision": prec,
    "recall": rec,
}).head(15)

pr_table


## 7) A quick feature engineering example (we won't go too far in this mod...)

**Timebox:** ~10 minutes

We add one engineered feature that often matters: **log amount**.



In [None]:
X_fe = X.copy()
X_fe["log_transaction_amount"] = np.log1p(X_fe["transaction_amount"])

X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(
    X_fe, y, test_size=0.25, random_state=SEED, stratify=y
)

numeric_features_fe = X_train_fe.select_dtypes(include=[np.number]).columns.tolist()
categorical_features_fe = X_train_fe.select_dtypes(exclude=[np.number]).columns.tolist()

high_cardinality_fe = [c for c in categorical_features_fe if c in ["merchant_id", "merchant_name"]]
low_cardinality_fe = [c for c in categorical_features_fe if c not in high_cardinality_fe]

preprocess_fe = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features_fe),
        ("low_cat", low_cat_pipe, low_cardinality_fe),
        ("high_cat", Pipeline([("to_dict", ColumnToDict(high_cardinality_fe)),
                               ("hasher", FeatureHasher(n_features=2**12, input_type="dict"))]), high_cardinality_fe),
    ],
    remainder="drop",
)

clf_fe = Pipeline(steps=[
    ("preprocess", preprocess_fe),
    ("model", LogisticRegression(max_iter=500)),
])

clf_fe.fit(X_train_fe, y_train_fe)
y_score_fe = clf_fe.predict_proba(X_test_fe)[:, 1]

average_precision_score(y_test_fe, y_score), average_precision_score(y_test_fe, y_score_fe)


## 8) Checkpoint:

- Define **X vs y** correctly and avoid leakage features
- Split data into train/test correctly (with stratification under imbalance)
- Use **pipelines** so preprocessing is fit on training data only
- Evaluate classification with **precision/recall/F1**, not just accuracy
- Explain why threshold choice is a business decision

### Exercises (suggested)
- **Exercise A (guided):** reproduce the baseline pipeline + report metrics
- **Exercise B (transfer):** try a different missingness handling strategy (e.g., “Missing” vs drop)
- **Exercise C (diagnosis):** intentionally include a leakage column and explain what happens
