[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/04_00_main.ipynb)

# Module 4 — Unsupervised Learning (Guided Discovery)


This notebook is a **guided discovery walk-through** of unsupervised learning on our over used canonical synthetic transaction dataset.

- We are **not** trying to predict a label 
- We are trying to **see structure**, **compress complexity**, and **spot what’s unusual**.
- Labels (e.g., `is_fraud`) exist in the dataset, but we will treat them as a **diagnostic lens** used *after* exploration.


In [None]:
# (Colab) First-time setup: clone repo + add src/ to Python path
# If you're running locally, you likely don't need this cell.

# !git clone https://github.com/tunnel-ai/way.git
# import sys
# sys.path.insert(0, "/content/way/src")


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.metrics import silhouette_score


In [None]:
from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

# Canonical dataset (do not modify generator... well I guess you can. But carefully... )
df = generate_transaction_risk_dataset(seed=1955)

df.head()


## 1) First contact: what does “a transaction” look like?

Unsupervised learning is, somewhat obviously, unusually sensitive to **feature choice**.

We’ll start with a clean split:
- **Features for unsupervised exploration**: a numeric “behavior” subset.
- **Outcomes for later validation**: `is_fraud`, `transaction_loss_amount` (kept aside until later).


In [None]:
OUTCOME_COLS = ["is_fraud", "transaction_loss_amount"]

df.dtypes.astype(str).value_counts()


In [None]:
# Missingness (MNAR is expected for some fields, so this is informative)
df.isna().mean().sort_values(ascending=False).head(15)


### Choose a behavior feature set (numeric-first)

Start with **numeric behavioral signals** and avoid high-cardinality IDs at first.


In [None]:
candidate_numeric = [
    "transaction_amount",
    "transaction_hour",
    "transaction_day",
    "account_age_days",
    "customer_risk_score",
    "prior_transaction_count",
    "prior_fraud_count",
]

NUM_FEATURES = [c for c in candidate_numeric if c in df.columns]
X_num = df[NUM_FEATURES].copy()

X_num.head()


In [None]:
X_num.describe().T


## 2) Scaling: making distances comparable

K-Means, DBSCAN, and PCA all depend on geometry. Standardization makes each feature contribute on a comparable footing.


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_num)

pd.DataFrame(X_scaled, columns=NUM_FEATURES).agg(["mean", "std"]).T


## 3) Clustering with K-Means: a first partition of behavior

K-Means forces every point into one of *k* groups. We'll start with a small k for interpretability. How should we choose k?


In [None]:
k = 4
kmeans = KMeans(n_clusters=k, n_init=10, random_state=1955)
cluster_km = kmeans.fit_predict(X_scaled)

df_km = df.copy()
df_km["cluster_kmeans"] = cluster_km

df_km["cluster_kmeans"].value_counts().sort_index()


In [None]:
df_km.groupby("cluster_kmeans")[NUM_FEATURES].mean()


### Visualize clusters on two intuitive axes

We’ll pick two human-readable features for a simple “story view.”


In [None]:
x_feat = "transaction_amount" if "transaction_amount" in NUM_FEATURES else NUM_FEATURES[0]
y_feat = "customer_risk_score" if "customer_risk_score" in NUM_FEATURES else NUM_FEATURES[1]

plt.figure(figsize=(7, 5))
plt.scatter(df_km[x_feat], df_km[y_feat], c=df_km["cluster_kmeans"], s=10, alpha=0.5)
plt.xlabel(x_feat)
plt.ylabel(y_feat)
plt.title("K-Means clusters (two-feature view)")
plt.show()


### Internal coherence check: silhouette score (diagnostic)

Silhouette score is useful, but it is not “truth.” Treat it as a consistency check, not an objective.


In [None]:
silhouette_score(X_scaled, cluster_km)


## 4) Density-based clustering (DBSCAN): allowing “noise”

DBSCAN can label points as **noise** if they don’t belong to any dense region. Choosing `eps` is the main judgment call.


In [None]:
k_nn = 10
nn = NearestNeighbors(n_neighbors=k_nn)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)

k_dist = np.sort(distances[:, -1])

plt.figure(figsize=(7, 4))
plt.plot(k_dist)
plt.title(f"k-distance plot (k={k_nn})")
plt.xlabel("Points sorted by distance")
plt.ylabel(f"Distance to {k_nn}th nearest neighbor")
plt.show()


### Run DBSCAN

Start with a heuristic `eps`, then adjust live to see how the “noise” rate changes.


In [None]:
eps = float(np.percentile(k_dist, 95))  # heuristic start; adjust as needed
min_samples = 10

dbscan = DBSCAN(eps=eps, min_samples=min_samples)
cluster_db = dbscan.fit_predict(X_scaled)

df_db = df.copy()
df_db["cluster_dbscan"] = cluster_db

df_db["cluster_dbscan"].value_counts().head(10)


In [None]:
(df_db["cluster_dbscan"] == -1).mean()


In [None]:
plt.figure(figsize=(7, 5))
plt.scatter(df_db[x_feat], df_db[y_feat], c=df_db["cluster_dbscan"], s=10, alpha=0.5)
plt.xlabel(x_feat)
plt.ylabel(y_feat)
plt.title("DBSCAN clusters (+ noise = -1) (two-feature view)")
plt.show()


## 5) Dimensionality reduction (PCA): compressing complexity

PCA gives us a reduced space that helps visualization and sense-making.


In [None]:
pca = PCA(n_components=min(10, len(NUM_FEATURES)), random_state=1955)
X_pca = pca.fit_transform(X_scaled)
explained = pca.explained_variance_ratio_

plt.figure(figsize=(7, 4))
plt.plot(np.cumsum(explained), marker="o")
plt.ylim(0, 1.01)
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.title("PCA explained variance (cumulative)")
plt.grid(True, alpha=0.3)
plt.show()

explained[:5], explained.sum()


In [None]:
pc1, pc2 = X_pca[:, 0], X_pca[:, 1]

plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA projection (unlabeled)")
plt.show()


### Overlay cluster assignments in PCA space

Does structure become easier to see?


In [None]:
plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, c=cluster_km, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA space colored by K-Means clusters")
plt.show()


In [None]:
plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, c=cluster_db, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA space colored by DBSCAN (+ noise)")
plt.show()


## 6) Anomaly detection: modeling normality

Isolation Forest assigns an anomaly score based on how *easily* points are isolated by random splits.


In [None]:
iso = IsolationForest(
    n_estimators=300,
    contamination=0.02,  # adjust 
    random_state=1955,
)
iso.fit(X_scaled)

score_normal = iso.decision_function(X_scaled)
anomaly_score = -score_normal  # higher = more unusual

df_anom = df.copy()
df_anom["anomaly_score"] = anomaly_score

df_anom["anomaly_score"].describe()


In [None]:
plt.figure(figsize=(7, 4))
plt.hist(df_anom["anomaly_score"], bins=50, alpha=0.8)
plt.xlabel("Anomaly score (higher = more unusual)")
plt.ylabel("Count")
plt.title("Isolation Forest anomaly score distribution")
plt.show()


In [None]:
top_n = 300
idx_top = np.argsort(anomaly_score)[-top_n:]

plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, s=8, alpha=0.25)
plt.scatter(pc1[idx_top], pc2[idx_top], s=15, alpha=0.9)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title(f"Top {top_n} anomalies highlighted in PCA space")
plt.show()


## 7) Reveal labels as validation (not as targets)

Now we can cheat. Check whether the discovered structure lines up with outcomes like fraud prevalence or loss.


In [None]:
y_fraud = df["is_fraud"].astype(int)
y_loss = df["transaction_loss_amount"]

pd.DataFrame({"cluster": cluster_km, "is_fraud": y_fraud}).groupby("cluster")["is_fraud"].mean()


In [None]:
q = 0.98
threshold = np.quantile(df_anom["anomaly_score"], q)
mask_top = df_anom["anomaly_score"] >= threshold

fraud_rate_overall = y_fraud.mean()
fraud_rate_top = y_fraud[mask_top].mean()

fraud_rate_overall, fraud_rate_top


In [None]:
loss_top = y_loss[mask_top]

pd.DataFrame({
    "group": ["overall", f"top_anomaly_q{q}"],
    "mean_loss": [y_loss.mean(), loss_top.mean()],
    "median_loss": [y_loss.median(), loss_top.median()],
    "p95_loss": [y_loss.quantile(0.95), loss_top.quantile(0.95)],
})


## 8) Check in...

Unsupervised evaluation is multi-evidence:
- **Coherence** (internal consistency)
- **Visualization** (structure that persists in reduced space)
- **Domain validation** (clusters/anomalies that match meaningful behavior)

Something to think about
> With a limited budget (money, time etc), where would you look first—and why?
