# Module 4 — 04_01 Guided Exercise (Unsupervised Learning)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/04_01_exercise_guided.ipynb)

This guided exercise is designed to be completed **top-to-bottom**.

**Rules of the exercise**
- Use the **canonical dataset** generated by `generate_transaction_risk_dataset(seed=1955)`.
- Do **not** use `is_fraud` or `transaction_loss_amount` as model inputs.
- You may *inspect* those outcomes only in the **final validation section**.

**What you will produce**
1) A K-Means clustering with short cluster profiles  
2) A PCA visualization colored by your cluster assignments  
3) A DBSCAN run with a defensible `eps` (via k-distance plot)  
4) An Isolation Forest anomaly score + a “top anomalies” table  
5) A short validation: Are anomalies enriched for fraud or high loss?


In [None]:
# (Colab) First-time setup: clone repo + add src/ to Python path
# If you're running locally, you likely don't need this cell.

# !git clone https://github.com/tunnel-ai/way.git
# import sys
# sys.path.insert(0, "/content/way/src")


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest

from sklearn.metrics import silhouette_score


In [None]:
from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

df = generate_transaction_risk_dataset(seed=1955)
df.head()


## 0) Feature set selection (numeric behavioral features)

We will start with a numeric behavioral feature set. This keeps the geometry interpretable.

**TODO 0A**: Confirm the selected columns exist (defensive check).  
**TODO 0B**: Create `X_num` as a DataFrame with only these columns.

> Keep outcomes (`is_fraud`, `transaction_loss_amount`) out of `X_num`.


In [None]:
# TODO 0A: choose the feature set
candidate_numeric = [
    "transaction_amount",
    "transaction_hour",
    "transaction_day",
    "account_age_days",
    "customer_risk_score",
    "prior_transaction_count",
    "prior_fraud_count",
]

# TODO 0A: keep only columns that exist
NUM_FEATURES = [c for c in candidate_numeric if c in df.columns]

# TODO 0B: build X_num
X_num = df[NUM_FEATURES].copy()

NUM_FEATURES, X_num.shape


## 1) Scaling (required)

Distance-based methods are extremely sensitive to scale.

**TODO 1**: Standardize `X_num` into `X_scaled`.


In [None]:
# TODO 1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_num)

pd.DataFrame(X_scaled, columns=NUM_FEATURES).agg(["mean", "std"]).T


## 2) K-Means clustering (structured partition)

**TODO 2A**: Fit K-Means with `k=4` (use `random_state=42`, `n_init=10`).  
**TODO 2B**: Add cluster labels to `df_km`.  
**TODO 2C**: Compute and report:
- cluster counts
- cluster mean profiles for the selected features
- silhouette score

Interpretation prompt:
- Which cluster looks most “high risk” based on behavior alone?


In [None]:
# TODO 2A
k = 4
kmeans = KMeans(n_clusters=k, n_init=10, random_state=42)
cluster_km = kmeans.fit_predict(X_scaled)

# TODO 2B
df_km = df.copy()
df_km["cluster_kmeans"] = cluster_km

# TODO 2C
counts = df_km["cluster_kmeans"].value_counts().sort_index()
profiles = df_km.groupby("cluster_kmeans")[NUM_FEATURES].mean()
sil = silhouette_score(X_scaled, cluster_km)

counts, profiles, sil


### Quick visualization (two-feature view)

**TODO 2D**: Make a scatter plot using:
- x-axis: `transaction_amount`
- y-axis: `customer_risk_score`
colored by K-Means cluster.

If one of those features is missing, pick a reasonable substitute from `NUM_FEATURES`.


In [None]:
# TODO 2D
x_feat = "transaction_amount" if "transaction_amount" in NUM_FEATURES else NUM_FEATURES[0]
y_feat = "customer_risk_score" if "customer_risk_score" in NUM_FEATURES else NUM_FEATURES[1]

plt.figure(figsize=(7, 5))
plt.scatter(df_km[x_feat], df_km[y_feat], c=df_km["cluster_kmeans"], s=10, alpha=0.5)
plt.xlabel(x_feat)
plt.ylabel(y_feat)
plt.title("K-Means clusters (two-feature view)")
plt.show()


## 3) PCA for visualization and sense-making

**TODO 3A**: Fit PCA on `X_scaled` with up to 10 components (or fewer if needed).  
**TODO 3B**: Plot cumulative explained variance.  
**TODO 3C**: Create a 2D PCA scatterplot colored by K-Means cluster.


In [None]:
# TODO 3A
pca = PCA(n_components=min(10, len(NUM_FEATURES)), random_state=42)
X_pca = pca.fit_transform(X_scaled)
explained = pca.explained_variance_ratio_

# TODO 3B: explained variance plot
plt.figure(figsize=(7, 4))
plt.plot(np.cumsum(explained), marker="o")
plt.ylim(0, 1.01)
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.title("PCA explained variance (cumulative)")
plt.grid(True, alpha=0.3)
plt.show()

explained[:5], explained.sum()


In [None]:
# TODO 3C: PCA 2D scatter colored by K-Means cluster
pc1, pc2 = X_pca[:, 0], X_pca[:, 1]

plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, c=cluster_km, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA space colored by K-Means clusters")
plt.show()


## 4) DBSCAN (density-based clustering + noise)

DBSCAN requires choosing `eps`. A common practical approach is the **k-distance plot**.

**TODO 4A**: Build a k-distance plot using `k_nn=10`.  
**TODO 4B**: Choose an `eps` near the “elbow” (start with the 95th percentile).  
**TODO 4C**: Fit DBSCAN with `min_samples=10`.  
**TODO 4D**: Report:
- cluster label counts (including `-1` noise)
- noise rate


In [None]:
# TODO 4A: k-distance plot
k_nn = 10
nn = NearestNeighbors(n_neighbors=k_nn)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)

k_dist = np.sort(distances[:, -1])

plt.figure(figsize=(7, 4))
plt.plot(k_dist)
plt.title(f"k-distance plot (k={k_nn})")
plt.xlabel("Points sorted by distance")
plt.ylabel(f"Distance to {k_nn}th nearest neighbor")
plt.show()


In [None]:
# TODO 4B / 4C: DBSCAN run
eps = float(np.percentile(k_dist, 95))  # adjust once if needed
min_samples = 10

dbscan = DBSCAN(eps=eps, min_samples=min_samples)
cluster_db = dbscan.fit_predict(X_scaled)

df_db = df.copy()
df_db["cluster_dbscan"] = cluster_db

counts_db = df_db["cluster_dbscan"].value_counts().head(10)
noise_rate = (df_db["cluster_dbscan"] == -1).mean()

counts_db, noise_rate


### DBSCAN visualization (PCA space)

**TODO 4E**: Plot PCA (PC1 vs PC2) colored by DBSCAN labels.


In [None]:
# TODO 4E
plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, c=cluster_db, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA space colored by DBSCAN (+ noise = -1)")
plt.show()


## 5) Isolation Forest anomaly detection

**TODO 5A**: Fit Isolation Forest (use `random_state=42`).  
**TODO 5B**: Create `anomaly_score` where higher = more unusual.  
**TODO 5C**: Plot the anomaly score distribution.  
**TODO 5D**: Create a table of the top 25 most anomalous points with:
- anomaly_score
- transaction_amount
- customer_risk_score
- transaction_hour
- merchant_category (if it exists)
- channel (if it exists)

Note: we still have not used labels.


In [None]:
# TODO 5A / 5B
iso = IsolationForest(n_estimators=300, contamination=0.02, random_state=42)
iso.fit(X_scaled)

score_normal = iso.decision_function(X_scaled)
anomaly_score = -score_normal

df_anom = df.copy()
df_anom["anomaly_score"] = anomaly_score

df_anom["anomaly_score"].describe()


In [None]:
# TODO 5C
plt.figure(figsize=(7, 4))
plt.hist(df_anom["anomaly_score"], bins=50, alpha=0.8)
plt.xlabel("Anomaly score (higher = more unusual)")
plt.ylabel("Count")
plt.title("Isolation Forest anomaly score distribution")
plt.show()


In [None]:
# TODO 5D
show_cols = ["anomaly_score", "transaction_amount", "customer_risk_score", "transaction_hour"]
for c in ["merchant_category", "channel"]:
    if c in df_anom.columns:
        show_cols.append(c)

top25 = df_anom.sort_values("anomaly_score", ascending=False).head(25)[show_cols]
top25


## 6) Validation (labels revealed as diagnostics)

Now—and only now—we use outcomes to validate whether:
- clusters differ in fraud prevalence
- top anomalies are enriched for fraud
- top anomalies have higher loss

**TODO 6A**: Fraud rate by K-Means cluster  
**TODO 6B**: Fraud rate among top 2% anomaly scores  
**TODO 6C**: Compare mean and 95th percentile loss overall vs top anomalies

Interpretation prompt:
- Do anomalies align with fraud? With high loss? With neither? What might that imply?


In [None]:
# TODO 6A
y_fraud = df["is_fraud"].astype(int)
fraud_by_cluster = pd.DataFrame({"cluster": cluster_km, "is_fraud": y_fraud}).groupby("cluster")["is_fraud"].mean()
fraud_by_cluster


In [None]:
# TODO 6B
q = 0.98
thr = np.quantile(df_anom["anomaly_score"], q)
mask_top = df_anom["anomaly_score"] >= thr

fraud_overall = y_fraud.mean()
fraud_top = y_fraud[mask_top].mean()

fraud_overall, fraud_top


In [None]:
# TODO 6C
y_loss = df["transaction_loss_amount"]
loss_top = y_loss[mask_top]

pd.DataFrame({
    "group": ["overall", f"top_anomaly_q{q}"],
    "mean_loss": [y_loss.mean(), loss_top.mean()],
    "p95_loss": [y_loss.quantile(0.95), loss_top.quantile(0.95)],
    "median_loss": [y_loss.median(), loss_top.median()],
})


## Reflection (short)

Answer in 3–6 sentences:

1) Which method (K-Means, DBSCAN, Isolation Forest) produced the most **actionable** view of the data? Why?  
2) In this dataset, did anomaly detection seem to track **fraud**, **loss**, both, or neither?  
3) If you had to propose one next step for a real fraud team, what would it be?

(There is no single correct answer. You are graded on evidence-based reasoning.)
