[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tunnel-ai/way/blob/main/notebooks/04_02_exercise_open.ipynb)

# Module 4 — 04_02 Open Exercise (Unsupervised Learning)


**Decision-focused** unsupervised learning exercise.

Build an unsupervised view of the canonical transaction dataset and defend your choices.

---

## Constraints (to keep results comparable)

- Use the canonical dataset: `generate_transaction_risk_dataset(seed=1955)`
- Do **not** use `is_fraud` or `transaction_loss_amount` as model inputs
- Use **one** numeric feature set you define (Prepare to justify it!)
- Use **at least two** unsupervised methods from:
  - K-Means
  - DBSCAN
  - PCA (as visualization and/or preprocessing)
  - Isolation Forest

## Some outputs that might help you argue your case

1) **Feature set statement** (what you included/excluded and why)  
2) **Two-method comparison** (what each method revealed that the other did not)  
3) **One visualization** in a reduced space (PCA 2D)  
4) **One “investigation list”**: top 25 candidate unusual transactions (your criteria)  
5) **Validation (optional but encouraged)**: After discovery, check whether your “unusual” list is enriched for fraud or high loss

## Decision Log

Defend **two** decisions:

- Decision A (choose one):
  - how you handled scaling / transformations
  - which variables you treated as “behavior” vs “identifiers”
- Decision B (choose one):
  - choice of k (K-Means) OR eps/min_samples (DBSCAN) OR contamination (Isolation Forest)
  - whether you used PCA before clustering/anomaly detection



In [None]:
# (Colab) First-time setup: clone repo + add src/ to Python path
# If you're running locally, you likely don't need this cell.

# !git clone https://github.com/tunnel-ai/way.git
# import sys
# sys.path.insert(0, "/content/way/src")


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import IsolationForest
from sklearn.metrics import silhouette_score


In [None]:
from core.generators.transaction_risk_dgp import generate_transaction_risk_dataset

df = generate_transaction_risk_dataset(seed=1955)
df.head()


## 1) Define your feature set (your first major decision)

**Task**
- Choose a numeric feature set that you believe captures transaction behavior.
- Explicitly exclude outcomes (`is_fraud`, `transaction_loss_amount`) and anything you believe is a pure identifier.


> Tip: Start with a sensible baseline feature set, then adjust if needed.


### Feature set statement 

This would be a good place to document some arguments if you like... 

- Included features:
- Excluded features:
- Why these choices make sense for clustering / anomaly detection:


In [None]:
# Build your numeric feature set here (edit freely)
candidate_numeric = [
    "transaction_amount",
    "transaction_hour",
    "transaction_day",
    "account_age_days",
    "customer_risk_score",
    "prior_transaction_count",
    "prior_fraud_count",
]

NUM_FEATURES = [c for c in candidate_numeric if c in df.columns]
X_num = df[NUM_FEATURES].copy()

NUM_FEATURES, X_num.shape


## 2) Scaling / transformations (Decision A)

Most unsupervised methods depend on geometry. You should decide whether to:
- standardize
- apply transforms (e.g., log for heavy-tailed features like amounts)
- or leave raw scales (rarely recommended)

**Task**
- Implement your scaling/transformation choice.
- Briefly justify it in the Decision Log near the bottom.


In [None]:
# Your choice: (a) log-transform transaction_amount, then standardize everything.
X_work = X_num.copy()

if "transaction_amount" in X_work.columns:
    X_work["transaction_amount"] = np.log1p(X_work["transaction_amount"])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_work)

pd.DataFrame(X_scaled, columns=NUM_FEATURES).agg(["mean", "std"]).T


## 3) PCA (required visualization)

**Task**
- Fit PCA on `X_scaled`
- Plot cumulative explained variance
- Create a 2D PCA scatterplot (unlabeled)

This gives you a shared “map” to compare methods.


In [None]:
pca = PCA(n_components=min(10, len(NUM_FEATURES)), random_state=1955)
X_pca = pca.fit_transform(X_scaled)
explained = pca.explained_variance_ratio_

plt.figure(figsize=(7, 4))
plt.plot(np.cumsum(explained), marker="o")
plt.ylim(0, 1.01)
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.title("PCA explained variance (cumulative)")
plt.grid(True, alpha=0.3)
plt.show()

explained[:5], explained.sum()


In [None]:
pc1, pc2 = X_pca[:, 0], X_pca[:, 1]

plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA projection (unlabeled)")
plt.show()


## 4) Method 1 (choose one): K-Means OR DBSCAN

Pick **one** to run first, then you’ll run a second method afterward.

### Option A: K-Means
- Decide k
- Fit K-Means
- Report cluster counts + silhouette score
- Visualize clusters in PCA space

### Option B: DBSCAN
- Use a k-distance plot to choose eps
- Fit DBSCAN
- Report noise rate + cluster counts
- Visualize DBSCAN labels in PCA space

**Task**
- Implement Method 1 and record your tuning decision (Decision B).


In [None]:
# METHOD 1 (default): K-Means
k = 4  # Decision B: choose and justify
kmeans = KMeans(n_clusters=k, n_init=10, random_state=1955)
labels_m1 = kmeans.fit_predict(X_scaled)

counts_m1 = pd.Series(labels_m1).value_counts().sort_index()
sil_m1 = silhouette_score(X_scaled, labels_m1)

counts_m1, sil_m1


In [None]:
plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, c=labels_m1, s=10, alpha=0.5)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA space colored by Method 1 labels")
plt.show()


## 5) Method 2 (choose a different method than Method 1)

Run a second method from the list:
- K-Means
- DBSCAN
- Isolation Forest

**Task**
- Implement Method 2
- Visualize the result in PCA space (if applicable)
- Write 4–8 lines comparing Method 1 vs Method 2:
  - What did one reveal that the other did not?
  - Which is more actionable for investigation?


In [None]:
# METHOD 2 (default): Isolation Forest
iso = IsolationForest(n_estimators=300, contamination=0.02, random_state=1955)  # Decision B if you choose this
iso.fit(X_scaled)

anomaly_score = -iso.decision_function(X_scaled)
df_scores = df.copy()
df_scores["anomaly_score"] = anomaly_score

df_scores["anomaly_score"].describe()


In [None]:
plt.figure(figsize=(7, 4))
plt.hist(df_scores["anomaly_score"], bins=50, alpha=0.8)
plt.xlabel("Anomaly score (higher = more unusual)")
plt.ylabel("Count")
plt.title("Isolation Forest anomaly score distribution")
plt.show()


In [None]:
# Highlight top anomalies in PCA space
top_n = 300
idx_top = np.argsort(anomaly_score)[-top_n:]

plt.figure(figsize=(7, 5))
plt.scatter(pc1, pc2, s=8, alpha=0.25)
plt.scatter(pc1[idx_top], pc2[idx_top], s=15, alpha=0.9)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title(f"Top {top_n} anomalies highlighted in PCA space")
plt.show()


### Method comparison

- Method 1 summary:
- Method 2 summary:
- What Method 1 revealed that Method 2 did not:
- What Method 2 revealed that Method 1 did not:
- If you had limited investigation budget/time, which output would you use first and why:


## 6) Investigation list

Think about a top-25 list of transactions you would investigate.

**Your choice:** this can be based on:
- anomaly score (Isolation Forest)
- DBSCAN noise points
- “small cluster” membership in K-Means
- or a hybrid rule you define

**Try it out**
- Produce a table with top 25 candidates including:
  - your score / reason
  - transaction_amount
  - customer_risk_score
  - transaction_hour
  - channel (if exists)
  - merchant_category (if exists)


In [None]:
# Default approach: top anomaly scores
cols = ["transaction_amount", "customer_risk_score", "transaction_hour"]
for c in ["channel", "merchant_category"]:
    if c in df_scores.columns:
        cols.append(c)

investigation = df_scores.sort_values("anomaly_score", ascending=False).head(25)[["anomaly_score"] + cols]
investigation


## 7) Validation (optional)

Now that you have a discovery result, you may use outcomes as a diagnostic lens.

**Task**
- Compute fraud rate and loss summary for your top-25 list
- Compare to the overall dataset

Interpretation prompt:
- If enrichment is weak, what might that mean about your definition of “unusual”?


In [None]:
# Outcomes (diagnostic only)
y_fraud = df["is_fraud"].astype(int)
y_loss = df["transaction_loss_amount"]

# Fraud rate: overall vs top-25 list
fraud_overall = y_fraud.mean()
fraud_top25 = y_fraud.loc[investigation.index].mean()

# Loss: overall vs top-25 list
loss_overall = y_loss.mean()
loss_top25 = y_loss.loc[investigation.index].mean()

fraud_overall, fraud_top25, loss_overall, loss_top25


## Think about

### Decision A
- What did you choose for scaling / transformation?
- Why is that reasonable for geometry-based methods?

### Decision B
- What tuning choice did you make (k, eps, contamination, PCA usage)?
- What evidence supports the choice (plot, score, stability, interpretability)?

### Final takeaway
- What “structure” do you believe exists in this dataset?
- What would you do next if this were a real investigation?
