# AI Quest — ML Model Development Problem 2  
**Title:** Segments & Sensitivity: Clustering with a Privacy Twist (Unsupervised)

## Scenario
You must segment users for personalized messaging. Clustering should be **useful** yet **respect privacy** and **avoid disparate impact** on a protected group.

## Objectives
- Data cleaning & scaling
- Compare **KMeans** and **Gaussian Mixture** (or DBSCAN)
- Model selection: choose K via **elbow & silhouette**
- Validation: **stability via bootstrapping**
- **Ethics:** evaluate whether clusters disproportionately isolate a protected group; implement a simple **differential-privacy-like noise** mechanism and discuss the utility trade‑off.

## Deliverables
1. Notebook that fits ≥2 clustering methods, chooses K, and compares solutions
2. Stability analysis over bootstraps (e.g., Adjusted Rand Index)
3. Fairness analysis across clusters (distribution by protected attribute)
4. A small **Data Sheet** Markdown cell describing privacy choices and limitations

## Scoring (auto-checked in notebook)
- Silhouette ≥ 0.35 **without** privacy noise (20 pts)
- Stability ARI median ≥ 0.40 (20 pts)
- After adding noise, silhouette drop ≤ 0.15 absolute (utility preserved) (20 pts)
- Fairness check reported + no cluster with >70% from one group unless justified (20 pts)
- Data Sheet provided (20 pts)

**Total:** 100 pts

## Ethics Notes
- **Fairness:** check protected group representation per cluster
- **Transparency & Explainability:** document K choice, method selection, and noise parameter
- **Privacy:** add Gaussian noise to features; discuss trade-offs
- **Human Agency:** recommend human override when segment-driven actions affect individuals

---

**How to run**
Open `AI_Quest_ML2_Segmentation_Privacy.ipynb` and follow TODOs.

In [None]:
# Optional: install packages locally
# !pip -q install pandas numpy scikit-learn matplotlib

import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, adjusted_rand_score

RANDOM_SEED = 7
np.random.seed(RANDOM_SEED)
print("✅ Imports OK")

## 1) Synthetic data with a protected attribute

In [None]:
n = 1500
# Underlying clusters
centers = np.array([[0,0],[3.5,3.5],[0,5.5]])
X_core = np.vstack([np.random.normal(loc=c, scale=[0.9,0.7], size=(n//3,2)) for c in centers])

# Protected attribute correlated with one region
prot = np.where(X_core[:,0] + 0.5*X_core[:,1] > 4.0, "B", "A")

# Two more behavioral features
f3 = 0.8*X_core[:,0] + np.random.normal(0,1.0,size=n)
f4 = 0.5*X_core[:,1] + np.random.normal(0,1.0,size=n)

df = pd.DataFrame({"f1":X_core[:,0], "f2":X_core[:,1], "f3":f3, "f4":f4, "group":prot})
df.head()

### TODO 1 — Scale and explore K via elbow & silhouette

In [None]:
num_cols = ["f1","f2","f3","f4"]
scaler = StandardScaler()
X = scaler.fit_transform(df[num_cols])

sil = {}
for k in range(2,7):
    km = KMeans(n_clusters=k, n_init=10, random_state=RANDOM_SEED).fit(X)
    sil[k] = silhouette_score(X, km.labels_)
sil

### TODO 2 — Fit two methods (KMeans & GaussianMixture), pick one

In [None]:
best_k = max(sil, key=sil.get)
km = KMeans(n_clusters=best_k, n_init=10, random_state=RANDOM_SEED).fit(X)
gmm = GaussianMixture(n_components=best_k, random_state=RANDOM_SEED).fit(X)

labels_km = km.labels_
labels_gmm = gmm.predict(X)

sil_km = silhouette_score(X, labels_km)
sil_gmm = silhouette_score(X, labels_gmm)
print({"sil_km": sil_km, "sil_gmm": sil_gmm, "best_k": best_k})

labels = labels_km if sil_km >= sil_gmm else labels_gmm
method = "KMeans" if sil_km >= sil_gmm else "GMM"
print("Selected method:", method)

### TODO 3 — Stability via bootstrapping (ARI)

In [None]:
def bootstrap_ari(X, labels, method, k, B=20):
    rng = np.random.default_rng(RANDOM_SEED)
    aris = []
    for b in range(B):
        idx = rng.choice(len(X), size=len(X), replace=True)
        Xb = X[idx]
        if method=="KMeans":
            m = KMeans(n_clusters=k, n_init=10, random_state=RANDOM_SEED).fit(Xb)
            lb = m.labels_
        else:
            m = GaussianMixture(n_components=k, random_state=RANDOM_SEED).fit(Xb)
            lb = m.predict(Xb)
        # Compare bootstrap clustering to original model’s labels on same subsample indices
        aris.append(adjusted_rand_score(labels[idx], lb))
    return np.median(aris), np.percentile(aris, [25,75])

ari_med, ari_iqr = bootstrap_ari(X, labels, method, best_k, B=25)
print({"ari_median": float(ari_med), "ari_IQR": [float(ari_iqr[0]), float(ari_iqr[1])]})

### TODO 4 — Fairness: distribution by protected group per cluster

In [None]:
tab = pd.crosstab(pd.Series(labels, name="cluster"), df["group"])
share = tab.div(tab.sum(1), axis=0)
print("Counts:\n", tab, "\n\nShares by cluster:\n", share)

# Simple rule: flag clusters with >70% from one group
flags = (share.max(1) > 0.70)
flags

### TODO 5 — Privacy: add noise and compare utility

In [None]:
epsilon_like = 0.6  # smaller -> more noise
noise_scale = 0.6 / max(epsilon_like, 1e-3)  # toy mapping

X_noisy = X + np.random.normal(0, noise_scale, size=X.shape)

if method=="KMeans":
    m2 = KMeans(n_clusters=best_k, n_init=10, random_state=RANDOM_SEED).fit(X_noisy)
    labels2 = m2.labels_
else:
    m2 = GaussianMixture(n_components=best_k, random_state=RANDOM_SEED).fit(X_noisy)
    labels2 = m2.predict(X_noisy)

sil2 = silhouette_score(X_noisy, labels2)

print({"sil_original": float(silhouette_score(X, labels)), "sil_noisy": float(sil2), "abs_drop": float(silhouette_score(X, labels) - sil2)})

### TODO 6 — Data Sheet (Markdown)
Describe: synthetic data, protected attribute, K choice, method, noise parameter, fairness findings, and human oversight recommendations.

In [None]:
# === Auto-Scoring ===
score = 0
sil0 = silhouette_score(X, labels)
sil_pass = sil0 >= 0.35
if sil_pass: score += 20

ari_med, _ = bootstrap_ari(X, labels, method, best_k, B=20)
ari_pass = (ari_med >= 0.40)
if ari_pass: score += 20

sil_noise = silhouette_score(X_noisy, labels2)
abs_drop = sil0 - sil_noise
drop_pass = (abs_drop <= 0.15)
if drop_pass: score += 20

tab = pd.crosstab(pd.Series(labels, name="cluster"), df["group"])
share = tab.div(tab.sum(1), axis=0)
fair_pass = (share.max(1) <= 0.70).all()  # or justified
# For auto-score, assume justified if not passed; contestants can toggle this variable after justification:
justified = False
if fair_pass or justified: score += 20

data_sheet_claim = True  # set True after you add it
if data_sheet_claim: score += 20

summary = {
    "silhouette": float(sil0),
    "ari_median": float(ari_med),
    "silhouette_noisy": float(sil_noise),
    "abs_drop": float(abs_drop),
    "fair_clusters_ok_or_justified": bool(fair_pass or justified),
    "score": int(score)
}
print("✅ Final Summary:", summary)