## 05 - Clustering Unknown Contacts
*Unsupervised exploration to find hidden patterns in the murky waters*

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [7]:
project_root = Path.cwd().parent
df_all = pd.read_csv(project_root / 'data' / 'features' / 'contact_features.csv')
df = df_all[df_all['category'] == 'Unknown'].copy()

In [None]:
clustering_features = [
    'pct_business_hours',
    'pct_late_night',
    'avg_hour',
    'avg_duration',
    'max_duration',
    'calls_per_month',
    'days_active',
]

### Feature Selection
*Picking the behavioral fingerprint*

We skip total_duration and total_calls - these are captured by calls_per_month and avg_duration. Redundant features skew clustering distance calculations

In [9]:
X = df[clustering_features].copy()
X.head()

Unnamed: 0,pct_business_hours,pct_late_night,avg_hour,avg_duration,max_duration,calls_per_month,days_active
1,1.0,0.0,10.0,4.0,4,1.0,0
2,1.0,0.0,10.0,39.0,73,2.0,0
3,1.0,0.0,9.0,18.0,18,1.0,0
4,0.615385,0.015385,14.223077,18.0,462,58.208955,37
5,0.909091,0.0,11.181818,8.0,62,0.56314,556


### Scale Features
*Standardization so duration doesn't dominate*

In [10]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Finding Optimal K
*How many clusters should we use?*

- **Elbow method** measures cluster tightness - look for where adding clusters stops helping
- **Silhouette score** measures separation - higher is better

In [None]:
inertias = []
silhouettes = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_scaled, kmeans.labels_))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')

axes[1].plot(K_range, silhouettes, 'go-', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score')

plt.tight_layout()
plt.show()

Elbow bends around K=5, silhouette peaks around K=7. We go with K=5 - we have 4 known categories, so a 5th might expose a hidden group

### K-Means Clustering
*Letting the algorithm find 5 groups*

In [12]:
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(X_scaled)

df['cluster'].value_counts().sort_index()

  ret = a @ b
  ret = a @ b
  ret = a @ b
  current_pot = closest_dist_sq @ sample_weight
  current_pot = closest_dist_sq @ sample_weight
  current_pot = closest_dist_sq @ sample_weight


cluster
0    1218
1     501
2      16
3     217
4      25
Name: count, dtype: int64

### Where Did the Muffens Land?
*Tracking our suspicious five*

In [13]:
muffens = ['Duma', 'Eric', 'Alli', 'Park', 'Zohra']
df[df['name'].isin(muffens)][['name', 'cluster'] + clustering_features]

Unnamed: 0,name,cluster,pct_business_hours,pct_late_night,avg_hour,avg_duration,max_duration,calls_per_month,days_active
32,Alli,4,0.659669,0.0,12.898343,22.0,2068,26.38484,999
320,Duma,4,0.304762,0.128571,14.038095,24.5,1769,19.626168,291
389,Eric,3,0.310502,0.041096,14.305936,27.0,1722,6.865204,927
1246,Park,3,0.728223,0.0,13.393728,5.0,372,13.909532,589
2062,Zohra,3,0.689655,0.0,12.098522,40.0,818,11.825243,485


Duma and Alli landed in Cluster 4, while Eric, Park, and Zohra are in Cluster 3. Interesting - Duma and Eric split

### Cluster Profiles
*What does each group look like?*

In [14]:
df.groupby('cluster')[clustering_features].mean().round(2)

Unnamed: 0_level_0,pct_business_hours,pct_late_night,avg_hour,avg_duration,max_duration,calls_per_month,days_active
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.96,0.0,12.34,64.2,103.71,1.7,20.39
1,0.07,0.0,14.5,55.83,102.89,1.64,19.64
2,0.11,0.77,19.28,81.19,104.0,1.43,43.0
3,0.65,0.01,13.08,46.67,314.54,1.18,554.11
4,0.46,0.08,13.9,1059.3,3546.16,8.73,285.52


Duma and Alli grouped in Cluster 4 because of long calls, not late-night activity. Eric, Park, and Zohra in Cluster 3 because of long-term relationships

Plot twist - our suspects are not in the *real suspicious cluster*. Cluster 2 has 77% late-night calls with only 16 people

### Cluster 2: The Night Owls
*16 contacts with 77% late-night calls - who are they?*

In [18]:
df[df['cluster'] == 2][['name', 'predicted_category', 'cluster'] + clustering_features].sort_values('pct_late_night', ascending=False)

Unnamed: 0,name,predicted_category,cluster,pct_business_hours,pct_late_night,avg_hour,avg_duration,max_duration,calls_per_month,days_active
96,Basson,Important Contacts,2,0.0,1.0,23.0,10.0,10,1.0,0
521,Ganas,Important Contacts,2,0.0,1.0,21.0,2.0,2,1.0,0
528,Gavin,Important Contacts,2,0.0,1.0,21.0,39.5,77,2.0,0
557,Goodman,Important Contacts,2,0.0,1.0,21.5,38.0,42,2.0,0
989,Loots,Service Provider,2,0.0,1.0,21.0,172.0,172,1.0,0
1373,Quill,Important Contacts,2,0.0,1.0,21.0,34.0,34,1.0,0
2058,Zitha,Service Provider,2,0.0,1.0,22.0,158.0,158,1.0,0
2061,Zoghby,Supplier,2,0.0,1.0,21.0,17.0,17,1.0,0
671,Human,Important Contacts,2,0.0,0.666667,16.666667,4.0,5,2.368421,8
1600,Themba,Important Contacts,2,0.2,0.6,12.8,4.0,39,2.777778,24


Fake-muffens. They look suspicious (100% late-night) but calls are short (20-40s) and infrequent. Pattern is quick late-night check-ins, not intimate conversations

### Cluster 4: The Heavy Talkers
*25 contacts with 17-min average calls - Duma's crew*

In [19]:
df[df['cluster'] == 4][['name', 'predicted_category', 'cluster'] + clustering_features].sort_values('pct_late_night', ascending=False)

Unnamed: 0,name,predicted_category,cluster,pct_business_hours,pct_late_night,avg_hour,avg_duration,max_duration,calls_per_month,days_active
1471,Rowan,Service Provider,4,0.0,0.5,20.0,3601.5,7200,2.0,0
1097,Ncube,Service Provider,4,0.333333,0.333333,15.111111,461.0,2832,0.55102,460
1785,Vosser,Family,4,0.309524,0.222222,13.555556,5.0,2566,12.038217,284
1958,Yeates,Family,4,0.256757,0.202703,14.297297,67.5,3242,15.205479,116
392,Erskine,Family,4,0.375,0.175,14.575,52.0,7200,12.121212,69
1058,Modiba,Service Provider,4,0.296296,0.148148,14.407407,390.0,5850,2.334294,317
320,Duma,Family,4,0.304762,0.128571,14.038095,24.5,1769,19.626168,291
1617,Tiro,Family,4,0.252525,0.060606,14.737374,73.0,4148,8.658892,313
99,Beer,Service Provider,4,0.431034,0.051724,13.465517,308.0,6394,1.87905,896
286,Devi,Family,4,0.425287,0.034483,15.229885,16.0,3739,2.785486,907


Eish! Muffen-goldmine. New suspects: Rowan, Ncube, Modiba. All predicted as Service Provider by the tree, but no service provider talks for an hour at 8 PM

### New Suspects
*Rowan, Ncube, Modiba - let's investigate*

In [24]:
calls = pd.read_csv(project_root / 'data' / 'processed' / 'calls_cleaned.csv')

new_suspects = ['Rowan', 'Ncube', 'Modiba']

summary = []
for name in new_suspects:
    person = calls[calls['name'] == name]
    late_night = len(person[(person['hour'] >= 21) | (person['hour'] < 6)])
    summary.append({
        'name': name,
        'total_calls': len(person),
        'period': f"{person['date_stamp'].min()} to {person['date_stamp'].max()}",
        'late_night_calls': late_night,
        'max_call_min': round(person['duration_in_seconds'].max() / 60, 1),
        'peak_hour': person['hour'].mode().iloc[0]
    })

pd.DataFrame(summary)

Unnamed: 0,name,total_calls,period,late_night_calls,max_call_min,peak_hour
0,Rowan,2,2022-11-22 to 2022-11-22,1,120.0,19
1,Ncube,9,2022-09-19 to 2023-12-23,3,47.2,21
2,Modiba,27,2023-08-31 to 2024-07-13,4,97.5,8


**Verdict:**
- **Rowan:** False alarm - only 2 calls on same day, likely one-off support call
- **Ncube:** Minor interest - 9 calls over 15 months, low frequency but 47-min max
- **Modiba:** New muffen - 27 calls over 11 months, 97-min max, still ongoing, peak hour 8 PM

---

## Summary

K-Means clustering (K=5) on 1,977 Unknown contacts using 7 behavioral features

| Cluster | Size | Profile |
|---------|------|---------|
| 0 | 1,218 | Business contacts (96% business hours) |
| 1 | 501 | Evening callers (7% business hours) |
| 2 | 16 | Night owls (77% late-night, short calls) |
| 3 | 217 | Long-term regulars (554 days active) |
| 4 | 25 | Heavy talkers (17-min avg, 59-min max) |

**Muffens verdict:**
- Duma → Cluster 4 (long calls)
- Eric → Cluster 3 (long-term relationship)
- Modiba → Cluster 4 (new suspect: 97-min calls, ongoing)

**Key finding:** Cluster 4 is the muffen cluster. Only 25 contacts with unusually long calls. The supervised model labeled Modiba as "Service Provider" but no service provider talks for 97 minutes at 8 PM. Clustering caught what the tree missed