## 06 - Business Insights
*Final report with key findings and conclusions*

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
project_root = Path.cwd().parent
calls = pd.read_csv(project_root / 'data' / 'processed' / 'calls_cleaned.csv', parse_dates=['date_stamp'])
contacts = pd.read_csv(project_root / 'data' / 'features' / 'contact_features.csv')

### The Data
*What we analyzed*

Phone records from a South African small business owner's wife, provided by a concerned friend of the husband

In [None]:
print(f'Total calls: {len(calls):,}')
print(f'Unique contacts: {len(contacts):,}')
print(f'Period: {calls["date_stamp"].min().strftime("%b %Y")} to {calls["date_stamp"].max().strftime("%b %Y")}')
print(f'Duration: {(calls["date_stamp"].max() - calls["date_stamp"].min()).days // 365} years, {((calls["date_stamp"].max() - calls["date_stamp"].min()).days % 365) // 30} months')

The top contact is "Husband CEL01" with 2,413 calls - roughly 2-3 per day. The caller is clearly the wife, a mother with multiple children (Daughter ANG01, Daughter AYA01, etc.) who also manages business relationships (HK Computers, Lauzent Computers, etc.)

### Category Behavior
*How different contact types behave*

In [None]:
category_stats = calls.groupby('category').agg(
    total_calls=('category', 'count'),
    pct_business_hours=('is_business_hours', 'mean'),
    avg_duration=('duration_in_seconds', 'median')
).sort_values('pct_business_hours', ascending=False)

category_stats['pct_of_calls'] = (category_stats['total_calls'] / len(calls) * 100).round(1)
category_stats['pct_business_hours'] = (category_stats['pct_business_hours'] * 100).round(0).astype(int)
category_stats = category_stats[['pct_of_calls', 'pct_business_hours', 'avg_duration']]
category_stats.columns = ['% of Calls', '% Business Hours', 'Median Duration (s)']
category_stats

**Key insight:** Unknown contacts behave more like business than family - 64% during work hours vs Family's 51%. This suggests Unknown contains hidden business contacts, not personal relationships.

### Classifying the Unknown
*The main objective - what are these 1,977 contacts?*

In [None]:
unknown = contacts[contacts['category'] == 'Unknown']

print(f"Unknown contacts: {len(unknown):,}")
print(f"\nSupervised model predictions:")
print(unknown['predicted_category'].value_counts())

In [None]:
#visualize the breakdown
fig, ax = plt.subplots(figsize=(8, 5))

pred_counts = unknown['predicted_category'].value_counts()
colors = ['#2ecc71', '#3498db', '#9b59b6', '#e74c3c']
bars = ax.bar(pred_counts.index, pred_counts.values, color=colors)

ax.set_title('Unknown Contacts: Predicted Categories')
ax.set_ylabel('Number of Contacts')
ax.set_xlabel('')

for bar, v in zip(bars, pred_counts.values):
    ax.text(bar.get_x() + bar.get_width()/2, v + 20, str(v), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

**The verdict on Unknown contacts:**

| Predicted Category | Count | % of Unknown |
|-------------------|-------|--------------|
| Service Provider | 958 | 48.5% |
| Supplier | 898 | 45.4% |
| Important Contacts | 97 | 4.9% |
| Family | 24 | 1.2% |

**93.9% of Unknown contacts are predicted to be business-related** (Service Provider + Supplier). Only 24 contacts (1.2%) show Family-like behavior - and two of those are our muffens Duma and Eric.

This confirms our hypothesis from exploration: the Unknown category is dominated by unlabeled business contacts, not hidden personal relationships

### The Muffens
*Suspicious contacts that stood out*

"Muffen" is Norwegian slang for sensing something suspicious. Three Unknown contacts raised red flags:

In [None]:
muffens = ['Duma', 'Eric', 'Modiba']
muffen_calls = calls[calls['name'].isin(muffens)]

profiles = []
for name in muffens:
    person = muffen_calls[muffen_calls['name'] == name]
    late_night = len(person[(person['hour'] >= 21) | (person['hour'] < 6)])
    profiles.append({
        'Name': name,
        'Total Calls': len(person),
        'Period': f"{person['date_stamp'].min().strftime('%b %Y')} - {person['date_stamp'].max().strftime('%b %Y')}",
        '% Business Hours': round(person['is_business_hours'].mean() * 100),
        'Late-Night Calls': late_night,
        'Max Call (min)': round(person['duration_in_seconds'].max() / 60, 1),
        'Status': 'Ended' if person['date_stamp'].max() < pd.Timestamp('2023-01-01') else 'Ongoing'
    })

pd.DataFrame(profiles).set_index('Name')

**Duma** - The intense, short-lived relationship
- 27 calls between 1-5 AM on weekdays
- 30-minute call at 5 AM on a Tuesday (Apr 5, 2022)
- Spiked to 120 calls/month in March, then abrupt decline
- Contact ended November 2022 - classic pattern of discovery or pressure

**Eric** - The slow burn
- 9 late-night calls (9-11 PM range)
- 25-minute call at 11 PM on a Friday (Jul 12, 2024)
- Started slow in 2022, grew steadily through 2023-2024
- Still ongoing as of data cutoff

**Modiba** - The new discovery
- Found through clustering analysis, not manual inspection
- 97-minute max call - longer than anything Duma or Eric had
- Peak calling hour at 8 PM
- Supervised model mislabeled as "Service Provider"
- No service provider talks for 97 minutes at 8 PM

### Relationship Timelines
*How the patterns evolved*

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(12, 10))

for i, name in enumerate(muffens):
    person = calls[calls['name'] == name]
    monthly = person.groupby(person['date_stamp'].dt.to_period('M')).size()
    monthly.plot(kind='line', ax=axes[i], marker='o', linewidth=2, markersize=6)
    axes[i].set_title(f'{name}: Calls Over Time')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Calls')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### What the Models Found
*Supervised vs Unsupervised*

**Supervised Learning (RandomForest, 79% accuracy)**

Trained on 114 labeled contacts to predict 1,977 unknowns:
- 94% classified as business (Service Provider + Supplier)
- Only 24 classified as Family
- Duma and Eric → Family (closest available category)
- Modiba → Service Provider (wrong - the model missed this one)

**Unsupervised Learning (K-Means, K=5)**

Let the algorithm find natural groupings:

| Cluster | Size | Profile |
|---------|------|--------|
| 0 | 1,218 | Business contacts (96% business hours) |
| 1 | 501 | Evening callers (7% business hours) |
| 2 | 16 | Night owls (77% late-night, short calls) |
| 3 | 217 | Long-term regulars (554 days active) |
| 4 | 25 | Heavy talkers (17-min avg, 59-min max) |

**Cluster 4 is the muffen cluster.** Only 25 contacts with unusually long calls. This is where Duma and Modiba landed. Clustering caught what the tree missed.

### The Evidence
*Late-night call details*

In [None]:
late_night = calls[
    (calls['name'].isin(['Duma', 'Eric'])) & 
    ((calls['hour'] >= 21) | (calls['hour'] < 6))
][['name', 'date_stamp', 'hour', 'duration_in_seconds', 'day_of_week']].copy()

late_night['duration_min'] = (late_night['duration_in_seconds'] / 60).round(1)
late_night = late_night.sort_values(['name', 'date_stamp'])
late_night[['name', 'date_stamp', 'hour', 'duration_min', 'day_of_week']]

The most concerning entries:
- **Duma, Apr 5 2022, 5 AM Tuesday**: 29.5 minutes - preceded by three short calls at 4 AM (checking if awake?)
- **Eric, Jul 12 2024, 11 PM Friday**: 24.9 minutes - a Friday night conversation

---

## Conclusions

### What the data shows

Three Unknown contacts exhibit patterns inconsistent with business relationships:

1. **Duma** - Intense relationship in early 2022 with 27 late-night calls (1-5 AM). Ended abruptly in November 2022, suggesting discovery or external pressure.

2. **Eric** - Slow-building relationship since 2022, still ongoing. Less extreme timing than Duma (9-11 PM vs 1-5 AM) but persistent.

3. **Modiba** - Discovered through clustering. Longest calls of all three (97 min max). Ongoing relationship with evening pattern.

### What the data doesn't show

Phone logs record *when* and *how long*, not *what was said*. Alternative explanations exist:
- International contacts in different time zones
- Night-shift workers
- Close friends with unusual schedules

### The uncomfortable truth

4 AM calls on weekdays to someone you're not related to, lasting 30 minutes, that suddenly stop after 10 months? That's a pattern that's hard to explain away.

The data tells a story. I just present it.

---

## Summary

| Metric | Value |
|--------|-------|
| Total calls analyzed | 24,952 |
| Unique contacts | 2,091 |
| Originally labeled | 114 (5.5%) |
| Unknown (classified) | 1,977 (94.5%) |
| Supervised model accuracy | 79% |

**Classification results for Unknown contacts:**

| Category | Count | % |
|----------|-------|---|
| Service Provider | 958 | 48.5% |
| Supplier | 898 | 45.4% |
| Important Contacts | 97 | 4.9% |
| Family | 24 | 1.2% |

**Main finding:** 93.9% of Unknown contacts are business-related. The "Unknown" category was mostly unlabeled business contacts, not hidden personal relationships.

**Secondary finding:** Among the 24 Family-predicted contacts, we identified 3 suspicious individuals (Duma, Eric, Modiba) with late-night call patterns inconsistent with typical family behavior.

**Output:** Final dataset saved to `data/features/contacts_final.csv` with original labels, predicted categories, and final classifications for all 2,091 contacts

### Save Final Dataset
*Export contacts with all classifications*

In [None]:
#load clustering results from notebook 05
contacts_clustered = pd.read_csv(project_root / 'data' / 'features' / 'contact_features.csv')

#create final classification column
def get_final_category(row):
    if row['category'] != 'Unknown':
        return row['category']  #keep original label
    else:
        return row['predicted_category']  #use prediction

contacts_clustered['final_category'] = contacts_clustered.apply(get_final_category, axis=1)

#summary of final classifications
print("Final category distribution (all 2,091 contacts):")
print(contacts_clustered['final_category'].value_counts())

In [None]:
#save final dataset
output_path = project_root / 'data' / 'features' / 'contacts_final.csv'
contacts_clustered.to_csv(output_path, index=False)

print(f"Saved to {output_path}")
print(f"\nColumns in final dataset:")
print(contacts_clustered.columns.tolist())

In [None]:
#preview final dataset
contacts_clustered[['name', 'category', 'predicted_category', 'final_category', 'total_calls', 'pct_business_hours']].head(10)