# Week 6 ‚Äî Supervised Learning: Classification

**Course:** Applied ML Foundations for SaaS Analytics  
**Week Focus:** Build classification models to predict customer churn, upgrade likelihood, and segment behavior.

---

## üéØ Learning Objectives

By the end of this week, you will:
- Understand classification algorithms: Logistic Regression, Decision Trees, Random Forests
- Build and evaluate classification models
- Handle class imbalance in SaaS data
- Interpret feature importance
- Optimize for business metrics (precision vs recall trade-offs)
- Deploy models responsibly

In [None]:
from IPython.display import HTML
HTML('''
<style>
details {
  margin: 10px 0;
  padding: 8px 12px;
  border: 1px solid #d9e2ec;
  border-radius: 8px;
  background: #f9fbfd;
}
details summary {
  font-weight: 600;
  color: #0056b3;
  cursor: pointer;
}
details[open] {
  background: #f1f7ff;
  border-color: #c3d4f0;
}
details pre {
  background: #f8f9fa;
  padding: 8px;
  border-radius: 6px;
}
</style>
''')

## üè¢ Scenario ‚Äî Churn Prediction Model

Sales wants to identify at-risk customers so they can intervene. Build a model that:
- Predicts which customers will churn in the next 30 days
- Ranks customers by churn risk
- Explains which features drive churn decisions

Goal: Actionable predictions for the sales team.

## ‚úçÔ∏è Hands-on Exercises

1. **Binary Classification**: Train Logistic Regression on churn (Yes/No)
2. **Decision Tree**: Build a tree model and visualize feature splits
3. **Random Forest**: Train an ensemble and compute feature importance
4. **Hyperparameter Tuning**: Use GridSearchCV to find optimal max_depth, n_estimators
5. **Evaluation Metrics**: Compute precision, recall, F1, ROC-AUC for each model
6. **Business Interpretation**: For top 3 features, explain impact on churn prediction

<details>
<summary>üí° Hint ‚Äî Classification Model Workflow</summary>

**Step 1: Prepare Data**
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

**Step 2: Train Multiple Models**
```python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

models = {
    'logistic': LogisticRegression(),
    'forest': RandomForestClassifier(n_estimators=100)
}
```

**Step 3: Evaluate on Test Set**
- Accuracy: % correct (bad for imbalanced data!)
- Precision: Of predicted positives, % are actually positive
- Recall: Of actual positives, % did we catch?
- F1-Score: Balance between precision & recall
- ROC-AUC: Threshold-independent performance

**Step 4: Choose Metric for Your Goal**
- Minimize false positives (costly interventions): Precision
- Minimize false negatives (can't lose customers): Recall
- Balance both: F1 or threshold tuning

</details>

<details>
<summary>‚úÖ Solution ‚Äî End-to-End Churn Classification</summary>

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

# Prepare data (from Week 5 feature engineering)
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')

user_features = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique'
}).reset_index()
user_features.columns = ['user_id', 'total_usage', 'num_features']

df = subs.merge(user_features, how='left')
df['total_usage'] = df['total_usage'].fillna(0)
df['num_features'] = df['num_features'].fillna(0)
df['target'] = df['churn_date'].notna().astype(int)

# Features and target
X = df[['total_usage', 'num_features']]
y = df['target']

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
y_pred_proba = rf.predict_proba(X_test)[:,1]

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Active', 'Churned']))
print(f"\nROC-AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")

# Feature importance
for feature, importance in zip(X.columns, rf.feature_importances_):
    print(f"{feature}: {importance:.3f}")
```

**Key insight:** Feature adoption is the strongest churn predictor. Invest in onboarding!

</details>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

print("=" * 70)
print("WEEK 6: CLASSIFICATION MODEL DEMO")
print("=" * 70)

# Prepare data
subs = pd.read_csv('../data/subscriptions.csv', parse_dates=['signup_date','churn_date'])
feature_usage = pd.read_csv('../data/feature_usage.csv')

user_features = feature_usage.groupby('user_id').agg({
    'usage_count': 'sum',
    'feature_name': 'nunique'
}).reset_index()
user_features.columns = ['user_id', 'total_usage', 'num_features']

df = subs.merge(user_features, how='left')
df['total_usage'] = df['total_usage'].fillna(0)
df['num_features'] = df['num_features'].fillna(0)
df['target'] = df['churn_date'].notna().astype(int)

print(f"\nDataset: {len(df)} customers")
print(f"Churn rate: {df['target'].mean():.1%}")

# Prepare features
X = df[['total_usage', 'num_features']]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\n1. LOGISTIC REGRESSION")
print("-" * 70)
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
print(f"ROC-AUC: {lr_auc:.4f}")

print("\n2. RANDOM FOREST")
print("-" * 70)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])
print(f"ROC-AUC: {rf_auc:.4f}")

print("\n3. FEATURE IMPORTANCE (Random Forest)")
print("-" * 70)
for feature, importance in zip(X.columns, rf.feature_importances_):
    print(f"{feature:.<30} {importance:.2%}")

print("\n4. BUSINESS APPLICATION")
print("-" * 70)
# Get top 20% at-risk customers
risk_scores = rf.predict_proba(X)[:,1]
df['churn_risk'] = risk_scores
high_risk = df[df['churn_risk'] >= df['churn_risk'].quantile(0.80)]
print(f"High-risk segment (top 20%): {len(high_risk)} customers")
print(f"Their actual churn rate: {high_risk['target'].mean():.1%}")
print(f"Overall churn rate: {df['target'].mean():.1%}")
print(f"Lift: {high_risk['target'].mean() / df['target'].mean():.1f}x")
print("=" * 70)

## üìö Key Concepts ‚Äî Classification Metrics Explained

### Confusion Matrix
```
                 Predicted
                Positive  Negative
Actual
Positive    TP (correct)   FN (miss)
Negative    FP (false alarm) TN (correct)
```

### Metric Definitions
- **Accuracy**: (TP + TN) / All ‚Äî overall correctness (can be misleading!)
- **Precision**: TP / (TP + FP) ‚Äî of predictions, how many right?
- **Recall**: TP / (TP + FN) ‚Äî of actual positives, how many caught?
- **F1**: Harmonic mean of precision & recall
- **ROC-AUC**: How well does the model rank positives higher than negatives?

### Churn Prediction Trade-off Example
- **High precision**: Only warn on very confident churners (miss some)
- **High recall**: Warn on anyone at slight risk (false alarms)
- **Business choice**: Sales can handle false alarms; can't afford to miss true churners ‚Üí prioritize recall

## ü§î Reflection & Application

**Question 1:** Your model achieves 95% accuracy but only 10% recall on churners. Is it good?
- No! It's predicting everyone stays (easy baseline on imbalanced data)
- Always check recall separately; use business metric

**Question 2:** How do you explain model predictions to sales?
- Feature importance: "Customers with < 3 features adopted churn 5x more"
- SHAP values: Individual prediction reasons
- Decision trees: Visualize decision path

**Question 3:** When should you retrain the model?
- Monthly at minimum (customer behavior changes)
- Immediately if model performance drops
- When you add new data sources or business rules

## üìù Practice Assignment

**Problem:** Build a churn intervention strategy:
1. Train classification model on historical data
2. Generate churn risk scores for all current customers
3. Define segments: high/medium/low risk
4. For each segment, estimate: % likely to churn, cost to intervene, expected retention
5. Recommend: which segment(s) should sales contact?

## üîó Next Steps

In Week 7, we'll predict continuous values (CLV, revenue) with regression models.