# Churn Prediction: Evaluation & Business Impact

In this final notebook, we evaluate models comprehensively and translate technical metrics into business value.

## Goals
1. Deep dive into confusion matrix and error analysis
2. Cost-benefit analysis of different thresholds
3. Feature importance interpretation
4. Real-world deployment considerations

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, classification_report, roc_auc_score,
    roc_curve, precision_recall_curve, f1_score
)

# Load and prep data (same as before)
df = pd.read_csv('data/telco_churn.csv')
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce').fillna(0)
df['Churn'] = (df['Churn'] == 'Yes').astype(int)
df.drop('customerID', axis=1, inplace=True)

X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Rebuild model (best from 02_modeling.ipynb)
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
cat_cols = X.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), cat_cols)
])

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42))
])

model.fit(X_train, y_train)

## 1. Confusion Matrix Analysis

Understanding **where** the model fails is as important as the overall accuracy.

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Stay', 'Churn'], yticklabels=['Stay', 'Churn'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

tn, fp, fn, tp = cm.ravel()
print(f"True Negatives: {tn}")
print(f"False Positives: {fp} (wasted retention offers)")
print(f"False Negatives: {fn} (missed churners - CRITICAL)")
print(f"True Positives: {tp} (saved customers)")

> **DECISION CHECKPOINT 1**: Interpreting Errors
>
> - **False Negatives (FN)**: These are churners we missed. They leave without intervention.
> - **False Positives (FP)**: We sent retention offers to people who would have stayed anyway.
>
> **Business Question**: Which error is more expensive?
> - FN cost: Lost customer = Lost LTV ($2000)
> - FP cost: Wasted offer = $100
>
> **Conclusion**: FN is 20x more expensive. We MUST minimize false negatives, even at the cost of more false positives.

## 2. ROC Curve and AUC

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = roc_auc_score(y_test, y_prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid()
plt.show()

## 3. Precision-Recall Tradeoff & Custom Threshold

In [None]:
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)

# Calculate F1 for each threshold
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)

plt.figure(figsize=(14, 5))

# Plot 1: Precision-Recall curve
plt.subplot(1, 2, 1)
plt.plot(recall, precision, label='PR Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid()

# Plot 2: Metrics vs Threshold
plt.subplot(1, 2, 2)
plt.plot(pr_thresholds, precision[:-1], label='Precision')
plt.plot(pr_thresholds, recall[:-1], label='Recall')
plt.plot(pr_thresholds, f1_scores[:-1], label='F1', linestyle='--')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Metrics vs Threshold')
plt.axvline(x=0.5, color='red', linestyle='--', alpha=0.5, label='Default (0.5)')
plt.axvline(x=0.3, color='green', linestyle='--', alpha=0.5, label='Custom (0.3)')
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()

> **DECISION CHECKPOINT 2**: Threshold Selection
>
> Given our cost analysis:
> - Lowering threshold from 0.5 â†’ 0.3 increases Recall significantly
> - Precision drops slightly, but we can afford more false positives
>
> **Action**: Deploy with threshold = 0.3

## 4. Cost-Benefit Analysis

In [None]:
# Business parameters
LTV = 2000  # Customer lifetime value
OFFER_COST = 100
ACCEPTANCE_RATE = 0.5  # 50% of churners accept retention offer

def calculate_roi(threshold):
    y_pred_custom = (y_prob >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_custom).ravel()
    
    # Revenue from saved customers
    saved_customers = tp * ACCEPTANCE_RATE
    revenue = saved_customers * LTV
    
    # Cost of offers
    total_offers = tp + fp
    cost = total_offers * OFFER_COST
    
    # Lost revenue from missed churners
    lost_revenue = fn * LTV * ACCEPTANCE_RATE
    
    roi = revenue - cost - lost_revenue
    
    return {
        'threshold': threshold,
        'tp': tp, 'fp': fp, 'fn': fn, 'tn': tn,
        'revenue': revenue,
        'cost': cost,
        'lost_revenue': lost_revenue,
        'net_roi': roi
    }

# Compare different thresholds
thresholds_to_test = [0.2, 0.3, 0.4, 0.5, 0.6]
results = [calculate_roi(t) for t in thresholds_to_test]

df_roi = pd.DataFrame(results)
print(df_roi[['threshold', 'tp', 'fp', 'fn', 'net_roi']])

# Plot ROI vs Threshold
plt.figure(figsize=(10, 6))
plt.plot(df_roi['threshold'], df_roi['net_roi'], marker='o', linewidth=2)
plt.xlabel('Prediction Threshold')
plt.ylabel('Net ROI ($)')
plt.title('Business ROI vs Prediction Threshold')
plt.grid()
optimal_idx = df_roi['net_roi'].idxmax()
optimal_threshold = df_roi.loc[optimal_idx, 'threshold']
plt.axvline(x=optimal_threshold, color='green', linestyle='--', 
            label=f'Optimal: {optimal_threshold}')
plt.legend()
plt.show()

print(f"\nOptimal threshold for ROI: {optimal_threshold}")
print(f"Expected annual ROI: ${df_roi.loc[optimal_idx, 'net_roi']:,.0f}")

## 5. Error Analysis: Which Customers Are We Missing?

Let's examine false negatives to understand model blind spots.

In [None]:
# Get false negatives
X_test_df = X_test.copy()
X_test_df['actual'] = y_test.values
X_test_df['predicted'] = y_pred
X_test_df['probability'] = y_prob

false_negatives = X_test_df[(X_test_df['actual'] == 1) & (X_test_df['predicted'] == 0)]
true_positives = X_test_df[(X_test_df['actual'] == 1) & (X_test_df['predicted'] == 1)]

print(f"False Negatives: {len(false_negatives)}")
print(f"\nAverage probability for FN: {false_negatives['probability'].mean():.3f}")
print(f"Average probability for TP: {true_positives['probability'].mean():.3f}")

# Compare feature distributions
print("\nFeature comparison (FN vs TP):")
for col in ['tenure', 'MonthlyCharges']:
    if col in false_negatives.columns:
        print(f"{col}: FN={false_negatives[col].mean():.1f}, TP={true_positives[col].mean():.1f}")

> **DECISION CHECKPOINT 3**: Model Limitations
>
> False negatives tend to have:
> - Higher tenure (long-term customers who suddenly churn)
> - Lower monthly charges
>
> **Insight**: The model struggles with "surprise" churners who don't fit the typical pattern.
>
> **Action for Future**: Add features like:
> - Recent service changes
> - Support ticket frequency
> - Payment delays

## 6. Real-World Deployment Considerations

### A. Data Drift Monitoring

After deployment, we must monitor whether data distribution changes:

In [None]:
# Simulate monitoring (compare train vs test distributions)
print("Feature Drift Check:")
print(f"Train tenure mean: {X_train['tenure'].mean():.1f}")
print(f"Test tenure mean: {X_test['tenure'].mean():.1f}")
print(f"\nTrain churn rate: {y_train.mean():.1%}")
print(f"Test churn rate: {y_test.mean():.1%}")

# If test churn rate >> train churn rate, retrain!

### B. Metric Misuse Example: The Accuracy Trap

Let's show why accuracy is misleading for this problem.

In [None]:
# Dummy baseline: Always predict "No Churn"
baseline_pred = np.zeros_like(y_test)
baseline_accuracy = (baseline_pred == y_test).mean()

model_accuracy = (y_pred == y_test).mean()

print(f"Baseline (always 'No Churn') Accuracy: {baseline_accuracy:.1%}")
print(f"Our Model Accuracy: {model_accuracy:.1%}")
print(f"\nBut baseline catches 0% of churners!")
print(f"Our model catches {recall:.1%} of churners (Recall)")

> **METRIC MISUSE EXAMPLE**: 
>
> If we optimized for accuracy, we'd just predict "No Churn" for everyone and get 73% accuracy!
> But we'd miss 100% of churners.
>
> **Lesson**: For imbalanced problems, optimize for the metric that matters to business (Recall in this case).

## Summary

### Key Findings
1. **Optimal threshold**: 0.3 (not default 0.5)
2. **Expected ROI**: Calculated based on business costs
3. **Model blind spot**: Long-tenure surprise churners
4. **Critical metric**: Recall (catching churners) over Precision

### Deployment Checklist
- [ ] Set prediction threshold to 0.3
- [ ] Monitor churn rate weekly for drift
- [ ] Track false negative characteristics
- [ ] Retrain quarterly with fresh data
- [ ] A/B test retention offers to measure true acceptance rate