# üß† TAB 5: Modeling Strategy & Baseline Training (Industry Reality)

**Project:** E-commerce Customer Churn Prediction  
**Dataset:** 5,630 customers | Churn Rate: ~16.84%  
**Philosophy:** Models don't win churn projects. **Decisions do.**

---

## üéØ Objective

> Prove that your model is **better than doing nothing** and **safe to deploy**.

**NOT** chasing accuracy. We're building **decision systems**, not models.

---

## üìã TAB 5 Checklist

### Phase 1: Dumb Baseline (MANDATORY)
- ‚úÖ Majority-class baseline
- ‚úÖ Expected accuracy ‚âà 83.16%
- ‚úÖ Recall (churn) = 0
- ‚úÖ Business value = 0

### Phase 2: First Real Model - Logistic Regression
- ‚úÖ Interpretable & probability-calibrated
- ‚úÖ Use class weights (NOT SMOTE)
- ‚úÖ No hyperparameter tuning initially
- ‚úÖ Recall (churn) > 0.5 target

### Phase 3: Threshold Tuning (CRITICAL)
- ‚úÖ Default threshold ‚â† 0.5
- ‚úÖ Business-aligned threshold selection
- ‚úÖ Target top 10-20% highest-risk customers

### Phase 4: Tree Ensemble (ONLY AFTER BASELINE)
- ‚úÖ XGBoost / LightGBM / CatBoost
- ‚úÖ Same features, same split
- ‚úÖ Compare interpretability vs performance

---

**‚ö†Ô∏è CRITICAL RULES:**
1. If your ML model doesn't beat majority-class baseline **meaningfully**, it's useless
2. False negatives are **more expensive** than false positives in churn
3. Probability ‚â† Decision (threshold tuning is business decision)
4. If 2 very different models fail ‚Üí data/feature problem

---

## üì¶ Step 1: Import Libraries & Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Modeling
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Metrics
from sklearn.metrics import (
    confusion_matrix, classification_report, 
    roc_auc_score, roc_curve, precision_recall_curve,
    accuracy_score, precision_score, recall_score, f1_score
)

# Visualization
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries imported successfully!")
print(f"Scikit-learn ready | XGBoost ready | LightGBM ready | CatBoost ready")

## üì• Step 2: Load Feature-Engineered Data

**Source:** Output from TAB 4 (Feature Engineering)  
**Expected:** Clean, no leakage, train-test split done properly

In [None]:
# Load the feature-engineered data
# NOTE: This assumes you've run the feature engineering notebook (TAB 4)
# and saved the processed data

try:
    # Try to load saved processed data (Phase 2 = Baseline + Controlled features)
    X_train = pd.read_csv('../data/processed/X_train_phase2.csv')
    X_test = pd.read_csv('../data/processed/X_test_phase2.csv')
    y_train = pd.read_csv('../data/processed/y_train.csv').squeeze()
    y_test = pd.read_csv('../data/processed/y_test.csv').squeeze()
    
    print("‚úÖ Loaded pre-processed data from TAB 4 (Phase 2: Baseline + Controlled)")
    print(f"\nTrain shape: {X_train.shape}")
    print(f"Test shape: {X_test.shape}")
    print(f"Features: {X_train.shape[1]} (18 original + 6 missing flags + 2 engineered)")
    print(f"\nTrain churn rate: {y_train.mean()*100:.2f}%")
    print(f"Test churn rate: {y_test.mean()*100:.2f}%")
    
    
except FileNotFoundError:
    print("‚ö†Ô∏è Processed data not found!")
    print("\nFalling back to raw data + basic preprocessing...")
    
    # Load raw data
    df = pd.read_csv('../data/raw/ecommerce_churn.csv')
    
    # Drop CustomerID
    df = df.drop('CustomerID', axis=1)
    
    # Separate features and target
    X = df.drop('Churn', axis=1)
    y = df['Churn']
    
    # Basic preprocessing: Fill missing values
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
    categorical_cols = X.select_dtypes(include=['object']).columns
    
    # Median for numerical
    for col in numerical_cols:
        if X[col].isnull().sum() > 0:
            X[col].fillna(X[col].median(), inplace=True)
    
    # Mode for categorical
    for col in categorical_cols:
        if X[col].isnull().sum() > 0:
            X[col].fillna(X[col].mode()[0], inplace=True)
    
    # Label encode categorical
    from sklearn.preprocessing import LabelEncoder
    for col in categorical_cols:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    print("‚úÖ Basic preprocessing completed")
    print(f"\nTrain shape: {X_train.shape}")
    print(f"Test shape: {X_test.shape}")
    print(f"\nTrain churn rate: {y_train.mean()*100:.2f}%")
    print(f"Test churn rate: {y_test.mean()*100:.2f}%")

---

## 1Ô∏è‚É£ DUMB BASELINE (MANDATORY)

### ‚ö†Ô∏è If you skip this, you fail interviews.

**Strategy:** Predict "No Churn" for everyone  
**Expected Accuracy:** ~83.16%  
**Recall (churn):** 0  
**Business Value:** 0

üëâ This is your **floor**, not your competitor.

**Rule:** If your ML model doesn't beat this **meaningfully**, it's useless.

---

In [None]:
# Dumb Baseline: Predict majority class (No Churn = 0)
y_pred_dumb = np.zeros(len(y_test))

# Calculate metrics
dumb_accuracy = accuracy_score(y_test, y_pred_dumb)
dumb_precision = precision_score(y_test, y_pred_dumb, zero_division=0)
dumb_recall = recall_score(y_test, y_pred_dumb, zero_division=0)
dumb_f1 = f1_score(y_test, y_pred_dumb, zero_division=0)

print("=" * 80)
print("ü§ñ DUMB BASELINE: Predict 'No Churn' for Everyone")
print("=" * 80)
print(f"Accuracy:  {dumb_accuracy:.4f} ({dumb_accuracy*100:.2f}%)")
print(f"Precision: {dumb_precision:.4f}")
print(f"Recall:    {dumb_recall:.4f}")
print(f"F1-Score:  {dumb_f1:.4f}")
print("=" * 80)

# Confusion Matrix
cm_dumb = confusion_matrix(y_test, y_pred_dumb)
print("\nConfusion Matrix:")
print(cm_dumb)
print("\n[Interpretation]")
print(f"True Negatives:  {cm_dumb[0,0]:,} ‚úÖ (Correctly predicted No Churn)")
print(f"False Positives: {cm_dumb[0,1]:,} ‚ùå (Predicted Churn, but didn't)")
print(f"False Negatives: {cm_dumb[1,0]:,} ‚ùå (Predicted No Churn, but CHURNED)")
print(f"True Positives:  {cm_dumb[1,1]:,} ‚úÖ (Correctly predicted Churn)")

# Business Translation
print("\n" + "=" * 80)
print("üìä BUSINESS TRANSLATION")
print("=" * 80)
print(f"Saved churners:         {cm_dumb[1,1]:,} (0%)")
print(f"Lost customers:         {cm_dumb[1,0]:,} ({cm_dumb[1,0]/len(y_test)*100:.1f}%)")
print(f"Wasted retention:       {cm_dumb[0,1]:,}")
print("=" * 80)
print("\n‚ö†Ô∏è This baseline catches ZERO churners. Useless for business.")
print("\n‚úÖ ANY ML model MUST beat this to be valuable.")

In [None]:
# Visualize Dumb Baseline
fig, ax = plt.subplots(1, 1, figsize=(8, 6))

sns.heatmap(cm_dumb, annot=True, fmt='d', cmap='Reds', 
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'],
            cbar_kws={'label': 'Count'}, ax=ax)

ax.set_title('Dumb Baseline: Confusion Matrix\n(Predict No Churn for Everyone)', 
             fontsize=14, fontweight='bold')
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)

plt.tight_layout()
plt.show()

print("\nüìå Key Insight: All actual churners (bottom row) are missed!")

---

## 2Ô∏è‚É£ FIRST REAL MODEL - LOGISTIC REGRESSION (NON-NEGOTIABLE)

### Why Logistic Regression FIRST?

‚úÖ **Interpretable** - Can explain to business  
‚úÖ **Stable** - Consistent results  
‚úÖ **Probability-calibrated** - Outputs are actual probabilities  
‚úÖ **Forces feature discipline** - Can't hide behind complexity  
‚úÖ **Easy to explain** - Business stakeholders understand it

üö© **If someone starts with XGBoost ‚Üí RED FLAG**

---

### Setup Principles (NO CODE YET)

1. Use **ONLY baseline features** (from TAB 4)
2. Use **class weights**, not SMOTE
3. No hyperparameter tuning initially
4. Default threshold ‚â† 0.5 (we'll fix this later)

---

### What Success Looks Like (Baseline)

Don't fixate on numbers, fixate on **direction**:

- ‚úÖ Recall (churn) > **0.5**
- ‚úÖ Precision not collapsing (<0.2 is bad)
- ‚úÖ ROC-AUC > **0.70**
- ‚úÖ Confusion matrix shows **actual churners caught**

**If this fails ‚Üí go back to features, NOT models.**

---

In [None]:
# Scale features (Logistic Regression needs scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features scaled using StandardScaler")
print(f"Train shape: {X_train_scaled.shape}")
print(f"Test shape: {X_test_scaled.shape}")

In [None]:
# Train Logistic Regression with class weights
# class_weight='balanced' automatically handles imbalance

lr_model = LogisticRegression(
    class_weight='balanced',  # Handle imbalance
    random_state=42,
    max_iter=1000,
    solver='lbfgs'
)

print("Training Logistic Regression...")
lr_model.fit(X_train_scaled, y_train)
print("‚úÖ Model trained successfully!")

# Predictions
y_pred_lr = lr_model.predict(X_test_scaled)
y_pred_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_roc_auc = roc_auc_score(y_test, y_pred_proba_lr)

print("\n" + "=" * 80)
print("üéØ LOGISTIC REGRESSION RESULTS (Threshold = 0.5)")
print("=" * 80)
print(f"Accuracy:  {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall:    {lr_recall:.4f}")
print(f"F1-Score:  {lr_f1:.4f}")
print(f"ROC-AUC:   {lr_roc_auc:.4f}")
print("=" * 80)

# Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
print("\nConfusion Matrix:")
print(cm_lr)
print("\n[Business Translation]")
print(f"True Negatives:  {cm_lr[0,0]:,} ‚úÖ (Correctly ignored)")
print(f"False Positives: {cm_lr[0,1]:,} ‚ùå (Wasted retention effort)")
print(f"False Negatives: {cm_lr[1,0]:,} ‚ùå (Lost customers)")
print(f"True Positives:  {cm_lr[1,1]:,} ‚úÖ (Saved churners)")

print("\n" + "=" * 80)
print("üìä BUSINESS IMPACT")
print("=" * 80)
print(f"Churners caught:        {cm_lr[1,1]:,} / {cm_lr[1,0] + cm_lr[1,1]:,} ({lr_recall*100:.1f}%)")
print(f"Customers contacted:    {cm_lr[0,1] + cm_lr[1,1]:,} ({(cm_lr[0,1] + cm_lr[1,1])/len(y_test)*100:.1f}%)")
print(f"Precision (efficiency): {lr_precision*100:.1f}%")
print("=" * 80)

In [None]:
# Visualize Logistic Regression Results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Confusion Matrix
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Churn', 'Churn'],
            yticklabels=['No Churn', 'Churn'],
            cbar_kws={'label': 'Count'}, ax=axes[0])
axes[0].set_title('Logistic Regression: Confusion Matrix\n(Threshold = 0.5)', 
                  fontsize=14, fontweight='bold')
axes[0].set_xlabel('Predicted', fontsize=12)
axes[0].set_ylabel('Actual', fontsize=12)

# ROC Curve
fpr_lr, tpr_lr, thresholds_lr = roc_curve(y_test, y_pred_proba_lr)
axes[1].plot(fpr_lr, tpr_lr, linewidth=2, label=f'Logistic Regression (AUC = {lr_roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
axes[1].set_xlabel('False Positive Rate', fontsize=12)
axes[1].set_ylabel('True Positive Rate (Recall)', fontsize=12)
axes[1].set_title('ROC Curve - Logistic Regression', fontsize=14, fontweight='bold')
axes[1].legend(loc='lower right')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 3Ô∏è‚É£ THRESHOLD TUNING (THIS IS WHERE YOU LOOK SENIOR)

### üéØ Probability ‚â† Decision

Your model outputs **probability**, not truth.

**Example:**
- Customer A ‚Üí churn prob = 0.72
- Customer B ‚Üí churn prob = 0.41

**Business question:**
> At what probability do we act?

That's a **threshold decision**, not an ML one.

---

### Default Threshold = 0.5 ‚Üí Arbitrary

### Industry-Style Thresholds:

1. Target **top 10-20% highest-risk customers**
2. Or maximize **Recall @ fixed Precision**
3. Or minimize **expected cost**

**You must be able to say:**

> "We tuned the threshold to catch ~70% churners while contacting ~25% customers."

**That sentence alone upgrades you.**

---

In [None]:
# Precision-Recall Curve
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_pred_proba_lr)

# Plot Precision-Recall Curve
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Precision-Recall Curve
axes[0].plot(recall_vals, precision_vals, linewidth=2, color='purple')
axes[0].set_xlabel('Recall (Churners Caught)', fontsize=12)
axes[0].set_ylabel('Precision (Efficiency)', fontsize=12)
axes[0].set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=y_test.mean(), color='red', linestyle='--', 
                label=f'Baseline (No Skill) = {y_test.mean():.3f}')
axes[0].legend()

# Threshold Analysis
f1_scores = 2 * (precision_vals * recall_vals) / (precision_vals + recall_vals + 1e-10)
axes[1].plot(pr_thresholds, precision_vals[:-1], label='Precision', linewidth=2)
axes[1].plot(pr_thresholds, recall_vals[:-1], label='Recall', linewidth=2)
axes[1].plot(pr_thresholds, f1_scores[:-1], label='F1-Score', linewidth=2, linestyle='--')
axes[1].set_xlabel('Threshold', fontsize=12)
axes[1].set_ylabel('Score', fontsize=12)
axes[1].set_title('Metrics vs Threshold', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].axvline(x=0.5, color='black', linestyle=':', alpha=0.5, label='Default (0.5)')

plt.tight_layout()
plt.show()

# Find optimal threshold (maximize F1)
optimal_idx = np.argmax(f1_scores[:-1])
optimal_threshold = pr_thresholds[optimal_idx]

print("=" * 80)
print("üéØ OPTIMAL THRESHOLD ANALYSIS")
print("=" * 80)
print(f"Optimal Threshold (Max F1): {optimal_threshold:.3f}")
print(f"Precision at optimal:       {precision_vals[optimal_idx]:.3f}")
print(f"Recall at optimal:          {recall_vals[optimal_idx]:.3f}")
print(f"F1-Score at optimal:        {f1_scores[optimal_idx]:.3f}")
print("=" * 80)

In [None]:
# Business-Aligned Threshold Selection
# Strategy: Contact top 20% highest-risk customers

# Sort probabilities
sorted_proba = np.sort(y_pred_proba_lr)[::-1]
top_20_pct_idx = int(len(sorted_proba) * 0.20)
business_threshold = sorted_proba[top_20_pct_idx]

print("=" * 80)
print("üìä BUSINESS-ALIGNED THRESHOLD")
print("=" * 80)
print(f"Strategy: Contact top 20% highest-risk customers")
print(f"Business Threshold: {business_threshold:.3f}")
print("=" * 80)

# Apply business threshold
y_pred_business = (y_pred_proba_lr >= business_threshold).astype(int)

# Calculate metrics
business_accuracy = accuracy_score(y_test, y_pred_business)
business_precision = precision_score(y_test, y_pred_business)
business_recall = recall_score(y_test, y_pred_business)
business_f1 = f1_score(y_test, y_pred_business)

print(f"\nAccuracy:  {business_accuracy:.4f}")
print(f"Precision: {business_precision:.4f}")
print(f"Recall:    {business_recall:.4f}")
print(f"F1-Score:  {business_f1:.4f}")

# Confusion Matrix
cm_business = confusion_matrix(y_test, y_pred_business)
print("\nConfusion Matrix:")
print(cm_business)
print("\n[Business Translation]")
print(f"Saved churners:         {cm_business[1,1]:,} / {cm_business[1,0] + cm_business[1,1]:,} ({business_recall*100:.1f}%)")
print(f"Customers contacted:    {cm_business[0,1] + cm_business[1,1]:,} ({(cm_business[0,1] + cm_business[1,1])/len(y_test)*100:.1f}%)")
print(f"Precision (efficiency): {business_precision*100:.1f}%")
print("=" * 80)

print("\n‚úÖ INTERVIEW-READY STATEMENT:")
print(f'   "We tuned the threshold to {business_threshold:.3f} to catch {business_recall*100:.1f}% churners')
print(f'    while contacting only {(cm_business[0,1] + cm_business[1,1])/len(y_test)*100:.1f}% of customers."')

---

## 4Ô∏è‚É£ TREE ENSEMBLE (ONLY AFTER BASELINE)

### You've earned the right to use:
- XGBoost
- LightGBM  
- CatBoost

### Rules:

1. ‚úÖ Same features initially (no cheating)
2. ‚úÖ Same train/test split
3. ‚úÖ Same evaluation metrics
4. ‚úÖ No blind SMOTE

**If tree model improves AUC by 0.02-0.03, that's already good.**

**If it improves nothing ‚Üí keep Logistic Regression.**

Yes, that happens often.

---

### Why "Better Metrics" Can Still Be Worse

Tree models often:
- ‚úÖ Increase recall
- ‚ùå Destroy probability calibration
- ‚ùå Become harder to explain

**So you must ask:**

> "Is the performance gain worth the loss in interpretability?"

That's an **engineering trade-off**, not an ML one.

---

In [None]:
# Initialize models
models = {
    'XGBoost': XGBClassifier(
        scale_pos_weight=(len(y_train) - y_train.sum()) / y_train.sum(),
        random_state=42,
        eval_metric='logloss',
        verbosity=0
    ),
    'LightGBM': LGBMClassifier(
        class_weight='balanced',
        random_state=42,
        verbose=-1
    ),
    'CatBoost': CatBoostClassifier(
        auto_class_weights='Balanced',
        random_state=42,
        verbose=0
    ),
    'Random Forest': RandomForestClassifier(
        class_weight='balanced',
        random_state=42,
        n_estimators=100
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        random_state=42,
        n_estimators=100
    )
}

print("=" * 80)
print("üå≤ TRAINING TREE ENSEMBLE MODELS")
print("=" * 80)

# Store results
results = {
    'Dumb Baseline': {
        'Accuracy': dumb_accuracy,
        'Precision': dumb_precision,
        'Recall': dumb_recall,
        'F1-Score': dumb_f1,
        'ROC-AUC': 0.5
    },
    'Logistic Regression': {
        'Accuracy': lr_accuracy,
        'Precision': lr_precision,
        'Recall': lr_recall,
        'F1-Score': lr_f1,
        'ROC-AUC': lr_roc_auc
    }
}

# Train and evaluate each model
for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Store results
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc,
        'Model': model,
        'Predictions': y_pred,
        'Probabilities': y_pred_proba
    }
    
    print(f"‚úÖ {name} trained")
    print(f"   ROC-AUC: {roc_auc:.4f} | Recall: {recall:.4f} | Precision: {precision:.4f}")

print("\n" + "=" * 80)
print("‚úÖ All models trained successfully!")
print("=" * 80)

---

## 5Ô∏è‚É£ MODEL COMPARISON - HOW TO PRESENT IT

### You compare models on:

1. ‚úÖ **ROC-AUC** - Overall discriminative ability
2. ‚úÖ **Recall** - How many churners we catch
3. ‚úÖ **Precision** - How efficient we are
4. ‚úÖ **Stability** - CV variance (not shown here, but important)
5. ‚úÖ **Explainability** - Can we explain to business?

**NOT on accuracy.**

---

### ‚ùå NEVER SAY:

> "XGBoost achieved 92% accuracy"

### ‚úÖ ALWAYS SAY:

> "XGBoost improved recall by 8% over Logistic Regression at similar precision."

---

In [None]:
# Create comparison dataframe
results_df = pd.DataFrame(results).T
results_df = results_df[['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']]
results_df = results_df.round(4)

print("=" * 80)
print("üìä MODEL COMPARISON TABLE")
print("=" * 80)
print(results_df.to_string())
print("=" * 80)

# Highlight best model per metric
print("\nüèÜ BEST MODELS PER METRIC:")
print(f"Best ROC-AUC:   {results_df['ROC-AUC'].idxmax()} ({results_df['ROC-AUC'].max():.4f})")
print(f"Best Recall:    {results_df['Recall'].idxmax()} ({results_df['Recall'].max():.4f})")
print(f"Best Precision: {results_df['Precision'].idxmax()} ({results_df['Precision'].max():.4f})")
print(f"Best F1-Score:  {results_df['F1-Score'].idxmax()} ({results_df['F1-Score'].max():.4f})")
print("=" * 80)

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['ROC-AUC', 'Recall', 'Precision', 'F1-Score']
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    
    # Sort by metric
    sorted_results = results_df.sort_values(metric, ascending=True)
    
    # Plot
    bars = ax.barh(sorted_results.index, sorted_results[metric], color=colors[:len(sorted_results)])
    
    # Highlight best
    best_idx = sorted_results[metric].idxmax()
    best_bar_idx = list(sorted_results.index).index(best_idx)
    bars[best_bar_idx].set_color('gold')
    bars[best_bar_idx].set_edgecolor('black')
    bars[best_bar_idx].set_linewidth(2)
    
    ax.set_xlabel(metric, fontsize=12, fontweight='bold')
    ax.set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(sorted_results[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Plot ROC Curves for all models
plt.figure(figsize=(12, 8))

# Plot each model
for name in results.keys():
    if name == 'Dumb Baseline':
        continue
    
    if name == 'Logistic Regression':
        y_proba = y_pred_proba_lr
    else:
        y_proba = results[name]['Probabilities']
    
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = results[name]['ROC-AUC']
    
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {auc:.3f})')

# Plot random classifier
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier (AUC = 0.500)')

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curves - All Models', fontsize=16, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìå Key Insight: Compare how models separate churners from non-churners")
print("   Higher AUC = Better discrimination ability")

---

## 6Ô∏è‚É£ FAILURE MODES YOU MUST WATCH FOR

### ‚ùå Logistic Regression works, trees don't
‚Üí Features are linear ‚Üí **that's fine**

### ‚ùå Trees work, LR fails
‚Üí Non-linear interactions ‚Üí **explain carefully**

### ‚ùå Both fail
‚Üí **Feature problem, not model problem**

---

### Rule:
**If 2 very different models fail ‚Üí data/feature issue.**

Go back to TAB 4 (Feature Engineering), not TAB 5 (Modeling).

---

In [None]:
# Failure Mode Analysis
print("=" * 80)
print("üîç FAILURE MODE ANALYSIS")
print("=" * 80)

# Check if models are performing well
lr_auc = results['Logistic Regression']['ROC-AUC']
best_tree_auc = max([results[m]['ROC-AUC'] for m in models.keys()])
best_tree_name = [m for m in models.keys() if results[m]['ROC-AUC'] == best_tree_auc][0]

print(f"\nLogistic Regression AUC: {lr_auc:.4f}")
print(f"Best Tree Model AUC:     {best_tree_auc:.4f} ({best_tree_name})")
print(f"Improvement:             {(best_tree_auc - lr_auc):.4f} ({(best_tree_auc - lr_auc)*100:.2f}%)")

# Diagnosis
if lr_auc < 0.65 and best_tree_auc < 0.65:
    print("\n‚ùå FAILURE MODE: Both LR and Trees failing")
    print("   ‚Üí DIAGNOSIS: Feature problem, not model problem")
    print("   ‚Üí ACTION: Go back to TAB 4 (Feature Engineering)")
    
elif lr_auc >= 0.70 and best_tree_auc < lr_auc:
    print("\n‚úÖ SUCCESS MODE: LR works, Trees don't improve")
    print("   ‚Üí DIAGNOSIS: Features are mostly linear")
    print("   ‚Üí ACTION: Keep Logistic Regression (interpretability wins)")
    
elif best_tree_auc > lr_auc + 0.02:
    print("\n‚úÖ SUCCESS MODE: Trees improve over LR")
    print("   ‚Üí DIAGNOSIS: Non-linear interactions present")
    print("   ‚Üí ACTION: Use tree model BUT explain trade-offs")
    
else:
    print("\n‚úÖ SUCCESS MODE: Similar performance")
    print("   ‚Üí DIAGNOSIS: Both models capture patterns well")
    print("   ‚Üí ACTION: Choose based on interpretability needs")

print("=" * 80)

---

## 7Ô∏è‚É£ FINAL MODEL SELECTION & DOCUMENTATION

### What You Document at End of TAB 5:

1. ‚úÖ Which model is **baseline** (Dumb + Logistic Regression)
2. ‚úÖ Which model is **chosen**
3. ‚úÖ **Why** it was chosen
4. ‚úÖ What **metric** mattered most
5. ‚úÖ What **threshold strategy** is used

**This is more important than code.**

---

In [None]:
# Final Model Selection
print("=" * 80)
print("üèÜ FINAL MODEL SELECTION")
print("=" * 80)

# Selection criteria: Prioritize Recall (catching churners) while maintaining reasonable precision
# Business context: False negatives (missing churners) are more expensive than false positives

# Find model with best recall among those with AUC > 0.70
viable_models = results_df[results_df['ROC-AUC'] >= 0.70]

if len(viable_models) == 0:
    print("\n‚ö†Ô∏è WARNING: No models achieved ROC-AUC >= 0.70")
    print("   Selecting best available model...")
    viable_models = results_df

# Among viable models, select based on recall (primary) and precision (secondary)
viable_models['Score'] = viable_models['Recall'] * 0.6 + viable_models['Precision'] * 0.4

selected_model_name = viable_models['Score'].idxmax()
selected_metrics = results_df.loc[selected_model_name]

print(f"\nüéØ SELECTED MODEL: {selected_model_name}")
print("=" * 80)
print(f"ROC-AUC:   {selected_metrics['ROC-AUC']:.4f}")
print(f"Recall:    {selected_metrics['Recall']:.4f}")
print(f"Precision: {selected_metrics['Precision']:.4f}")
print(f"F1-Score:  {selected_metrics['F1-Score']:.4f}")
print("=" * 80)

print("\nüìù SELECTION RATIONALE:")
if selected_model_name == 'Logistic Regression':
    print("   ‚úÖ Interpretable and probability-calibrated")
    print("   ‚úÖ Easy to explain to business stakeholders")
    print("   ‚úÖ Stable and consistent predictions")
    print("   ‚úÖ Sufficient performance for business needs")
else:
    improvement = selected_metrics['ROC-AUC'] - results_df.loc['Logistic Regression', 'ROC-AUC']
    print(f"   ‚úÖ Improved ROC-AUC by {improvement:.4f} over Logistic Regression")
    print(f"   ‚úÖ Better recall: {selected_metrics['Recall']:.4f} vs {results_df.loc['Logistic Regression', 'Recall']:.4f}")
    print("   ‚ö†Ô∏è Trade-off: Less interpretable than Logistic Regression")
    print("   ‚úÖ Performance gain justifies complexity")

print("\nüéØ THRESHOLD STRATEGY:")
print(f"   Business-aligned threshold: {business_threshold:.3f}")
print(f"   Target: Contact top 20% highest-risk customers")
print(f"   Expected recall: ~{business_recall*100:.1f}%")
print(f"   Expected precision: ~{business_precision*100:.1f}%")

print("\nüìä BUSINESS IMPACT:")
total_churners = cm_business[1,0] + cm_business[1,1]
churners_saved = cm_business[1,1]
customers_contacted = cm_business[0,1] + cm_business[1,1]

print(f"   Churners saved: {churners_saved:,} / {total_churners:,} ({business_recall*100:.1f}%)")
print(f"   Customers contacted: {customers_contacted:,} / {len(y_test):,} ({customers_contacted/len(y_test)*100:.1f}%)")
print(f"   Efficiency: {business_precision*100:.1f}% of contacted customers are actual churners")

print("=" * 80)

---

## ‚úÖ TAB 5 COMPLETE

### What We've Accomplished:

1. ‚úÖ **Dumb Baseline** - Established floor (83.16% accuracy, 0% recall)
2. ‚úÖ **Logistic Regression** - First real model with interpretability
3. ‚úÖ **Threshold Tuning** - Business-aligned decision boundary
4. ‚úÖ **Tree Ensembles** - Tested XGBoost, LightGBM, CatBoost, RF, GB
5. ‚úÖ **Model Comparison** - Evaluated on business-relevant metrics
6. ‚úÖ **Final Selection** - Justified model choice with clear rationale

---

### Key Deliverables:

üìå **Baseline Model:** Logistic Regression (interpretable, stable)  
üìå **Selected Model:** [See above]  
üìå **Threshold:** Business-aligned (top 20% risk)  
üìå **Metrics:** ROC-AUC, Recall, Precision (NOT accuracy)  
üìå **Zero Leakage:** All features from TAB 4  
üìå **Interview-Safe Story:** Complete and justified

---

### Interview-Ready Statement:

> "We started with a dumb baseline that achieved 83% accuracy but caught zero churners.  
> Our Logistic Regression baseline improved recall to ~XX% with ROC-AUC of X.XXX.  
> After testing tree ensembles, we selected **[MODEL_NAME]** which achieved XX% recall.  
> We tuned the threshold to X.XXX to contact the top 20% highest-risk customers,  
> catching ~XX% of churners while maintaining XX% precision."

**Note:** Fill in actual values after running the notebook.

---

### Next Steps (TAB 6):

**üß† Explainability + Business Action Plan**

1. Feature importance (coefficients for LR, SHAP for trees)
2. Business interpretation of key drivers
3. Actionable retention strategies
4. Cost-benefit analysis
5. Deployment recommendations

**This is where churn projects become decision systems, not models.**

---

### üéØ Your Move

Ready for **TAB 6**?

---

## üíæ Save Results for TAB 6

In [None]:
# Save model comparison results
import os

# Create directory if it doesn't exist
os.makedirs('../data/processed', exist_ok=True)

# Save results dataframe
results_df.to_csv('../data/processed/model_comparison_results.csv')
print("‚úÖ Model comparison results saved to: ../data/processed/model_comparison_results.csv")

# Save selected model info
with open('../data/processed/selected_model_info.txt', 'w') as f:
    f.write(f"Selected Model: {selected_model_name}\n")
    f.write(f"ROC-AUC: {selected_metrics['ROC-AUC']:.4f}\n")
    f.write(f"Recall: {selected_metrics['Recall']:.4f}\n")
    f.write(f"Precision: {selected_metrics['Precision']:.4f}\n")
    f.write(f"F1-Score: {selected_metrics['F1-Score']:.4f}\n")
    f.write(f"Business Threshold: {business_threshold:.3f}\n")

print("‚úÖ Selected model info saved to: ../data/processed/selected_model_info.txt")

print("\n" + "=" * 80)
print("üéâ TAB 5 COMPLETE - Ready for TAB 6 (Explainability)")
print("=" * 80)