# Lab 3: End-to-End ML Project — Customer Churn Prediction
**Introduction to Data Science & Engineering - Day 3**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 120 min | Intermediate | scikit-learn, matplotlib | 5 |

In this lab, you'll practice:
- Generating and exploring a churn dataset
- Feature engineering for ML
- Encoding and scaling features
- Training multiple models (Logistic Regression, Random Forest, Gradient Boosting)
- Cross-validation and evaluation
- Confusion matrices and feature importance
- Saving and loading models for deployment

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             confusion_matrix, classification_report, roc_auc_score, roc_curve)
import joblib
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Libraries loaded successfully!")

## Part 1: Generate and Explore the Dataset
We'll create a synthetic subscription service dataset for churn prediction.

In [None]:
np.random.seed(42)
n_customers = 2000

# Base features
tenure = np.random.randint(1, 72, n_customers)
monthly_charges = np.round(np.random.uniform(20, 120, n_customers), 2)
total_charges = np.round(tenure * monthly_charges * np.random.uniform(0.8, 1.1, n_customers), 2)

contract = np.random.choice(['Month-to-month', 'One year', 'Two year'], n_customers, p=[0.5, 0.3, 0.2])
payment = np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], n_customers)
internet = np.random.choice(['DSL', 'Fiber optic', 'No'], n_customers, p=[0.35, 0.45, 0.20])

support_tickets = np.random.poisson(2, n_customers)
num_products = np.random.randint(1, 6, n_customers)

# Churn based on realistic factors
churn_prob = np.zeros(n_customers)
churn_prob += (contract == 'Month-to-month') * 0.25
churn_prob += (tenure < 12) * 0.15
churn_prob += (monthly_charges > 80) * 0.1
churn_prob += (support_tickets > 3) * 0.15
churn_prob += (internet == 'Fiber optic') * 0.05
churn_prob -= (contract == 'Two year') * 0.2
churn_prob -= (num_products > 3) * 0.1
churn_prob = np.clip(churn_prob, 0.05, 0.85)

churn = np.random.binomial(1, churn_prob)

df = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'tenure_months': tenure,
    'monthly_charges': monthly_charges,
    'total_charges': total_charges,
    'contract_type': contract,
    'payment_method': payment,
    'internet_service': internet,
    'support_tickets': support_tickets,
    'num_products': num_products,
    'age': np.random.randint(18, 75, n_customers),
    'has_partner': np.random.choice([0, 1], n_customers, p=[0.45, 0.55]),
    'has_dependents': np.random.choice([0, 1], n_customers, p=[0.6, 0.4]),
    'churn': churn
})

print(f"Dataset shape: {df.shape}")
print(f"\nChurn distribution:")
print(df['churn'].value_counts(normalize=True).round(3))
df.head()

### Exercise 1.1: Explore the Data

**Your Task:** Print the dataset shape and churn rate, display descriptive statistics, then create a 2x3 subplot visualizing churn patterns across contract type, tenure, monthly charges, support tickets, internet service, and number of products.

In [None]:
# TODO: Print dataset shape and churn rate
# TODO: Display descriptive statistics using df.describe()
pass

In [None]:
def plot_churn_analysis(df):
    """Visualize churn by key features in a 2x3 subplot.
    
    Plots:
    1. Churn rate by contract type (bar)
    2. Tenure distribution by churn (stacked histogram)
    3. Monthly charges by churn (boxplot)
    4. Support tickets by churn (boxplot)
    5. Churn rate by internet service (bar)
    6. Churn rate by number of products (bar)
    """
    # TODO: Create 2x3 figure (18, 10)
    # TODO: Implement each subplot
    pass

plot_churn_analysis(df)

## Part 2: Feature Engineering

### Exercise 2.1: Create New Features

**Your Task:** Engineer derived features from the raw data including tenure-based, charges-based, support ratio, and a composite engagement score.

In [None]:
def engineer_features(df):
    """Create new features from existing data.
    
    Features to create:
    - tenure_years: tenure_months / 12
    - is_new_customer: 1 if tenure <= 6 months
    - avg_monthly_charge: total_charges / tenure_months
    - charge_per_product: monthly_charges / num_products
    - support_per_tenure: support_tickets / tenure_months
    - engagement_score: weighted composite (products * 0.3 + tenure_norm * 0.3 + partner * 0.2 + dependents * 0.2)
    
    Returns: DataFrame with new features added
    """
    # TODO: Create each derived feature
    # TODO: Handle division by zero in tenure-based features
    pass

df_ml = engineer_features(df.copy())

### Exercise 2.2: Encode Categorical Variables

**Your Task:** One-hot encode the categorical columns and drop the customer_id column.

In [None]:
def encode_features(df_ml):
    """One-hot encode categorical variables.
    
    Encode: contract_type, payment_method, internet_service (drop_first=True)
    Drop: customer_id
    
    Returns: encoded DataFrame
    """
    # TODO: Apply pd.get_dummies with drop_first=True
    # TODO: Drop customer_id column
    pass

df_encoded = encode_features(df_ml)

### Exercise 2.3: Prepare Train/Test Split

**Your Task:** Split the data 80/20 with stratification on the target, then scale numeric features using StandardScaler (fit on train only).

In [None]:
def prepare_data(df_encoded):
    """Split data and scale numeric features.
    
    - Split 80/20 stratified on churn
    - Scale numeric features with StandardScaler (fit on train only!)
    
    Returns: X_train, X_test, y_train, y_test, scaler
    """
    # TODO: Separate features (X) and target (y = 'churn')
    # TODO: train_test_split with test_size=0.2, stratify=y, random_state=42
    # TODO: Define numeric_features list
    # TODO: Fit scaler on X_train, transform both X_train and X_test
    pass

X_train, X_test, y_train, y_test, scaler = prepare_data(df_encoded)

## Part 3: Train and Evaluate Models

### Exercise 3.1: Train Three Models

**Your Task:** Train Logistic Regression, Random Forest, and Gradient Boosting classifiers. Compute accuracy, precision, recall, F1, and AUC for each.

In [None]:
def train_and_evaluate(X_train, X_test, y_train, y_test):
    """Train 3 models and compute metrics.
    
    Models: LogisticRegression, RandomForestClassifier, GradientBoostingClassifier
    Metrics: accuracy, precision, recall, f1, auc
    
    Returns: dict of {model_name: {model, y_pred, y_prob, accuracy, precision, recall, f1, auc}}
    """
    # TODO: Define models dict with 3 classifiers
    # TODO: For each model: fit, predict, predict_proba, compute all metrics
    # TODO: Print results for each model
    pass

results = train_and_evaluate(X_train, X_test, y_train, y_test)

### Exercise 3.2: Cross-Validation

**Your Task:** Perform 5-fold stratified cross-validation using F1 scoring on all three models.

In [None]:
def cross_validate_models(results, X_train, y_train):
    """Perform 5-fold stratified cross-validation on all models.
    
    Print mean and std of F1 scores for each model.
    """
    # TODO: Create StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    # TODO: Run cross_val_score with scoring='f1' for each model
    # TODO: Print formatted results
    pass

cross_validate_models(results, X_train, y_train)

### Exercise 3.3: Confusion Matrices

**Your Task:** Plot confusion matrices for all three models side by side using seaborn heatmaps.

In [None]:
def plot_confusion_matrices(results, y_test):
    """Plot confusion matrices for all 3 models side by side.
    
    Use sns.heatmap with annot=True, fmt='d', cmap='Blues'.
    Labels: ['Stayed', 'Churned']
    """
    # TODO: Create 1x3 subplot figure (18, 5)
    # TODO: For each model, compute confusion_matrix and plot heatmap
    pass

plot_confusion_matrices(results, y_test)

### Exercise 3.4: ROC Curves

**Your Task:** Plot ROC curves for all models on a single figure with AUC values in the legend.

In [None]:
def plot_roc_curves(results, y_test):
    """Plot ROC curves for all models on a single figure.
    
    Include diagonal reference line and legend with AUC values.
    """
    # TODO: Create figure (10, 8)
    # TODO: For each model, compute roc_curve and plot
    # TODO: Add diagonal reference line
    # TODO: Add legend, labels, title
    pass

plot_roc_curves(results, y_test)

## Part 4: Feature Importance

### Exercise 4.1: Analyze Feature Importance

**Your Task:** Extract and visualize the top 15 features from Random Forest, then compare importance rankings between Random Forest and Gradient Boosting.

In [None]:
def plot_feature_importance(results, X_train):
    """Plot top 15 features from Random Forest by importance.
    
    Also print the top 10 features with their importance values.
    """
    # TODO: Get feature_importances_ from Random Forest model
    # TODO: Create DataFrame, sort by importance
    # TODO: Plot horizontal bar chart
    # TODO: Print top 10
    pass

plot_feature_importance(results, X_train)

In [None]:
def compare_importance(results, X_train):
    """Side-by-side comparison of Random Forest vs Gradient Boosting importance.
    
    Show top 10 features from each model.
    """
    # TODO: Get importances from both models
    # TODO: Create 1x2 subplot (18, 8)
    pass

compare_importance(results, X_train)

## Part 5: Model Deployment Preparation

### Exercise 5.1: Save the Best Model

**Your Task:** Identify the best model by F1 score and save both the model and scaler using joblib.

In [None]:
def save_best_model(results, scaler):
    """Identify and save the best model (by F1 score).
    
    Save both the model and scaler using joblib.
    """
    # TODO: Find model with highest F1 score
    # TODO: Save model as 'churn_model.pkl'
    # TODO: Save scaler as 'churn_scaler.pkl'
    pass

save_best_model(results, scaler)

### Exercise 5.2: Create Prediction Function

**Your Task:** Build a function that loads the saved model and scaler, engineers the same features, encodes categoricals, aligns columns, scales, and predicts churn probability for a new customer.

In [None]:
def predict_churn(customer_data, model_path='churn_model.pkl', scaler_path='churn_scaler.pkl'):
    """Predict churn probability for a new customer.
    
    Steps:
    1. Load model and scaler
    2. Engineer features (same as training)
    3. One-hot encode categoricals
    4. Align columns with training data
    5. Scale numeric features
    6. Predict probability
    
    Returns: dict with 'prediction' and 'churn_probability'
    """
    # TODO: Load model and scaler with joblib
    # TODO: Create DataFrame from customer_data
    # TODO: Engineer same features as in training
    # TODO: One-hot encode
    # TODO: Align columns with X_train (fill missing with 0)
    # TODO: Scale numeric features
    # TODO: Predict probability and return result
    pass

# Test with a high-risk customer
test_customer = {
    'tenure_months': 3, 'monthly_charges': 95.00, 'total_charges': 285.00,
    'contract_type': 'Month-to-month', 'payment_method': 'Electronic check',
    'internet_service': 'Fiber optic', 'support_tickets': 5, 'num_products': 1,
    'age': 32, 'has_partner': 0, 'has_dependents': 0
}
# result = predict_churn(test_customer)

In [None]:
# Test with a loyal customer
loyal_customer = {
    'tenure_months': 48,
    'monthly_charges': 55.00,
    'total_charges': 2640.00,
    'contract_type': 'Two year',
    'payment_method': 'Bank transfer',
    'internet_service': 'DSL',
    'support_tickets': 1,
    'num_products': 4,
    'age': 45,
    'has_partner': 1,
    'has_dependents': 1
}

# result = predict_churn(loyal_customer)
# print(f"Prediction: {result['prediction']}")
# print(f"Churn Probability: {result['churn_probability']:.1%}")

In [None]:
# Clean up saved files
import os
for f in ['churn_model.pkl', 'churn_scaler.pkl']:
    if os.path.exists(f):
        os.remove(f)
print("Cleanup complete!")

## Summary

In this lab, you learned how to:

1. **Generate and explore** a realistic churn dataset
2. **Engineer features** from raw data (tenure-based, charge-based, engagement)
3. **Encode and scale** features for ML
4. **Train and compare** three models (Logistic Regression, Random Forest, Gradient Boosting)
5. **Evaluate models** with cross-validation, confusion matrices, and ROC curves
6. **Analyze feature importance** to understand model decisions
7. **Save and deploy** models with joblib and build prediction functions

---

*Introduction to Data Science & Engineering | AI Elevate*