# Capstone Project 2: Customer Churn Prediction (Classification)

---

## Learning Objectives

By completing this project you will be able to:

- Frame a business problem as a binary classification task
- Build preprocessing pipelines with `ColumnTransformer` for mixed feature types
- Train and compare Logistic Regression, Random Forest, and Gradient Boosting classifiers
- Evaluate models using confusion matrices, precision, recall, F1, ROC-AUC, and PR-AUC
- Tune decision thresholds using a business-cost framework
- Handle class imbalance with `class_weight`
- Interpret feature importances for stakeholder communication

## Prerequisites

- Python 3.8+
- Libraries: numpy, pandas, matplotlib, seaborn, scikit-learn, joblib
- Familiarity with classification metrics and logistic regression

## Table of Contents

1. [Problem Statement & Business Context](#1)
2. [Data Generation](#2)
3. [Exploratory Data Analysis](#3)
4. [Data Splitting](#4)
5. [Baseline Model](#5)
6. [Preprocessing Pipeline](#6)
7. [Model Training](#7)
8. [Evaluation & Comparison](#8)
9. [Threshold Tuning](#9)
10. [Handling Class Imbalance](#10)
11. [Final Test Evaluation](#11)
12. [Feature Importance](#12)
13. [Model Saving](#13)
14. [Conclusions](#14)

<a id="1"></a>
## 1. Problem Statement & Business Context

**Scenario:** A telecom company is experiencing a monthly churn rate of approximately 25%. Acquiring a new customer costs 5-7x more than retaining an existing one. The retention team wants a predictive model to identify at-risk customers so they can offer targeted incentives.

**Business Cost Analysis:**
- **False Negative (missed churner):** The customer leaves. Estimated cost: ~$500 (lost revenue + acquisition cost of replacement).
- **False Positive (retention offer to non-churner):** Unnecessary discount or incentive. Estimated cost: ~$50.
- The cost ratio (FN:FP) is roughly 10:1, which means **recall is more important than precision**, but we still want to avoid flooding loyal customers with unnecessary offers.

**Goal:** Build a classifier that maximizes recall while keeping precision at an acceptable level, and choose an optimal decision threshold that minimizes total business cost.

<a id="2"></a>
## 2. Data Generation

We generate a synthetic telecom churn dataset with 1000 customers, 10 features (mix of numeric and categorical), and a ~25% churn rate.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

np.random.seed(42)
n = 1000

# --- Feature generation ---
tenure_months = np.random.exponential(24, n).clip(1, 72).astype(int)
monthly_charges = np.random.normal(65, 20, n).clip(20, 120).round(2)
total_charges = (tenure_months * monthly_charges * np.random.uniform(0.85, 1.05, n)).round(2)
contract_type = np.random.choice(["Month-to-month", "One year", "Two year"], n, p=[0.50, 0.30, 0.20])
internet_service = np.random.choice(["DSL", "Fiber optic", "No"], n, p=[0.35, 0.45, 0.20])
num_support_tickets = np.random.poisson(1.5, n)
has_online_security = np.random.choice([0, 1], n, p=[0.55, 0.45])
is_senior = np.random.choice([0, 1], n, p=[0.85, 0.15])
num_dependents = np.random.choice([0, 1, 2, 3, 4], n, p=[0.40, 0.25, 0.20, 0.10, 0.05])
payment_method = np.random.choice(
    ["Electronic check", "Mailed check", "Bank transfer", "Credit card"],
    n, p=[0.35, 0.20, 0.25, 0.20]
)

# --- Churn label (logistic formula for ~25% churn) ---
churn_logit = (
    -2.0
    - 0.04 * tenure_months
    + 0.02 * monthly_charges
    + 1.2 * (contract_type == "Month-to-month").astype(float)
    + 0.5 * (internet_service == "Fiber optic").astype(float)
    + 0.3 * num_support_tickets
    - 0.4 * has_online_security
    + 0.3 * is_senior
    - 0.2 * num_dependents
    + 0.4 * (payment_method == "Electronic check").astype(float)
    + np.random.normal(0, 0.5, n)  # noise
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
churned = (np.random.uniform(0, 1, n) < churn_prob).astype(int)

df = pd.DataFrame({
    "tenure_months": tenure_months,
    "monthly_charges": monthly_charges,
    "total_charges": total_charges,
    "contract_type": contract_type,
    "internet_service": internet_service,
    "num_support_tickets": num_support_tickets,
    "has_online_security": has_online_security,
    "is_senior": is_senior,
    "num_dependents": num_dependents,
    "payment_method": payment_method,
    "churned": churned,
})

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.1%}")
df.head(10)

In [None]:
df.describe().round(2)

In [None]:
df.info()

<a id="3"></a>
## 3. Exploratory Data Analysis

In [None]:
# Class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df["churned"].value_counts().plot(kind="bar", ax=axes[0], color=["steelblue", "salmon"], edgecolor="black")
axes[0].set_title("Churn Class Distribution")
axes[0].set_xticklabels(["Stayed (0)", "Churned (1)"], rotation=0)
axes[0].set_ylabel("Count")

df["churned"].value_counts().plot(kind="pie", ax=axes[1], autopct="%1.1f%%",
                                   colors=["steelblue", "salmon"], labels=["Stayed", "Churned"])
axes[1].set_ylabel("")
axes[1].set_title("Churn Proportion")

plt.tight_layout()
plt.show()

In [None]:
# Numeric feature distributions by churn status
numeric_features = ["tenure_months", "monthly_charges", "total_charges",
                    "num_support_tickets", "num_dependents"]

fig, axes = plt.subplots(1, 5, figsize=(22, 4))
for ax, col in zip(axes, numeric_features):
    for label, color in [(0, "steelblue"), (1, "salmon")]:
        subset = df[df["churned"] == label][col]
        ax.hist(subset, bins=25, alpha=0.6, color=color, label=f"Churn={label}", edgecolor="black")
    ax.set_title(col)
    ax.legend(fontsize=8)

plt.suptitle("Numeric Feature Distributions by Churn Status", fontsize=14, y=1.03)
plt.tight_layout()
plt.show()

In [None]:
# Categorical features vs churn
cat_features = ["contract_type", "internet_service", "payment_method"]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax, col in zip(axes, cat_features):
    churn_rates = df.groupby(col)["churned"].mean().sort_values(ascending=False)
    churn_rates.plot(kind="bar", ax=ax, color="coral", edgecolor="black")
    ax.set_title(f"Churn Rate by {col}")
    ax.set_ylabel("Churn Rate")
    ax.set_ylim(0, 1)
    ax.tick_params(axis="x", rotation=30)

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap (numeric features only)
numeric_df = df.select_dtypes(include=[np.number])
plt.figure(figsize=(8, 6))
sns.heatmap(numeric_df.corr(), annot=True, fmt=".2f", cmap="coolwarm", center=0, square=True)
plt.title("Correlation Matrix (Numeric Features)")
plt.tight_layout()
plt.show()

<a id="4"></a>
## 4. Data Splitting

We use stratified splitting to preserve the churn class ratio in both sets.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["churned"])
y = df["churned"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples (churn rate: {y_train.mean():.1%})")
print(f"Test set:     {X_test.shape[0]} samples (churn rate: {y_test.mean():.1%})")

<a id="5"></a>
## 5. Baseline Model

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix,
                             classification_report)

dummy = DummyClassifier(strategy="most_frequent", random_state=42)
dummy.fit(X_train, y_train)
y_dummy_pred = dummy.predict(X_test)

print("Baseline (Most Frequent) Performance:")
print(f"  Accuracy:  {accuracy_score(y_test, y_dummy_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_dummy_pred, zero_division=0):.4f}")
print(f"  Recall:    {recall_score(y_test, y_dummy_pred, zero_division=0):.4f}")
print(f"  F1:        {f1_score(y_test, y_dummy_pred, zero_division=0):.4f}")

<a id="6"></a>
## 6. Preprocessing Pipeline

We use `ColumnTransformer` to apply different transformations to numeric and categorical features.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ["tenure_months", "monthly_charges", "total_charges",
                    "num_support_tickets", "has_online_security",
                    "is_senior", "num_dependents"]
categorical_features = ["contract_type", "internet_service", "payment_method"]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(drop="first", sparse_output=False, handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

# Quick check
X_train_processed = preprocessor.fit_transform(X_train)
print(f"Processed feature matrix shape: {X_train_processed.shape}")

<a id="7"></a>
## 7. Model Training

We train three models:
1. **Logistic Regression** - interpretable, good baseline
2. **Random Forest** - handles non-linear relationships
3. **Gradient Boosting** - often top-performing for tabular data

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

models = {
    "Logistic Regression": Pipeline([
        ("preprocessor", preprocessor),
        ("model", LogisticRegression(max_iter=1000, random_state=42))
    ]),
    "Random Forest": Pipeline([
        ("preprocessor", preprocessor),
        ("model", RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1))
    ]),
    "Gradient Boosting": Pipeline([
        ("preprocessor", preprocessor),
        ("model", GradientBoostingClassifier(n_estimators=200, random_state=42))
    ]),
}

trained_models = {}
for name, pipe in models.items():
    pipe.fit(X_train, y_train)
    trained_models[name] = pipe
    print(f"{name} trained.")

<a id="8"></a>
## 8. Evaluation & Comparison

In [None]:
from sklearn.metrics import average_precision_score, RocCurveDisplay, PrecisionRecallDisplay

results = []
for name, pipe in trained_models.items():
    y_pred = pipe.predict(X_test)
    y_prob = pipe.predict_proba(X_test)[:, 1]

    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1": f1_score(y_test, y_pred),
        "ROC-AUC": roc_auc_score(y_test, y_prob),
        "PR-AUC": average_precision_score(y_test, y_prob),
    })

results_df = pd.DataFrame(results).set_index("Model").round(4)
results_df

In [None]:
# Confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, (name, pipe) in zip(axes, trained_models.items()):
    y_pred = pipe.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax,
                xticklabels=["Stayed", "Churned"], yticklabels=["Stayed", "Churned"])
    ax.set_title(f"{name}")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")

plt.suptitle("Confusion Matrices", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# ROC and PR curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for name, pipe in trained_models.items():
    RocCurveDisplay.from_estimator(pipe, X_test, y_test, ax=axes[0], name=name)
    PrecisionRecallDisplay.from_estimator(pipe, X_test, y_test, ax=axes[1], name=name)

axes[0].set_title("ROC Curves")
axes[0].plot([0, 1], [0, 1], "k--", label="Random")
axes[0].legend()
axes[1].set_title("Precision-Recall Curves")

plt.tight_layout()
plt.show()

In [None]:
# Identify best model by ROC-AUC
best_model_name = results_df["ROC-AUC"].idxmax()
best_pipe = trained_models[best_model_name]
print(f"Best model: {best_model_name} (ROC-AUC = {results_df.loc[best_model_name, 'ROC-AUC']:.4f})")

<a id="9"></a>
## 9. Threshold Tuning

The default threshold of 0.5 may not be optimal given our asymmetric costs (FN cost >> FP cost). We search for the threshold that minimizes total business cost.

In [None]:
y_prob_best = best_pipe.predict_proba(X_test)[:, 1]

# Business cost parameters
cost_fn = 500   # cost of missing a churner
cost_fp = 50    # cost of unnecessary retention offer

thresholds = np.arange(0.05, 0.95, 0.01)
costs = []
f1_scores = []
recalls = []
precisions = []

for t in thresholds:
    y_pred_t = (y_prob_best >= t).astype(int)
    cm = confusion_matrix(y_test, y_pred_t)
    tn, fp, fn, tp = cm.ravel()

    total_cost = fn * cost_fn + fp * cost_fp
    costs.append(total_cost)
    f1_scores.append(f1_score(y_test, y_pred_t, zero_division=0))
    recalls.append(recall_score(y_test, y_pred_t, zero_division=0))
    precisions.append(precision_score(y_test, y_pred_t, zero_division=0))

optimal_idx = np.argmin(costs)
optimal_threshold = thresholds[optimal_idx]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(thresholds, costs, color="red", linewidth=2)
axes[0].axvline(x=optimal_threshold, color="green", linestyle="--", label=f"Optimal: {optimal_threshold:.2f}")
axes[0].axvline(x=0.5, color="gray", linestyle=":", label="Default: 0.50")
axes[0].set_xlabel("Threshold")
axes[0].set_ylabel("Total Business Cost ($)")
axes[0].set_title("Business Cost vs Threshold")
axes[0].legend()

axes[1].plot(thresholds, precisions, label="Precision", linewidth=2)
axes[1].plot(thresholds, recalls, label="Recall", linewidth=2)
axes[1].plot(thresholds, f1_scores, label="F1", linewidth=2)
axes[1].axvline(x=optimal_threshold, color="green", linestyle="--", label=f"Optimal: {optimal_threshold:.2f}")
axes[1].set_xlabel("Threshold")
axes[1].set_ylabel("Score")
axes[1].set_title("Metrics vs Threshold")
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f"At this threshold: Recall={recalls[optimal_idx]:.3f}, "
      f"Precision={precisions[optimal_idx]:.3f}, F1={f1_scores[optimal_idx]:.3f}")
print(f"Min cost: ${costs[optimal_idx]:,.0f} (vs ${costs[np.argmin(np.abs(thresholds - 0.5))]:,.0f} at default 0.5)")

<a id="10"></a>
## 10. Handling Class Imbalance

We retrain the best model with `class_weight='balanced'` to see if explicitly accounting for class imbalance improves performance.

In [None]:
# Build a balanced version of the best model type
balanced_models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42, class_weight="balanced"),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1, class_weight="balanced"),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=200, random_state=42),
}

# Get the model type from best_model_name
if best_model_name == "Gradient Boosting":
    # GradientBoosting does not support class_weight; use sample_weight instead
    from sklearn.utils.class_weight import compute_sample_weight
    balanced_pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("model", GradientBoostingClassifier(n_estimators=200, random_state=42))
    ])
    sample_weights = compute_sample_weight("balanced", y_train)
    balanced_pipe.fit(X_train, y_train, model__sample_weight=sample_weights)
else:
    balanced_pipe = Pipeline([
        ("preprocessor", preprocessor),
        ("model", balanced_models[best_model_name])
    ])
    balanced_pipe.fit(X_train, y_train)

y_pred_balanced = balanced_pipe.predict(X_test)
y_prob_balanced = balanced_pipe.predict_proba(X_test)[:, 1]

print(f"Balanced {best_model_name} (default threshold 0.5):")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred_balanced):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred_balanced):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred_balanced):.4f}")
print(f"  F1:        {f1_score(y_test, y_pred_balanced):.4f}")
print(f"  ROC-AUC:   {roc_auc_score(y_test, y_prob_balanced):.4f}")

In [None]:
# Compare confusion matrices: original vs balanced
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, (title, y_p) in zip(axes, [(f"{best_model_name} (Original)", best_pipe.predict(X_test)),
                                    (f"{best_model_name} (Balanced)", y_pred_balanced)]):
    cm = confusion_matrix(y_test, y_p)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax,
                xticklabels=["Stayed", "Churned"], yticklabels=["Stayed", "Churned"])
    ax.set_title(title)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")

plt.tight_layout()
plt.show()

<a id="11"></a>
## 11. Final Test Evaluation

We apply the optimal threshold from the cost analysis to the best model for our final evaluation.

In [None]:
# Final predictions using the optimal threshold
y_final_prob = best_pipe.predict_proba(X_test)[:, 1]
y_final_pred = (y_final_prob >= optimal_threshold).astype(int)

print("=" * 55)
print("  FINAL MODEL - Test Set Performance")
print(f"  Model: {best_model_name}")
print(f"  Threshold: {optimal_threshold:.2f}")
print("=" * 55)
print(f"  Accuracy:  {accuracy_score(y_test, y_final_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_final_pred):.4f}")
print(f"  Recall:    {recall_score(y_test, y_final_pred):.4f}")
print(f"  F1:        {f1_score(y_test, y_final_pred):.4f}")
print(f"  ROC-AUC:   {roc_auc_score(y_test, y_final_prob):.4f}")
print("=" * 55)

print("\nClassification Report:")
print(classification_report(y_test, y_final_pred, target_names=["Stayed", "Churned"]))

<a id="12"></a>
## 12. Feature Importance

In [None]:
# Get feature names after preprocessing
cat_encoder = preprocessor.named_transformers_["cat"].named_steps["onehot"]
cat_feature_names = cat_encoder.get_feature_names_out(categorical_features).tolist()
all_feature_names = numeric_features + cat_feature_names

# Extract importances based on model type
model_obj = best_pipe.named_steps["model"]

if hasattr(model_obj, "feature_importances_"):
    importances = model_obj.feature_importances_
    importance_label = "Feature Importance (Gini)"
elif hasattr(model_obj, "coef_"):
    importances = np.abs(model_obj.coef_[0])
    importance_label = "Absolute Coefficient"
else:
    importances = np.zeros(len(all_feature_names))
    importance_label = "N/A"

feat_imp_df = pd.DataFrame({
    "Feature": all_feature_names,
    "Importance": importances
}).sort_values("Importance", ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(feat_imp_df["Feature"], feat_imp_df["Importance"], color="steelblue", edgecolor="black")
plt.xlabel(importance_label)
plt.title(f"Feature Importance ({best_model_name})")
plt.tight_layout()
plt.show()

print("\nTop 5 Features:")
print(feat_imp_df.sort_values("Importance", ascending=False).head(5).to_string(index=False))

<a id="13"></a>
## 13. Model Saving

In [None]:
import joblib
import os

os.makedirs("saved_models", exist_ok=True)

# Save model and optimal threshold together
artifact = {
    "pipeline": best_pipe,
    "optimal_threshold": optimal_threshold,
    "model_name": best_model_name,
}

model_path = "saved_models/churn_model.joblib"
joblib.dump(artifact, model_path)
print(f"Model artifact saved to: {model_path}")

# Verify
loaded = joblib.load(model_path)
test_probs = loaded["pipeline"].predict_proba(X_test[:5])[:, 1]
test_preds = (test_probs >= loaded["optimal_threshold"]).astype(int)
print(f"\nSample predictions (threshold={loaded['optimal_threshold']:.2f}):")
print(f"  Probabilities: {test_probs.round(3)}")
print(f"  Predictions:   {test_preds}")

<a id="14"></a>
## 14. Conclusions

### Key Findings

- **Contract type** and **tenure** are the strongest predictors of churn. Month-to-month contracts and short tenure strongly increase churn risk.
- **Electronic check** as a payment method is associated with higher churn, possibly due to lack of commitment.
- The default 0.5 threshold is not optimal for this business problem. Lowering the threshold improves recall, catching more churners at a modest increase in false positives.
- Balancing class weights increases recall but may decrease precision. The optimal trade-off depends on the actual business cost structure.

### Business Recommendations

1. **Target month-to-month customers** with 1-12 months tenure for proactive retention outreach.
2. **Incentivize contract upgrades** from month-to-month to annual or two-year plans.
3. **Monitor support ticket volume** as an early warning signal.
4. **Deploy the model** with the business-cost-optimized threshold to maximize ROI on retention campaigns.

### Next Steps

1. **A/B test** the model-driven retention offers vs. existing strategy.
2. **Add time-series features** (recent usage trends, billing changes).
3. **Try XGBoost/LightGBM** for potentially better performance.
4. **Build a real-time scoring pipeline** for integration with CRM systems.
5. **Implement model monitoring** to detect data drift and trigger retraining.