# Customer Churn Prediction

This notebook builds baseline churn prediction models using customer behavioral
features derived from RFM analysis. The goal is to identify customers who are
likely to churn and support proactive retention strategies.


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
rfm = pd.read_csv("../data/processed/rfm_features.csv")

rfm.head()


In [None]:
# Define churn based on inactivity (recency)
RECENCY_THRESHOLD = rfm["recency"].quantile(0.75)

rfm["churn"] = (rfm["recency"] > RECENCY_THRESHOLD).astype(int)

rfm["churn"].value_counts()


In [None]:
features = ["recency", "frequency", "monetary_value"]
X = rfm[features]
y = rfm["churn"]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)


In [None]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)

print("Logistic Regression Results")
print(classification_report(y_test, y_pred_lr))


In [None]:
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Results")
print(classification_report(y_test, y_pred_rf))


In [None]:
rfm["rfm_segment"] = pd.qcut(
    rfm["frequency"],
    q=3,
    labels=["Low", "Medium", "High"]
)


In [None]:
rfm["value_segment"] = pd.qcut(
    rfm["monetary_value"],
    q=3,
    labels=["Low Value", "Mid Value", "High Value"]
)


In [None]:
cm = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Random Forest Confusion Matrix")
plt.show()


## Business Interpretation

Customers predicted as high churn risk can be targeted with retention campaigns,
personalized offers, or engagement incentives. When combined with CLV estimates,
high-value customers with high churn probability are prioritized for immediate
intervention, while low-value churn-prone customers are handled with cost-efficient
strategies.


## Actionable Recommendations

- High-CLV customers should be enrolled in loyalty and retention programs.
- Customers with high churn probability and high CLV require immediate intervention.
- Low-value, high-churn customers can be targeted with automated, low-cost campaigns.

