# Churn Prediction Modeling

This notebook builds predictive models to estimate customer churn risk
using insights derived from exploratory data analysis. The goal is to
establish an interpretable baseline model and evaluate its performance
using applicable classification metrics.


In [1]:
#imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix


In [2]:
#load raw data 
cust_df = pd.read_excel("../data/ecommerce_churn.xlsx", sheet_name="E Comm")

cust_df.head()

Unnamed: 0,CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
0,50001,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3,Laptop & Accessory,2,Single,9,1,11.0,1.0,1.0,5.0,159.93
1,50002,1,,Phone,1,8.0,UPI,Male,3.0,4,Mobile,3,Single,7,1,15.0,0.0,1.0,0.0,120.9
2,50003,1,,Phone,1,30.0,Debit Card,Male,2.0,4,Mobile,3,Single,6,1,14.0,0.0,1.0,3.0,120.28
3,50004,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4,Laptop & Accessory,5,Single,8,0,23.0,0.0,1.0,3.0,134.07
4,50005,1,0.0,Phone,1,12.0,CC,Male,,3,Mobile,5,Single,3,0,11.0,1.0,1.0,3.0,129.6


## Data Preparation

Apply preprocessing steps identified during EDA, including missing value
treatment, feature engineering, and encoding, to prepare the dataset
for future modeling. This helps increase reproducibility in the preprocessing step.


In [3]:
#separate features from target variable
cust_df = cust_df.drop(columns = ['CustomerID'])
X = cust_df.drop(columns = ['Churn'])
y = cust_df['Churn'].astype('int')

#define dtypes for pipeline separation
cols_category = X.select_dtypes(include = ['object']).columns.tolist()
cols_numeric = X.select_dtypes(exclude = ['object']).columns.tolist()

cols_category, cols_numeric

(['PreferredLoginDevice',
  'PreferredPaymentMode',
  'Gender',
  'PreferedOrderCat',
  'MaritalStatus'],
 ['Tenure',
  'CityTier',
  'WarehouseToHome',
  'HourSpendOnApp',
  'NumberOfDeviceRegistered',
  'SatisfactionScore',
  'NumberOfAddress',
  'Complain',
  'OrderAmountHikeFromlastYear',
  'CouponUsed',
  'OrderCount',
  'DaySinceLastOrder',
  'CashbackAmount'])

## Modeling Pipeline

A scikit-learn Pipeline with a ColumnTransformer is used to combine preprocessing
and modeling into a single, reproducible workflow. This ensures that imputation,
encoding, and scaling are learned only from the training data, preventing data
leakage and allowing consistent application to new data.


In [4]:
#see if theres wnough unique levels to justify another encoding technique
for col in cols_category:
    print(f"{col} has {cust_df[col].nunique()} unique values")

PreferredLoginDevice has 3 unique values
PreferredPaymentMode has 7 unique values
Gender has 2 unique values
PreferedOrderCat has 6 unique values
MaritalStatus has 3 unique values


In [5]:
#split into train and test splits, use stratify to keep class balance uniform
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


In [6]:
#Separate CouponUsed to preserve business logic from EDA
#missing coupon usage implies no coupon was used
coupon_col = ["CouponUsed"]
other_numeric_cols = [col for col in cols_numeric if col != "CouponUsed"]

#Categorical preprocessing:
#impute missing categories and one hot encode
cat_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

#Numeric preprocessing:
#median imputation with missingness indicators, then scaling
numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median", add_indicator=True)),
    ("scaler", StandardScaler())
])

#Coupon preprocessing:
#zero-imputation reflects absence of coupon usage, with missingness indicator
coupon_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=0, add_indicator=True)),
    ("scaler", StandardScaler())
])

#Combine preprocessing steps by feature type
full_transformer = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, other_numeric_cols),
        ("coupon", coupon_pipe, coupon_col),
        ("cat", cat_pipe, cols_category),
    ],
    remainder="drop"
)

#Full classification pipeline to prevent data leakage
classification_pipe = Pipeline(steps=[
    ("preprocess", full_transformer),
    ("model", LogisticRegression(
        max_iter=2000,
        class_weight="balanced",
        random_state=42
    ))
])


## Cross-Validation

5-fold stratified cross-validation is run on the training set to help simulate how my model would preform on unseen data/sampling variation. Preprocessing is included inside the pipeline to prevent data leakage within each fold. Final results are reported
on a held-out test set.


In [7]:
from sklearn.model_selection import StratifiedKFold, cross_validate

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model_scores = cross_validate(
    classification_pipe,
    X_train, 
    y_train,
    cv = cv,
    scoring=["roc_auc", "precision", "recall", "f1"],
    return_train_score=False
)

{m: (model_scores[f"test_{m}"].mean(), model_scores[f"test_{m}"].std()) for m in ["roc_auc","precision","recall","f1"]}

{'roc_auc': (0.8941929889013199, 0.0157601135502362),
 'precision': (0.46398772429134016, 0.02550091573331306),
 'recall': (0.8199743918053779, 0.015772444430971404),
 'f1': (0.5922518100396525, 0.0224844164119128)}

## Model Evaluation Summary

Cross-validation indicates that the logistic regression pipeline performs
consistently across folds (ROC-AUC ≈ 0.89) with high recall for churners.
Given the class imbalance, recall is emphasized to reduce missed churn cases,
while accepting lower precision. The low variance across folds suggests the
model generalizes reliably.


In [8]:
classification_pipe.fit(X_train, y_train)

y_prob = classification_pipe.predict_proba(X_test)[:, 1]
y_pred = classification_pipe.predict(X_test)

print("Test ROC-AUC:", roc_auc_score(y_test, y_prob))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


Test ROC-AUC: 0.8971343328757202
              precision    recall  f1-score   support

           0       0.97      0.79      0.87      1171
           1       0.45      0.86      0.59       237

    accuracy                           0.80      1408
   macro avg       0.71      0.83      0.73      1408
weighted avg       0.88      0.80      0.82      1408

[[926 245]
 [ 33 204]]


In [9]:
#Evaluate different probability thresholds to understand the precision–recall tradeoff
#tradeoff and support business decision making by prioritizing recall to minimize missed churners versus precision to reduce unnecessary outreach)

thresholds = [0.2, 0.3, 0.4, 0.5, 0.6]

for t in thresholds:
    y_pred_t = (y_prob >= t).astype(int)
    report = classification_report(y_test, y_pred_t, output_dict=True)
    print(
        f"Threshold={t:.1f}  "
        f"Precision={report['1']['precision']:.3f}  "
        f"Recall={report['1']['recall']:.3f}  "
        f"F1={report['1']['f1-score']:.3f}"
    )


Threshold=0.2  Precision=0.299  Recall=0.937  F1=0.453
Threshold=0.3  Precision=0.353  Recall=0.920  F1=0.511
Threshold=0.4  Precision=0.412  Recall=0.895  F1=0.564
Threshold=0.5  Precision=0.454  Recall=0.861  F1=0.595
Threshold=0.6  Precision=0.529  Recall=0.797  F1=0.636


## Threshold Analysis Interpretation

Adjusting the classification threshold reveals a clear tradeoff between recall
and precision. Lower thresholds prioritize identifying most churners at the cost
of increased false positives, while higher thresholds reduce unnecessary outreach
but miss more churn cases. A threshold around 0.4–0.5 provides a balanced tradeoff,
though the optimal choice depends on business constraints and retention capacity.


In [10]:
#Now lets look at feature importance by extracting preprocessing/modeling steps from the pipeline
log_reg = classification_pipe.named_steps["model"]

preprocessor = classification_pipe.named_steps["preprocess"]

num_features = preprocessor.named_transformers_["num"] \
    .named_steps["imputer"] \
    .get_feature_names_out(other_numeric_cols)

coupon_features = preprocessor.named_transformers_["coupon"] \
    .named_steps["imputer"] \
    .get_feature_names_out(coupon_col)

cat_features = preprocessor.named_transformers_["cat"] \
    .named_steps["onehot"] \
    .get_feature_names_out(cols_category)

#Combine all feature names
feature_names = np.concatenate([
    num_features,
    coupon_features,
    cat_features
])

#Create coefficient DataFrame
coef_df = pd.DataFrame({
    "feature": feature_names,
    "coefficient": log_reg.coef_.ravel()
})

coef_df["abs_coefficient"] = coef_df["coefficient"].abs()

coef_df.sort_values("abs_coefficient", ascending=False).head(15)

Unnamed: 0,feature,coefficient,abs_coefficient
37,PreferedOrderCat_Others,2.543767,2.543767
0,Tenure,-1.686479,1.686479
34,PreferedOrderCat_Laptop & Accessory,-1.677821,1.677821
35,PreferedOrderCat_Mobile,-1.147181,1.147181
33,PreferedOrderCat_Grocery,0.731569,0.731569
7,Complain,0.716873,0.716873
36,PreferedOrderCat_Mobile Phone,-0.703809,0.703809
24,PreferredPaymentMode_COD,0.668626,0.668626
6,NumberOfAddress,0.644189,0.644189
23,PreferredPaymentMode_CC,-0.593912,0.593912


## Model Interpretation Summary

Logistic regression coefficients indicate that tenure and product category
preferences are the strongest churn drivers. Longer-tenured customers and those
purchasing in core product categories exhibit lower churn risk, while customers
with complaints, less stable category behavior, and COD payment preferences are
more likely to churn. These associations closely align with patterns observed
during EDA, reinforcing the importance of customer longevity, service experience,
and purchasing behavior in churn risk.


## Gradient Boosted Tree Model

A gradient boosted tree model is trained to capture non-linear relationships
and feature interactions that logistic regression may miss by iterratively training on residual error of previous weak learners. Performance will be
compared against the baseline model to assess whether additional complexity
provides meaningful gains.


In [11]:
from sklearn.ensemble import GradientBoostingClassifier

#use "full transformer" from earlier
gb_pipe = Pipeline(steps=[
    ("preprocess", full_transformer),
    ("model", GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=3,
        random_state=42
    ))
])

#Use same cv steps as before to estimate how model will preform on unseen data
gb_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

gb_scores = cross_validate(
    gb_pipe,
    X_train,
    y_train,
    cv=gb_cv,
    scoring=["roc_auc", "precision", "recall", "f1"],
    return_train_score=False
)

#See mean/std of metrics 
{m: (gb_scores[f"test_{m}"].mean(), gb_scores[f"test_{m}"].std())
 for m in ["roc_auc", "precision", "recall", "f1"]}



{'roc_auc': (0.9382949514870595, 0.011707998241371885),
 'precision': (0.8342138918129042, 0.03272835462225324),
 'recall': (0.6174332709543977, 0.01826000168339295),
 'f1': (0.7092496755686907, 0.017306669540064837)}

In [12]:
#lets fit and evaluate on test set
gb_pipe.fit(X_train, y_train)

y_prob_gb = gb_pipe.predict_proba(X_test)[:, 1]
y_pred_gb = gb_pipe.predict(X_test)

print("GB Test ROC-AUC:", roc_auc_score(y_test, y_prob_gb))
print(classification_report(y_test, y_pred_gb))


GB Test ROC-AUC: 0.9425425273937311
              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1171
           1       0.82      0.63      0.71       237

    accuracy                           0.91      1408
   macro avg       0.87      0.80      0.83      1408
weighted avg       0.91      0.91      0.91      1408



## Gradient Boost Observations:
This model outpreformed the basline logistic regression model in terms of F1, precision, and ROC AUC, but scored lower in terms of recall. This means this model failed to identify as many churners as the baseline model, but was more selective when identifying a customer as a churner. I am interested in how changing the threshold for preicted churners could increase recall.

In [13]:
# Threshold analysis for Gradient Boosted Tree (same method as basline)
thresholds = [0.2, 0.3, 0.4, 0.5]

for t in thresholds:
    y_pred_t = (y_prob_gb >= t).astype(int)
    report = classification_report(y_test, y_pred_t, output_dict=True)
    print(
        f"Threshold={t:.1f} | "
        f"Precision={report['1']['precision']:.3f} | "
        f"Recall={report['1']['recall']:.3f} | "
        f"F1={report['1']['f1-score']:.3f}"
    )


Threshold=0.2 | Precision=0.630 | Recall=0.878 | F1=0.734
Threshold=0.3 | Precision=0.714 | Recall=0.831 | F1=0.768
Threshold=0.4 | Precision=0.772 | Recall=0.730 | F1=0.751
Threshold=0.5 | Precision=0.819 | Recall=0.629 | F1=0.711


## Business Recommendation: Churn Intervention Strategy

Based on the model comparison, the gradient boosted tree is the preferred
approach for identifying customers at risk of churn. When the probability
threshold is lowered (around 0.2–0.3), the model captures a similar share of
churners as the baseline logistic regression while flagging far fewer customers
who are unlikely to churn.

In practice, this allows retention efforts to be more focused and efficient.
Customers identified as high risk can be prioritized for targeted outreach,
such as personalized incentives, follow-up communication, or service recovery
actions. At the same time, reducing false positives helps limit unnecessary
outreach and associated operational costs.

The probability threshold can be adjusted depending on business needs, such as
available retention budget, customer lifetime value, or campaign capacity.
Overall, this approach balances the goal of preventing churn with the practical
constraints of running scalable retention programs.
