# <span style="color:goldenrod"><b>Customer Churn Prediction and Analysis</b></span>
- Customer churn means losing customer interest in a business's services, products or in some cases both. This can become a challenge for any business as it impacts revenue and the market. Accurately identifying the reasons for churn can help businesses improve their marketing strategies, retention strategies and enhance customer service, ultimately improving satisfaction and profitability.

- The project showcases the complete workflow of data exploration and preprocessing, feature engineering, model training with advanced classifiers, hyperparameter tuning, and model explanation using SHAP values.

- The Telco customer churn dataset contains customer details related to demographics, account details, and usage patterns.

- This project also emphasizes a common issue of class imbalance in churn datasets. Using oversampling techniques like SMOTE combined with weight class adjustments enhances the model's robustness. Performance is evaluated using multiple metrics and the final model is interpreted to uncover the key factors driving churn behavior.

- This notebook serves as a comprehensive case study demonstrating the practical application of machine learning workflows to solve a critical business problem with real-world data and modern tools.

<h2><span style="color:coral"><u><center>Importing the libraries</center></u></span></h2>

In [None]:
import numpy as np
import pandas as pd
import re

<h2><span style="color:coral"><u><center>Reading the data</center></u></span></h2>

In [None]:
df = pd.read_excel("../data/Telco_customer_churn.xlsx")

<h2><span style="color:coral"><u><center>Info of the data</center></u></span></h2>

In [None]:
df.info()

In [None]:
df.describe()

<h2><span style="color:coral"><u><center>Separating columns in different lists based on non-object type and object type</center></u></span></h2>

In [None]:
list_of_cols_object = [col for col in df.columns if df[col].dtype == "object"]
list_of_cols_not_object = [col for col in df.columns if df[col].dtype != "object"]
print(list_of_cols_object)
print(list_of_cols_not_object)

<h2><span style="color:coral"><u><center>Checking for duplicated values in all columns</center></u></span></h2>

In [None]:
for col in df.columns:
    if df[col].duplicated().sum() > 0:
        print(col)

<h2><span style="color:coral"><u><center>Checking for null values in all columns</center></u></span></h2>

In [None]:
for col in df.columns:
    if df[col].isnull().sum() > 0:
        print(f"Column name = {col};  total null values = {df[col].isnull().sum()}")

<h2><span style="color:coral"><u><center>Replacing NaN with 0</center></u></span></h2>

In [None]:
display(df.nunique())

In [None]:
df = df.fillna(0)


In [None]:
display(df["Churn Reason"].unique())

<h2><span style="color:coral"><u><center>Regex to convert object type to respective data types</center></u></span></h2>

In [None]:
string_pattern = re.compile(r'^[\d\w\s,_\-]+$')
int_pattern = re.compile(r'^-?\d+$')
float_pattern = re.compile(r'^-?\d*\.\d+$')
datetime_pattern = re.compile(
        r'^\d{4}-\d{2}-\d{2}( \d{2}:\d{2}(:\d{2}(\.\d{3})?)?)?$'
    )

for i in list_of_cols_object:
    col_values = df[i].dropna().astype(str)
    if col_values.apply(lambda x: bool(string_pattern.match(x))).all():
        df[i] = df[i].astype("string")


In [None]:
for i in df.columns:
    values = df[col].dropna().astype(str).sample(min(100, len(df[col])))
    if values.apply(lambda x: bool(int_pattern.match(x))).all():
        df[i] = pd.to_numeric(df[i], downcast='integer', errors='coerce')
    elif values.apply(lambda x: bool(float_pattern.match(x))).all():
        df[i] = pd.to_numeric(df[i], downcast='float', errors='coerce')
    elif values.apply(lambda x: bool(datetime_pattern.match(x))).all():
        df[i] = pd.to_datetime(df[i], errors='coerce')

In [None]:
df[["Payment Method", "Churn Reason"]] = df[["Payment Method", "Churn Reason"]].astype("string")

<h2><span style="color:coral"><u><center>Empty values replaced with 0</center></u></span></h2>

In [None]:
df = df.replace(' ', 0)

<h2><span style="color:coral"><u><center>Feature Engineering</center></u></span></h2>

In [None]:
df["Churn"] = (df["Churn Reason"] == "0").astype(int)

In [None]:
x = df.drop(columns=["CustomerID", "Churn"])
y = df["Churn"]

<h3><span style="color:lightgreen"><u><center>Dropping all unnecessary features from x -> dataset of independent variables/ features</center></u></span></h3>

In [None]:
x = x.drop(columns=["Count","Country", "State", "City", "Zip Code", "Latitude", "Longitude", "Churn Label", "Churn Value", "Churn Score", "Churn Reason"])

In [None]:
x["Total Charges"] = x["Total Charges"].astype(float)

In [None]:
x.nunique()

<h3><span style="color:lightgreen"><u><center>Automated encoding and feature scaling using the scikit-learn linrary</center></u></span></h3>

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder

binary_cols = []
onehot_cols = []
numeric_cols = []
for col in x.columns:
    if x[col].nunique() == 2 and x[col].dtype == 'string':
        binary_cols.append(col)
    elif x[col].nunique() > 2 and x[col].dtype == 'string':
        onehot_cols.append(col)
    else:
        numeric_cols.append(col)

# label encoder for binary columns
le = LabelEncoder()
for col in binary_cols:
    x[col] = le.fit_transform(x[col])

# one hot encoder for non-binary string columns - onehot_cols

ohe = OneHotEncoder()
onehot = ohe.fit_transform(x[onehot_cols])
onehot_dense = onehot.toarray()

onehot_df = pd.DataFrame(onehot_dense, columns=ohe.get_feature_names_out(onehot_cols), index=x.index)

x = x.drop(columns=onehot_cols)

x = pd.concat([x, onehot_df], axis=1)

# Scaling the numeric values of the columns in x
scaler = StandardScaler()
x[numeric_cols] = scaler.fit_transform(x[numeric_cols])

<h2><span style="color:coral"><u><center>Test Train dataset split</center></u></span></h2>

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42, stratify=y)

<h2><span style="color:coral"><u><center>Logistic regression classification</center></u></span></h2>

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [None]:
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_test, y_pred))
print("Accuracy = ", accuracy_score(y_test, y_pred))

<h2><span style="color:coral"><u><center>Using SVC for model training</center></u></span></h2>

In [None]:
from sklearn.svm import SVC
svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_model.fit(x_train, y_train)
y_pred_svm = svm_model.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_svm))
print("Accuracy = ", accuracy_score(y_test, y_pred_svm))

<h2><span style="color:coral"><u><center>Using Decision Tree for model training</center></u></span></h2>

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(x_train, y_train)
y_pred_dt = dt_model.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_dt))
print("Accuracy = ", accuracy_score(y_test, y_pred_dt))

<h2><span style="color:coral"><u><center>Using Random Forest for model training</center></u></span></h2>

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)
y_pred_rf = rf_model.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_rf))
print("Accuracy = ", accuracy_score(y_test, y_pred_rf))

<h2><span style="color:coral"><u><center>Using XGBoost for model training</center></u></span></h2>

In [None]:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(x_train, y_train)
y_pred_xgb = xgb_model.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_xgb))
print("Accuracy = ", accuracy_score(y_test, y_pred_xgb))

<h2><span style="color:coral"><u><center>Hyper parameter tuning for better accuracy</center></u></span></h2>

<h3><span style="color:lightgreen"><u><center>Random Forest classifier hyper parameter tuning</center></u></span></h3>

In [None]:
from sklearn.model_selection import GridSearchCV

grid = {
    'n_estimators': [50,100,200,250],
    'max_depth': [None, 10, 20, 25],
    'min_samples_split': [2,5,10,15]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid = grid,
    cv = 5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(x_train, y_train)

print('Best params: ', grid_search.best_params_)
print('Best score: ', grid_search.best_score_)

best_model = grid_search.best_estimator_
y_pred_grid = best_model.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_grid))
print("Accuracy = ", accuracy_score(y_test, y_pred_grid))

<h3><span style="color:lightgreen"><u><center>XGB classifier hyper parameter tuning</center></u></span></h3>

In [None]:
xgb_grid = {
    'n_estimators': [50, 100, 200, 250, 300],
    'max_depth': [3, 6, 10, 15],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

xgb_grid_search = GridSearchCV(
    estimator = xgb.XGBClassifier(eval_metric='logloss', random_state=42),
    param_grid = xgb_grid,
    cv = 5,
    scoring='accuracy',
    n_jobs=-1
)

xgb_grid_search.fit(x_train, y_train)

print('Best params: ', xgb_grid_search.best_params_)
print('Best score: ', xgb_grid_search.best_score_)

best_model = xgb_grid_search.best_estimator_
y_pred_xgb_grid = best_model.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_xgb_grid))
print("Accuracy = ", accuracy_score(y_test, y_pred_xgb_grid))

<h2><span style="color:coral"><u><center>SMOTE and Hyper parameter tuning for better accuracy</center></u></span></h2>

<h3><span style="color:lightgreen"><u><center>Random Forest classifier hyper parameter tuning with SMOTE</center></u></span></h3>

In [None]:
from imblearn.over_sampling import SMOTE
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, random_state=42)

smote = SMOTE(random_state=42)
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)
rfc = RandomForestClassifier(random_state=42)
param_grid_rfc = {
    'n_estimators': [50,100,200,250],
    'max_depth': [None, 10, 20, 25],
    'min_samples_split': [2,5,10,15]
}

grid_search_rfc = GridSearchCV(estimator=rfc, param_grid = param_grid_rfc, cv=5, scoring='accuracy', n_jobs = -1)
grid_search_rfc.fit(x_train_smote, y_train_smote)

print('Best params: ', grid_search_rfc.best_params_)
print('Best score: ', grid_search_rfc.best_score_)

best_model_rfc = grid_search_rfc.best_estimator_
y_pred_rfc_grid = best_model_rfc.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_rfc_grid))
print("Accuracy = ", accuracy_score(y_test, y_pred_rfc_grid))

<h3><span style="color:lightgreen"><u><center>XGB classifier hyper parameter tuning with SMOTE</center></u></span></h3>

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, random_state=42)

smote = SMOTE(random_state=42)
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)
xgbs = xgb.XGBClassifier(eval_metric='logloss', random_state=42)
param_grid_xgbs = {
    'n_estimators': [50, 100, 200, 250, 300],
    'max_depth': [3, 6, 10, 15],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

grid_search_xgbs = GridSearchCV(estimator=xgbs, param_grid = param_grid_xgbs, cv=5, scoring='accuracy', n_jobs = -1)
grid_search_xgbs.fit(x_train_smote, y_train_smote)

print('Best params: ', grid_search_xgbs.best_params_)
print('Best score: ', grid_search_xgbs.best_score_)

best_model_xgbs = grid_search_xgbs.best_estimator_
y_pred_xgbs_grid = best_model_xgbs.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_xgbs_grid))
print("Accuracy = ", accuracy_score(y_test, y_pred_xgbs_grid))

<h2><span style="color:coral"><u><center>Adding class weight to RFC</center></u></span></h2>

In [None]:
smote = SMOTE(random_state=42)
rfc_balanced = RandomForestClassifier(class_weight='balanced',random_state=42)
param_grid_rfc_balanced = {
    'n_estimators': [50,100,200,250],
    'max_depth': [None, 10, 20, 25],
    'min_samples_split': [2,5,10,15]
}

grid_search_rfc_balanced = GridSearchCV(estimator=rfc, param_grid = param_grid_rfc_balanced, cv=5, scoring='accuracy', n_jobs = -1)
grid_search_rfc_balanced.fit(x_train_smote, y_train_smote)

print('Best params: ', grid_search_rfc_balanced.best_params_)
print('Best score: ', grid_search_rfc_balanced.best_score_)

best_model_rfc_balanced = grid_search_rfc_balanced.best_estimator_
y_pred_rfc_grid_balanced = best_model_rfc_balanced.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_rfc_grid_balanced))
print("Accuracy = ", accuracy_score(y_test, y_pred_rfc_grid_balanced))

<h2><span style="color:coral"><u><center>Using grid search cv to enhnace the score with tuning recall, f1-score and precision</center></u></span></h2>

In [None]:
from sklearn.metrics import make_scorer, accuracy_score, f1_score, recall_score, precision_score

smote = SMOTE(random_state=42)
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)

rfc_cw = RandomForestClassifier(class_weight='balanced', random_state=42)
param_grid_rfc_cw = {
    'n_estimators': [50,100,200,250],
    'max_depth': [None, 10, 20, 25],
    'min_samples_split': [2,5,10,15]
}
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score),
    'recall': make_scorer(recall_score),
    'precision': make_scorer(precision_score)
}

grid_search_rfc_cw = GridSearchCV(
    estimator=rfc_cw,
    param_grid=param_grid_rfc_cw,
    scoring=scoring,
    refit='accuracy',
    cv=5,
    n_jobs=-1
)

grid_search_rfc_cw.fit(x_train_smote, y_train_smote)
print('Best params:', grid_search_rfc_cw.best_params_)
print('Best F1 score:', grid_search_rfc_cw.best_score_)

Y_pred_rfc_cw = grid_search_rfc_cw.predict(x_test)

In [None]:
print(classification_report(y_test, Y_pred_rfc_cw))
print("Accuracy = ", accuracy_score(y_test, Y_pred_rfc_cw))

In [None]:
# rf with hyper parameter tuning, class weighting and SMOTE; accuracy = 0.8639175257731958, Best params:  {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 250}
# rf with hyper parameter tuning and SMOTE; accuracy: 0.8639175257731958, Best params:  {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 250}
# rf with just hyper parameter tuning, accuracy:  0.8097290626807523, Best params:  {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}
# rf without anything above - accuracy: 0.7892122072391767
# rf with smote, hyper parameter tuning, class weight and refiting f1-score; Best F1 score: 0.8634641371184564, Best params: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 250}

<h2><span style="color:coral"><u><center>Using LightGBM for classification on customer churn</center></u></span></h2>

In [None]:
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
classes = np.unique(y_train_smote)
class_counts = np.bincount(y_train_smote)
class_weights = {
    cls: sum(class_counts) / c for cls, c in zip(classes, class_counts)
}

lgbm = lgb.LGBMClassifier(class_weight = class_weights, random_state = 42)
param_grid_lgbm = {
    'num_leaves':[31,50],
    'max_depth':[-1,10,20,30],
    'learning_rate':[0.01, 0.1, 0.2],
    'n_estimators':[100,200]
}
grid_lgbm = GridSearchCV(
    estimator=lgbm,
    param_grid=param_grid_lgbm,
    scoring=make_scorer(f1_score),
    cv=5,
    n_jobs=-1
)
grid_lgbm.fit(x_train_smote, y_train_smote)
print('Best params LightGBM:', grid_lgbm.best_params_)
print('Best F1 LightGBM:', grid_lgbm.best_score_)
y_pred_lgmb = grid_lgbm.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_lgmb))
print("Accuracy = ", accuracy_score(y_test, y_pred_lgmb))

<h2><span style="color:coral"><u><center>Using CatBoost for classification on customer churn</center></u></span></h2>

In [None]:
from catboost import CatBoostClassifier
classes = np.unique(y_train_smote)
class_counts = np.bincount(y_train_smote)
class_weights = [sum(class_counts) / c for c in class_counts]

catboost = CatBoostClassifier(class_weights = class_weights, verbose=0, random_state=42)
param_grid_cat = {
    'depth':[6,10,20],
    'learning_rate':[0.01, 0.1, 0.2],
    'iterations': [100,200]
}
grid_cat = GridSearchCV(
    estimator=catboost,
    param_grid = param_grid_cat,
    scoring=make_scorer(f1_score),
    cv=5,
    n_jobs=-1
)
grid_cat.fit(x_train_smote, y_train_smote)
print('Best params CatBoost:', grid_cat.best_params_)
print('Best F1 CatBoost:', grid_cat.best_score_)
y_pred_cat = grid_cat.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred_lgmb))
print("Accuracy = ", accuracy_score(y_test, y_pred_lgmb))

<h2><span style="color:coral"><u><center>SHAP explanation for the best model so far for customer churn- LightGBM</center></u></span></h2>

<h3><span style="color:lightgreen"><u><center>LightGBM is selected based on the best params and best score of 86%</center></u></span></h3>

In [None]:
import shap
explainer = shap.TreeExplainer(grid_lgbm.best_estimator_)
shap_values = explainer.shap_values(x_train_smote)
shap.summary_plot(shap_values, x_train_smote)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0,:], x_train_smote.iloc[0,:])