# Telecom Customer Churn Prediction AI Model
Author: Ömer Saatcioglu

## Introduction
This notebook demonstrates a machine learning model designed to predict customers at high risk of churn. We utilize a synthetic dataset featuring telco user activities, billing information, and support interactions. The objective is to identify customers likely to change their subscription, as retaining existing customers typically costs less than acquiring new ones, ultimately boosting overall profitability.

## Data Loading and Exploration
We begin by loading and exploring the synthetic dataset to understand its structure and feature distribution. In practice, we would typically have three separate datasets: CDR (Call Detail Records) for user activities, customer billing information, and support interactions. These datasets would then be integrated into a unified repository for analysis and modeling. For the purpose of this PoC, we’ve consolidated the synthetic data into a single dataset.

In [None]:
## Install dependencies
%pip install -r requirements.txt

This script generates a synthetic telecommunications dataset spanning 36 months for 27,778 unique customers—resulting in approximately 1 million records. It calculates a churn risk score based on usage, billing, and support features, then flags customers as churned if their risk exceeds a defined threshold. The script filters out records following each customer’s first churn event and incorporates a 25% noise factor into the churn risk score to better reflect real-world prediction uncertainties. Finally, the processed dataset is saved to a CSV file.

In [None]:
## Run the customer churn data generator 
!python3 01-telco-customer-churn-data-generator.py

In [None]:
import pandas as pd

# Load the synthetic telecom data
data_path = "data/synthetic_customer_data_evenly_distributed.csv"
data = pd.read_csv(data_path)

# Display basic information about the dataset
data.info()

# Display the first few rows of the dataset
data.head()

## Data Preprocessing
Before training the model, we first preprocess the data by handling missing values, removing date columns, transforming categorical variables into numeric format, and finally splitting the dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:", missing_values)

# Mapping for PaymentStatus
payment_mapping = {"Paid": 0, "Partial": 0.5, "Unpaid": 1}

# Mapping for PrimaryIssueType: empty string indicates no support interaction.
issue_mapping = {
    "": 0, 
    "Billing": 0.7, 
    "Technical": 0.8, 
    "Service Quality": 0.6
}

# Mapping for SupportChannel: adjust the numeric values as needed.
support_channel_mapping = {
    "": 0,       # No support channel if there's no interaction.
    "Phone": 1,  # Example value: Phone might be considered more direct.
    "Chat": 0.5, # Example value: Chat might be intermediate.
    "Email": 0.2 # Example value: Email might be considered less immediate.
}

# Convert to numeric values
data["PaymentStatus"] = data["PaymentStatus"].map(payment_mapping)
data["PrimaryIssueType"] = data["PrimaryIssueType"].map(issue_mapping)
data["SupportChannel"] = data["SupportChannel"].map(support_channel_mapping)

# The date columns
date_columns = ['BillingCycleStart', 'BillingCycleEnd', 'PaymentDueDate', 'LastInteractionDate']

# Split the data into features and target variable
X = data.drop(columns=['CustomerID', 'HasChurned'] + date_columns, axis=1)
y = data['HasChurned']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Model Training and Evaluation
We will train two models, a Random Forest classifier and a LightGBM classifier, to predict customer churn. After training, we will use these models to generate predictions on our test data and evaluate their performance using several metrics:

- Lift by Quantile: Measures the improvement in targeting effectiveness over random selection for each bin of predicted probabilities.
- Gini Coefficient: Assesses the model’s discriminatory power by converting the ROC-AUC score into a value that quantifies inequality in prediction performance.
- Confusion Matrix: Provides a breakdown of true versus predicted classes, helping to visualize correct and incorrect classifications.
- Classification Report: Summarizes key metrics such as precision, recall, and F1-score for each class.
- Accuracy Score: Indicates the overall proportion of correctly classified instances.

This comprehensive evaluation framework will help us compare the performance of both classifiers and determine which one is better suited for detecting customer churn.

### Random Forest classifier

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
import matplotlib.pyplot as plt

# Initialize and train the BalancedRandomForestClassifier with Fine-tuned hyperparameters 
model_BRFC = BalancedRandomForestClassifier(
    random_state=42,
    n_estimators=200,
    min_samples_split=10,
    min_samples_leaf=2,
    max_features='sqrt',
    max_depth=100,
    bootstrap=True,
    sampling_strategy={False: 5000, True: 1000}, 
    replacement=True  
)
model_BRFC.fit(X_train, y_train)

def compute_lift(y_true, y_scores, n_bins=10):
    """
    Compute lift for each quantile (bin) of predicted probabilities.
    
    Parameters:
        y_true (array-like): True binary labels.
        y_scores (array-like): Predicted probabilities for the positive class.
        n_bins (int): Number of bins (quantiles) to divide the data into.
        
    Returns:
        lift_df (pd.DataFrame): DataFrame containing the count, positive rate, and lift for each bin.
    """
    # Create DataFrame with true labels and predicted scores
    df = pd.DataFrame({'y_true': y_true, 'y_scores': y_scores})
    # Sort descending by predicted probability
    df = df.sort_values('y_scores', ascending=False).reset_index(drop=True)
    # Create bins based on the index quantiles
    df['bin'] = pd.qcut(df.index, q=n_bins, labels=False)
    
    # Overall positive rate in the dataset
    overall_rate = df['y_true'].mean()
    
    # Aggregate data by bin: count and average positive rate per bin
    lift_df = df.groupby('bin').agg(
        count=('y_true', 'count'),
        positive_rate=('y_true', 'mean')
    ).reset_index()
    # Compute lift as the ratio of bin positive rate to overall positive rate
    lift_df['lift'] = lift_df['positive_rate'] / overall_rate
    
    return lift_df

def compute_gini(y_true, y_scores):
    """
    Compute the Gini coefficient using the ROC-AUC score.
    
    Parameters:
        y_true (array-like): True binary labels.
        y_scores (array-like): Predicted probabilities for the positive class.
    
    Returns:
        gini (float): The Gini coefficient.
    """
    auc = roc_auc_score(y_true, y_scores)
    return 2 * auc - 1

y_pred_proba = model_BRFC.predict_proba(X_test)[:, 1]

# Compute lift by deciles
lift_df = compute_lift(y_test, y_pred_proba, n_bins=10)

# Plot the lift by decile using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(lift_df['bin'], lift_df['lift'], color='skyblue')
plt.xlabel("Decile Bin")
plt.ylabel("Lift")
plt.title("Lift by Decile")
plt.xticks(lift_df['bin'])
plt.ylim(0, max(lift_df['lift'])*1.1)  # Extend y-axis slightly for better visual appearance
plt.show()

# Compute Gini coefficient
gini_score = compute_gini(y_test, y_pred_proba)
print(f"\nGini Score: {gini_score:.4f}")

# Make predictions on the test set with BalancedRandomForestClassifier
y_pred = model_BRFC.predict(X_test)
# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
acc_score = accuracy_score(y_test, y_pred)
print("---------------")
print("BalancedRandomForestClassifier Results:")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
print("\nAccuracy Score:")
print(acc_score)
print("---------------")

### LightGBM classifier 

In [None]:
import lightgbm as lgb

# Initialize the LightGBM classifier with sample hyperparameters
model_LGBM = lgb.LGBMClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=7,
    num_leaves=31,
    random_state=42
)

# Train the model
model_LGBM.fit(X_train, y_train)

y_pred_proba = model_LGBM.predict_proba(X_test)[:, 1]

# Compute lift by deciles
lift_df = compute_lift(y_test, y_pred_proba, n_bins=10)

# Plot the lift by decile using a bar chart
plt.figure(figsize=(10, 6))
plt.bar(lift_df['bin'], lift_df['lift'], color='skyblue')
plt.xlabel("Decile Bin")
plt.ylabel("Lift")
plt.title("Lift by Decile")
plt.xticks(lift_df['bin'])
plt.ylim(0, max(lift_df['lift'])*1.1)  # Extend y-axis slightly for better visual appearance
plt.show()

# Compute Gini coefficient
gini_score = compute_gini(y_test, y_pred_proba)
print(f"\nGini Score: {gini_score:.4f}")

# Make predictions on the test set
y_pred = model_LGBM.predict(X_test)

# Evaluate the model
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
print("---------------")
print("LightGBM Results:")
print("Confusion Matrix:\n", cm) 
print("Classification Report:\n", cr)
print("Accuracy:", acc)
print("---------------")



## Conclusion
In this notebook, we developed a machine learning model to detect churning customers using a synthetic dataset. We employed both a Random Forest classifier and a LightGBM classifier to evaluate our approach.

Our Random Forest classifier demonstrated strong performance in identifying potential churners. A key aspect of its configuration was the explicit adjustment of the sampling_strategy, which controls the balance between the majority (non-churn) and minority (churn) classes for each tree. This tailored balancing resulted in a model that, although it slightly reduced recall, significantly increased precision. In practical terms, when the model predicts churn, it is much more likely to be correct, thereby achieving a better trade-off between capturing true churners and minimizing false alarms.

In addition, we explored the LightGBM classifier as an alternative. LightGBM builds trees sequentially using a gradient boosting framework, where each new tree is trained to correct the errors (residuals) made by previous trees. This iterative process minimizes a loss function (such as log-loss for classification) through gradient descent, often resulting in faster training times and robust performance, especially on large datasets.

Both models reached the accuracy levels designed into our synthetic data, proving the viability of our approach. Looking ahead, applying these models to real-world data will likely require further enhancements through hyperparameter tuning and advanced feature engineering. Moreover, additional evaluation metrics, such as SHAP-based explainability, could provide deeper insights into model performance and its decision-making process.

Overall, our findings underscore the potential of both Random Forest and LightGBM classifiers for churn prediction and set the foundation for further refinements and real-world validations.

## Saving the Models
Finally, we will save the trained models to files for future use.

In [None]:
import pickle

# Save the model
model_path_BRFC = "models/customer_churn_prediction_model_brfc.pkl"
model_path_LGBM = "models/customer_churn_prediction_model_lgbm.pkl"

with open(model_path_BRFC, 'wb') as model_file:
    pickle.dump((model_BRFC, X_train.columns.tolist()), model_file)
print(f"Customer Churn Prediction Random Forest classifier Model Saved to {model_path_BRFC}")

with open(model_path_LGBM, 'wb') as model_file:
    pickle.dump((model_LGBM, X_train.columns.tolist()), model_file)
print(f"Customer Churn Prediction LightGBM classifier Model Saved to {model_LGBM}")