# Telecom Customer Churn Prediction AI Model
Author: Ömer Saatcioglu

## Introduction
This notebook demonstrates a machine learning model designed to predict customers at high risk of churn. We utilize a synthetic dataset featuring telco user activities, billing information, and support interactions. The objective is to identify customers likely to change their subscription, as retaining existing customers typically costs less than acquiring new ones, ultimately boosting overall profitability.

## Data Loading and Exploration
We begin by loading and exploring the synthetic dataset to understand its structure and feature distribution. In practice, we would typically have three separate datasets: CDR (Call Detail Records) for user activities, customer billing information, and support interactions. These datasets would then be integrated into a unified repository for analysis and modeling. For the purpose of this PoC, we’ve consolidated the synthetic data into a single dataset.

In [None]:
## Install dependencies
%pip install -r requirements.txt

This following script generates a synthetic telco dataset spanning 36 months for 27,778 unique customers (around 1 million records). It computes a churn risk score from usage, billing, and support features, flags customers who churn (based on a threshold), filters out customer records after the first churn event, and saves the final dataset to a CSV file.

In [None]:
## Run the customer churn data generator 
!python3 01-telco-customer-churn-data-generator.py

In [None]:
import pandas as pd

# Load the synthetic telecom data
data_path = "data/synthetic_customer_data_evenly_distributed.csv"
data = pd.read_csv(data_path)

# Display basic information about the dataset
data.info()

# Display the first few rows of the dataset
data.head()

## Data Preprocessing
Before training the model, we first preprocess the data by handling missing values, converting dates to Unix timestamps, transforming categorical variables into numeric format, and finally splitting the dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:", missing_values)

# The date columns that need to be converted
date_columns = ['BillingCycleStart', 'BillingCycleEnd', 'PaymentDueDate', 'LastInteractionDate']

# Convert each date column to datetime and then to a Unix timestamp as a float
for col in date_columns:
    data[col] = pd.to_datetime(data[col])
    data[col] = data[col].astype('int64') / 1e9

# Mapping for PaymentStatus
payment_mapping = {"Paid": 0, "Partial": 0.5, "Unpaid": 1}

# Mapping for PrimaryIssueType: empty string indicates no support interaction.
issue_mapping = {
    "": 0, 
    "Billing": 0.7, 
    "Technical": 0.8, 
    "Service Quality": 0.6
}

# Mapping for SupportChannel: adjust the numeric values as needed.
support_channel_mapping = {
    "": 0,       # No support channel if there's no interaction.
    "Phone": 1,  # Example value: Phone might be considered more direct.
    "Chat": 0.5, # Example value: Chat might be intermediate.
    "Email": 0.2 # Example value: Email might be considered less immediate.
}

# Convert to numeric values
data["PaymentStatus"] = data["PaymentStatus"].map(payment_mapping)
data["PrimaryIssueType"] = data["PrimaryIssueType"].map(issue_mapping)
data["SupportChannel"] = data["SupportChannel"].map(support_channel_mapping)

# Split the data into features and target variable
X = data.drop(columns=['CustomerID', 'HasChurned'], axis=1)
y = data['HasChurned']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Model Training
We will use a Random Forest classifier to train the model. This involves fitting the model on the training data and then making predictions on the test data.

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier

# Initialize and train the BalancedRandomForestClassifier with Fine-tuned hyperparameters 
model = BalancedRandomForestClassifier(
    random_state=42,
    n_estimators=200,
    min_samples_split=10,
    min_samples_leaf=2,
    max_features='sqrt',
    max_depth=100,
    bootstrap=True,
    sampling_strategy={False: 5000, True: 1000}, 
    replacement=True  
)
model.fit(X_train, y_train)

## Model Evaluation
We will evaluate the model's performance using metrics such as confusion matrix, classification report, and accuracy score.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make predictions on the test set with BalancedRandomForestClassifier
y_pred = model.predict(X_test)
# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
acc_score = accuracy_score(y_test, y_pred)
print("---------------")
print("BalancedRandomForestClassifier Results:")
print("Confusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
print("\nAccuracy Score:")
print(acc_score)
print("---------------")

## Conclusion
In this notebook, we developed a machine learning model to detect churning customers using a synthetic dataset. Our Random Forest classifier performed well in identifying potential churners. By explicitly configuring the sampling_strategy to control the balance between the majority (non-churn) and minority (churn) classes, we achieved a more precise prediction. Although this approach slightly reduced recall, it significantly increased precision, meaning that when the model predicts churn, it is much more likely to be correct. This results in a better trade-off between capturing churners and minimizing false alarms.

Further steps could involve hyperparameter tuning, feature engineering, and testing with real-world data.

## Saving the Model
Finally, we will save the trained model to a file for future use.

In [None]:
import pickle

# Save the model
model_path = "models/customer_churn_prediction_model.pkl"
with open(model_path, 'wb') as model_file:
    pickle.dump((model, X_train.columns.tolist()), model_file)
print(f"Customer Churn Prediction BalancedRandomForestClassifier Model Saved to {model_path}")