# CUSTOMER CHURN PREDICTION

This notebook outlines the steps taken to build a predictive model for identifying customers at high risk of churn for a telecom company. The dataset used is synthetically generated, with a focus on simulating real-world data challenges, including missing values, outliers, and imbalanced classes. The goal is to demonstrate the full machine learning lifecycle, from data generation and exploration to model building, evaluation, and (optionally) deployment.


# DATA GENERATION

In this section, we generate a synthetic dataset of 5000 customer records. The dataset includes various features like `CustomerID`, `Age`, `Gender`, `ContractType`, `MonthlyCharges`, `TotalCharges`, and others. Additionally, derived features like `average_monthly_charges` and `customer_lifetime_value` are created. We also introduce missing values and outliers to simulate real-world scenarios.


In [None]:
import pandas as pd
import numpy as np

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of records
n_records = 5000

In [None]:
# Generate synthetic data
data = pd.DataFrame({
    'CustomerID': np.arange(1, n_records + 1),
    'Age': np.random.randint(18, 70, size=n_records),
    'Gender': np.random.choice(['Male', 'Female'], size=n_records),
    'ContractType': np.random.choice(['Month-to-month', 'One year', 'Two year'], size=n_records),
    'MonthlyCharges': np.round(np.random.uniform(20, 120, size=n_records), 2),
    'TotalCharges': np.round(np.random.uniform(100, 8000, size=n_records), 2),
    'TechSupport': np.random.choice(['Yes', 'No'], size=n_records),
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], size=n_records),
    'Tenure': np.random.randint(1, 72, size=n_records),
    'PaperlessBilling': np.random.choice(['Yes', 'No'], size=n_records),
    'PaymentMethod': np.random.choice(['Cash','UPI','Internet Banking', 'Debit card', 'Credit card'], size=n_records),
    'Churn': np.random.choice(['Yes', 'No'], size=n_records, p=[0.2, 0.8])
})

In [None]:
# Derived features
data['average_monthly_charges'] = data['TotalCharges'] / np.where(data['Tenure'] == 0, 1, data['Tenure'])
data['customer_lifetime_value'] = data['MonthlyCharges'] * data['Tenure']

In [None]:
# Introduce missing values
data.loc[np.random.choice(data.index, size=50, replace=False), 'TotalCharges'] = np.nan

# Introduce outliers
data.loc[np.random.choice(data.index, size=10, replace=False), 'MonthlyCharges'] *= 1.5

In [None]:
# Save dataset to CSV
data.to_csv('data/customer_data.csv', index=False)

data.head()

# EXPLORATORY DATA ANALYSIS (EDA)

EDA is a crucial step in understanding the dataset characteristics, identifying patterns, and detecting anomalies. In this section, we perform in-depth exploratory data analysis, including summary statistics, distribution analysis, and visualization of relationships between features and the target variable `Churn`.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
data = pd.read_csv('data/customer_data.csv')

print(data.describe())

In [None]:
# Analyze churn distribution
sns.countplot(data['Churn'])
plt.title('Churn Distribution')
plt.show()

In [None]:
# Analyze numerical features
sns.pairplot(data, hue='Churn', vars=['Age', 'MonthlyCharges', 'TotalCharges', 'Tenure'])
plt.show()

In [None]:
# Correlation matrix
numeric_cols = data.select_dtypes(include=[np.number])

corr_matrix = numeric_cols.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Numeric Features')
plt.show()

In [None]:
# Analyze categorical features
for col in ['Gender', 'ContractType', 'TechSupport', 'InternetService', 'PaperlessBilling', 'PaymentMethod']:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=col, hue='Churn', data=data)
    plt.title(f'{col} vs Churn')
    plt.show()

# DATA PREPROCESSING

Data preprocessing involves cleaning and preparing the data for modeling. This section covers handling missing values, encoding categorical variables, scaling numerical features, and splitting the dataset into training and testing sets.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

In [None]:
# Handle missing values
imputer = SimpleImputer(strategy='mean')
data['TotalCharges'] = imputer.fit_transform(data[['TotalCharges']])


In [None]:
# Encode categorical features
label_encoders = {}
for col in ['Gender', 'ContractType', 'TechSupport', 'InternetService', 'PaperlessBilling', 'PaymentMethod', 'Churn']:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

In [None]:
# Feature scaling
scaler = StandardScaler()
data[['Age', 'MonthlyCharges', 'TotalCharges', 'Tenure', 'average_monthly_charges', 'customer_lifetime_value']] = \
    scaler.fit_transform(data[['Age', 'MonthlyCharges', 'TotalCharges', 'Tenure', 'average_monthly_charges', 'customer_lifetime_value']])

In [None]:
# Split dataset
X = data.drop(columns=['CustomerID', 'Churn'])
y = data['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
X_train.shape, X_test.shape

# FEATURE ENGINEERING

Feature engineering involves creating new features based on domain knowledge or insights from EDA. In this section, we create additional features that may improve the model's performance.


In [None]:
# Example feature engineering
X_train['monthly_to_tenure_ratio'] = X_train['MonthlyCharges'] / np.where(X_train['Tenure'] == 0, 1, X_train['Tenure'])
X_test['monthly_to_tenure_ratio'] = X_test['MonthlyCharges'] / np.where(X_test['Tenure'] == 0, 1, X_test['Tenure'])


# MODEL BUILDING

In this section, we experiment with two classification algorithms, Decision Trees and Naive Bayes. We also optimize hyperparameters and evaluate model performance using metrics like accuracy, precision, recall, F1-score, and ROC AUC.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


### DECISION TREE
Decision trees are created by recursively partitioning the data into smaller and smaller subsets. At each partition, the data is split based on a specific feature, and the split is made in a way that maximizes the information gain.

In [None]:
# Train Decision Tree
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)

In [None]:
# Evaluate Decision Tree
print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_tree):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_tree):.2f}")
print(f"F1-Score: {f1_score(y_test, y_pred_tree):.2f}")
print(f"ROC AUC: {roc_auc_score(y_test, tree.predict_proba(X_test)[:, 1]):.2f}")

In [None]:
# Assuming X and y are your features and target variables
# Address class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled dataset into training and testing sets
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

In [None]:
# Define the parameter grid to search
param_grid = {
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'max_features': ['sqrt', 'log2', None],
    'class_weight': ['balanced', None],
    'criterion': ['gini', 'entropy']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=tree,
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=10,
                           n_jobs=-1,
                           verbose=1)

# Fit the model
grid_search.fit(X_train_smote, y_train_smote)

# Get the best estimator
best_tree = grid_search.best_estimator_

# Make predictions with the best model
y_pred_best_tree = best_tree.predict(X_test_smote)


In [None]:
# Evaluate the model
print(f"Best Decision Tree Accuracy: {accuracy_score(y_test_smote, y_pred_best_tree):.2f}")
print(f"Best Precision: {precision_score(y_test_smote, y_pred_best_tree):.2f}")
print(f"Best Recall: {recall_score(y_test_smote, y_pred_best_tree):.2f}")
print(f"Best F1-Score: {f1_score(y_test_smote, y_pred_best_tree):.2f}")
print(f"Best ROC AUC: {roc_auc_score(y_test_smote, best_tree.predict_proba(X_test_smote)[:, 1]):.2f}")

### NAIVE BAYES
Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming independence between predictors.

In [None]:
# Train Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

In [None]:
# Evaluate Naive Bayes
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, y_pred_nb):.2f}")
print(f"Precision: {precision_score(y_test, y_pred_nb):.2f}")
print(f"Recall: {recall_score(y_test, y_pred_nb):.2f}")
print(f"F1-Score: {f1_score(y_test, y_pred_nb):.2f}")
print(f"ROC AUC: {roc_auc_score(y_test, nb.predict_proba(X_test)[:, 1]):.2f}")

In [None]:
# Define a pipeline with scaling and the Naive Bayes model
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: Feature scaling
    ('nb', GaussianNB())  # Step 2: Gaussian Naive Bayes model
])

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'nb__var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]  # This is the only hyperparameter in GaussianNB
}

# Set up the GridSearchCV
grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='accuracy',  # You can change this to 'f1' or another metric if needed
                           cv=5,
                           n_jobs=-1,
                           verbose=1)

# Fit the model
grid_search.fit(X_train_smote, y_train_smote)

# Get the best model
best_nb = grid_search.best_estimator_

# Make predictions
y_pred_best_nb = best_nb.predict(X_test_smote)


In [None]:
#Evaluate the model
print(f"Best Naive Bayes Accuracy: {accuracy_score(y_test_smote, y_pred_best_nb):.2f}")
print(f"Best Precision: {precision_score(y_test_smote, y_pred_best_nb):.2f}")
print(f"Best Recall: {recall_score(y_test_smote, y_pred_best_nb):.2f}")
print(f"Best F1-Score: {f1_score(y_test_smote, y_pred_best_nb):.2f}")
print(f"Best ROC AUC: {roc_auc_score(y_test_smote, best_nb.predict_proba(X_test_smote)[:, 1]):.2f}")

# MODEL SELECTION AND EVALUATION

Based on the performance metrics, we select the best model, `best_tree`. In this section, we analyze the confusion matrix, calculate feature importance, and plot the ROC curve to further understand the model's behavior.


In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [None]:
# Confusion Matrix for Decision Tree
conf_matrix_logreg = confusion_matrix(y_test_smote, y_pred_best_tree)
sns.heatmap(conf_matrix_logreg, annot=True, fmt='d', cmap='Blues')
plt.title('Decision Tree Confusion Matrix')
plt.show()

In [None]:
# Feature Importance for Decision Tree
importances = pd.Series(tree.feature_importances_, index=X_train.columns)
importances.sort_values().plot(kind='barh', color='teal')
plt.title('Decision Tree Feature Importances')
plt.show()

In [None]:
def plot_roc(model, X_test_smote, y_test_smote, model_name):
    y_prob = model.predict_proba(X_test_smote)[:, 1] 
    fpr, tpr, thresholds = roc_curve(y_test_smote, y_prob)
    roc_auc = auc(fpr, tpr)
    
    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'{model_name} ROC curve (AUC = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'{model_name} Receiver Operating Characteristic')
    plt.legend(loc="lower right")
    plt.show()
# Plot ROC Curves for both models
plot_roc(best_tree, X_test_smote, y_test_smote, 'Decision Tree')

# CONCLUSION

In this project, we successfully built a predictive model for customer churn using a synthetically generated dataset. We followed a systematic approach, starting from data generation and exploratory analysis, through preprocessing and feature engineering, to model building and evaluation. The selected model `best_tree`, based on performance metrics, provides a robust solution for identifying high-risk customers, enabling the telecom company to take proactive measures to reduce churn.
