# **Customer Churn Prediction and Retention Strategies**

## **WorkFlow of the Project**

<ul>
    <li>Data Loading</li>
    <li>Exploratory Data Analysis (E.D.A.)</li>
    <li>Data Preprocessing</li>
    <li>Machine Learning Model Developments</li>
    <li>Model Evaluation</li>
    <li>Conclusion</li>
</ul>

## Importing the Libraries

In [4]:
# Print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Library for Warning Message
import warnings
warnings.simplefilter("ignore")

# Base Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for ML Parameters
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.inspection import permutation_importance
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.model_selection import GridSearchCV

# Libraries for ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# Libraries for Performance Metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score, auc, precision_recall_curve

## Data Loading

In [6]:
# We have 2 separate datasets for training and test set.
df_train = pd.read_csv("customer_churn_dataset_training.csv")
df_test = pd.read_csv("customer_churn_dataset_testing.csv")

## Exploratory Data Analysis (E.D.A.)

Since we have separate datasets for training and testing, we will conduct Exploratory Data Analysis (EDA) on the training data, while the performance of the best machine learning models will be evaluated using the test data.

In [9]:
# Glimpse of the first 5 rows
print("Glimpse of the first 5 rows")
print(" ")
df_train.head()

Glimpse of the first 5 rows
 


Unnamed: 0,CustomerID,Age,Gender,Tenure,Usage Frequency,Support Calls,Payment Delay,Subscription Type,Contract Length,Total Spend,Last Interaction,Churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0
1,3.0,65.0,Female,49.0,1.0,10.0,8.0,Basic,Monthly,557.0,6.0,1.0
2,4.0,55.0,Female,14.0,4.0,6.0,18.0,Basic,Quarterly,185.0,3.0,1.0
3,5.0,58.0,Male,38.0,21.0,7.0,7.0,Standard,Monthly,396.0,29.0,1.0
4,6.0,23.0,Male,32.0,20.0,5.0,8.0,Basic,Monthly,617.0,20.0,1.0


In [10]:
# Converting the Column Names to "LowerCase"
df_train.columns = df_train.columns.str.lower()

# Verify
df_train.head(1)

Unnamed: 0,customerid,age,gender,tenure,usage frequency,support calls,payment delay,subscription type,contract length,total spend,last interaction,churn
0,2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0


In [11]:
# Setting the "customerid" column as Index
df_train = df_train.set_index("customerid")

# Verify
df_train.head(1)

Unnamed: 0_level_0,age,gender,tenure,usage frequency,support calls,payment delay,subscription type,contract length,total spend,last interaction,churn
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2.0,30.0,Female,39.0,14.0,5.0,18.0,Standard,Annual,932.0,17.0,1.0


In [12]:
# Dimension of the Data
print("Dimension of the Data")
print(" ")
print("The dataset consists of {} rows and {} columns, with 'customerid' set as the index column.".format(
        df_train.shape[0], df_train.shape[1]+1))

Dimension of the Data
 
The dataset consists of 440833 rows and 12 columns, with 'customerid' set as the index column.


In [13]:
# Checking of Missing Values
print("Checking of Missing Values")
print(" ")

if df_train.isna().sum().sum() == 0:
    print("The data has no missing values")
else:
    print("The data has {} missing values in total.".format(df_train.isna().sum().sum()))

Checking of Missing Values
 
The data has 11 missing values in total.


In [14]:
# Verify which column(s) has missing values
df_train[df_train.isna().any(axis = 1)]

Unnamed: 0_level_0,age,gender,tenure,usage frequency,support calls,payment delay,subscription type,contract length,total spend,last interaction,churn
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
,,,,,,,,,,,


Only 1 row has no data. Therefore, we can remove it from our dataset.

In [16]:
# Removing the missing row
df_train.dropna(inplace = True)

# Final Dimension
print("After removing the row with missing values, the final data consists of {} rows and {} columns (incl. index column).".format(
    df_train.shape[0], df_train.shape[1]+1))

After removing the row with missing values, the final data consists of 440832 rows and 12 columns (incl. index column).


### **Data Description**

**Customer Demographics:**
<ul>
    <li>Age</li>
    <li>Gender</li>
</ul>

**Engagement Metrics:**
<ul>
    <li>Tenure</li>
    <li>Usage Frequency</li>
    <li>Support Calls</li>
</ul>

**Payment Behavior:**
<ul>
    <li>Payment Delay</li>
    <li>Total Spend</li>
</ul>

**Subscription Details**
<ul>
    <li>Subscription Type</li>
    <li>Contract Length</li>
</ul>

**Customer Interaction:**
<ul>
    <li>Last Interaction</li>
</ul>

**Target Variable:**
<ul>
    <li>Churn (Binary indicator of whether a customer left)</li>
</ul>

In [18]:
# Check the Data Types of each variables
df_train.info()
print(" ")

<class 'pandas.core.frame.DataFrame'>
Index: 440832 entries, 2.0 to 449999.0
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   age                440832 non-null  float64
 1   gender             440832 non-null  object 
 2   tenure             440832 non-null  float64
 3   usage frequency    440832 non-null  float64
 4   support calls      440832 non-null  float64
 5   payment delay      440832 non-null  float64
 6   subscription type  440832 non-null  object 
 7   contract length    440832 non-null  object 
 8   total spend        440832 non-null  float64
 9   last interaction   440832 non-null  float64
 10  churn              440832 non-null  float64
dtypes: float64(8), object(3)
memory usage: 40.4+ MB
 


Now, columns like **age, tenure, support calls, churn, etc.** should be describe as integer and not as float.

In [20]:
# Convert the Numeric columns to integer

float_cols = ["age", "tenure", "usage frequency", "support calls", "payment delay", "last interaction", "churn"]

df_train[float_cols] = df_train[float_cols].astype("int")

In [21]:
# Convert the index to Integer
df_train.index = df_train.index.astype("int")

In [22]:
# Verify the Data Type
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 440832 entries, 2 to 449999
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   age                440832 non-null  int64  
 1   gender             440832 non-null  object 
 2   tenure             440832 non-null  int64  
 3   usage frequency    440832 non-null  int64  
 4   support calls      440832 non-null  int64  
 5   payment delay      440832 non-null  int64  
 6   subscription type  440832 non-null  object 
 7   contract length    440832 non-null  object 
 8   total spend        440832 non-null  float64
 9   last interaction   440832 non-null  int64  
 10  churn              440832 non-null  int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 40.4+ MB


In [23]:
# View the data
df_train.head()

Unnamed: 0_level_0,age,gender,tenure,usage frequency,support calls,payment delay,subscription type,contract length,total spend,last interaction,churn
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,30,Female,39,14,5,18,Standard,Annual,932.0,17,1
3,65,Female,49,1,10,8,Basic,Monthly,557.0,6,1
4,55,Female,14,4,6,18,Basic,Quarterly,185.0,3,1
5,58,Male,38,21,7,7,Standard,Monthly,396.0,29,1
6,23,Male,32,20,5,8,Basic,Monthly,617.0,20,1


#### **Do we have Duplicate Customers?**

In [25]:
is_duplicate_customer = df_train.index.duplicated().any()

if is_duplicate_customer == True:
    print("There are duplicate customer present in the dataset.")
else:
    print("The dataset has no duplicate customers.")

The dataset has no duplicate customers.


In [26]:
# Check the unique values from Categorical Columns
print("Gender has {} unique values and they are {}.".format(df_train.gender.nunique(), df_train.gender.unique()))
print(" ")
print("Subscription Type has {} unique values and they are {}.".format(df_train["subscription type"].nunique(), df_train["subscription type"].unique()))
print(" ")
print("Contract Length has {} unique values and they are {}.".format(df_train["contract length"].nunique(), df_train["contract length"].unique()))

Gender has 2 unique values and they are ['Female' 'Male'].
 
Subscription Type has 3 unique values and they are ['Standard' 'Basic' 'Premium'].
 
Contract Length has 3 unique values and they are ['Annual' 'Monthly' 'Quarterly'].
