## Introduction

Hello, in this project we will briefly discuss about the customer churn prediction model that we will build using the artificial neural network.

### Problem Statement

#### What is customer churn?

So, customer churn is simply the rate at which customers leave doing business with an entity. Simply put, churn prediction involves determining the possibility of customers stopping doing business with an entity. In other words, if a consumer has purchased a subscription to a particular service, we must determine the likelihood that the customer would leave or cancel the membership.

It is a critical prediction for many businesses because acquiring new clients often costs more than retaining existing ones. Customer churn measures how and why are customers leaving the business.

There are many ways to calculate the customer churn one of the ways is to divide the number of customers leaving a business in a given time interval by the number of customers that are present at the beginning of the period.

We know that customer churn is important in business problems, the ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it.

Clarify for your better understanding let’s take an example, suppose you have taken a premium subscription of the company product now you think that it’s time to leave the subscription, for this you will contact to the company, the company will try to offer some extra functionalities for not leaving the subscription. This is because it will be a loss for any industry that there is some percent of customers are not using their product.

For prediction of this kind of situation, we will do in depth analysis and use some data science techniques, that will predict the customer churn based on several features.



Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# use pandas to import csv file
df = pd.read_csv("C:/Users/VISHAL/OneDrive/Desktop/I-Neuron Projects/Customer churn/archive/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In this dataset there are 7043 rows and 21 columns are present. There are some categorical and some numerical columns present.

## Preprocess Dataset

Now it’s time to preprocess the data, firstly we will observe the dataset,  this means we have to see the data types of the columns, other functionalities, and parameters of each column.

First, we check the dataset information using the info() method

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


we can see that the datatypes of each column, number of rows present with non-null values, there are 2 int, 1 float, and remaining are string datatype columns.

Second, we check the description of the dataset, here we will only visible the num variables functionalities. we will use describe() method.

In [4]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


Here we can see that describe() method only describe the functionalities of a numerical variable.  From this, we can easily conclude the parameters of each column.

Now we drop unwanted features from our dataset because these unwanted features are like the garbage they will affect our model accuracy so we drop it.

In [5]:
# we didn't require customerID so we drop it
df = df.drop('customerID',axis=1)

We drop customerID because it has no meaning in the dataset and we can easily differentiate each customer using indices of the rows. By dropping this column or dataset should be now ready to process.

When we note the TotalCharges column then we found that it’s a data type of an object but it even would be float. so we have to typecast this column.

In [6]:
#count of string value into the column.
count=0
for i in df.TotalCharges:
    if i==' ':
        count+=1
print('count of empty string:- ',count)
#we will replace this empty string to nan values
df['TotalCharges'] = df['TotalCharges'].replace(" ",np.nan)
# typecasting of the TotalCharges column
df['TotalCharges'] = df['TotalCharges'].astype(float)
df.info()

count of empty string:-  11
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod 

After printing, we found that 11 rows contain” ” empty string which will affect the datatype of the column, so we convert this into nan values and typecast into float64.

So, now TotalCharges has 11 null values, we have to fill it. let’s do it.

## checking Null Values in Customer Churn Data

Null values badly affect our model performance because, these null values are irreverent in nature they are misplaced in the dataset so we have to remove them and replace them with other values if null values are less, but if it was present in large quantity then we just drop it.

Now we have to check for null values, for this, we use the pandas IsNull() method which will give True if the null value is present and False when there are no null values.

In [7]:
# fill null values with mean
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].mean())

To handle null values we fill null values of the TotalCharges column with the mean of the TotalCharges column.

Now we will extract the numerical and categorical columns from the dataset for further processes.

As we see from above table there are 11 null values in TotalCharges coloumn.

To handle null values we fill null values of the TotalCharges column with the mean of the TotalCharges column.

Now we will extract the numerical and categorical columns from the dataset for further processes.

To handle null values we fill null values of the TotalCharges column with the mean of the TotalCharges column.

Now we will extract the numerical and categorical columns from the dataset for further processes.

In [8]:
#numerical variables

num = list(df.select_dtypes(include=['int64','float64']).keys())

#categorical variables

cat = list(df.select_dtypes(include='O').keys())

print('Categorical=',cat)

print('numerical=',num)

Categorical= ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
numerical= ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']


Here we create the num variable for numerical columns and cat for the categorical columns

Now we see the value counts of each category in each categorical column.

In [9]:
# value_counts of the categorical columns
for i in cat:
    print(df[i].value_counts())
# as we see that there is extra categories which we have to convert it into No.
df.MultipleLines = df.MultipleLines.replace('No phone service','No')
df.OnlineSecurity = df.OnlineSecurity.replace('No internet service','No')
df.OnlineBackup = df.OnlineBackup.replace('No internet service','No')
df.DeviceProtection = df.DeviceProtection.replace('No internet service','No')
df.TechSupport = df.TechSupport.replace('No internet service','No')
df.StreamingTV = df.StreamingTV.replace('No internet service','No')
df.StreamingMovies = df.StreamingMovies.replace('No internet service','No')

Male      3555
Female    3488
Name: gender, dtype: int64
No     3641
Yes    3402
Name: Partner, dtype: int64
No     4933
Yes    2110
Name: Dependents, dtype: int64
Yes    6361
No      682
Name: PhoneService, dtype: int64
No                  3390
Yes                 2971
No phone service     682
Name: MultipleLines, dtype: int64
Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64
No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64
No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64
No                     3095
Yes                    2422
No internet service    1526
Name: DeviceProtection, dtype: int64
No                     3473
Yes                    2044
No internet service    1526
Name: TechSupport, dtype: int64
No                     2810
Yes                    2707
No internet service    1526
Name: StreamingTV

On observation we found that there are multiple columns were having some irrelevant categories, so we have to just convert it into a useful manner. For we change the “No Phone Service” category into the “No” category and we do it for all the columns where this “No Phone Service” is present.

## Handling categorical Variables in Customer Churn Data

So, here we have to handle categorical columns, handle means we have to convert categorical values into numerical values because while the training model dataset contains all the numerical values categories won’t w accept.

In [10]:
# we have to handel this all categorical variables
# there are mainly Yes/No features in most of the columns
# we will convert Yes = 1 and No = 0
for i in cat:
    df[i] = df[i].replace('Yes',1)
    df[i] = df[i].replace('No',0)

On observing the count values of the dataset then we found that there are NO and YES are present, so we have to convert it into 1 and 0 which will be easy to process. For all categorical variables, we replace Yes with 1 and No with 0.

In [11]:
# we will convert male = 1 and female = 0
df.gender = df.gender.replace('Male',1)
df.gender = df.gender.replace('Female',0)

In the gender column, we replace Male with 1 and Female with 0.

#### Now convert categorical columns into numeric one
Now will decode categorical values into numeric ones.

In [12]:
df2 = pd.get_dummies(data=df, columns=['InternetService','Contract','PaymentMethod'])
df2.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'Churn',
       'InternetService_0', 'InternetService_DSL',
       'InternetService_Fiber optic', 'Contract_Month-to-month',
       'Contract_One year', 'Contract_Two year',
       'PaymentMethod_Bank transfer (automatic)',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check'],
      dtype='object')

In [13]:
df2.sample(5)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,DeviceProtection,...,InternetService_0,InternetService_DSL,InternetService_Fiber optic,Contract_Month-to-month,Contract_One year,Contract_Two year,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
3583,1,0,1,1,40,0,0,0,1,1,...,0,1,0,0,1,0,0,0,0,1
6056,0,0,0,0,58,1,1,1,1,1,...,0,1,0,0,1,0,0,1,0,0
157,1,0,1,1,22,1,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
6337,0,0,1,1,55,1,1,1,0,0,...,0,0,1,0,1,0,0,0,0,1
6943,0,0,1,1,1,0,0,1,0,0,...,0,1,0,1,0,0,1,0,0,0


We can see that all the categorical columns are now typed cast into the numerical values.

The handling of categorical columns is over now we have to scale our data because there are some columns present where values are much larger which will affect the runtime of the program so we will convert bigger values into smaller ones.

In [15]:
scale_cols = ['tenure','MonthlyCharges','TotalCharges']
# now we scling all the data 
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()
df[scale_cols] = scale.fit_transform(df2[scale_cols])

scale_cols contain that columns which are having large numerical values, and with MinMaxScaler we will scale it into values between -1 to 1.

## Independent and Dependent Variables

This is an important step into the model-building part we have to separate all the columns which are important or by which target values are predicted with the target values which e have to predict.

Now we start our model training process, first, we have to divide our dataset into dependent and independent variables.

In [16]:
# independent and dependent variables
x = df2.drop('Churn',axis=1)
y = df2['Churn']

X contains an independent variable that is independent, Y contains a dependent variable which is to target variable. All the columns except Churn are present in the X variable and Churn is present in the Y variable.

## Splitting data

This is the important part is we have to split our data into training and testing parts by which we do further processes.

Now we have to split our dataset into train and test sets, where the training set is used to train the model, and the testing set is used for testing the values of targeted columns.model

In [17]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.2,random_state=10)
print(xtrain.shape)
print(xtest.shape)

(5634, 26)
(1409, 26)


We have just imported the train_test_split() method from the sklearn and we set some parameters where testing size was 30% and the remaining 70% considered as training data.

## Building Neural Network for Customer Churn Data

Now all our preprocessing and splitting part is our, its time for building the neural network, we will use TensorFlow and Keras library for building the artificial neural net.

Firstly we have to import these important libraries for further processes.
now we create our artificial neural net.

In [22]:
import tensorflow as tf
from tensorflow import keras

Tensorflow is used for multiple tasks but has a particular focus on the training and inference of deep neural networks and  Keras acts as an interface for the TensorFlow library.
## Define Model

Now we have to define our model, which means we have to set the parameters and layers of the deep neural network which will be used for training the data.

In [24]:
# define sequential model
model = keras.Sequential([
    # input layer
    keras.layers.Dense(19, input_shape=(19,), activation='relu'),
    keras.layers.Dense(15, activation='relu'),
    keras.layers.Dense(10,activation = 'relu'),
    # we use sigmoid for binary output
    # output layer
    keras.layers.Dense(1, activation='sigmoid')
]
)

Here we define sequential model, in the sequential model the input, hidden and output layers are connected into the sequential manner, here we define one input layer which contains all 19 columns as an input, second and third layer is hidden layers which contain 15, 10 hidden neurons and here we apply RelU activation function. Our last layer is the output layer, as our output is in the form of 1 and 0 so, we will use the sigmoid activation function.

Now we compile our sequential model and fit the training data into our model.
## Compile the Customer Churn Model

The compilation of the model is the final step of creating an artificial neural model. The compile defines the loss function, the optimizer, and the metrics which we have to give into parameters.

Here we use compile method for compiling the model, we set some parameters into the compile method.

In [28]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(xtrain, ytrain, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x155b9f6ec08>

Now we evaluate our model by this we can observe the summary of the model.

In [29]:
# evalute the model
model.evaluate(xtest,ytest)



[0.522813618183136, 0.723207950592041]

As above we are performing scaling on the data, that’s why our predicted values are scaled so we have to unscale it into normal form for this we write the following program.

In [30]:
# predict the churn values
ypred = model.predict(xtest)
print(ypred)
# unscaling the ypred values 
ypred_lis = []
for i in ypred:
    if i>0.5:
        ypred_lis.append(1)
    else:
        ypred_lis.append(0)
print(ypred_lis)

[[0.29622328]
 [0.6261416 ]
 [0.7748246 ]
 ...
 [0.73006916]
 [0.4502596 ]
 [0.8873215 ]]
[0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 

Here we just create a list of predicted variables, when the scaling values are greater than 0.5 then it will be considered as 1 otherwise it will be considered as 0. We store these values into the list.

At the conclusion we have to differentiate original and predicted values together, so we find that our model predicted true or false.

For that, we combine original values and predicted values together into a dataframe.

In [31]:
#make dataframe for comparing the orignal and predict values
data = {'orignal_churn':ytest, 'predicted_churn':ypred_lis}
df_check = pd.DataFrame(data)
df_check.head(10)

Unnamed: 0,orignal_churn,predicted_churn
6418,0,0
1948,1,1
4497,0,1
66,0,1
1705,0,1
924,0,1
1051,0,0
7012,0,0
3723,0,0
4590,0,0


ou can easily observe that the original and model predicted values of each customer.
## Performance Matrices

This is used in the classification problems, and the customer churn is also a classification problem so we use performance metrics for checking the model behavior.

At the last, we have to predict the churn which is in the form of 0 and 1 means it was a classification problem, and the performance of the classification problem is observed with the performance metrics.

There are many types of performance metrics for checking the performance of the model but we use the confucion_metrix and classification_report.

In [42]:
# checking for performance metrices
#importing classification_report and confusion metrics
from sklearn.metrics import confusion_matrix, classification_report
#print classification_report
print(classification_report(ytest,ypred_lis))

              precision    recall  f1-score   support

           0       0.89      0.73      0.80      1066
           1       0.46      0.72      0.56       343

    accuracy                           0.72      1409
   macro avg       0.67      0.72      0.68      1409
weighted avg       0.78      0.72      0.74      1409



In [41]:
print(confusion_matrix(ytest,ypred_lis))

[[773 293]
 [ 97 246]]


It can be concluded the confusion matrix shows us that the 316 predictions have been done correctly and that there are only 390 incorrect predictions.