# Churn Prediction

Basic info about project - to be added

**Customer churn** refers to the phenomenon where customers stop doing business with a company or stop using a service. It's a critical metric for businesses, especially those in subscription-based industries, as it directly impacts revenue and growth potential.

Churn rate is calculated with the following formula:



$$ \text{Churn Rate} = \frac{\text{Lost Customers}}{\text{Total Customers at the start period}} \times 100 $$


# Data Exploration and Data Cleaning

In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from google.colab import files

In [61]:
#load dataset
data = pd.read_csv("customer_churn_data.csv")
#Explore presented data
data.head(10)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,CUST0000,Male,0,No,Yes,23,No,No phone service,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Bank transfer,49.85,1146.55,No
1,CUST0001,Female,0,Yes,No,43,No,No phone service,DSL,Yes,...,Yes,No,Yes,No,Month-to-month,No,Mailed check,100.7,4330.1,Yes
2,CUST0002,Male,1,No,No,51,Yes,No,DSL,No,...,Yes,Yes,No,No,One year,No,Electronic check,97.33,4963.83,Yes
3,CUST0003,Male,1,No,No,72,Yes,Yes,DSL,Yes,...,Yes,No,No,No,Month-to-month,No,Credit card,101.38,7299.36,No
4,CUST0004,Male,1,No,No,25,Yes,Yes,DSL,No,...,No,Yes,No,Yes,Month-to-month,No,Electronic check,52.22,1305.5,Yes
5,CUST0005,Female,0,Yes,No,35,Yes,No,DSL,No,...,No,Yes,Yes,Yes,One year,No,Credit card,116.96,4093.6,No
6,CUST0006,Male,0,Yes,No,17,No,No phone service,DSL,No,...,No,Yes,No,Yes,One year,Yes,Bank transfer,91.53,1556.01,Yes
7,CUST0007,Male,0,Yes,Yes,18,Yes,No,DSL,No,...,Yes,Yes,No,Yes,One year,No,Mailed check,26.52,477.36,Yes
8,CUST0008,Male,0,No,No,27,No,No phone service,DSL,Yes,...,No,Yes,No,No,One year,No,Mailed check,67.77,1829.79,Yes
9,CUST0009,Female,0,No,No,15,No,No phone service,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,No,Electronic check,86.45,1296.75,Yes


In [62]:
#exlore columns
columnnames=data.columns
print(columnnames)

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')


Mean normalization

It is important to normalize mean values in order to run effective machine learning models later on.

A `Min-Max Scaling` is typically done via the foloowing equation:

$$X_{norm} = \frac{X_{i} - X_{min}}{X_{max} - X_{min}}$$

$X_i$ is the $i^{th}$ sample of dataset.


Now we would like to bring your attention to the following columns and their graphs:

"tenure" - displays number of months the customer has stayed with the company,

"MonthlyCharges" - displays the amount charged to the customer monthly,

"TotalCharges" - The total amount charged to the customer.


In [63]:
#from matplotlib import pyplot as plt
#_df_1['tenure'].plot(kind='hist', bins=20, title='tenure')
#plt.gca().spines[['top', 'right',]].set_visible(False)

In [64]:
#from matplotlib import pyplot as plt
#_df_2['MonthlyCharges'].plot(kind='hist', bins=20, title='MonthlyCharges')
#plt.gca().spines[['top', 'right',]].set_visible(False)

In [65]:
#from matplotlib import pyplot as plt
_#df_3['TotalCharges'].plot(kind='hist', bins=20, title='TotalCharges')
#plt.gca().spines[['top', 'right',]].set_visible(False)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,CUST0000,Male,0,No,Yes,23,No,No phone service,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Bank transfer,49.85,1146.55,No
1,CUST0001,Female,0,Yes,No,43,No,No phone service,DSL,Yes,...,Yes,No,Yes,No,Month-to-month,No,Mailed check,100.7,4330.1,Yes
2,CUST0002,Male,1,No,No,51,Yes,No,DSL,No,...,Yes,Yes,No,No,One year,No,Electronic check,97.33,4963.83,Yes
3,CUST0003,Male,1,No,No,72,Yes,Yes,DSL,Yes,...,Yes,No,No,No,Month-to-month,No,Credit card,101.38,7299.36,No
4,CUST0004,Male,1,No,No,25,Yes,Yes,DSL,No,...,No,Yes,No,Yes,Month-to-month,No,Electronic check,52.22,1305.5,Yes
5,CUST0005,Female,0,Yes,No,35,Yes,No,DSL,No,...,No,Yes,Yes,Yes,One year,No,Credit card,116.96,4093.6,No
6,CUST0006,Male,0,Yes,No,17,No,No phone service,DSL,No,...,No,Yes,No,Yes,One year,Yes,Bank transfer,91.53,1556.01,Yes
7,CUST0007,Male,0,Yes,Yes,18,Yes,No,DSL,No,...,Yes,Yes,No,Yes,One year,No,Mailed check,26.52,477.36,Yes
8,CUST0008,Male,0,No,No,27,No,No phone service,DSL,Yes,...,No,Yes,No,No,One year,No,Mailed check,67.77,1829.79,Yes
9,CUST0009,Female,0,No,No,15,No,No phone service,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,No,Electronic check,86.45,1296.75,Yes


All three of these columns presented above have a big range of numbers, therefore they are going to be normilized in order to run better on machine learning algorithm.

In [66]:
# "tenure" normalization
data['tenure']=(data['tenure']-data['tenure'].mean())/data['tenure'].std()

# "MonthlyCharges" normalization
data['MonthlyCharges']=(data['MonthlyCharges']-data['MonthlyCharges'].mean())/data['MonthlyCharges'].std()

#"TotalCharges" normalization
data['TotalCharges']=(data['TotalCharges']-data['TotalCharges'].mean())/data['TotalCharges'].std()

#Represent dataset with reduced mean values
data.head(10)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,CUST0000,Male,0,No,Yes,-0.647985,No,No phone service,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Bank transfer,-0.705018,-0.743586,No
1,CUST0001,Female,0,Yes,No,0.30851,No,No phone service,DSL,Yes,...,Yes,No,Yes,No,Month-to-month,No,Mailed check,1.060324,0.923178,Yes
2,CUST0002,Male,1,No,No,0.691108,Yes,No,DSL,No,...,Yes,Yes,No,No,One year,No,Electronic check,0.943329,1.254971,Yes
3,CUST0003,Male,1,No,No,1.695428,Yes,Yes,DSL,Yes,...,Yes,No,No,No,Month-to-month,No,Credit card,1.083931,2.47775,No
4,CUST0004,Male,1,No,No,-0.552335,Yes,Yes,DSL,No,...,No,Yes,No,Yes,Month-to-month,No,Electronic check,-0.62274,-0.660367,Yes
5,CUST0005,Female,0,Yes,No,-0.074088,Yes,No,DSL,No,...,No,Yes,Yes,Yes,One year,No,Credit card,1.624817,0.799357,No
6,CUST0006,Male,0,Yes,No,-0.934933,No,No phone service,DSL,No,...,No,Yes,No,Yes,One year,Yes,Bank transfer,0.741972,-0.529211,Yes
7,CUST0007,Male,0,Yes,Yes,-0.887109,Yes,No,DSL,No,...,Yes,Yes,No,Yes,One year,No,Mailed check,-1.514958,-1.093944,Yes
8,CUST0008,Male,0,No,No,-0.456686,No,No phone service,DSL,Yes,...,No,Yes,No,No,One year,No,Mailed check,-0.082896,-0.385872,Yes
9,CUST0009,Female,0,No,No,-1.030583,No,No phone service,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,No,Electronic check,0.565611,-0.664948,Yes




```
# This is formatted as code
```

Label Encoding:

Label encoding serves the same purpose as mean normalization, which is present data in numeric values to algorithm to run it efficiently.  
In the code below, we are going to map binary values to 1 and 0 respectively.

In [67]:
#data encoding switch yes/no values to 1/0.
data['Churn']=data['Churn'].apply(lambda x:1 if x=='Yes' else 0)
data['gender']=data['gender'].apply(lambda x:1 if x=='Female' else 0) #"Female" value has been mapped to 1 and "male" value to 0
data['Partner']=data['Partner'].apply(lambda x:1 if x=='Yes' else 0)
data['Dependents']=data['Dependents'].apply(lambda x:1 if x=='Yes' else 0)
data['PhoneService']=data['PhoneService'].apply(lambda x:1 if x=='Yes' else 0)
data['MultipleLines']=data['MultipleLines'].apply(lambda x:1 if x=='Yes' else 0)
data['OnlineSecurity']=data['OnlineSecurity'].apply(lambda x:1 if x=='Yes' else 0)
data['OnlineBackup']=data['OnlineBackup'].apply(lambda x:1 if x=='Yes' else 0)
data['DeviceProtection']=data['DeviceProtection'].apply(lambda x:1 if x=='Yes' else 0)
data['TechSupport']=data['TechSupport'].apply(lambda x:1 if x=='Yes' else 0)
data['StreamingTV']=data['StreamingTV'].apply(lambda x:1 if x=='Yes' else 0)
data['StreamingMovies']=data['StreamingMovies'].apply(lambda x:1 if x=='Yes' else 0)
data['PaperlessBilling']=data['PaperlessBilling'].apply(lambda x:1 if x=='Yes' else 0)
data['SeniorCitizen']=data['SeniorCitizen'].apply(lambda x:1 if x=='Yes' else 0)


In [68]:
data.head(10)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,CUST0000,0,0,0,1,-0.647985,0,0,No,0,...,0,0,0,0,Month-to-month,1,Bank transfer,-0.705018,-0.743586,0
1,CUST0001,1,0,1,0,0.30851,0,0,DSL,1,...,1,0,1,0,Month-to-month,0,Mailed check,1.060324,0.923178,1
2,CUST0002,0,0,0,0,0.691108,1,0,DSL,0,...,1,1,0,0,One year,0,Electronic check,0.943329,1.254971,1
3,CUST0003,0,0,0,0,1.695428,1,1,DSL,1,...,1,0,0,0,Month-to-month,0,Credit card,1.083931,2.47775,0
4,CUST0004,0,0,0,0,-0.552335,1,1,DSL,0,...,0,1,0,1,Month-to-month,0,Electronic check,-0.62274,-0.660367,1
5,CUST0005,1,0,1,0,-0.074088,1,0,DSL,0,...,0,1,1,1,One year,0,Credit card,1.624817,0.799357,0
6,CUST0006,0,0,1,0,-0.934933,0,0,DSL,0,...,0,1,0,1,One year,1,Bank transfer,0.741972,-0.529211,1
7,CUST0007,0,0,1,1,-0.887109,1,0,DSL,0,...,1,1,0,1,One year,0,Mailed check,-1.514958,-1.093944,1
8,CUST0008,0,0,0,0,-0.456686,0,0,DSL,1,...,0,1,0,0,One year,0,Mailed check,-0.082896,-0.385872,1
9,CUST0009,1,0,0,0,-1.030583,0,0,No,0,...,0,0,0,0,One year,0,Electronic check,0.565611,-0.664948,1


One hot encoding

One hot encoding is another method to prepare data for machine learning analysis. For example, "InternetServives" column uses multiple possible answers(NO, DSL, Fiber optic), in that we are going to drop the "InternetServices" colum, and create 3 new columns "NO", "DSL", "Fiber optic" that will be tied up to binary 1 and 0 values, based whether or not customer is using one of those internet services.

In [69]:
data['InternetService'].value_counts()

InternetService
No             2029
DSL            1936
Fiber optic    1915
Name: count, dtype: int64

In [70]:
for x in data['InternetService'].value_counts().keys():
      data[x]=data['InternetService'].apply(lambda d: 1 if d==x else 0)
data.drop(columns=['InternetService'], inplace=True)
data.head(10)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,...,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,No,DSL,Fiber optic
0,CUST0000,0,0,0,1,-0.647985,0,0,0,0,...,0,Month-to-month,1,Bank transfer,-0.705018,-0.743586,0,1,0,0
1,CUST0001,1,0,1,0,0.30851,0,0,1,0,...,0,Month-to-month,0,Mailed check,1.060324,0.923178,1,0,1,0
2,CUST0002,0,0,0,0,0.691108,1,0,0,1,...,0,One year,0,Electronic check,0.943329,1.254971,1,0,1,0
3,CUST0003,0,0,0,0,1.695428,1,1,1,0,...,0,Month-to-month,0,Credit card,1.083931,2.47775,0,0,1,0
4,CUST0004,0,0,0,0,-0.552335,1,1,0,0,...,1,Month-to-month,0,Electronic check,-0.62274,-0.660367,1,0,1,0
5,CUST0005,1,0,1,0,-0.074088,1,0,0,0,...,1,One year,0,Credit card,1.624817,0.799357,0,0,1,0
6,CUST0006,0,0,1,0,-0.934933,0,0,0,0,...,1,One year,1,Bank transfer,0.741972,-0.529211,1,0,1,0
7,CUST0007,0,0,1,1,-0.887109,1,0,0,1,...,1,One year,0,Mailed check,-1.514958,-1.093944,1,0,1,0
8,CUST0008,0,0,0,0,-0.456686,0,0,1,0,...,0,One year,0,Mailed check,-0.082896,-0.385872,1,0,1,0
9,CUST0009,1,0,0,0,-1.030583,0,0,0,0,...,0,One year,0,Electronic check,0.565611,-0.664948,1,1,0,0


In [71]:
#We need to do the same manipulation with "contract column that has three following answers (Month-to-month, One year, Two year) those answers will be created as separate columns to store binary values.
for x in data['Contract'].value_counts().keys():
    data[x]=data['Contract'].apply(lambda d: 1 if d==x else 0)

#drop the "Contract" column
data.drop(columns=['Contract'], inplace=True)

In [72]:
#Applying one hot encoding to payment method column as well "PaymentMethod"
for x in data['PaymentMethod'].value_counts().keys():
    data[x]=data['PaymentMethod'].apply(lambda d: 1 if d==x else 0)

#dropping "PaymentMethod" column
data.drop(columns=['PaymentMethod'], inplace=True)

#present updated data
data.head(10)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,OnlineSecurity,OnlineBackup,...,No,DSL,Fiber optic,Month-to-month,One year,Two year,Credit card,Electronic check,Mailed check,Bank transfer
0,CUST0000,0,0,0,1,-0.647985,0,0,0,0,...,1,0,0,1,0,0,0,0,0,1
1,CUST0001,1,0,1,0,0.30851,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
2,CUST0002,0,0,0,0,0.691108,1,0,0,1,...,0,1,0,0,1,0,0,1,0,0
3,CUST0003,0,0,0,0,1.695428,1,1,1,0,...,0,1,0,1,0,0,1,0,0,0
4,CUST0004,0,0,0,0,-0.552335,1,1,0,0,...,0,1,0,1,0,0,0,1,0,0
5,CUST0005,1,0,1,0,-0.074088,1,0,0,0,...,0,1,0,0,1,0,1,0,0,0
6,CUST0006,0,0,1,0,-0.934933,0,0,0,0,...,0,1,0,0,1,0,0,0,0,1
7,CUST0007,0,0,1,1,-0.887109,1,0,0,1,...,0,1,0,0,1,0,0,0,1,0
8,CUST0008,0,0,0,0,-0.456686,0,0,1,0,...,0,1,0,0,1,0,0,0,1,0
9,CUST0009,1,0,0,0,-1.030583,0,0,0,0,...,1,0,0,0,1,0,0,1,0,0


All of the data has been conversed into numeric values, and mean values of "tenure", "TotalCharges" and "MonthlyCharges" has been normalized to run efficiently.

In [73]:
#create new formatted dataset
data.to_csv('formated_customer_churn_data.csv')

In [74]:
#Following lide downloads a copy of newly created dataset from cell above (formated_customer_churn_data.csv) on your local machine for back up purposes. If you would like to store it locally, please remote '#' from the following code line:

#files.download('formated_customer_churn_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Feature selection and Corelation Matrix