<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/machine-learning-bookcamp/3_churn_prediction_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Churn prediction project

Churn is when customers stop using the services of a company. Thus, churn prediction
is about identifying customers who are likely to cancel their contracts soon.

If the company can do that, it can offer discounts on these services in an effort to
keep the users.

Imagine that we are working at a telecom company that offers phone and internet
services, and we have a problem: some of our customers are churning. They no longer
are using our services and are going to a different provider. 

We would like to prevent
that from happening, so we develop a system for identifying these customers and offer
them an incentive to stay. 

We want to target them with promotional messages and give
them a discount. We also would like to understand why the model thinks our customers
churn, and for that, we need to be able to interpret the model’s predictions.

##Setup

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

import seaborn as sns
from matplotlib import pyplot as plt
from IPython.display import display

%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
# content/gdrive/My Drive/Kaggle is the path where kaggle.json is  present in the Google Drive
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle-keys"

In [None]:
%%shell

# download dataset from kaggle> URL: https://www.kaggle.com/blastchar/telco-customer-churn
kaggle datasets download -d blastchar/telco-customer-churn

unzip -qq telco-customer-churn.zip
rm -rf telco-customer-churn.zip

Downloading telco-customer-churn.zip to /content
  0% 0.00/172k [00:00<?, ?B/s]
100% 172k/172k [00:00<00:00, 51.6MB/s]




##Dataset

In [None]:
# let’s read our dataset
data_df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
len(data_df)

7043

In [None]:
data_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
data_df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [None]:
data_df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

###Initial data preparation

In [None]:
# converting it to numbers with 'coerce'  for nonnumeric data (such as spaces),
total_charges = pd.to_numeric(data_df.TotalCharges, errors="coerce")

# confirm that data indeed contains nonnumeric characters
data_df[total_charges.isnull()][["customerID", "TotalCharges"]]

Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,
936,5709-LVOEQ,
1082,4367-NUYAO,
1340,1371-DWPAZ,
3331,7644-OMVMY,
3826,3213-VVOLG,
4380,2520-SGTTA,
5218,2923-ARZLG,
6670,4075-WKNIU,


In [None]:
# so, let's set the missing values to zero
data_df["TotalCharges"] = pd.to_numeric(data_df.TotalCharges, errors="coerce")
data_df["TotalCharges"] = data_df["TotalCharges"].fillna(0)

In [None]:
# Let’s make the column names uniform by lowercasing everything and replacing spaces with underscores
data_df.columns = data_df.columns.str.lower().str.replace(" ", "_")
string_columns = list(data_df.dtypes[data_df.dtypes == "object"].index)

for col in string_columns:
  data_df[col] = data_df[col].str.lower().str.replace(" ", "_")

In [None]:
# let’s look at our target variable
data_df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

In [None]:
# converting to Boolean
(data_df.churn == "yes").head()

0    False
1    False
2     True
3    False
4     True
Name: churn, dtype: bool

In [None]:
# converting the Boolean to integer
(data_df.churn == "yes").astype(int).head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

In [None]:
# so, let’s convert the target variable to numbers
data_df.churn = (data_df.churn == "yes").astype(int)

data_df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


Let's split the dataset.

In [None]:
# split such that 80% of the data goes to the train set and the remaining 20% goes to the test set.
df_train_full, df_test = train_test_split(data_df, test_size=0.2, random_state=1)

In [None]:
df_train_full.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
1814,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
5946,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,...,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
3881,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,...,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
2389,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,...,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
3676,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,...,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0


In [None]:
# let's split it one more time into train and validation
df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

In [None]:
# Takes the column with the target variable, churn, and saves it outside the dataframe
y_train = df_train.churn.values
y_val = df_val.churn.values

In [None]:
# Deletes the churn columns
del df_train["churn"]
del df_val["churn"]

###Exploratory data analysis

We should always check for any missing values in the dataset because many machine
learning models cannot easily deal with missing data.

In [None]:
# let’s perform any additional null handling
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [None]:
# let's check the distribution of values in the target variable
df_train_full.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

We know the absolute numbers, but let’s also check the proportion of churned
users among all customers. 

For that, we need to divide the number of customers who
churned by the total number of customers. We know that 1,521 of 5,634 churned, so the proportion is-

$$1521 / 5634 =  0.27$$

This gives us the proportion of churned users, or the probability that a customer will
churn. As we see in the training dataset, approximately `27%` of the customers stopped
using our services, and the rest remained as customers.

The proportion of churned users, or the probability of churning, has a special
name: churn rate.

In [None]:
# let's calculate the churn rate
global_mean = df_train_full.churn.mean()
round(global_mean, 3)

0.27

Our churn dataset is an example of a so-called imbalanced dataset.

We can clearly see that: the churn rate in our data is `.27`, which is a strong indicator of class imbalance.



In [None]:
# let's create two lists for categorical and numerical variables
categorical_cols = [
  'gender', 'seniorcitizen', 'partner', 'dependents',
  'phoneservice', 'multiplelines', 'internetservice',
  'onlinesecurity', 'onlinebackup', 'deviceprotection',
  'techsupport', 'streamingtv', 'streamingmovies',
  'contract', 'paperlessbilling', 'paymentmethod'
]

numerical_cols = ['tenure', 'monthlycharges', 'totalcharges']

In [None]:
# First, we can see how many unique values each variable has
df_train_full[categorical_cols].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

###Feature importance

Knowing how other variables affect the target variable, churn, is the key to understanding
the data and building a good model. This process is called feature importance
analysis.

We have two different kinds of features: categorical and numerical. Each kind has
different ways of measuring feature importance, so we will look at each separately.

####Churn rate

We can look at all the distinct values of a variable. Then, for each variable, there’s a
group of customers: all the customers who have this value. 

For each such group, we
can compute the churn rate, which is the group churn rate. 

When we have it, we can
compare it with the global churn rate — the churn rate calculated for all the observations
at once.

Let’s check first for the gender variable.



In [None]:
female_mean = df_train_full[df_train_full.gender == "female"].churn.mean()
print(f"gender == female: {round(female_mean, 3) * 100}")

male_mean = df_train_full[df_train_full.gender == "male"].churn.mean()
print(f"gender == male: {round(male_mean, 3) * 100}")

print(f"global mean: {round(global_mean, 3) * 100}")

gender == female: 27.700000000000003
gender == male: 26.3
global mean: 27.0


In [None]:
female_mean / global_mean

1.0253955354648652

In [None]:
male_mean / global_mean

0.9749802969838747

The difference between the group rates for both females
and males is quite small, which indicates that knowing the gender of the customer
doesn’t help us identify whether they will churn.

Now let’s take a look at another variable: partner.

In [None]:
partner_yes = df_train_full[df_train_full.partner == "yes"].churn.mean()
print(f"partner == yes: {round(partner_yes, 3) * 100}")

partner_no = df_train_full[df_train_full.partner == "no"].churn.mean()
print(f"partner == no: {round(partner_no, 3) * 100}")

print(f"global mean: {round(global_mean, 3) * 100}")

partner == yes: 20.5
partner == no: 33.0
global mean: 27.0


In [None]:
partner_yes / global_mean

0.7594724924338315

In [None]:
partner_no / global_mean

1.2216593879412643

As we see, the rates for those who have a partner are quite different from rates for
those who don’t: `20%` and `33%`, respectively. 

It means that clients with no partner are
more likely to churn than the ones with a partner.

####Risk ratio

In statistics, the ratio between probabilities
in different groups is called the risk ratio, where risk refers to the risk of having the effect. 

In our case, the effect is churn, so it’s the risk of churning:

`risk = group rate / global rate`

For gender == female, for example, the risk of churning is 1.02:

`risk = 27.7% / 27% = 1.02`

Risk is a number between zero and infinity. It has a nice interpretation that tells you
how likely the elements of the group are to have the effect (churn) compared with the
entire population.

Let’s calculate the risks for gender and partner.

In [None]:
global_mean = df_train_full.churn.mean()

df_group = df_train_full.groupby(by="gender").churn.agg(["mean"])  # calculate the AVG(churn) part
df_group["diff"] = df_group["mean"] - global_mean                  # Calculates the difference between group churn rate and global rate
df_group["risk"] = df_group["mean"] / global_mean                  # Calculates the ricsk between group churn rate and global rate

df_group

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Let’s now do that for all categorical variables.

In [None]:
for col in categorical_cols:
  df_group = df_train_full.groupby(by=col).churn.agg(["mean"])  # calculate the AVG(churn) part
  df_group["diff"] = df_group["mean"] - global_mean             # Calculates the difference between group churn rate and global rate
  df_group["risk"] = df_group["mean"] / global_mean             # Calculates the ricsk between group churn rate and global rate
  display(df_group)

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,risk
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,risk
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,risk
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,risk
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,risk
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,risk
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,risk
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,risk
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,risk
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,risk
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,risk
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,risk
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,risk
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,risk
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,risk
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


This way, just by looking at the differences and the risks, we can identify the most discriminative
features: the features that are helpful for detecting churn. 

Thus, we
expect that these features will be useful for our future models.

####Mutual information

Higher values of mutual information mean a higher degree of dependence: if the
mutual information between a categorical variable and the target is high, this categorical
variable will be quite useful for predicting the target. 

On the other hand, if the
mutual information is low, the categorical variable and the target are independent,
and thus the variable will not be useful for predicting the target.

In [None]:
def calculate_mi(series):
  return mutual_info_score(series, df_train_full.churn)

In [None]:
# Applies the function to each categorical column of the dataset
df_mi = df_train_full[categorical_cols].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name="MI")

display(df_mi.head())
display(df_mi.tail())

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923


Unnamed: 0,MI
partner,0.009968
seniorcitizen,0.00941
multiplelines,0.000857
phoneservice,0.000229
gender,0.000117


As we see, `contract`, `onlinesecurity`, and `techsupport` are among the most
important features.

Indeed, we’ve already noted that `contract` and
`techsupport` are quite informative. 

It’s also not surprising that `gender` is among the least important features, so we shouldn’t expect it to be useful for the model.

####Correlation coefficient

Mutual information is a way to quantify the degree of dependency between two categorical
variables, but it doesn’t work when one of the features is numerical, so we cannot
apply it to the three numerical variables that we have.

We can, however, measure the dependency between a binary target variable and a
numerical variable. We can pretend that the binary variable is numerical (containing
only the numbers zero and one) and then use the classical methods from statistics to
check for any dependency between these variables.

One such method is the correlation coefficient.It is a value from –1 to 1.

* Positive correlation means that when one variable goes up, the other variable tends to go up as well.
* Zero correlation means no relationship between two variables: they are completely independent.
* Negative correlation occurs when one variable goes up and the other goes
down.



In [None]:
df_train_full[numerical_cols].corrwith(df_train_full.churn).to_frame("correlation")

Unnamed: 0,correlation
tenure,-0.351885
monthlycharges,0.196805
totalcharges,-0.196353


* The correlation between `tenure` and churn is `–0.35`: it has a negative sign, so the longer customers stay, the less often they tend to churn.
* `monthlycharges` has a positive coefficient of `0.19`, which means that customers who pay more tend to leave more often.
* `totalcharges` has a negative correlation, which makes sense: the longer people stay with the company, the more they have paid in total, so it’s less likely that they will leave.

In [None]:
df_train_full.groupby(by="churn")[numerical_cols].mean()

Unnamed: 0_level_0,tenure,monthlycharges,totalcharges
churn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,37.531972,61.176477,2548.021627
1,18.070348,74.521203,1545.689415


##Feature engineering

Before we proceed to training, however, we need to perform the feature engineering
step: transforming all categorical variables to numeric features.

Let's use one-hot encoding for categorical variables.

In [None]:
# convert dataframe to a list of dictionaries
train_dict = df_train[categorical + numerical].to_dict(orient="records")
train_dict[0]

Now we can use DictVectorizer for converting the dictionaries to a matrix.

If a feature is categorical, it applies the one-hot encoding
scheme, but if a feature is numerical, it’s left intact.

In [None]:
# convert the list of dictionaries to matrix
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
x_train = dv.transform(train_dict)

print(f"Old shape: {df_train_full.shape}")
print(f"New shape: {x_train.shape}")
x_train[0]

In [None]:
# let's see the names of all these columns
dv.get_feature_names()

As we see, for each categorical feature it creates multiple columns for each of its distinct values.

Features such as `tenure` and `totalcharges` keep the original names because
they are numerical; therefore, `DictVectorizer` doesn’t change them.

##Classification model