# Machine Learning for Classification

## 3.1 Churn Prediction Project

**Binary classification** 

$$y_i = g(x_i)$$

$y_i \in {0, 1}$ i.e. 0 - Churn, 1 - No Churn

In [1]:
import os

import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt

## 3.2 Data Preparation

In [32]:
TELCO_CHURN_DATASET = "./dataset/telco_customer_churn.csv"
telco_churn_df = pd.read_csv(TELCO_CHURN_DATASET)

In [33]:
telco_churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [34]:
print(telco_churn_df.columns)

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')


In [35]:
print(telco_churn_df.shape)

(7043, 21)


In [36]:
telco_churn_df.head().T 

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [37]:
telco_churn_df.columns = telco_churn_df.columns.str.lower().str.replace(" ", "_")

In [38]:
categorical_columns = list(telco_churn_df.dtypes[telco_churn_df.dtypes == "object"].index)

In [39]:
for c in categorical_columns:
    telco_churn_df[c] = telco_churn_df[c].str.lower().str.replace(" ", "_")

In [40]:
print(categorical_columns)

['customerid', 'gender', 'partner', 'dependents', 'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod', 'totalcharges', 'churn']


In [41]:
telco_churn_df.head().T 

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [42]:
tc = pd.to_numeric(telco_churn_df.totalcharges, errors="coerce")

In [43]:
tc.isnull().sum()

11

In [44]:
telco_churn_df["totalcharges"] = pd.to_numeric(telco_churn_df["totalcharges"], errors="coerce")
telco_churn_df["totalcharges"] = telco_churn_df["totalcharges"].fillna(0)

In [45]:
telco_churn_df["totalcharges"].isnull().sum()

0

In [46]:
telco_churn_df["totalcharges"].dtype 

dtype('float64')

In [47]:
telco_churn_df["churn"] = (telco_churn_df["churn"].str.lower() == "yes").astype(int)

In [48]:
telco_churn_df["churn"].value_counts()

churn
0    5174
1    1869
Name: count, dtype: int64

## 3.3 Setting up the Validation Framework

Split the dataset into training, testing, and validation sets

In [49]:
from sklearn.model_selection import train_test_split 

In [60]:
df_full_train, df_test = train_test_split(telco_churn_df, test_size=0.2, random_state=42)

In [61]:
len(df_full_train), len(df_test)

(5634, 1409)

In [62]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [63]:
len(df_train), len(df_val), len(df_test)

(4225, 1409, 1409)

In [64]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_train.reset_index(drop=True)

In [65]:
y_train = df_train.churn.values 
y_val = df_val.churn.values 
y_test = df_test.churn.values 

In [66]:
del df_train["churn"]
del df_val["churn"]
del df_test["churn"]

## 3.4 Exploratory Data Analysis (EDA)

In [67]:
df_full_train = df_full_train.reset_index(drop=True)

In [69]:
df_full_train.churn.value_counts(normalize=True)

churn
0    0.734469
1    0.265531
Name: proportion, dtype: float64

In [70]:
global_churn_rate = df_full_train.churn.mean()
print(round(global_churn_rate, 2))

0.27


In [71]:
df_full_train.dtypes 

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int32
dtype: object

In [79]:
numerical_columns = df_full_train.select_dtypes(include=["number"]).columns 
print(numerical_columns)

Index(['seniorcitizen', 'tenure', 'monthlycharges', 'totalcharges', 'churn'], dtype='object')


In [81]:
categorical_columns = df_full_train.select_dtypes(include=["object"]).columns
print(categorical_columns)

Index(['customerid', 'gender', 'partner', 'dependents', 'phoneservice',
       'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup',
       'deviceprotection', 'techsupport', 'streamingtv', 'streamingmovies',
       'contract', 'paperlessbilling', 'paymentmethod'],
      dtype='object')


In [82]:
df_full_train[categorical_columns].nunique()

customerid          5634
gender                 2
partner                2
dependents             2
phoneservice           2
multiplelines          3
internetservice        3
onlinesecurity         3
onlinebackup           3
deviceprotection       3
techsupport            3
streamingtv            3
streamingmovies        3
contract               3
paperlessbilling       2
paymentmethod          4
dtype: int64

In [84]:
df_full_train[numerical_columns].describe().T 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
seniorcitizen,5634.0,0.160809,0.367388,0.0,0.0,0.0,0.0,1.0
tenure,5634.0,32.373092,24.424539,0.0,9.0,29.0,55.0,72.0
monthlycharges,5634.0,64.864253,30.089324,18.25,35.75,70.525,89.9375,118.6
totalcharges,5634.0,2287.087948,2263.197899,0.0,406.275,1405.65,3806.6125,8684.8
churn,5634.0,0.265531,0.441655,0.0,0.0,0.0,1.0,1.0


## 3.5 Feature Importance - Churn Rate and Risk Ratio

How to measure feature importance:
1. Difference: difference between the global churn rate and the churn rate of a particular group.
2. Risk ratio:
$$RISK = \cfrac{\text{Group churn rate}}{\text{Global churn rate}}$$

If $RISK > 1$, more likely to churn, else the group is less likely to churn. 

In [85]:
# Churn rate for males and females 
churn_female = df_full_train[df_full_train["gender"] == "female"].churn.mean()
churn_female 

0.2708409173643975

In [86]:
churn_male = df_full_train[df_full_train["gender"] == "male"].churn.mean()
churn_male 

0.26047800484932454

In [87]:
df_full_train.partner.value_counts()

partner
no     2904
yes    2730
Name: count, dtype: int64

In [88]:
churn_partner = df_full_train[df_full_train.partner == "yes"].churn.mean()
churn_partner 

0.20073260073260074

In [91]:
churn_no_partner = df_full_train[df_full_train.partner == "no"].churn.mean()
churn_no_partner 

0.32644628099173556

In [92]:
df_group = df_full_train.groupby("gender").churn.agg(["mean", "count"])
df_group["diff"]

Unnamed: 0_level_0,mean,count
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.270841,2747
male,0.260478,2887
