# 3 Machine Learning for Classification
- Performaing exploratory data analysis for identifying important features
- Encoding categorical variables to use them in machine learning models
- Using logistic regression for classification

Churn is when customers stop using the services of a company. Chrun prediction is about identifying customers who are likely to cancel their contracts soon. 

Models can be used for binary classification, logistic regression, decision trees and neural networks.

This chapter will be using the simplest one: logistic regression. It's indeed the simplest, it's still powerful and has many advantages over other models: it's fast and easy to understand, and it's results are easy to interpret. It's a workhorse of machine learning and the most widely used model in the industry.

## 3.1 Churn Prediction Project

#### Problem Statement
A telecom company that offers phone and internet services, and we have a problem: some of our customers are churning. They no longer are using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offer them an incentive to stay. We want to target them with promotional messages and give them discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model's predictions.

### 3.1.1 Telco Churn Dataset

### 3.1.2 Initial Data Preparation

In [2]:
from utils import ensure_correct_directory

ensure_correct_directory("chapter-03")

'chapter-03'

In [3]:
import pandas as pd
import numpy as np

In [4]:
import seaborn as sns

from matplotlib import pyplot as plt

%matplotlib inline

In [5]:
df = pd.read_csv("./data/dataset.csv")

In [6]:
len(df)

7043

In [7]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [8]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [9]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [10]:
total_charges = pd.to_numeric(df.TotalCharges, errors="coerce")

df[total_charges.isnull()][["customerID", "TotalCharges"]]

Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,
936,5709-LVOEQ,
1082,4367-NUYAO,
1340,1371-DWPAZ,
3331,7644-OMVMY,
3826,3213-VVOLG,
4380,2520-SGTTA,
5218,2923-ARZLG,
6670,4075-WKNIU,


In [11]:
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors="coerce")
df.TotalCharges = df.TotalCharges.fillna(0)

In [12]:
df.columns = df.columns.str.lower().str.replace(" ", "_")

string_columns = list(df.dtypes[df.dtypes == "object"].index)

for col in string_columns:
  df[col] = df[col].str.lower().str.replace(" ", "_")


In [13]:
encoded_churn = (df.churn == "yes").astype(int)

df.churn = encoded_churn

In [14]:
from sklearn.model_selection import train_test_split

df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

y_train = df_train.churn.values

y_val = df_val.churn.values

del df_train["churn"]
del df_val["churn"]

### 3.1.3 Exploratory Data Analysis

In [15]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [16]:
df_train_full.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [17]:
1532 / (4113 + 1521)

0.2719204827831026

In [18]:
global_mean = df_train_full.churn.mean()

round(global_mean, 3)

0.27

This dataset is an example of *imbalanced dataset*. The churn rate in our data is 0.27, which is a strong indicator of class imbalance. The opposite of imbalanced is the balanced case, when positive and negative classes are equally distributed among all observations.

Both the categorical and numerical variables in the dataset are important, but they are different and need different treatment.

- **Categorical**, which will contain the name of categorical variables
- **Numerical**, which will have the names of numerical variables

In [19]:
categorical = [
  "gender",
  "seniorcitizen",
  "partner",
  "dependents",
  "phoneservice",
  "multiplelines",
  "internetservice",
  "onlinesecurity",
  "onlinebackup",
  "deviceprotection",
  "techsupport",
  "streamingtv",
  "streamingmovies",
  "contract",
  "paperlessbilling",
  "paymentmethod",
]

numerical = [
  "tenure",
  "monthlycharges",
  "totalcharges"
]

In [20]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

### 3.1.4 Feature Importance

**Feature importance analysis** is a process to identify how other variables affect the target variable. The key is to understand the data and build a good model. It's often done as part of exploratory data analysis to figure out which variables will be useful for the model. It also gives us additional insights about the dataset and helps answer questions like "What makes customers churn?" and "What are the characteristics of people who churn?".

#### Churn Rate

#### Gender

In [25]:
global_mean = df_train_full.churn.mean()
round(global_mean, 3)

0.27

In [28]:
female_mean = df_train_full[df_train_full.gender == "female"].churn.mean()
print("gender == female: ", round(female_mean, 3))

male_mean = df_train_full[df_train_full.gender == "male"].churn.mean()
print("gender == male: ", round(male_mean, 3))


gender == female:  0.277
gender == male:  0.263


#### Partner

In [29]:
partner_yes = df_train_full[df_train_full.partner == "yes"].churn.mean()
print("partner == yes: ", round(partner_yes, 3))
partner_no = df_train_full[df_train_full.partner == "no"].churn.mean()
print("partner == no: ", round(partner_no, 3))

partner == yes:  0.205
partner == no:  0.33


#### RISK RATIO

ItIn statistics, the ratio between probabilities in different groups is called the **risk ratio**, where **risk** refers to the risk of having the effect. In our case, the effect is chrun.

`risk = group rate / global rate`

Fro `gender == female`, the risk of churning is 1.02

`risk = 27.7% / 27% = 1.02`

Risk is a number between zero and infinity. It has a nice interpretation that tells you how likely the elements of the group are to have the effect (chrun in our case) compared with the entire population.

If the different between the group rate and the global rate is small, the risk is closed to 1: this group has the same level of risk as the rest of the population. Customers in the group are as likely to churn as anyone else. In other words, a group with a risk close to 1 is not risky at all.

If the risk is lower than 1, the group has lower risks: the churn rate in this group is smaller than the global churn. The value 0.5 means that the clients in this group are two times less likely to churn than clients in general.

On the other hand, if the value is higher than 1, the group is risky: there's more churn in the group than in the population. A risk of 2 means that customers from the group are twice more likely to churn.

The term **risk** originally comes form controlled trials, in which one group of patients is given a treatment and the other group isn't. Then we compare how effective the medicine is by calculating the rate of negative outcomes in each group and then calculating the ratio between the rates:

`risk = negative outcome rate in group 1 / negative outcome rate in group 2`

If the medicine turns out to be effective, it's said to reduce the risk of having the negative outcome, and the value of the risk is less than 1.

In [39]:
df_group_gender = df_train_full.groupby(by="gender").churn.agg(["mean"])
df_group_gender["diff"] = (df_group_gender["mean"] - global_mean)
df_group_gender["risk"] = (df_group_gender["mean"] / global_mean)

df_group_gender

Unnamed: 0_level_0,mean,diff,risk
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


In [42]:
from IPython.display import display

for col in categorical:
  df_group = df_train_full.groupby(by=col).churn.agg(["mean"])
  df_group["diff"] = df_group["mean"] - global_mean
  df_group["rate"] = df_group["mean"] / global_mean
  display(df_group)

Unnamed: 0_level_0,mean,diff,rate
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.276824,0.006856,1.025396
male,0.263214,-0.006755,0.97498


Unnamed: 0_level_0,mean,diff,rate
seniorcitizen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.24227,-0.027698,0.897403
1,0.413377,0.143409,1.531208


Unnamed: 0_level_0,mean,diff,rate
partner,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.329809,0.059841,1.221659
yes,0.205033,-0.064935,0.759472


Unnamed: 0_level_0,mean,diff,rate
dependents,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.31376,0.043792,1.162212
yes,0.165666,-0.104302,0.613651


Unnamed: 0_level_0,mean,diff,rate
phoneservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.241316,-0.028652,0.89387
yes,0.273049,0.003081,1.011412


Unnamed: 0_level_0,mean,diff,rate
multiplelines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.257407,-0.012561,0.953474
no_phone_service,0.241316,-0.028652,0.89387
yes,0.290742,0.020773,1.076948


Unnamed: 0_level_0,mean,diff,rate
internetservice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
dsl,0.192347,-0.077621,0.712482
fiber_optic,0.425171,0.155203,1.574895
no,0.077805,-0.192163,0.288201


Unnamed: 0_level_0,mean,diff,rate
onlinesecurity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.420921,0.150953,1.559152
no_internet_service,0.077805,-0.192163,0.288201
yes,0.153226,-0.116742,0.56757


Unnamed: 0_level_0,mean,diff,rate
onlinebackup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.404323,0.134355,1.497672
no_internet_service,0.077805,-0.192163,0.288201
yes,0.217232,-0.052736,0.80466


Unnamed: 0_level_0,mean,diff,rate
deviceprotection,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.395875,0.125907,1.466379
no_internet_service,0.077805,-0.192163,0.288201
yes,0.230412,-0.039556,0.85348


Unnamed: 0_level_0,mean,diff,rate
techsupport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.418914,0.148946,1.551717
no_internet_service,0.077805,-0.192163,0.288201
yes,0.159926,-0.110042,0.59239


Unnamed: 0_level_0,mean,diff,rate
streamingtv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.342832,0.072864,1.269897
no_internet_service,0.077805,-0.192163,0.288201
yes,0.302723,0.032755,1.121328


Unnamed: 0_level_0,mean,diff,rate
streamingmovies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.338906,0.068938,1.255358
no_internet_service,0.077805,-0.192163,0.288201
yes,0.307273,0.037305,1.138182


Unnamed: 0_level_0,mean,diff,rate
contract,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
month-to-month,0.431701,0.161733,1.599082
one_year,0.120573,-0.149395,0.446621
two_year,0.028274,-0.241694,0.10473


Unnamed: 0_level_0,mean,diff,rate
paperlessbilling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
no,0.172071,-0.097897,0.637375
yes,0.338151,0.068183,1.25256


Unnamed: 0_level_0,mean,diff,rate
paymentmethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bank_transfer_(automatic),0.168171,-0.101797,0.622928
credit_card_(automatic),0.164339,-0.10563,0.608733
electronic_check,0.45589,0.185922,1.688682
mailed_check,0.19387,-0.076098,0.718121


#### Mutual information

The **metrics of importance** can help us measure the degree of dependency between a categorical variable and the target variable. If two variables are dependent, knowing the value of one variable gives us some information about another. On the other hand, if a variable is completely independent of the target variable, it's not useful and can be safely removed from the dataset.

For categorica variables, one such metrics is mutual information, which tells how much information we learn about one variable if we learn the value of the other variable. It's a concept from **information theory**, and in machine learning, we often use it to measure the mutual dependency between two variables.

Higher values of mutual information mean a higher degree of dependence: if the mutual information between a categorical variable and the target is high, this categorical variable will be quite useful for predicting the target.

Mutual information is already implemented in Scikit-learn in the mutual_info_ score function form the metrics package.

In [48]:
from sklearn.metrics import mutual_info_score

def calculate_mi(series):
  return mutual_info_score(series, df_train_full.churn)

df_mi = df_train_full[categorical].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name="MI")
df_mi

Unnamed: 0,MI
contract,0.09832
onlinesecurity,0.063085
techsupport,0.061032
internetservice,0.055868
onlinebackup,0.046923
deviceprotection,0.043453
paymentmethod,0.04321
streamingtv,0.031853
streamingmovies,0.031581
paperlessbilling,0.017589


#### Correlation Coefficient

Mutual information is a way to quantify the degree of dependency between two categorical variables, but it doesn't work when one of the features is numerical.

Measure the dependency between a binary target variable and a numerical variable by pretending that the binary variable is numerical (contain only 0 and 1) and then use the classical methods from statistics to check for any dependency between these variables.

One such method is the **correlation coefficient** (referred as **Pearson's correlation coefficient**). It is a value from -1 to 1.

- Positive correlation means that when one variable goes up, the other variable tends to go up as well. /in the case of a binary target, when the values of the variable are high, we see ones more often than zeros. But when the values of the variable are low, zeros become more frequent than ones.
- Zero correlation means no relationship between two variables, they are completely independent.
- Negative correlation occurs when one variable goes up and the other goes down.

In [54]:
correlation_coefficient = (df_train_full[numerical]
  .corrwith(df_train_full.churn)
  .to_frame(name="Correlation Coefficient"))

correlation_coefficient

Unnamed: 0,Correlation Coefficient
tenure,-0.351885
monthlycharges,0.196805
totalcharges,-0.196353


- The correlation of `tenure` is -0.35. It has a negative sign, so longer the customers stay, the less often they tend to churn.
- `monthlycharges` has a positive coefficient of 0.19, which means that customers who pay more tend to leave more often. 
- `totalcharges` has a negative correlation, which makes sense, the longer people stay with the company, the more they have paid in total, so it's less likely that they will leave.

## 3.2 Feature Engineering

### 3.2.1 One-hot Encoding for Categorical Variables

## 3.3 Machine learning for Classification

### 3.3.1 Logistic Regression

### 3.3.2 Training Logistic Regression

### 3.3.3 Model Interpretation

### 3.3.4 Using the Model

## 3.4 Next Steps

## Additional Readings