# 3 Machine Learning for Classification
- Performaing exploratory data analysis for identifying important features
- Encoding categorical variables to use them in machine learning models
- Using logistic regression for classification

Churn is when customers stop using the services of a company. Chrun prediction is about identifying customers who are likely to cancel their contracts soon. 

Models can be used for binary classification, logistic regression, decision trees and neural networks.

This chapter will be using the simplest one: logistic regression. It's indeed the simplest, it's still powerful and has many advantages over other models: it's fast and easy to understand, and it's results are easy to interpret. It's a workhorse of machine learning and the most widely used model in the industry.

## 3.1 Churn Prediction Project

#### Problem Statement
A telecom company that offers phone and internet services, and we have a problem: some of our customers are churning. They no longer are using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offer them an incentive to stay. We want to target them with promotional messages and give them discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model's predictions.

### 3.1.1 Telco Churn Dataset

### 3.1.2 Initial Data Preparation

In [3]:
from utils import ensure_correct_directory

ensure_correct_directory("chapter-03")

'chapter-03'

In [4]:
import pandas as pd
import numpy as np

In [5]:
import seaborn as sns

from matplotlib import pyplot as plt

%matplotlib inline

In [6]:
df = pd.read_csv("./data/dataset.csv")

In [7]:
len(df)

7043

In [8]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [9]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [10]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [11]:
total_charges = pd.to_numeric(df.TotalCharges, errors="coerce")

df[total_charges.isnull()][["customerID", "TotalCharges"]]

Unnamed: 0,customerID,TotalCharges
488,4472-LVYGI,
753,3115-CZMZD,
936,5709-LVOEQ,
1082,4367-NUYAO,
1340,1371-DWPAZ,
3331,7644-OMVMY,
3826,3213-VVOLG,
4380,2520-SGTTA,
5218,2923-ARZLG,
6670,4075-WKNIU,


In [12]:
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors="coerce")
df.TotalCharges = df.TotalCharges.fillna(0)

In [13]:
df.columns = df.columns.str.lower().str.replace(" ", "_")

string_columns = list(df.dtypes[df.dtypes == "object"].index)

for col in string_columns:
  df[col] = df[col].str.lower().str.replace(" ", "_")


In [22]:
encoded_churn = (df.churn == "yes").astype(int)

df.churn = encoded_churn

In [25]:
from sklearn.model_selection import train_test_split

df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

df_train, df_val = train_test_split(df_train_full, test_size=0.33, random_state=11)

y_train = df_train.churn.values

y_val = df_val.churn.values

del df_train["churn"]
del df_val["churn"]

### 3.1.3 Exploratory Data Analysis

In [26]:
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [27]:
df_train_full.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [28]:
1532 / (4113 + 1521)

0.2719204827831026

In [30]:
global_mean = df_train_full.churn.mean()

round(global_mean, 3)

0.27

This dataset is an example of *imbalanced dataset*. The churn rate in our data is 0.27, which is a strong indicator of class imbalance. The opposite of imbalanced is the balanced case, when positive and negative classes are equally distributed among all observations.

Both the categorical and numerical variables in the dataset are important, but they are different and need different treatment.

- **Categorical**, which will contain the name of categorical variables
- **Numerical**, which will have the names of numerical variables

In [34]:
categorical = [
  "gender",
  "seniorcitizen",
  "partner",
  "dependents",
  "phoneservice",
  "multiplelines",
  "internetservice",
  "onlinesecurity",
  "onlinebackup",
  "deviceprotection",
  "techsupport",
  "streamingtv",
  "streamingmovies",
  "contract",
  "paperlessbilling",
  "paymentmethod",
]

numerical = [
  "tenure",
  "monthlycharges",
  "totalcharges"
]

In [35]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

### 3.1.4 Feature Importance

**Feature importance analysis** is a process to identify how other variables affect the target variable. The key is to understand the data and build a good model. It's often done as part of exploratory data analysis to figure out which variables will be useful for the model. It also gives us additional insights about the dataset and helps answer questions like "What makes customers churn?" and "What are the characteristics of people who churn?".

#### Churn Rate



#### Mutual information

#### Correlation Coefficient

## 3.2 Feature Engineering

### 3.2.1 One-hot Encoding for Categorical Variables

## 3.3 Machine learning for Classification

### 3.3.1 Logistic Regression

### 3.3.2 Training Logistic Regression

### 3.3.3 Model Interpretation

### 3.3.4 Using the Model

## 3.4 Next Steps

## Additional Readings