# Goal: Choose a Classification Model that Predicts the Customer Churn rate

### Scope: 
In this notebook, I'm selecting a suitable classification model to predict customer churn.

### Out of scope:
This work *not yet* includes the feature importances that identifies the key features affecting the churn.

# Import Packages

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings("ignore")

# Import the Data Set

In [None]:
telco_data = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [None]:
telco_data.info()

In [None]:
telco_data.describe()

In [None]:
telco_data.head()

* The customerID is not important
* TotalCharges is not numerical

In [None]:
# Drop the customerID column
telco_data.drop(['customerID'],axis=1,inplace=True)

# Change type of TotalCharges
telco_data['TotalCharges'] = telco_data['TotalCharges'].apply(pd.to_numeric,errors='coerce')

In [None]:
# Check if there is missing data or NaN values
telco_data.isnull().sum()

In [None]:
# replace the NaN values in TotalCharges with its average value
telco_data['TotalCharges'] = telco_data['TotalCharges'].fillna(telco_data['TotalCharges'].mean())

# EDA

A look at the customer churn rate

In [None]:
sns.countplot(telco_data['Churn'])
plt.title('Customer Churn Count')

In [None]:
labels = telco_data['Churn'].value_counts().index

fig = go.Figure(data=[go.Pie(labels=labels, values=telco_data['Churn'].value_counts())])
fig.show()

In [None]:
sns.pairplot(telco_data,hue='Churn')

There are only 4 numerical features to plot. We will look at count plots on all the categorical features to see their relation to churn rate

In [None]:
cat_features = telco_data.select_dtypes(include='object')

In [None]:
plt.figure(figsize=(15,20))

i=1
for col_name in cat_features.columns[0:15]: #specify this because Churn is included in the cat_features at this moment
    plt.subplot(5,3,i)
    sns.countplot(telco_data[col_name],hue=telco_data['Churn'])
    i +=1
    plt.tight_layout()

Take a look at the distribution of the 3 numerical data: tenure, monthly charges and total charges

In [None]:
sns.set_style('whitegrid')
telco_data['tenure'].hist(bins=35,alpha=0.7)

plt.title("Tenure Distribution")
plt.xlabel("Tenure")
plt.ylabel("Count")

In [None]:
sns.distplot(telco_data['MonthlyCharges'],hist=True)

plt.title("Customer Monthly Charges")
plt.xlabel("Dollars")
plt.ylabel("Count")

In [None]:
sns.distplot(telco_data['TotalCharges'],hist=True,bins=35)

plt.title("Customer Total Charges")
plt.xlabel("Dollars")
plt.ylabel("Count")

=> The Total charges distribuiton is skewed to the right.

## EDA Insights
* Churn rate is low for No-interenet-service feature
* Churn rate is significantly high for Month-to-month contract and Electronic-check payment method

# Data Preprocessing

## Skewness 

In [None]:
num_features = telco_data[['tenure','MonthlyCharges','TotalCharges']]
skew_features = num_features.skew().sort_values(ascending=False)
sknewness= pd.DataFrame({'Skew':skew_features})
sknewness

=> Total Charges skewwess is moderately postive skewed: 0.5 < 0.96 < 1

=> It is fine to train the model with these features as they are

## Handling Categorical Data 

In [None]:
# Simplify the "No internet service" response to "No"
Internet_cat = ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']

for cat in Internet_cat:
    telco_data[cat] = telco_data[cat].replace({"No internet service":"No"})
    
# MulitpleLines
telco_data['MultipleLines'] = telco_data['MultipleLines'].replace({"No phone service":"No"})

In [None]:
# Convert the target column Churn to a binary feature either using LabelEncoder or LabelBinarizer
# Yes (churn) is 1
# No Churn is 0

label_encd = LabelEncoder()
telco_data['Churn'] = label_encd.fit_transform(telco_data['Churn'])

In [None]:
# Convert the categorical features to dummy variables

cat_col = cat_features.drop('Churn',axis=1).columns.tolist()

telco_data_encd = pd.get_dummies(telco_data,prefix_sep="__",columns=cat_col,drop_first=True)

In [None]:
# Quick look at the current data set to gather some preliminary insights

data = [go.Heatmap(
        z= telco_data_encd.corr().values,
        x=telco_data_encd.columns.values,
        y=telco_data_encd.columns.values,
        colorscale='RdBu_r',
        opacity = 1.0 )]

layout = go.Layout(
    title='Pearson Correlation of Input Features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700)

fig = go.Figure(data=data, layout=layout)
fig.show()

## Insights:
Features that are highly correlated to Churn rate:
* Payment Method - Electronic Check, Internet Service - Fiber optic, Monthly Charges, Paperless Billing, Seniror Citizen

Features that contribute to high montly charges/total charges:
* Streaming TV and movies, Fiber optic, Online Backup, Device Protection

# ML Classification Models

The following models are used:

* Logistic Regression
* KNN
* Decision Tree
* Random Forest
* Support Vector Machine

## Evaluate the important metrics for this analysis
Firstly, we want to have a model with high accuracy predicting the customer churn rate.

Currently, 73.5% of customers is keeping the service. As I want to retain existing customers, I want to focus on correctly predicting customers who are likely to churn. Precision is the second important metric.

Assume that it is inexpensive to lose a customer, I want to give out promotions to those that are predicted to churn. I will try to minimize the wrongly predicited no-churn customers (FN), a high Recall value is also a good metric. 


In [None]:
X = telco_data_encd.drop('Churn',axis=1)
y = telco_data_encd['Churn']

## Train - Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Tuning Parameters for KNN and SVM
For these two models, I will perform a grid search to find the optimal parameters (k for KNN; C and gamma for SVM)

### For KNN

In [None]:
k_range = range(1,41)
param_grid_knn = dict(n_neighbors=k_range)

for k in k_range:
    grid_knn = GridSearchCV(KNeighborsClassifier(n_neighbors=k), param_grid_knn, cv=10,scoring='accuracy')

grid_knn.fit(X_train,y_train)
grid_knn.best_params_

### For SVM

In [None]:
param_grid={'C':[0.1, 1, 10, 100], 'gamma':[1, 0.1, 0.01, 0.001]}

grid_svm = GridSearchCV(SVC(), param_grid, cv=10,scoring='accuracy')

grid_svm.fit(X_train,y_train)
grid_svm.best_params_

## Train the Models

In [None]:
# Instantiate the models
log = LogisticRegression().fit(X_train, y_train)
knn = KNeighborsClassifier(n_neighbors=29).fit(X_train, y_train)
tree = DecisionTreeClassifier().fit(X_train, y_train)
rfc = RandomForestClassifier().fit(X_train, y_train)
svc = SVC(C=1,gamma=0.001).fit(X_train, y_train)

models = [log, knn, tree, rfc, svc]
models_names = ['Logistic Regression', 'KNN', 'Decision Tree', 'Random Forest', 'Support Vector Machine']

In [None]:
scoring = ['accuracy','precision','recall']
train_accuracy = []
train_precision = []
train_recall = []
train_std = []

train_scoring = {}
for i,model in enumerate (models):
    scores = cross_validate(model, X_train, y_train, cv=10, scoring=scoring)
    
    # ignore the first two columns from scoring which are fit_time and score_time
    # pay attention to the breakdown in the Cross validation section  
    train_accuracy.append(scores['test_accuracy'].mean())
    train_precision.append(scores['test_precision'].mean())
    train_recall.append(scores['test_recall'].mean())
    train_std.append(scores['test_accuracy'].std())
    
    train_scoring[i] = scores['test_accuracy']
    

train_scores = pd.DataFrame(list(zip(train_accuracy,train_precision,train_recall,train_std)),
                            index=models_names,columns=['Accuracy','Precision','Recall','Standard Deviation'])
print('Models Training Scores')
train_scores

## Compare the Models
Comparing the models accuracy scores from cross-validation

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.boxplot(train_scoring.values())
ax.set_xticklabels(['Log','KNN','DTree','RFC','SVM'])

plt.title('Models Accuracy Comparison')
plt.ylabel('Accuracy rate')

### Insights
By far, we see that Logistic Regression model achieves the high scores in all the metrics

## Test the Models

In [None]:
accuracy_scores = []
precision_scores = []
recall_scores = []
error_rate = []

for i,model in enumerate (models):
    y_pred = model.predict(X_test)
    conf_matrix = confusion_matrix(y_test,y_pred)
    
    print('\n')
    print(models_names[i])
    print(classification_report(y_test,y_pred))
    '\n'
    print(conf_matrix)
    
    tn = conf_matrix[0,0]
    fp = conf_matrix[0,1]
    tp = conf_matrix[1,1]
    fn = conf_matrix[1,0]
    
    total = tn + fp + tp + fn

    accuracy  = (tp + tn) / total # Accuracy Rate
    precision = tp / (tp + fp) # Positive Predictive Value
    recall    = tp / (tp + fn) # True Positive Rate
    error = (fp + fn) / total # Missclassification Rate
 
    accuracy_scores.append(accuracy)
    precision_scores.append(precision)
    recall_scores.append(recall)
    error_rate.append(error)
    
scores_df = pd.DataFrame(list(zip(accuracy_scores,precision_scores,recall_scores,error_rate)),index=models_names,columns=['Accuracy','Precision','Recall','Error Rate'])
print('\n')
print('Models Evaluation from Test Set')
scores_df

### Insights:
Logistic Regression model still achieves high scores on all the chosen metrics when it applies on the test set.

# Conclusion
Logistic Regression is chosen as the predictive model for Customer Churn due to its good performance and simplicity