Churn is a major problem in Telecom industry. Being able to understand who is likely to churn based on some indicators can help firms focus on those individuals and stop them from going over to competitors. 
A dataset of customer with their usage pattern is available in Assignment03_ Telco-Customer-Churn.csv file. 

https://drive.google.com/drive/folders/1Jl8iDu7nGmrqCECbrLqmVafgwE5PYfiU


Build a decision tree classifier to predict which customers are likely to churn.Use 10-Fold cross validation to report your results.
1) Tree with no pruning<br>
2) Tree with pre-pruning<br>
3) Tree with post-pruning


With the best model above - make the following changes:

4) Up-sample the minority class and fit the model with best result<br>
5) Weight the class and fit the model with best result

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [2]:
# Import data
telecom_df = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/Assignments/Assignment03_Telco-Customer-Churn.csv')
telecom_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure(month),PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
# Check info
telecom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure(month)     7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
# TotalCharges is stored as object
telecom_df['TotalCharges'] = pd.to_numeric(telecom_df['TotalCharges'], errors='coerce')
telecom_df.dropna(how='any', inplace=True)

In [5]:
# Check shape
telecom_df.shape

(7032, 21)

In [6]:
# Drop cutomerID
telecom_df.drop('customerID', axis=1, inplace=True)

In [7]:
# Check missing values
telecom_df.isna().sum().sum()

0

In [8]:
# Check value counts for categorical variables
cat_var = [var for var in telecom_df.select_dtypes(include='object').columns]
cat_var = cat_var[0:len(cat_var)-1]      #exclude target Churn from the list

for var in cat_var:
    print(telecom_df[var].value_counts(normalize=True)*100)
    print()

Male      50.469283
Female    49.530717
Name: gender, dtype: float64

No     51.749147
Yes    48.250853
Name: Partner, dtype: float64

No     70.150739
Yes    29.849261
Name: Dependents, dtype: float64

Yes    90.32992
No      9.67008
Name: PhoneService, dtype: float64

No                  48.137088
Yes                 42.192833
No phone service     9.670080
Name: MultipleLines, dtype: float64

Fiber optic    44.027304
DSL            34.357224
No             21.615472
Name: InternetService, dtype: float64

No                     49.729807
Yes                    28.654721
No internet service    21.615472
Name: OnlineSecurity, dtype: float64

No                     43.899317
Yes                    34.485210
No internet service    21.615472
Name: OnlineBackup, dtype: float64

No                     43.998862
Yes                    34.385666
No internet service    21.615472
Name: DeviceProtection, dtype: float64

No                     49.374289
Yes                    29.010239
No internet

In [9]:
# Lets map the target
telecom_df['Churn'] = telecom_df['Churn'].map({'Yes':1, 'No':0})

In [10]:
# Lets convert categorical variables to dummies
telecom_df_onehot = pd.get_dummies(telecom_df, columns=cat_var)
telecom_df_onehot.sample(5)

Unnamed: 0,SeniorCitizen,tenure(month),MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,PhoneService_Yes,MultipleLines_No,MultipleLines_No phone service,MultipleLines_Yes,InternetService_DSL,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No,OnlineBackup_No internet service,OnlineBackup_Yes,DeviceProtection_No,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
6181,0,61,99.9,6241.35,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0
5481,1,1,73.65,73.65,1,0,1,1,0,1,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0
875,0,3,34.8,113.95,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,1,0,0
2476,0,69,84.7,5878.9,0,0,1,0,1,1,0,0,1,1,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1
4070,1,55,74.0,4052.4,0,0,1,0,1,1,0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0


In [11]:
# Lets check the classes for Churn
telecom_df_onehot['Churn'].value_counts(normalize=True)

0    0.734215
1    0.265785
Name: Churn, dtype: float64

**There is class imbalance with low Churn rate**

### Decision Tree with no pruning

In [12]:
# Lets fit a decision tree
X = telecom_df_onehot.drop('Churn', axis=1)
y = telecom_df_onehot['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=4)

clf = DecisionTreeClassifier(random_state=4)
clf = clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_pred = clf.predict(X_test)

print('Train score:', accuracy_score(y_train, y_train_pred))
print('Test score:', accuracy_score(y_test, y_pred))

Train score: 0.9982222222222222
Test score: 0.7292110874200426


The model overfits on Training set and the score drops for Testing set

### Decision Tree with pre-pruning

In [13]:
# Lets fit a decision tree

params = {'criterion' : ['gini'], 'max_depth' : range(1,11), 'min_samples_split' : range(10, 60, 10)}

clf_gs = GridSearchCV(DecisionTreeClassifier(), cv=10, param_grid=params)

clf_gs.fit(X_train, y_train)

y_train_pred = clf_gs.predict(X_train)
y_pred = clf_gs.predict(X_test)

print('Train score:', accuracy_score(y_train, y_train_pred))
print('Test score:', accuracy_score(y_test, y_pred))

Train score: 0.8183111111111111
Test score: 0.7668798862828714


The scores are much better compared to the one with no pruning

### Decision Tree with post-pruning

In [14]:
# Cost Complexity Pruning

ccp = np.arange(0, 1, 0.1)

for v in ccp:
    clf_p = DecisionTreeClassifier(random_state=4, ccp_alpha=v)
    clf_p_gs = GridSearchCV(clf_p, cv=10, param_grid=params)
    clf_p_gs.fit(X_train, y_train)
    print('For ccp_alpha=', v)
    print(clf_p_gs.best_params_)
    print(clf_p_gs.best_score_)
    print()

For ccp_alpha= 0.0
{'criterion': 'gini', 'max_depth': 7, 'min_samples_split': 50}
0.7934255355461021

For ccp_alpha= 0.1
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.2
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.30000000000000004
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.4
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.5
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.6000000000000001
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.7000000000000001
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.8
{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 10}
0.734400422242309

For ccp_alpha= 0.9
{'criterion': 'gini', 'ma

The best score came out for the ccp_alpha=0.0, i.e. when no pruning was performed.

In [15]:
# Lets test the model
y_pred = clf_p_gs.predict(X_test)

print('Test score:', accuracy_score(y_test, y_pred))

Test score: 0.7334754797441365


The model with pre pruning is better than the above

### Up-sample the minority class and fit the model with best result

In [16]:
# Import the relevant package
from imblearn.over_sampling import RandomOverSampler

In [17]:
# Oversample the data
ros = RandomOverSampler(random_state=4)

X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

from collections import Counter
print(sorted(Counter(y_train).items()))         #before oversampling
print(sorted(Counter(y_resampled).items()))     #after oversampling

[(0, 4131), (1, 1494)]
[(0, 4131), (1, 4131)]


In [18]:
# Lets fit a decision tree on oversmapled data

params = {'criterion' : ['gini'], 'max_depth' : range(1,11), 'min_samples_split' : range(10, 60, 10)}

clf_gs_os = GridSearchCV(DecisionTreeClassifier(), cv=10, param_grid=params)

clf_gs_os.fit(X_resampled, y_resampled)

clf_gs_os.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 10},
 0.7934255355461021)

In [19]:
# Performance on test data
y_pred = clf_gs_os.predict(X_test)

print('Test score:', accuracy_score(y_test, y_pred))

Test score: 0.720682302771855


The accuracy on the test set did not improve significantly after oversampling

### Weight the class and fit the model with best result

In [20]:
# Lets fit a decision tree on oversmapled data

params = {'criterion' : ['gini'], 'max_depth' : range(1,11), 'min_samples_split' : range(10, 60, 10)}

clf_gs_os_cw = GridSearchCV(DecisionTreeClassifier(class_weight='balanced'), cv=10, param_grid=params)

clf_gs_os_cw.fit(X_resampled, y_resampled)

clf_gs_os_cw.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 10},
 0.7934255355461021)

In [21]:
# Performance on test data
y_pred = clf_gs_os_cw.predict(X_test)

print('Test score:', accuracy_score(y_test, y_pred))

Test score: 0.7199715707178393


In [22]:
# Lets fit a decision tree on oversmapled data

params = {'criterion' : ['gini'], 'max_depth' : range(1,11), 'min_samples_split' : range(10, 60, 10)}

weights = {0:1.0, 1:100.0}

clf_gs_os_cw = GridSearchCV(DecisionTreeClassifier(class_weight=weights), cv=10, param_grid=params)

clf_gs_os_cw.fit(X_resampled, y_resampled)

clf_gs_os_cw.best_params_, clf_gs.best_score_

({'criterion': 'gini', 'max_depth': 10, 'min_samples_split': 10},
 0.7934255355461021)

In [23]:
# Performance on test data
y_pred = clf_gs_os_cw.predict(X_test)

print('Test score:', accuracy_score(y_test, y_pred))

Test score: 0.5408670931058991


Balanced class weight seems to be working better than the second case