# Telecom Churn Prediction

## Introduction

## What is the churn rate?
- Wikipedia states that the churn rate (also called attrition rate) measures the number of individuals or items moving out of a collective group over a specific period. 
- It applies in many contexts, but the mainstream understanding of churn rate is related to the business case of customers that stop buying from you.

## Importance of customer churn prediction:
- The impact of the churn rate is clear, so we need strategies to reduce it. 
- Predicting churn is a good way to create proactive marketing campaigns targeted at the customers that are about to churn. 
- Forecasting customer churn with the help of machine learning is possible. 
- Machine learning and data analysis are powerful ways to identify and predict churn.
- Churn is a one of the biggest problem in the telecom industry. 
- Research has shown that the average monthly churn rate among the top 4 wireless carriers in the US is 1.9% - 2%.

<font color = 'blue'>
Content: 

1. [Load and Check Data](#1)
1. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable](#4)
        * [Numerical Variable](#5)
1. [Basic Data Analysis](#6)
1. [Missing Value](#7)
    * [Find Missing Value](#8)
    * [Fill Missing Value](#9)
1. [Visualization](#10)    
    * [Box plot of numerical features](#11)
1. [Outlier Detection](#12)
1. [Feature Engineering](#13)    
    * [One-hot encoding](#14)
    * [Ascending ranking of correlations between feaures and churn](#15)
1. [Modeling](#16)
    * [Train - Test Split](#17)        
    * [Trial and Conclusion](#18)
    * [Hyperparameter Tuning -- Grid Search -- Cross Validation](#19)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import matplotlib.pyplot as plt # for visualization
import seaborn as sns # for visualization

from collections import Counter

import warnings
warnings.filterwarnings("ignore") #Ignore certain system-wide alerts

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a><br>
# 1 | Load and Check Data

In [None]:
d = pd.read_csv('../input/telecom-customer/Telecom_customer churn.csv')
df = d.copy() # I just made a copy for easier experimentation during code writing.
df

In [None]:
print(df.shape) # row x columns of data
print(df.ndim) # dimension of data
print(df.size) # size of data

In [None]:
df.describe(include=['O']) # Only object columns

In [None]:
df.describe() # Only numerical columns

- The table above has a statistical summary of the dataset. 
- It contains number, mean, standard deviation, minimum and maximum values for each feature. 
- Although the values in the table provide a summary of the data set, they do not make sense for the machine learning model.

<a id = "2"></a><br>
# 2 | Variable Description
- 1 rev_Mean: Mean monthly revenue (charge amount)
- 2 mou_Mean: Mean number of monthly minutes of use
- 3 totmrc_Mean: Mean total monthly recurring charge
- 4 da_Mean: Mean number of directory assisted calls
- 5 ovrmou_Mean: Mean overage minutes of use
- 6 ovrrev_Mean: Mean overage revenue
- 7 vceovr_Mean: Mean revenue of voice overage
- 8 datovr_Mean: Mean revenue of data overage
- 9 roam_Mean: Mean number of roaming calls
- 10 change_mou: Percentage change in monthly minutes of use vs previous three month average
- 11 change_rev: Percentage change in monthly revenue vs previous three month average
- 12 drop_vce_Mean: Mean number of dropped (failed) voice calls
- 13 drop_dat_Mean: Mean number of dropped (failed) data calls
- 14 blck_vce_Mean: Mean number of blocked (failed) voice calls
- 15 blck_dat_Mean: Mean number of blocked (failed) data calls
- 16 unan_vce_Mean: Mean number of unanswered voice calls
- 17 unan_dat_Mean: Mean number of unanswered data calls
- 18 plcd_vce_Mean: Mean number of attempted voice calls placed
- 19 plcd_dat_Mean: Mean number of attempted data calls placed
- 20 recv_vce_Mean: Mean number of received voice calls
- 21 recv_sms_Mean: N
- 22 comp_vce_Mean: Mean number of completed voice calls
- 23 comp_dat_Mean: Mean number of completed data calls
- 24 custcare_Mean: Mean number of customer care calls
- 25 ccrndmou_Mean: Mean rounded minutes of use of customer care calls
- 26 cc_mou_Mean: Mean unrounded minutes of use of customer care (see CUSTCARE_MEAN) calls
- 27 inonemin_Mean: Mean number of inbound calls less than one minute
- 28 threeway_Mean: Mean number of three way calls
- 29 mou_cvce_Mean: Mean unrounded minutes of use of completed voice calls
- 30 mou_cdat_Mean: Mean unrounded minutes of use of completed data calls
- 31 mou_rvce_Mean: Mean unrounded minutes of use of received voice calls
- 32 owylis_vce_Mean: Mean number of outbound wireless to wireless voice calls
- 33 mouowylisv_Mean: Mean unrounded minutes of use of outbound wireless to wireless voice calls
- 34 iwylis_vce_Mean: N
- 35 mouiwylisv_Mean: Mean unrounded minutes of use of inbound wireless to wireless voice calls
- 36 peak_vce_Mean: Mean number of inbound and outbound peak voice calls
- 37 peak_dat_Mean: Mean number of peak data calls
- 38 mou_peav_Mean: Mean unrounded minutes of use of peak voice calls
- 39 mou_pead_Mean: Mean unrounded minutes of use of peak data calls
- 40 opk_vce_Mean: Mean number of off-peak voice calls
- 41 opk_dat_Mean: Mean number of off-peak data calls
- 42 mou_opkv_Mean: Mean unrounded minutes of use of off-peak voice calls
- 43 mou_opkd_Mean: Mean unrounded minutes of use of off-peak data calls
- 44 drop_blk_Mean: Mean number of dropped or blocked calls
- 45 attempt_Mean: Mean number of attempted calls
- 46 complete_Mean: Mean number of completed calls
- 47 callfwdv_Mean: Mean number of call forwarding calls
- 48 callwait_Mean: Mean number of call waiting calls
- 49 churn: Instance of churn between 31-60 days after observation date
- 50 months: Total number of months in service
- 51 uniqsubs: Number of unique subscribers in the household
- 52 actvsubs: Number of active subscribers in household
- 53 new_cell: New cell phone user
- 54 crclscod: Credit class code
- 55 asl_flag: Account spending limit
- 56 totcalls: Total number of calls over the life of the customer
- 57 totmou: Total minutes of use over the life of the cus
- 58 totrev: Total revenue
- 59 adjrev: Billing adjusted total revenue over the life of the customer
- 60 adjmou: Billing adjusted total minutes of use over the life of the customer
- 61 adjqty: Billing adjusted total number of calls over the life of the customer
- 62 avgrev: Average monthly revenue over the life of the customer
- 63 avgmou: Average monthly minutes of use over the life of the customer
- 64 avgqty: Average monthly number of calls over the life of the customer
- 65 avg3mou: Average monthly minutes of use over the previous three months
- 66 avg3qty: Average monthly number of calls over the previous three months
- 67 avg3rev: Average monthly revenue over the previous three months
- 68 avg6mou: Average monthly minutes of use over the previous six months
- 69 avg6qty: Average monthly number of calls over the previous six months
- 70 avg6rev: Average monthly revenue over the previous six months
- 71 prizm_social_one: Social group letter only
- 72 area: Geogrpahic area
- 73 dualband: Dualband
- 74 refurb_new: Handset: refurbished or new
- 75 hnd_price: Current handset price
- 76 phones: Number of handsets issued
- 77 models: Number of models issued
- 78 hnd_webcap: Handset web capability
- 79 truck: Truck indicator
- 80 rv: RV indicator
- 81 ownrent: Home owner/renter status
- 82 lor: Length of residence
- 83 dwlltype: Dwelling Unit type
- 84 marital: Marital Status
- 85 adults: Number of adults in household
- 86 infobase: InfoBase match
- 87 income: Estimated income
- 88 numbcars: Known number of vehicles
- 89 HHstatin: Premier household status indicator
- 90 dwllsize: Dwelling size
- 91 forgntvl: Foreign travel dummy variable
- 92 ethnic: Ethnicity roll-up code
- 93 kid0_2: Child 0 - 2 years of age in household
- 94 kid3_5: Child 3 - 5 years of age in household
- 95 kid6_10: Child 6 - 10 years of age in household
- 96 kid11_15: Child 11 - 15 years of age in household
- 97 kid16_17: Child 16 - 17 years of age in household
- 98 creditcd: Credit card indicator
- 99 eqpdays: Number of days (age) of current equipment
- 100 Customer_ID: N

In [None]:
# We want to observe the types of variables in the dataset and whether they contain null values.
df.info()

In [None]:
# from the data description, we can see that Customer_ID is unique - therefor it not provides us information we can learn.
df.drop(["Customer_ID"], axis = 1, inplace=True)

In [None]:
# We want to list the columns in 3 categories.
def columns_categories(data_set):
    object_columns = []
    float_columns = []
    int_columns = []
    other_columns = []
    n,m,s=0,0,0
    for i in data_set.columns.values:
        if data_set[i].dtypes=='object':
            object_columns.append(i)
            n+=1
        if data_set[i].dtypes=='int':
            int_columns.append(i)
            m+=1
        if data_set[i].dtypes=='float':
            float_columns.append(i)
            s+=1
    print('object(',n,'):\n',object_columns)
    print('int(',m,'):\n',int_columns)
    print('float(',s,'):\n',float_columns)

In [None]:
columns_categories(df)

<a id = "3"></a><br>
# Univariate Variable Analysis
- Categorical Variable: 'churn', 'new_cell', 'crclscod', 'asl_flag', 'prizm_social_one', 'area', 'dualband', 'refurb_new', 'hnd_webcap', 'ownrent', 'dwlltype', 'marital', 'infobase', 'HHstatin', 'dwllsize', 'ethnic', 'kid0_2', 'kid3_5', 'kid6_10', 'kid11_15', 'kid16_17', 'creditcd'

- Numerical Variable: 'months', 'uniqsubs', 'actvsubs', 'totcalls', 'adjqty', 'avg3mou', 'avg3qty', 'avg3rev', 'rev_Mean', 'mou_Mean', 'totmrc_Mean', 'da_Mean', 'ovrmou_Mean', 'ovrrev_Mean', 'vceovr_Mean', 'datovr_Mean', 'roam_Mean', 'change_mou', 'change_rev', 'drop_vce_Mean', 'drop_dat_Mean', 'blck_vce_Mean', 'blck_dat_Mean', 'unan_vce_Mean', 'unan_dat_Mean', 'plcd_vce_Mean', 'plcd_dat_Mean', 'recv_vce_Mean', 'recv_sms_Mean', 'comp_vce_Mean', 'comp_dat_Mean', 'custcare_Mean', 'ccrndmou_Mean', 'cc_mou_Mean', 'inonemin_Mean', 'threeway_Mean', 'mou_cvce_Mean', 'mou_cdat_Mean', 'mou_rvce_Mean', 'owylis_vce_Mean', 'mouowylisv_Mean', 'iwylis_vce_Mean', 'mouiwylisv_Mean', 'peak_vce_Mean', 'peak_dat_Mean', 'mou_peav_Mean', 'mou_pead_Mean', 'opk_vce_Mean', 'opk_dat_Mean', 'mou_opkv_Mean', 'mou_opkd_Mean', 'drop_blk_Mean', 'attempt_Mean', 'complete_Mean', 'callfwdv_Mean', 'callwait_Mean', 'totmou', 'totrev', 'adjrev', 'adjmou', 'avgrev', 'avgmou', 'avgqty', 'avg6mou', 'avg6qty', 'avg6rev', 'hnd_price', 'phones', 'models', 'truck', 'rv', 'lor', 'adults', 'income', 'numbcars', 'forgntvl', 'eqpdays'

<a id = "4"></a><br>
## Categorical Variable

In [None]:
obj_col = df.select_dtypes(include = 'object').columns
obj_col

- Churn relation with categorical columns

In [None]:
# new_cell vs churn
sns.countplot(x= "new_cell", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('new_cell')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# crclscod vs churn
sns.countplot(x= "crclscod", hue="churn", data=df);
plt.xticks(rotation = 90)
plt.show()
df.groupby('crclscod')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# asl_flag vs churn
sns.countplot(x= "asl_flag", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('asl_flag')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# prizm_social_one vs churn
sns.countplot(x= "prizm_social_one", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('prizm_social_one')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# area vs churn
sns.countplot(x= "area", hue="churn", data=df);
plt.xticks(rotation = 90)
plt.show()
df.groupby('area')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# dualband vs churn
sns.countplot(x= "dualband", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('dualband')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# refurb_new vs churn
sns.countplot(x= "refurb_new", hue="churn", data=df);
plt.xticks()
plt.show()

df.groupby('refurb_new')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# hnd_webcap vs churn
sns.countplot(x= "hnd_webcap", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('hnd_webcap')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "ownrent", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('ownrent')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "dwlltype", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('dwlltype')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "marital", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('marital')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "infobase", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('infobase')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "HHstatin", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('HHstatin')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "dwllsize", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('dwllsize')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
sns.countplot(x= "ethnic", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('ethnic')["churn"].value_counts(normalize=True).unstack(fill_value=0)

In [None]:
# creditcd vs churn
sns.countplot(x= "creditcd", hue="churn", data=df);
plt.xticks()
plt.show()
df.groupby('creditcd')["churn"].value_counts(normalize=True).unstack(fill_value=0)

<a id = "5"></a><br>
## Numerical Variable

In [None]:
df.iloc[:,:].hist(bins=50,figsize=(23,74),layout=(20,4));

<a id = "6"></a><br>
# 3 | Basic Data Analysis

In [None]:
stay = df[(df['churn'] ==0) ].count()[1]
churn = df[(df['churn'] ==1) ].count()[1]
print ("num of pepole who stay: "+ str(stay))
print ("num of pepole who churn: "+ str(churn))

In [None]:
# ratio of those who churn and those who don't
sizes = [48401,47647]
labels='NO','YES'
explode = (0, 0.1)  
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode,autopct='%1.1f%%',shadow=True, startangle=75 )
ax1.axis('equal') 
ax1.set_title("Client Churn Distribution")

ax1.legend(labels)

plt.show()

<a id = "7"></a><br>
# 4 | Missing Value
- Find Missing Value
- Fill Missing Value

<a id = "8"></a><br>
## Find Missing Value

In [None]:
df.columns[df.isnull().any()]

In [None]:
# Features with missing values
miss = df.isnull().sum().sort_values(ascending = False).head(44)
miss_per = (miss/len(df))*100

# Percentage of missing values
pd.DataFrame({'No. missing values': miss, '% of missing data': miss_per.values})

<a id = "9"></a><br>
## Fill Missing Value

In [None]:
# We dropped the columns that seem to have no significant contribution to the model.
df.drop(['numbcars','dwllsize','HHstatin','ownrent','dwlltype','lor','income','adults','prizm_social_one','infobase','crclscod'],axis=1,inplace=True)

In [None]:
df['hnd_webcap']=df['hnd_webcap'].fillna('UNKW') # Handset web capability

df['avg6qty']=df['avg6qty'].fillna(df['avg6qty'].mean()) # Billing adjusted total number of calls over the life of the customer
df['avg6rev']=df['avg6rev'].fillna(df['avg6rev'].mean()) # Average monthly revenue over the life of the customer
df['avg6mou']=df['avg6mou'].fillna(df['avg6mou'].mean()) # Average monthly minutes of use over the life of the customer

df['change_mou']=df['change_mou'].fillna(df['change_mou'].mean()) # Percentage change in monthly minutes of use vs previous three month average
df['change_rev']=df['change_rev'].fillna(df['change_rev'].mean()) # Percentage change in monthly revenue vs previous three month average

df['rev_Mean']=df['rev_Mean'].fillna(df['rev_Mean'].mean())
df['totmrc_Mean']=df['totmrc_Mean'].fillna(df['totmrc_Mean'].mean())
df['da_Mean']=df['da_Mean'].fillna(df['da_Mean'].mean())
df['ovrmou_Mean']=df['ovrmou_Mean'].fillna(df['ovrmou_Mean'].mean())
df['ovrrev_Mean']=df['ovrrev_Mean'].fillna(df['ovrrev_Mean'].mean())
df['vceovr_Mean']=df['vceovr_Mean'].fillna(df['vceovr_Mean'].mean())
df['datovr_Mean']=df['datovr_Mean'].fillna(df['datovr_Mean'].mean())
df['roam_Mean']=df['roam_Mean'].fillna(df['roam_Mean'].mean())
df['mou_Mean']=df['mou_Mean'].fillna(df['mou_Mean'].mean())


In [None]:
#VISUALIZATION OF NAN  VALUES
import missingno as msno
msno.matrix(df);

In [None]:
df.dropna(inplace=True)

In [None]:
sum(df.isnull().sum()>0)

In [None]:
columns_categories(df)

In [None]:
numerical_features = ['months', 'uniqsubs', 'actvsubs', 'totcalls', 'avg3qty', 'avg3rev','rev_Mean', 'mou_Mean', 'totmrc_Mean', 'da_Mean', 'ovrmou_Mean', 'datovr_Mean', 
                      'roam_Mean', 'change_mou', 'change_rev', 'drop_vce_Mean', 'drop_dat_Mean', 'blck_vce_Mean', 'blck_dat_Mean', 'unan_vce_Mean', 'unan_dat_Mean', 
                      'plcd_vce_Mean', 'plcd_dat_Mean', 'recv_vce_Mean', 'recv_sms_Mean', 'custcare_Mean', 'ccrndmou_Mean', 'threeway_Mean', 'mou_cvce_Mean', 
                      'mou_cdat_Mean', 'mou_rvce_Mean', 'owylis_vce_Mean', 'mouowylisv_Mean', 'iwylis_vce_Mean', 'mouiwylisv_Mean', 'peak_vce_Mean', 'peak_dat_Mean', 
                      'mou_peav_Mean', 'mou_pead_Mean', 'opk_vce_Mean', 'opk_dat_Mean', 'mou_opkv_Mean', 'drop_blk_Mean', 'callfwdv_Mean', 'callwait_Mean', 'totmou', 
                      'totrev', 'avgrev', 'avgmou', 'avgqty', 'avg6mou', 'avg6rev', 'hnd_price', 'phones', 'models', 'truck', 'rv', 'forgntvl', 'eqpdays']

In [None]:
for i in numerical_features:    
    f_sqrt= (lambda x: np.sqrt(abs(x)) if (x>=1) or (x<=-1) else x)
    df[i] = df[i].apply(f_sqrt)

<a id = "10"></a><br>
# 5 | Visualization

<a id = "11"></a><br>
# Box plot of numerical features

In [None]:
# Box plot of numerical features
fig, ax = plt.subplots(15, 4, figsize = (20, 50))
ax = ax.flatten()
for i, c in enumerate(numerical_features):
    sns.boxplot(x = df[c], ax = ax[i], palette = 'Set3')
# plt.suptitle('Box Plot', fontsize = 25)
fig.tight_layout()

- columns without outliers : 
- 'months', 'uniqsubs', 'actvsubs', 'totcalls', 'avg3qty', 'avg3rev','rev_Mean', 'mou_Mean', 'totmrc_Mean', 'da_Mean', 'ovrmou_Mean', 'datovr_Mean', 
   'roam_Mean', 'change_mou', 'change_rev', 'drop_vce_Mean', 'drop_dat_Mean', 'blck_vce_Mean', 'blck_dat_Mean', 'unan_vce_Mean', 'unan_dat_Mean', 
   'plcd_vce_Mean', 'plcd_dat_Mean', 'recv_vce_Mean', 'recv_sms_Mean', 'custcare_Mean', 'ccrndmou_Mean', 'threeway_Mean', 'mou_cvce_Mean', 
   'mou_cdat_Mean', 'mou_rvce_Mean', 'owylis_vce_Mean', 'mouowylisv_Mean', 'iwylis_vce_Mean', 'mouiwylisv_Mean', 'peak_vce_Mean', 'peak_dat_Mean', 
   'mou_peav_Mean', 'mou_pead_Mean', 'opk_vce_Mean', 'opk_dat_Mean', 'mou_opkv_Mean', 'drop_blk_Mean', 'callfwdv_Mean', 'callwait_Mean', 'totmou', 
   'totrev', 'avgrev', 'avgmou', 'avgqty', 'avg6mou', 'avg6rev', 'hnd_price', 'phones', 'models', 'truck', 'rv', 'forgntvl', 'eqpdays'

<a id = "12"></a><br>
# 6 | Outlier Detection

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return outlier_indices

In [None]:
df.loc[detect_outliers(df,['uniqsubs', 'actvsubs'])]

In [None]:
# drop outliers
df = df.drop(detect_outliers(df,['uniqsubs', 'actvsubs']),axis = 0).reset_index(drop = True)

<a id = "13"></a><br>
# 7 | Feature Engineering

<a id = "14"></a><br>
# One-hot encoding
- Before looking at the correlation, let's make the categorical variables numerical with get_dummies.

In [None]:
# Unique variables of object columns
encoding_col=[]
for i in df.select_dtypes(include='object'):   
    print(i,'-->',df[i].nunique())
    encoding_col.append(i)

In [None]:
# one-hot encoding for variables with more than 2 categories
df2 = df.copy()
df2 = pd.get_dummies(df2, drop_first=True, columns = encoding_col, prefix = encoding_col)

In [None]:
display(df.shape)
display(df2.shape)

In [None]:
# Create correlation matrix
corr_matrix = df.corr().abs()
# print(corr_matrix)

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features 
df.drop(df[to_drop], axis=1,inplace=True)

<a id = "15"></a><br>
# Ascending ranking of correlations between feaures and churn

In [None]:
c = df.corr()['churn'].abs()
sc = c.sort_values()
sc

In [None]:
a = dict(sc.tail(40))
b = a.keys()
print(sorted(b))

In [None]:
plt.figure(figsize=(30,11))
sns.heatmap(df2[b].corr(), annot = True, fmt = ".2f",robust=True,linewidths=1.3,linecolor = 'gold')
plt.show()

In [None]:
# Get Correlation of "churn" with other variables:
plt.figure(figsize=(15,8))
df2[b].corr()['churn'].sort_values(ascending = False).plot(kind='bar')

<a id = "16"></a><br>
# 8 | Modelling

<a id = "17"></a><br>
# Train - Test Split

In [None]:
# Import Machine learning algorithms
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

#Import metric for performance evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report,confusion_matrix, ConfusionMatrixDisplay

#Split data into train and test sets
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV

In [None]:
# dependent and independent variables were determined.
X = df2.drop('churn', axis=1)
y = df2['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train",len(X_train))
print("X_test",len(X_test))
print("y_train",len(y_train))
print("y_test",len(y_test))

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
#Defining the modelling function
def modeling(alg, alg_name, params={}):
    model = alg(**params) #Instantiating the algorithm class and unpacking parameters if any
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
      
    #Performance evaluation
    def print_scores(alg, y_true, y_pred):
        print(alg_name)
        acc_score = accuracy_score(y_true, y_pred)
        print("accuracy: ",acc_score)
        pre_score = precision_score(y_true, y_pred)
        print("precision: ",pre_score)
        rec_score = recall_score(y_true, y_pred)                            
        print("recall: ",rec_score)
        f_score = f1_score(y_true, y_pred, average='weighted')
        print("f1_score: ",f_score)        
    print_scores(alg, y_test, y_pred)
    
    
    cm = confusion_matrix(y_test, y_pred)
    #Create the Confusion Matrix Display Object(cmd_obj). 
    cmd_obj = ConfusionMatrixDisplay(cm, display_labels=['churn', 'notChurn'])

    #The plot() function has to be called for the sklearn visualization
    cmd_obj.plot()

    #Use the Axes attribute 'ax_' to get to the underlying Axes object.
    #The Axes object controls the labels for the X and the Y axes. It also controls the title.
    cmd_obj.ax_.set(
                    title='Sklearn Confusion Matrix with labels!!', 
                    xlabel='Predicted Churn', 
                    ylabel='Actual Churn')
    #Finally, call the matplotlib show() function to display the visualization of the Confusion Matrix.
    plt.show()
    
    return model

In [None]:
# Running RandomForestClassifier model
RF_model = modeling(RandomForestClassifier, 'Random Forest')

In [None]:
# LightGBM model
LGBM_model = modeling(lgb.LGBMClassifier, 'Light GBM')

In [None]:
#Decision tree
dt_model = modeling(DecisionTreeClassifier, "Decision Tree Classification")

In [None]:
#Naive bayes 
nb_model = modeling(GaussianNB, "Naive Bayes Classification")

In [None]:
# Ada Boost
ada_model=modeling(AdaBoostClassifier, "Ada Boost Classifier")

In [None]:
# Gradient Boosting
gbm_model=modeling(GradientBoostingClassifier, "Gradient Boosting Classifier")

<a id = "18"></a><br>
## Trial and Conclusion

In [None]:
clf = lgb.LGBMClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
cm = confusion_matrix(y_test, y_pred)
cmd_obj = ConfusionMatrixDisplay(cm, display_labels=['1', '0'])
cmd_obj.plot()
cmd_obj.ax_.set(
                    title='Confusion Matrix with labels!!', 
                    xlabel='Predicted Churn', 
                    ylabel='Actual Churn'
                    )
plt.show()

In [None]:
df = pd.DataFrame({'Models':['RF','LGBM','DT', 'NB','ADA','GBM'], 'Prediction':[0.616, 0.631, 0.552, 0.539 ,0.614, 0.626]})
ax = df.plot.barh(x='Models', y='Prediction', rot=0)

<a id = "19"></a><br>
# Hyperparameter Tuning -- Grid Search -- Cross Validation
We will compare 2 ML classifier and evaluate mean accuracy of each of them by stratified cross validation.
* Random Forest
* Light GBM

In [None]:
random_state = 42
classifier = [RandomForestClassifier(random_state = random_state),
             lgb.LGBMClassifier(random_state = random_state)]

rf_param_grid = {"max_features": [1,3,10],
                "min_samples_split":[2,3,10],
                "min_samples_leaf":[1,3,10],
                "bootstrap":[False],
                "n_estimators":[100,300],
                "criterion":["gini"]}

lgbm_params = {'n_estimators': [100, 500, 1000],
                'subsample': [0.6, 0.8, 1.0],
                'max_depth': [3, 4, 5],
                'learning_rate': [0.1,0.01,0.02],
                "min_child_samples": [5,10,20]}

classifier_param = [rf_param_grid,                   
                   lgbm_params]

In [None]:
# cv_result = []
# best_estimators = []
# for i in range(len(classifier)):
#     clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
#     clf.fit(X_train,y_train)
#     cv_result.append(clf.best_score_)
#     best_estimators.append(clf.best_estimator_)
#     print(cv_result[i])

In [None]:
# cv_results = pd.DataFrame({"Cross Validation Means":cv_result, "ML Models":[ "RandomForestClassifier","LGBMClassifier"]})

# g = sns.barplot("Cross Validation Means", "ML Models", data = cv_results)
# g.set_xlabel("Mean Accuracy")
# g.set_title("Cross Validation Scores")