## <font color=blue> Project: Churn Fortune Teller </font>


## <font color=blue>Objectives: Exploratory Data Analysis </font>
Produce Visualisations to understand importance of all predictor variables, as well as their
underlying data distribution.<br>
That will enable us to determine the meanings and importance of various predictor variables in how they influence
prediction of churn customers.<br>

k-means clustering carried out to profile customers by their tenure and monthly charge.

Lastly, predictive models trained on dataset to determine which customers will churn or not.

## <font color=blue>Overview: </font>

1) Importing Data & Examing Data Types<br><br>
2) Data Visualisation
   - 2.1 Categorical Influencers<br>
   - 2.2 Numerical Influencers <br>
   
3) Cluster Analysis Based Tenure and Monthly Charges<br>
4) Boxplots Of Monthly Charges Against Categorical Predictors<br>
5) Feature Engineering<br>
6) Predictive Modelling (XGBoost Classifier, Random Forest Classifier, KNN Classifier, Logistic Regression)<br>
7) Comparing Model performances<br>
8) Model Interpretation

## 1) Importing Data & Examing Data Types

In [None]:
#Importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas as pd #visualization
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns

churn_data=pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
churn_data.info()

In [None]:
churn_data.TotalCharges = pd.to_numeric(churn_data.TotalCharges, errors='coerce')
churn_data.TotalCharges = churn_data.TotalCharges.fillna(method='ffill')

There are 11 missing values under 'TotalCharges' column.

In [None]:
churn_data.TotalCharges = pd.to_numeric(churn_data.TotalCharges, errors='coerce')
churn_data.isnull().sum()

In [None]:
churn_data.info()

In [None]:
churn_data.nunique() #Number of unique values for categorical variables

In [None]:
print(churn_data["Churn"].value_counts()/len(churn_data)*100)

### 2) Exploratory Data Analysis

In [None]:
#Separating catagorical and numerical columns
Id_col     = ['customerID']
target_col = ["Churn"]
cat_cols   = churn_data.nunique()[churn_data.nunique() < 6].keys().tolist()
cat_cols   = [x for x in cat_cols if x not in target_col] #categorical predictor variables
num_cols   = [x for x in churn_data.columns if x not in cat_cols + target_col + Id_col] #numerical predictor variables

__References for plots:__ <br>
Udemy course on plotly & dash: https://github.com/Pierian-Data/Plotly-Dashboards-with-Dash<br>
Plotly website basic tutorials: https://plotly.com/python/line-and-scatter/#line-plot-with-plotly-express<br>
Plotly colors: https://plotly.com/python/discrete-color/<br>

### 2.2 Customer Attrition based on categorical influencers

In [None]:
#Separating churn and non churn customers
churn = churn_data[churn_data["Churn"] == "Yes"]
not_churn = churn_data[churn_data["Churn"] == "No"]

In [None]:

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go#visualization


def plot_pie(column) :
    trace1 = go.Pie(values  = churn[column].value_counts().values.tolist(),
                    labels  = churn[column].value_counts().keys().tolist(),
                    hoverinfo = "label+percent+name",
                    domain  = dict(x = [0,.48]),
                    name    = "Churn Customers",
                    marker  = dict(line = dict(width = 2,
                                               color = "rgb(243,243,243)")),
                    hole    = .6
                   )
    trace2 = go.Pie(values  = not_churn[column].value_counts().values.tolist(),
                    labels  = not_churn[column].value_counts().keys().tolist(),
                    hoverinfo = "label+percent+name",
                    marker  = dict(line = dict(width = 2,
                                               color = "rgb(243,243,243)")
                                  ),
                    domain  = dict(x = [.52,1]),
                    hole    = .6,
                    name    = "Non churn customers" 
                   )
    layout = go.Layout(dict(title = column + " distribution in customer attrition ",
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            annotations = [dict(text = "churn customers",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .15, y = .5),
                                           dict(text = "Non churn customers",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .88,y = .5
                                               )
                                          ]
                           )
                      )
    data = [trace1,trace2]
    fig  = go.Figure(data = data,layout = layout)
    py.iplot(fig)

In [None]:
#for all categorical columns plot pie
for i in cat_cols :
    plot_pie(i)

__Inferences from pie chars:__<br>
1) Gender is not a good indicator of churn <br>
2) Customers that doesn't have partners are more likely to churn <br>
3) Customers without dependents are also more likely to churn <br>
4) Customers who are on month-to-month contract are likely to abandon company services <br>
5) Customers who have internet available, opt for paperless billing and automatic payment services are more
   likely to churn. These groups of customers tend   to be tech savvy, read widely and be updated on latest market 
   trends and rates.<br>
6) Customers who enjoy premium stream services are likely to leave, if they are lured by competitors offering
   similar services whose prices are competitive and offer better quality.<br>
7) Customers also tend to leave because of lack of technical support and online security as they're unlikely to find success in    a company's products.<br>
8) Presence of phone service, especially multiple lines drive churn.

In [None]:
sns.heatmap(pd.crosstab(churn_data.Dependents, churn_data.Partner, normalize='all', margins=True), annot=True, cmap='ocean')

In [None]:
sns.heatmap(pd.crosstab(churn_data.Dependents, churn_data.SeniorCitizen, normalize='all', margins=True), annot=True, cmap='ocean')


Senior citizens have a higher probability of having dependents.<br>
People without partners generally do not have dependents.

In [None]:
sns.heatmap(pd.crosstab(churn_data.PhoneService, churn_data.MultipleLines, normalize='all', margins=True), annot=True, cmap='ocean')


Those with phoneservices have equal probability of having mutliple phone lines.<br>
<br>Multiple lines is not actually a strong predictor.

In [None]:
sns.heatmap(pd.crosstab(churn_data.InternetService,churn_data.PaymentMethod, normalize='all', margins=True), annot=True, cmap='ocean')


In [None]:
sns.heatmap(pd.crosstab(churn_data.InternetService,churn_data.PaperlessBilling, normalize='all', margins=True), annot=True, cmap='ocean')


People who opt for paperless billing tend to utilise internet service. <br>
Those with internet service have a clearcut preference for automatic transfers, especially Fiber optic subscribers.
Those without internet services tend to use mailed check mostly.

In [None]:
sns.heatmap(pd.crosstab(churn_data.SeniorCitizen,churn_data.PaymentMethod, normalize='all', margins=True), annot=True, cmap='ocean')


People generally prefer manual transfer, probably due to safety reasons as well as lower cost, compared to automatic transfers.<br>
This is regardless of age.

### 2.2 Customer Churn Analysis based on numeric influencers

In [None]:
#function  for histogram for customer attrition types
def plot_histogram(column) :
    trace1 = go.Histogram(x = churn[column],
                          histnorm = "percent",
                          name = "Churn Customers",
                          marker = dict(line = dict(width = .5,
                                                    color = "black"
                                                    )
                                        ),   
                         opacity = .6 
                         ) 
    
    trace2 = go.Histogram(x = not_churn[column],
                          histnorm = "percent",
                          name = "Non churn customers",
                          marker = dict(line = dict(width = .5,
                                              color = "black"
                                             )
                                 ),
                          opacity = .6
                         )         
    
    data = [trace1,trace2]
    layout = go.Layout(dict(title =column + " distribution in customer attrition ",
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(300,243,243)",
                            xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = column,
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                            yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
                                             title = "percent",
                                             zerolinewidth=1,
                                             ticklen=5,
                                             gridwidth=2
                                            ),
                           ))
    fig  = go.Figure(data=data,layout=layout)
    py.iplot(fig)

In [None]:
#for all categorical columns plot histogram    
for i in num_cols :
    plot_histogram(i)


Inferences from histogram diagrams: <br>
1) 39% of the churn customers have a tenure of about 5 months. <br>
2) Churn customers have monthly charges peaked at around $75 per month.<br>
3) Approximately 55% of churn customers have a cumulative total charge of 900 dollars 

In [None]:
import plotly.express as px

def plotly_scatterplot(xc, yc, colour, template, trendline=None):
    fig1 = px.scatter(churn_data, x=xc, y=yc,
                color=colour, render_mode='svg', template=template,
                hover_name="customerID",
                marginal_x=None,
                marginal_y=None, trendline=trendline)
    return fig1

In [None]:
plotly_scatterplot(xc='MonthlyCharges', yc='TotalCharges', colour='Churn', template='plotly_dark',trendline='ols')

In [None]:
plotly_scatterplot(xc='MonthlyCharges', yc='TotalCharges', colour='Contract', template='plotly')

It is understood from the two scatterplots that:

1) Clients with __lower tenure__ are more likely to churn 

2) Clients with __higher MonthlyCharges__ are also more likely to churn

3) Tenure and MonthlyCharges are **very significant** features in determining churn outcome


### 3. Cluster Analysis Based On Monthly Charges and Tenure

K-means clustering can be used to partition the dataset based on tenure and monthly charges, the significant numeric variables. <br>
Purpose is to group instances of similar traits together.<br>
The K in K-Means denotes the number of clusters. <br>
This algorithm initialises cluster centroids that randomly converges to a solution after some point in time <br>
is bound to converge to a solution after some iterations.<br>


In [None]:
from sklearn.cluster import KMeans 
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

scaler = MinMaxScaler()
churn_data[['MonthlyCharges','tenure']] = scaler.fit_transform(churn_data[['MonthlyCharges','tenure']])


def elbow_plot(data=churn_data[['MonthlyCharges','tenure']]):
    score = []
    for cluster in range(1,11):
        kmeans = KMeans(n_clusters = cluster, init="k-means++", random_state=10)
        kmeans.fit(data)
        score.append(kmeans.inertia_)
    
    plt.plot(range(1,11), score)
    plt.title('The Elbow Method')
    plt.xlabel('no of clusters')
    plt.ylabel('wcss')
    plt.grid()
    return plt.show()

elbow_plot()

•	__Inertia__ is the sum of squared error for each cluster. Therefore, the smaller the inertia the denser the cluster (closer together all the points are) <br>
•	Tip for choosing optimal number of clusters is looking at rate of decrease in inertia for addition of a cluster <br>
•	__Optimal number__ of clusters is __4__ since inertia does not decrease noticeably after additional clusters are added


In [None]:
#Apply kmeans clustering to the entire dataset
kmeans = KMeans(n_clusters = 4, random_state = 1000).fit(churn_data[['MonthlyCharges','tenure']])
churn_data['cluster'] = kmeans.labels_
churn_data[['MonthlyCharges','tenure']] = scaler.inverse_transform(churn_data[['MonthlyCharges','tenure']])


#Plot a plotly interactive scatter plot
fig = px.scatter(churn_data, x='MonthlyCharges', y='tenure',
                color='cluster', render_mode='svg', template='plotly',
                hover_data=['SeniorCitizen','Dependents','Contract','InternetService',
                            'PaperlessBilling','PaymentMethod'],
                hover_name="customerID",
                marginal_x="violin",
                marginal_y="violin")

fig.update_layout(title='Clusters of churned users by monthly charges and tenure',
                  paper_bgcolor='LightBlue')
                      
fig.show()

Overall, clusters are well segregated as seen above.<br>

Cluster 0: High tenure, high monthly charge <br>
Cluster 1: Low tenure, low monthly charge <br>
Cluster 2: Low tenure, high monthly charge <br>
Cluster 3: High tenure, low monthly charge,  <br>


The pivot table below shows mean monthly charges and tenure of senior citizens in a cluster.<br>
The figures in the table can be verified by hovering the cursor over the interactive graph above.

In [None]:
cluster_charges = pd.pivot_table(churn_data, index=['cluster'], columns=['SeniorCitizen'],
                     values=['MonthlyCharges','tenure'], margins=True, aggfunc='mean')

sns.heatmap(cluster_charges, annot=True, cmap='ocean')

In [None]:
churn_data['cluster'].value_counts() #count distribution of clusters

In [None]:
sns.countplot('cluster',hue='Churn',data=churn_data, orient='h')

In [None]:
sns.countplot('Churn', hue='MultipleLines',data=churn_data, orient='h')

Overall, clusters are well segregated as seen above.

Cluster 0: High tenure, high monthly charge
Cluster 1: Low tenure, low monthly charge
Cluster 2: Low tenure, high monthly charge
Cluster 3: High tenure, low monthly charge,

Clusters with descending order of churning probability: 2, 1, 0, 3 <br><br>

Cluster 3 is defined by high tenure and low, monthly charges, __ideal for retaining customers__.<br>
Customers in category 2 (low tenure and high monthly charges), have highest probability of churning.<br>
Cluster 0 customers have high tenure but high monthly charge. This shows monthly charge also an important predictor.

In [None]:
sns.countplot('cluster',hue='SeniorCitizen',data=churn_data, orient='h')

Senior citizens fall under clusters 0 and 2, clusters with high monthly charge. This means high monthly charge is problematic for senior citizens.

In [None]:
sns.countplot('cluster',hue='Contract',data=churn_data, orient='h')

Customers of over two year contracts are found in clusters with high tenure.
Customers of month-to-month contract are found in clusters with low tenure and they cause churn signficantly.

## 4. Boxplots Measuring Monthly Cost Against Amenities and Securities

Let's investigate monthly charge

In [None]:
def plot_boxplot(column):
    fig = px.box(churn_data, x=column, y="MonthlyCharges", color="Churn",points="outliers", 
                 hover_name="customerID",template='plotly')
    fig.update_traces(quartilemethod="inclusive")
    fig.update_layout(title='Monthly Charges against {} Segregated By Churn and Non-Churn Customers'.format(column))
    return fig.show()

In [None]:
for cols in ['MultipleLines','OnlineSecurity','StreamingTV','PaperlessBilling','PaymentMethod','SeniorCitizen']:
    plot_boxplot(cols)

__Boxplot Inferences:__<br>
1) Senior citizens tend to have higher cost monthly, even for those in churn groups. They are likely to churn but bring great benefits in revenue.<br>
2) Manual payment through checks are less costlier. Mailed check payment has largest range.<br>
3) Presence of internet increases monthly cost significantly. Addition of streaming cost poses greater costs.<br>
4) Having phoneline increases cost. Multiple phonelines raises monthly cost.<br>
5) High increases in costs due to internet and phone related services increases probability of churn.<br>

## 5. Feature Engineering

In [None]:
# A person is a family man if he has a spouse and dependents. 
#These groups of people chalk higher monthly bill and may churn
churn_data['Family_Person'] = np.where((churn_data['Dependents']=='Yes') & (churn_data['Partner']=='Yes'),1,0)


# Protection is defined by availability of security, backup and customer technical support
churn_data['Protection'] = np.where((churn_data['TechSupport'] == 'Yes') |\
                                    (churn_data['OnlineSecurity'] == 'Yes') |\
                                    (churn_data['OnlineBackup'] == 'Yes') |\
                                    (churn_data['DeviceProtection'] == 'Yes'),1,0)

# Total services - total counts of phone, internet, streaming and protection related services
churn_data['TotalServices'] = (churn_data[['PhoneService', 'InternetService', 'OnlineSecurity',
                                       'OnlineBackup', 'DeviceProtection', 'TechSupport',
                                       'StreamingTV', 'StreamingMovies']]== 'Yes').sum(axis=1)

# Presence of internet determines churn probability as well as numerous facilities availability
churn_data['Has_Internet']=churn_data['InternetService'].replace(['Fiber optic','DSL','No'], [1,1,0])

# Presence of either streaming movies or TV show facilities determine if a customer is using streaming services
churn_data['Streaming'] = np.where((churn_data['StreamingTV']=='Yes') | (churn_data['StreamingMovies']=='Yes'),1,0)

# Manual check payment does not utilises advanced modern electronic technology
churn_data['Tech_Payment'] = np.where((churn_data['PaymentMethod']!='Mailed check'),1,0)

# A person is a techie if he utilises high-tech payment methods or enjoys streaming services
churn_data['Techie'] = np.where((churn_data['Streaming']==1) | (churn_data['Tech_Payment']==1),1,0)

# Premium services defined by fiber optic usage and multiple phone lines
churn_data['Premium_Services'] = np.where((churn_data['MultipleLines']=='Yes') & (churn_data['InternetService']=='Fiber optic'),1,0)

In [None]:
y=churn_data['Churn']

#Get rid of columns you don't wish to feed to machine learning models
drop_list=["customerID","Churn","gender","cluster","Partner","Dependents","StreamingTV","StreamingMovies","TechSupport",
           "OnlineSecurity","OnlineBackup","DeviceProtection","TotalCharges","MultipleLines","PhoneService","Contract",
          "InternetService","PaymentMethod"]
x=churn_data.drop(drop_list, axis=1)

In [None]:
def label_encoder(dataframe, col_name): # Function for label encoding categorical variables
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    le.fit(dataframe[col_name].unique())
    dataframe[col_name] = le.transform(dataframe[col_name])
    
for i in list(x.columns[x.dtypes =='object']):
    label_encoder(x, i)

## 6. Predictive Modelling
### 6.1 Random Forest

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=8,stratify=y,train_size=0.75)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(bootstrap=True, oob_score=True, n_estimators=30, random_state=8,
                            max_depth=8, class_weight={'No': 1, 'Yes': 1.6},n_jobs=-1,max_features=0.3,min_samples_leaf=3,
                           min_samples_split=3)
rf.fit(x_train, y_train)

In [None]:
print("Training set score: {:.3f}".format(rf.score(x_train, y_train)))
print("Test set score: {:.3f}".format(rf.score(x_test, y_test)))
print("Out of bound score: {:.3f}".format(rf.oob_score_))

In [None]:
rf_predictions = rf.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix

ax = sns.heatmap(confusion_matrix(y_test, rf_predictions), annot=True, fmt='d')
ax.set(xlabel='Random Forest Prediction', ylabel='Truth')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,rf_predictions))

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param = dict(max_depth = np.arange(6,15,1), max_features = np.arange(0.1,0.9,0.1))

rf_cv = RandomizedSearchCV(rf, param, cv=5,n_jobs=-1)
# Fit it to the data
rf_cv.fit(x_train, y_train)

In [None]:
# Print the tuned parameters and score
print("Tuned Random Forest Parameters: {}".format(rf_cv.best_params_))
print("Best score is {}".format(rf_cv.best_score_))

In [None]:
rf2 = RandomForestClassifier(bootstrap=True, oob_score=True, n_estimators=30, random_state=8,
                            max_depth=6, class_weight={'No': 1, 'Yes': 1.6},n_jobs=-1,max_features=0.1,min_samples_leaf=5,
                           min_samples_split=5)
rf2.fit(x_train, y_train)

In [None]:
rf2_predictions = rf2.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix

ax = sns.heatmap(confusion_matrix(y_test, rf2_predictions), annot=True, fmt='d')
ax.set(xlabel='Random Forest Prediction', ylabel='Truth')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,rf2_predictions))

In [None]:
def plot_feature_importances (model, kind, title, color, dataframe):
    importances = pd.Series(data=model.feature_importances_, index= dataframe.columns)
    # Sort importances
    importances_sorted = importances.sort_values()
    # Draw a horizontal barplot of importances_sorted
    importances_sorted.plot(kind=kind, color=color)
    plt.title(title)
    plt.grid()
    return(plt.show())

fig,ax=plt.subplots(figsize=(10,10))
plot_feature_importances (rf2, 'barh', 'Churn Random Forest Classifier Importances', 'green', x_train)

In [None]:
#DT visualizatin method 1

from sklearn.tree import export_graphviz

dotfile = open("dt2.dot", 'w')

export_graphviz(rf[0], out_file=dotfile,feature_names = x.columns,class_names=['0','1'])
dotfile.close()
# Copying the contents of the created file ('dt2.dot' ) to a graphviz rendering agent at http://webgraphviz.com/
# check out https://www.kdnuggets.com/2017/05/simplifying-decision-tree-interpretation-decision-rules-python.html

In [None]:
#DT visualizatin method 2
# need to install Graphviz first https://graphviz.gitlab.io/_pages/Download/Download_windows.html
from sklearn.tree import export_graphviz
import os

os.environ["PATH"] += os.pathsep + 'C:/Users/anirban/Desktop/customer-churn/codes'

export_graphviz(rf[0], out_file='tree.dot', feature_names=x.columns,class_names=['0','1'])
# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

### 6.2 Extreme Gradient Boosting Classifier

In [None]:
from xgboost import XGBClassifier

xgb=XGBClassifier(objective='reg:squarederror', n_estimators=12, max_depth=4,
                  learning_rate=0.5, seed=100,reg_lambda=0, reg_alpha=0.2, colsample_bynode=1,colsample_bytree=1,
                  scale_pos_weight=1.6)
xgb.fit(x_train, y_train)

In [None]:
print("Training set score: {:.3f}".format(xgb.score(x_train, y_train)))
print("Test set score: {:.3f}".format(xgb.score(x_test, y_test)))

In [None]:
xgb_predictions = xgb.predict(x_test)

In [None]:
ax = sns.heatmap(confusion_matrix(y_test, xgb_predictions), annot=True, fmt='d')
ax.set(xlabel='XGBoost Prediction', ylabel='Truth')

In [None]:
print(classification_report(y_test, xgb_predictions))

In [None]:
fig,ax=plt.subplots(figsize=(10,10))
plot_feature_importances (xgb, 'barh', 'XGBoost Importances', 'maroon', x_train)

### 6.3 K-Nearest Neighbour Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

test_scores = []
train_scores = []

for i in range(1,15):

    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train,y_train)
    train_scores.append(knn.score(x_train,y_train))
    test_scores.append(knn.score(x_test,y_test))

In [None]:
## score that comes from testing on the same datapoints that were used for training
max_train_score = max(train_scores)
train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
print('Max train score {} % and k = {}'.format(max_train_score*100,list(map(lambda x: x+1, train_scores_ind))))

## score that comes from testing on the datapoints that were split in the beginning to be used for testing solely
max_test_score = max(test_scores)
test_scores_ind = [i for i, v in enumerate(test_scores) if v == max_test_score]
print('Max test score {} % and k = {}'.format(max_test_score*100,list(map(lambda x: x+1, test_scores_ind))))


In [None]:
plt.figure(figsize=(12,5))
plt.plot(range(1,15),train_scores,marker='*',label='Train Score')
plt.plot(range(1,15),test_scores,marker='o',label='Test Score')
plt.grid()

In [None]:
#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(11)
knn.fit(x_train,y_train)
knn.score(x_test,y_test)

In [None]:
#import confusion_matrix
from sklearn.metrics import confusion_matrix
#let us get the predictions using the classifier we had fit above
y_pred = knn.predict(x_test)
sns.heatmap(pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=False), annot=True, fmt='d')
print(classification_report(y_test,y_pred))

In [None]:
#import GridSearchCV
from sklearn.model_selection import GridSearchCV
#In case of classifier like knn the parameter to be tuned is n_neighbors
param_grid = {'n_neighbors':np.arange(1,50)}
knn2 = KNeighborsClassifier()
knn2_cv= GridSearchCV(knn,param_grid,cv=5)
knn2_cv.fit(x_train,y_train)

print("Best Score:" + str(knn2_cv.best_score_))
print("Best Parameters: " + str(knn2_cv.best_params_))

### 6.4 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=0.01,class_weight={'No': 1, 'Yes': 1.8},penalty='l2').fit(x_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(x_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(x_test, y_test)))

In [None]:
y_pred = logreg.predict(x_test)
print(sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d"))  
print(classification_report(y_test, y_pred)) 

## 7. Comparing Model Performances

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC score, better the model predictive power in predicting 0s as 0s and 1s as 1s. 

In [None]:
#Plot ROC charts for the MLP classifiers to compare performance
from sklearn.metrics import plot_roc_curve

# Instantiate the classfiers and make a list
classifiers = [('Random Forest Classifier', rf), ('XGBoost Classifier', xgb), ('KNN Classifier', knn),
               ('Logistic Regression', logreg)]

for name, cls in classifiers:
    cls_pred = cls.predict_proba(x_test)
    cls_classifier_disp = plot_roc_curve(cls, x_test, y_test)
    cls_classifier_disp.figure_.suptitle("{} ROC curve".format(name))
    plt.plot([0,1],[0,1],'k--', label='no skill')
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
from sklearn import model_selection

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
seed=8
for name, model in classifiers:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: mean-%f, std-(%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.ylabel("Accuracy Scores", fontsize=15)
ax.set_xticklabels(names)
ax.grid()
plt.xticks(rotation=90)
plt.show()

In [None]:
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    if axes is None:
        _, axes = plt.subplots(1, 2, figsize=(20, 5))
    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                       return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")
    axes[0].set_title("Learning curve of the model: {}".format(title))

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model: {}".format(title))
    
    return plt

In [None]:
plot_learning_curve(rf, 'Random Forest Classification', x, y, axes=None, ylim=(0.7, 1.01),cv=5, n_jobs=-1)
plot_learning_curve(xgb, 'XGBoost Classification', x, y, axes=None, ylim=(0.7, 1.01),cv=5, n_jobs=-1)
plot_learning_curve(knn, 'KNN Classification', x, y, axes=None, ylim=(0.7, 1.01),cv=5, n_jobs=-1)
plot_learning_curve(logreg, 'Logistic Regression', x, y, axes=None, ylim=(0.7, 1.01),cv=5, n_jobs=-1)
plt.show()

So far, most of the models are finetuned well such that they do not overfit, as seen by the convergence between training and validation scores. However, random forest is the only model that has fastest computation time w.r.t increase numbers of samples.

### 8. Model Interpretation

The relative importance score for each features generated is based on gini. It shows importance of each score in determining churn and non-churn categories.

Committment status of a customer (contract he/she signed up for), how long that person has been using telecom company service and monthly charges chalked up matters most regardless of demographics of the customer. This is for predicting both churn andd non-churn customers.

In [None]:
import shap

# load JS visualization code to notebook
shap.initjs()

#Relative importance of each feature in determining target class probability
explainerRF = shap.TreeExplainer(rf2)
shap_values = explainerRF.shap_values(x_test)
shap.summary_plot(shap_values, x_test)

For comparison, a multi-prediction force plot is shown here. It is a combination of many individual force plots that are rotated 90 degrees and stacked horizontally.<br>

The force plot shows that approximately three-quarters of the predictions follow the prediction path dominated by 
tenure, monthly charges and if the committment status, for both churn and non-churn customers.

Feel free to try out how various features interact with each other segregated by churn status flag.

In [None]:
# visualize the training set predictions similarity with respect to predictor variables
shap.force_plot(explainerRF.expected_value[0], shap_values[0], x)

In [None]:
import scikitplot as skplt

# Deriving Class probabilities
predicted_probabilities = rf2.predict_proba(x_test)
# Creating the plot
skplt.metrics.plot_cumulative_gain(y_test, predicted_probabilities)
skplt.metrics.plot_lift_curve(y_test, predicted_probabilities)

__Lift and cumulative charts__:<br><br>
1) __Lift__ is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.<br>
2) __Cumulative gains__ and lift charts are visual aids for measuring model performance.<br>
3) Both charts consist of a curve and a baseline.

If we observe carefully, we can reach out to over 80% of churn customers with this random forest model if marketing budget targets 50% of its customers according to __cumulative gain curve__. With a random pick in absence of this model, we will be reaching out to 50% of churn customers.<br><br>

__Lift__ is calculated as the ratio of Cumulative Gains from classification and random models. If the average incidence of targets is 20%, so the lift is 2.5. Thus, the model allows addressing two-and-half times more targets for this group, compared with addressing without the model (randomly).

In [None]:
churn_data['Churn_Predictions'] = rf2.predict(x)

ax = sns.heatmap(confusion_matrix(churn_data['Churn_Predictions'], churn_data['Churn']), annot=True, fmt='d')
ax.set(xlabel='Random Forest Predictions', ylabel='Truth')

print(classification_report(churn_data['Churn_Predictions'] , churn_data['Churn']))

**Takeaway:**

Acquiring a customer is far more costly than keeping a customer. Any company that wants to retain its customers should find some value in analysing and lowering down the churn rate. Even emerging markets, which witnessed high growth in the past, are now looking to consolidate their customer base and differentiate themselves from their peers to reduce churn rates.
Telecom players use a variety of different metrics to determine when customers are about to leave. It is profitable for companies to explore the reasons why customers are leaving, and then target at risk customers with enticing offers. There are several different tactics companies use to maintain their customer bases
