# Telco customer churn

## <a id='1'>1. Business understanding</a>

Since the cell phone market is now saturated, the huge growth in the wireless market has tapered off. Therefor as a telcom businnes attracting new customers is much more expensive than retaining existing ones. Therefor a large part of the marketing budget should go into preventing churn. The goal of this notebook is to decide which customers should be offered a special retention deal. 

<b>Interesting fact:</b> the earliest adopters of data mining were telcom businesses to maintain customer retention. <br>(Provost F., Fawcett T)


 <img src="https://images.unsplash.com/photo-1533664488202-6af66d26c44a?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" width="400" height="60" style="float:left"> 

## Overview 

- <a href='#1'>1. Business understanding</a>
- <a href='#2'>2. Data understanding</a>
    - <a href='#2.1'>2.1. Data manipulation</a>
    - <a href='#2.2'>2.2. Exploratory data analysis (EDA)</a>
- <a href='#3'>3. Data preparation</a>
- <a href='#4'>4. Model</a>
- <a href='#5'>5. Evaluation</a>
- <a href='#6'>6. Performance comparison of different models</a>
    - <a href='#6.1'>6.1. Logistic classifier</a>
        - <a href='#6.1.1'>6.1.1 Impact of changing the threshold</a>
    - <a href='#6.2'>6.2. SVM classifier</a>
    - <a href='#6.3'>6.3. Decision tree</a>    
    - <a href='#6.4'>6.4. Naive Bayes (GNB)</a>    
    - <a href='#6.5'>6.5. ROC and AUC</a>    
    - <a href='#6.6'>6.6. Effect of changing the test size on logistic regression model performance</a>    
    - <a href='#6.7'>6.7. Feature selection impact on model performance (logistic regression)</a>    
- <a href='#7'>7. Create function to predict churn probability via API</a>
- <a href='#8'>8. Ranking top 20 customers most likely to churn </a>

In [None]:
#import modules

import numpy as np #scientific computing library
import pandas as pd  #data analysis and manipulation library

import matplotlib.pyplot as plt #library for creating static, animated, and interactive visualizations 
import seaborn as sns #data visualization library based on matplotlib
sns.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 2.5})
# sns.set(style="white", context="talk") #set to a specific seaborn plot style

import plotly.offline as py # to create interactive, publication-quality graphs
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

## <a id='2'>2. Data understanding</a>

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The dataset includes information about:

* Customers who left within the last month – the column is called Churn
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support,and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents

In [None]:
df = pd.read_csv('/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head(4)

In [None]:
print("ATTRIBUTES OVERVIEW"+"\n"+"-"*20+"\n")
print(df.info())
print("\n"+"HOW MANY UNIQUE VALUES PER ATTRIBUTE?"+"\n"+"-"*40+"\n")
print(df.nunique())

* Senior citizen: in the US they consider someone from 60 years old onwards to be a senior citizen.
* Dependents: having to provide support for family members.
* Tenure: the period of time a person holds a position.
* Multiple lines: a multi-line phone system condenses multiple lines into a single device which means that more than one person will be able to make or receive calls at the same time.

### <a id='2.1'>2.1 Data manipulation</a>

In [None]:
#Check all attributes for cells with whitespaces
def check_all_columns_for_whitespaces(the_df):
    for c in the_df.columns:
        for i in the_df[str(c)]:
            if str(i) == " ":
                print(str(c)+' has a cell with a whitespace')
            else:
                pass

check_all_columns_for_whitespaces(df)

In [None]:
#Replace whitespaces in TotalCharges with NAN (not a number: numeric data type to represent any value that is undefined)
df["TotalCharges"] = df["TotalCharges"].replace(" ",np.nan)

#Remove NAN instances from the dataframe
df = df[df["TotalCharges"].notnull()]
#Reset the index
df = df.reset_index()[df.columns]

#TotalCharges is of dtype object, while it contains continuous numerical values, therefor change to float type
df['TotalCharges'] = df['TotalCharges'].astype(float)

#ID is useless, also no relationship between numeric part of ID and target value.
df = df.drop('customerID',axis=1)

#Replace 
yn_map = { 0:'No',1:'Yes' }
df['SeniorCitizen'] = df['SeniorCitizen'].map(yn_map)

In [None]:
print(df['InternetService'].unique())
print("Before: "+str(df['MultipleLines'].unique())) #No phone service = No

#Replace "No phone service" with "No"
df['MultipleLines'] = df['MultipleLines'].replace('No phone service','No')
print("After: "+str(df['MultipleLines'].unique()))

In [None]:
#All the following attributed have a 3th label "No internet service"
#Replace it with equivalent "No"

print("Before:"+"\n")
for c in df[['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']]:
    print(df[['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']][str(c)].unique())
    
to_replace_columns = ['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']

for i in to_replace_columns:
    df[i] = df[i].replace('No internet service','No')

print("\n"+"After:"+"\n")
for c in df[['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']]:
    print(df[['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']][str(c)].unique())

In [None]:
print(df['Contract'].unique())
print(df['PaymentMethod'].unique())

In [None]:
#df for churn and non churn customers
churn     = df[df["Churn"] == "Yes"]
not_churn = df[df["Churn"] == "No"]

#Separating catagorical and numerical columns
target_col = ["Churn"]
cat_cols   = df.nunique()[df.nunique() < 17].keys().tolist()
cat_cols   = [x for x in cat_cols if x not in target_col] # remove target column
num_cols   = [x for x in df.columns if x not in cat_cols + target_col]

### <a id='2.2'>2.2 Exploratory data analysis (EDA)</a>

### Numerical variables

In [None]:
 for c in num_cols:
    sns.distplot(df[c])
    plt.grid()
    plt.show()

In [None]:
sns.pairplot(df[['tenure','MonthlyCharges','TotalCharges','Churn']], hue = 'Churn',plot_kws = {'alpha': 0.45})
plt.show()

* Customers with low tenure are more likely to churn.
* Customers with high monthly charges are more likely to churn.


In [None]:
#After EDA the continuous numerical values can be binned
#Data binning: a way to group numbers of more or less continuous values into a smaller number of "bins"

#discrete numerical values

#Bin tenure even further
df['tenure_bin_round'] = np.array(np.floor(np.array(df['tenure']) / 4.))
print("Reducing from {} bins to {} bins".format(str(df['tenure'].nunique()),str(df['tenure_bin_round'].nunique())))

#continuous numerical values

#Bin the monthly charges.
df['MonthlyCharges_bin_round'] = np.array(np.floor(np.array(df['MonthlyCharges']) / 10.))
print("Reducing from {} unique values to {} bins".format(str(df['MonthlyCharges'].nunique()),str(df['MonthlyCharges_bin_round'].nunique())))

#Bin the total charges.
df['TotalCharges_bin_round'] = np.array(np.floor(np.array(df['TotalCharges']) / 1000.))
print("Reducing from {} unique values to {} bins".format(str(df['TotalCharges'].nunique()),str(df['TotalCharges_bin_round'].nunique())))


In [None]:
#drop original variables after binning
df = df.drop(['tenure','MonthlyCharges','TotalCharges'],axis=1)

### Categorical variables

In [None]:
print("People churning {0:,.2f}%".format(100*(len(churn)/len(df))))
print("People not churning {0:,.2f}%".format(100*(len(not_churn)/len(df))))

In [None]:
sns.countplot(df['Churn'])
plt.show()

In [None]:
for c in cat_cols:
    plt.figure(figsize=(15,4))
    sns.countplot(df[c])
    plt.grid()
    plt.show()

In [None]:
for c in cat_cols:
    plt.figure(figsize=(15,4))
    sns.countplot(x=c, hue="Churn", data=df)
    plt.grid()
    plt.show()

In [None]:
#Change dtype of binned variables to int
print(type(df['TotalCharges_bin_round'].iloc[2]))

convert_to_int = ['tenure_bin_round','MonthlyCharges_bin_round','TotalCharges_bin_round']

for c in convert_to_int:
    df[c] = df[c].astype(int)
    
print(type(df['TotalCharges_bin_round'].iloc[2]))

## <a id='3'>3. Data preparation</a>

In [None]:
print(df.shape)
df.head(4)

In [None]:
#Attributes
X = df.drop('Churn',axis=1)
print(X.shape)

#Target
y = df['Churn']#)np.array(.reshape(-1,1)
print(y.shape)

In [None]:
print(X.shape)
print(y.shape)

In [None]:
X.columns

In [None]:
#Encode categorical variables
from sklearn.preprocessing import OrdinalEncoder
ohe = OrdinalEncoder()
X = pd.DataFrame(ohe.fit_transform(X),columns=['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'tenure_bin_round',
       'MonthlyCharges_bin_round', 'TotalCharges_bin_round'])   
X = X.astype(int) # convert from float to int
print(X.shape)
print(type(X))

#Encode target labels with value between 0 and n_classes-1.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = pd.DataFrame(le.fit_transform(y))
print(y.shape)
print(type(y))

In [None]:
X.nunique()>2

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer((OneHotEncoder(),['InternetService','Contract','PaymentMethod',
                                                         'tenure_bin_round','MonthlyCharges_bin_round',
                                                         'TotalCharges_bin_round']),remainder='passthrough')
print(column_trans.fit_transform(X).shape)
column_trans.fit_transform(X)

## <a id='4'>4. Model</a>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

In [None]:
logreg=LogisticRegression(solver='lbfgs',max_iter=7600)
pipe = make_pipeline(column_trans,logreg)

## <a id='5'>5. Evaluation</a>

Use cross-validation (CV) for evaluation of the model.

CV is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. 

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

<div class="img-with-text">
  <img src="https://www.justintodata.com/wp-content/uploads/2020/06/image-8.png" alt="image" style="width:45%"  align="left">

In [None]:
y = y[0]

In [None]:
from sklearn.model_selection import cross_val_score
print("Accuracy: {0:,.3f}%".format(cross_val_score(pipe,X, y,cv=10,scoring='accuracy').mean()))
print("recall: {0:,.3f}%".format(cross_val_score(pipe,X, y,cv=10,scoring='recall').mean()))
print("precision: {0:,.3f}%".format(cross_val_score(pipe,X, y,cv=10,scoring='precision').mean()))
print("f1: {0:,.3f}%".format((cross_val_score(pipe,X, y,cv=10,scoring='f1').mean())))

## <a id='6'>6. Performance comparison of different models</a>

- Modeling performance from now one will be without pipeline and without "make_column_transformer" of features with >2 labels
- So the following features have more than 2 labels:
        -['InternetService',
        'Contract',
        'PaymentMethod',
        'tenure_bin_round',
        'MonthlyCharges_bin_round',
        'TotalCharges_bin_round']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix

### <a id='6.1'>6.1. Logistic classifier</a>

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
log_predictions = logmodel.predict(X_test)
print(classification_report(y_test,log_predictions))
plot_confusion_matrix(logmodel, X_test, y_test)
plt.show()
plot_confusion_matrix(logmodel, X_test, y_test, normalize='true')
plt.show()

### <a id='6.1.1'>6.1.1. Impact of changing the threshold</a>

- Threshold of 0.5 (probability) used by default for binary classification
- Changing the threshold changes the sensitivity and specificity (they have an inverse relationship) of the model.
- First put time and effort in making a good model, changing the threshold is something you can do at the end.
- Depending on usecase increase or decrease threshold.

In [None]:
# first 10 prediction responses
print(logmodel.predict(X_test)[0:10]) 
print('\n')
# first 10 predicted probabilities of class membership
print(logmodel.predict_proba(X_test)[0:10]) 
print('\n')
# first 10 predicted probabilities of class 1
print(logmodel.predict_proba(X_test)[0:10,1])

#place results in variable
y_logistic_pred_prob = np.array(logmodel.predict_proba(X_test)[:,1]).reshape(-1,1)

In [None]:
plt.rcParams['font.size'] = 14
plt.hist(logmodel.predict_proba(X_test)[:,1], bins=8)
plt.xlim(0,1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probabilities of churn')
plt.ylabel('Frequency')
plt.show()

If we would lower the threshold from 0.5 to 0.4, more people would be predicted to churn. (sensitivity of the classifier)

In [None]:
from sklearn.preprocessing import binarize
y_logistic_pred_class = binarize(y_logistic_pred_prob,threshold=0.4)[:]

In [None]:
y_logistic_pred_prob[0:10][:,0]

In [None]:
y_logistic_pred_class[0:10][:,0]

In [None]:
confusion_matrix(y_test,log_predictions)

In [None]:
confusion_matrix(y_test,y_logistic_pred_class)

In [None]:
print("Increase in sensitivity (recall) from {:0.2f} to {:0.2f}".format((320/(320+290)),416/(416+194)))

In [None]:
print("Decrease in specificity from {:0.2f} to {:0.2f}".format((1529/(1529+182)),1412/(1412+299)))

### <a id='6.2'>6.2. SVM classifier</a>

In [None]:
from sklearn.svm import SVC
svc_model = SVC(kernel='rbf',random_state=4,probability=True)
svc_model.fit(X_train, y_train)
SVM_predictions = svc_model.predict(X_test)
print(classification_report(y_test,SVM_predictions))
plot_confusion_matrix(svc_model, X_test, y_test)
plt.show()
plot_confusion_matrix(svc_model, X_test, y_test, normalize='true')
plt.show()

### <a id='6.3'>6.3. Decision tree</a>

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

In [None]:
from sklearn import tree

In [None]:
tree_clf = tree.DecisionTreeClassifier(criterion='entropy')
tree_clf = tree_clf.fit(X_train, y_train)
tree_predictions = tree_clf.predict(X_test)

In [None]:
print(classification_report(y_test,tree_predictions))
plot_confusion_matrix(tree_clf, X_test, y_test)
plt.show()
plot_confusion_matrix(tree_clf, X_test, y_test, normalize='true')
plt.show()

In [None]:
import graphviz 
dot_data = tree.export_graphviz(tree_clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("beautiful_tree")

### <a id='6.4'>6.4. Naive Bayes</a>

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.

In [None]:
#GaussianNB implements the Gaussian Naive Bayes algorithm for classification. 

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb_predictions = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total {} points : {}".format(X_test.shape[0], (y_test != gnb_predictions).sum()))

In [None]:
print(classification_report(y_test,gnb_predictions))
print('\n')
plot_confusion_matrix(gnb, X_test, y_test)
print('\n')
plot_confusion_matrix(gnb, X_test, y_test, normalize='true')

### <a id='6.5'>6.5. ROC and AUC</a>

“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity, and FPR is one minus the specificity or true negative rate.”

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.

In [None]:
# store the predicted probabilities for class 1
y_logistic_pred_prob = np.array(logmodel.predict_proba(X_test)[:,1]).reshape(-1,1)
y_svm_pred_prob = np.array(svc_model.predict_proba(X_test)[:,1]).reshape(-1,1)
y_tree_prob = np.array(tree_clf.predict_proba(X_test)[:,1]).reshape(-1,1)
y_gnb_prob = np.array(gnb.predict_proba(X_test)[:,1]).reshape(-1,1)

In [None]:
from sklearn.metrics import roc_curve, auc

#logistic tpr and fpr
logistic_fpr, logistic_tpr, threshold_log = roc_curve(y_test, y_logistic_pred_prob)
auc_logistic = auc(logistic_fpr, logistic_tpr)

#SVM tpr and fpr
svm_fpr, svm_tpr, threshold_svm = roc_curve(y_test, y_svm_pred_prob)
auc_svm = auc(svm_fpr, svm_tpr)

#Tree tpr and fpr
tree_fpr, tree_tpr, threshold_tree = roc_curve(y_test, y_tree_prob,drop_intermediate=False)
auc_tree = auc(tree_fpr, tree_tpr)

#GNB tpr and fpr
gnb_fpr, gnb_tpr, threshold_gnb = roc_curve(y_test, y_gnb_prob)
auc_gnb = auc(gnb_fpr, gnb_tpr)


#plot
plt.figure(figsize=(7,7), dpi=100)
plt.plot(svm_fpr, svm_tpr, linestyle='-', label='SVM (auc = %0.3f)' % auc_svm)
plt.plot(logistic_fpr, logistic_tpr, marker='.', label='Logistic (auc = %0.3f)' % auc_logistic)
plt.plot(tree_fpr, tree_tpr, marker='.', label='Tree (auc = %0.3f)' % auc_tree)
plt.plot(gnb_fpr, gnb_tpr, marker='.', label='GNB (auc = %0.3f)' % auc_gnb)

plt.xlabel('FPR = 1-specificity')
plt.ylabel('TPR = sensitivity')
plt.title('ROC')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
# calculate cross-validated AUC
from sklearn.model_selection import cross_val_score

#Logistic regression
print(cross_val_score(logmodel, X, y, cv=10, scoring='roc_auc').mean())
#SVM
print(cross_val_score(svc_model, X, y, cv=10, scoring='roc_auc').mean())
#Decision tree
print(cross_val_score(tree_clf, X, y, cv=10, scoring='roc_auc').mean())
#GNB
print(cross_val_score(gnb, X, y, cv=10, scoring='roc_auc').mean())

In [None]:
# calculate AUC with metrics.roc_auc_score
from sklearn import metrics
print(metrics.roc_auc_score(y_test, y_logistic_pred_prob))
print(metrics.roc_auc_score(y_test, y_svm_pred_prob))
print(metrics.roc_auc_score(y_test, y_tree_prob))
print(metrics.roc_auc_score(y_test, y_gnb_prob))

### <a id='6.6'>6.6. Effect of changing the test size on model performance (logistic regression)</a>

In [None]:
roc_list = []
acc = []
prec = []
rec = []

for spc in np.linspace(0.1,0.9,num=50):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=spc, random_state=42)
    logmodel = LogisticRegression()
    logmodel.fit(X_train,y_train)
    log_predictions = logmodel.predict(X_test)
    roc_list.append(cross_val_score(logmodel, X_test, y_test, cv=10, scoring='roc_auc').mean())   
    acc.append((confusion_matrix(y_test, log_predictions)[0][0]+
               confusion_matrix(y_test, log_predictions)[1][1])/(
               confusion_matrix(y_test, log_predictions)[0][0]+
              confusion_matrix(y_test, log_predictions)[0][1]+
              confusion_matrix(y_test, log_predictions)[1][0]+
              confusion_matrix(y_test, log_predictions)[1][1]))
    prec.append((confusion_matrix(y_test, log_predictions)[1][1])/(               
              confusion_matrix(y_test, log_predictions)[0][1]+              
              confusion_matrix(y_test, log_predictions)[1][1]))
    rec.append((confusion_matrix(y_test, log_predictions)[1][1])/(               
              confusion_matrix(y_test, log_predictions)[1][1]+              
              confusion_matrix(y_test, log_predictions)[1][0]))

In [None]:
plt.figure(figsize=(5,5), dpi=100)
plt.plot(np.linspace(0.1,0.9,num=50), roc_list)
plt.xlim(0,1)
plt.xlabel('test_size')
plt.ylabel('AUC')
plt.title('AUC vs. test_size')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
max(roc_list)

In [None]:
plt.figure(figsize=(5,5), dpi=100)
plt.plot(np.linspace(0.1,0.9,num=50), acc)
plt.xlim(0,1)
plt.xlabel('test_size')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. test_size')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(5,5), dpi=100)
plt.plot(np.linspace(0.1,0.9,num=50), prec)
plt.xlim(0,1)
plt.xlabel('test_size')
plt.ylabel('Precision')
plt.title('Precision vs. test_size')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(5,5), dpi=100)
plt.plot(np.linspace(0.1,0.9,num=50), rec)
plt.xlim(0,1)
plt.xlabel('test_size')
plt.ylabel('Recall')
plt.title('Recall vs. test_size')
plt.grid(True)
plt.legend()
plt.show()

- For all four performances 0.2 < test_size < 0.4 seems the best choise.
- Once above 0.4 less and less data will be used to train the model and overfitting is likely to occur.
- I decided to go for 0.35.

### <a id='6.7'>6.7. Feature selection impact on the performance of all 4 models</a>

In [None]:
print(X.shape)
print(X.columns)

In [None]:
df_joined = X
df_joined['Churn'] = y
df_copy = df_joined

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(df_joined.corr(),annot=True, fmt=".2")

In [None]:
correlations = pd.DataFrame(df_joined.corr()['Churn'])
correlations = correlations.drop('Churn',axis=0)
correlations.sort_values(by=['Churn'],ascending=False)

In [None]:
correlations.sort_values(by=['Churn'],ascending=False).plot(kind='bar',figsize=(10,5))
plt.grid()
plt.show()

Impact of removing features with abs(cor)<0.1

In [None]:
features_to_drop = correlations[abs(correlations['Churn'])<0].reset_index()['index'].to_list()
print(features_to_drop)

In [None]:
df_joined = df_copy
df_joined = df_joined.drop(features_to_drop,axis=1)

In [None]:
X_features_reduced = df_joined.drop('Churn',axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_features_reduced, y, test_size=0.35, random_state=42)

logmodel_features_reduces = LogisticRegression()
logmodel_features_reduces.fit(X_train,y_train)
log_predictions_features_reduces = logmodel_features_reduces.predict(X_test)

svc_model_features_reduces = SVC(kernel='rbf',random_state=4,probability=True)
svc_model_features_reduces.fit(X_train, y_train)
SVM_predictions_features_reduces = svc_model_features_reduces.predict(X_test)

tree_clf_features_reduces = tree.DecisionTreeClassifier(criterion='entropy')
tree_clf_features_reduces = tree_clf.fit(X_train, y_train)
tree_predictions_features_reduces = tree_clf_features_reduces.predict(X_test)

gnb_features_reduces = GaussianNB()
gnb_predictions_features_reduces = gnb_features_reduces.fit(X_train, y_train).predict(X_test)

# store the predicted probabilities for class 1
y_logistic_pred_prob_features_reduces = np.array(logmodel_features_reduces.predict_proba(X_test)[:,1]).reshape(-1,1)
y_svm_pred_prob_features_reduces = np.array(svc_model_features_reduces.predict_proba(X_test)[:,1]).reshape(-1,1)
y_tree_prob_features_reduces = np.array(tree_clf_features_reduces.predict_proba(X_test)[:,1]).reshape(-1,1)
y_gnb_prob_features_reduces = np.array(gnb_features_reduces.predict_proba(X_test)[:,1]).reshape(-1,1)

#logistic tpr and fpr
logistic_fpr, logistic_tpr, threshold_log = roc_curve(y_test, y_logistic_pred_prob_features_reduces)
auc_logistic = auc(logistic_fpr, logistic_tpr)

#SVM tpr and fpr
svm_fpr, svm_tpr, threshold_svm = roc_curve(y_test, y_svm_pred_prob_features_reduces)
auc_svm = auc(svm_fpr, svm_tpr)

#Tree tpr and fpr
tree_fpr, tree_tpr, threshold_tree = roc_curve(y_test, y_tree_prob_features_reduces,drop_intermediate=False)
auc_tree = auc(tree_fpr, tree_tpr)

#GNB tpr and fpr
gnb_fpr, gnb_tpr, threshold_gnb = roc_curve(y_test, y_gnb_prob_features_reduces)
auc_gnb = auc(gnb_fpr, gnb_tpr)


#plot
plt.figure(figsize=(8,8), dpi=100)
plt.plot(svm_fpr, svm_tpr, linestyle='-', label='SVM (auc = %0.3f)' % auc_svm)
plt.plot(logistic_fpr, logistic_tpr, marker='.', label='Logistic (auc = %0.3f)' % auc_logistic)
plt.plot(tree_fpr, tree_tpr, marker='.', label='Tree (auc = %0.3f)' % auc_tree)
plt.plot(gnb_fpr, gnb_tpr, marker='.', label='GNB (auc = %0.3f)' % auc_gnb)

plt.xlabel('FPR = 1-specificity')
plt.ylabel('TPR = sensitivity')
plt.title('ROC with {} features'.format(len(X_test.columns)))
plt.grid(True)
plt.legend()
plt.show()

print('-'*30)
print('log_predictions_features_reduces')
print(classification_report(y_test,log_predictions_features_reduces))
print('-'*30)
print('SVM_predictions_features_reduces')
print(classification_report(y_test,SVM_predictions_features_reduces))
print('-'*30)
print('tree_predictions_features_reduces')
print(classification_report(y_test,tree_predictions_features_reduces))
print('-'*30)
print('gnb_predictions_features_reduces')
print(classification_report(y_test,gnb_predictions_features_reduces))

- Removing features with abs(correlation) <0.05 did not improve the performance for any of the 4 models.
- I also experimenten with a for loop to try all the different correlation tresholds, ranging from all the features to only 1 feature to train the model. The best performance was using all the features.

## <a id='7'>7. Create function to predict churn probability via API</a>

In [None]:
X_test

In [None]:
X_test = X_test.reset_index()[X_test.columns]
y_test = pd.DataFrame(y_test)
y_test = y_test.reset_index()[y_test.columns]
X_test.shape

In [None]:
print(np.array(X_test.iloc[400].to_list()).reshape(1,-1))
print(y_test.iloc[400].to_list())

In [None]:
random_person = np.array(X_test.iloc[np.random.randint(0, 703 + 1)].to_list()).reshape(1,-1)

print(logmodel.predict(random_person))
print(logmodel.predict_proba(random_person))

In [None]:
def will_i_churn_or_not(gender,
                     SeniorCitizen,
                     Partner,
                     Dependents,
                     PhoneService,
                     MultipleLines,
                     InternetService,
                     OnlineSecurity,
                     OnlineBackup,
                     DeviceProtection,
                     TechSupport,
                     StreamingTV,
                     StreamingMovies,
                     Contract,
                     PaperlessBilling,
                     PaymentMethod,
                     tenure_bin_round,
                     MonthlyCharges_bin_round,
                     TotalCharges_bin_round):
    
    my_df = pd.DataFrame(data=[gender,
                     SeniorCitizen,
                     Partner,
                     Dependents,
                     PhoneService,
                     MultipleLines,
                     InternetService,
                     OnlineSecurity,
                     OnlineBackup,
                     DeviceProtection,
                     TechSupport,
                     StreamingTV,
                     StreamingMovies,
                     Contract,
                     PaperlessBilling,
                     PaymentMethod,
                     tenure_bin_round,
                     MonthlyCharges_bin_round,
                     TotalCharges_bin_round]).transpose()
    
    the_probability = logmodel.predict_proba(my_df)
    
    the_prediction = logmodel.predict(my_df)
    
    if the_prediction == 0:
        print('The probability you will not churn is {:.3f}%'.format(the_probability[0][0]))
    else:
        print('The probability you will churn is {:.3f}%'.format(the_probability[0][1]))

In [None]:
random_person = np.array(X_test.iloc[np.random.randint(0, 703 + 1)].to_list()).reshape(1,-1)

In [None]:
tupled_person = tuple(random_person[0])

In [None]:
will_i_churn_or_not(*tupled_person)

## <a id='8'> 8. Ranking top 20 customers most likely to churn</a>

In [None]:
churn_probability = pd.DataFrame(logmodel.predict_proba(X_test)[:,1], columns=['Churn_probability']) 
#for all the predictions (rows), give me the probabilities of churning (1)

In [None]:
churn_probability.sort_values(by='Churn_probability',ascending=False).head(20)