# Customer Churn Prediction: from Python to Modeler and back to Python


Adapted from original by Heba El-Shimy https://raw.githubusercontent.com/IBM/customer-churn-prediction/master/notebooks/customer-churn-prediction.ipynb 

-------------------

- This article: https://medium.com/@markryan_69718/watson-studio-desktop-first-impressions-5a85309597d0 describes adapting the original Python-based churn solution described here: https://developer.ibm.com/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/
- the article describes a simple Modeler flow https://github.com/ryanmark1867/shared_ml/blob/master/churn%20flow%20Feb%202019.str?raw=true that implements a subset of the original Python-based churn solution
- this notebook takes things full circle by attempting to implement the Modeler flow "note for note" back into a much simplified Python notebook that captures nuances of the Modeler flow

This notebook shows screenshots of the Modeler flow followed by the Python that attempts to implement the same function.


# Load Libraries

Import all the libraries that are needed by the Python code in the rest of the notebook.

In [51]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from itertools import combinations
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler
import sklearn.feature_selection
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics
from scipy import stats
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin

testproportion = 0.3


# The Dataset

The same dataset used in this notebook was used in the Modeler flow described above: https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv


# Modeler Flow: Ingest dataset
- the first step in the Modeler flow is to ingest the dataset

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_ingest_data.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>


# Python: Ingest Dataset

- the cell below brings the dataset into a Pandas dataframe directly from the repo using the URL of the CSV file

EXERCISE: 
- comment out the cell below and add a new cell that ingests the CSV file your local filesystem. In that cell:
- create a variable called <b>path</b> that is the fully qualified name of the CSV file in your filesystem
- use the read_csv API with your new <b>path</b> variable as the argument
- assign the output of the call to <b>customer_data</b>


In [52]:
url="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/WA_Fn-UseC_-Telco-Customer-Churn.csv"
customer_data = pd.read_csv(url)
customer_data.head()



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# Modeler Flow: Select subset of columns from original dataset
- the next node of the Modeler flow selects a subset of the columns from the original dataset

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_filter_selected.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Select subset of columns from original dataset

The code below implements in Python the selection of a subset of columns from the original dataset. This subset of columns will go through preprocessing and then be used to train the models.

EXERCISE:
- make updates to the notebook to have the models trained on the same set of columns used in the full-blown churn prediction project  https://developer.ibm.com/patterns/predict-customer-churn-using-watson-studio-and-jupyter-notebooks/
- the following graphic shows the columns that were used in the full-blown Python project
- you will need to update the following lists: 
- retain_columns - the list of columns from the original dataset that you want to train the model on
- categorical_columns - the subset of columns that are categorical (where the values need to be replaced with numeric values)
- continuous_columns - the subset of columns that are number values (where the values need to be scaled and outliers needs to be dealt with)
    

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/deployed_entry_page.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

In [53]:
# define the columns that we are going to keep
retain_columns = ['MonthlyCharges','TotalCharges','InternetService','PaymentMethod','OnlineSecurity','Churn','Contract','tenure']

In [54]:
# select a subset of columns based on the retain_columns list
customer_data = customer_data[retain_columns]
customer_data.head()
customer_data_orig = customer_data.copy()

# Modeler flow: Fill empty values in TotalCharges column
- in the Modeler flow we only fill the empty value in TotalCharges - do the same in Python

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_filler.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Fill empty values in TotalCharges column

In [55]:
# check the number of empty values in TotalCharges before and after filling the missing values
def fill_empty(df,col):
    print("empty values before filling:",df[col].isnull().sum())
    df[col].fillna(value=0, inplace=True)
    print("empty values after filling:",df[col].isnull().sum())
    return df

In [56]:
customer_data = fill_empty(customer_data,'TotalCharges')

empty values before filling: 11
empty values after filling: 0


# Modeler flow: Prep data
The Auto Data Prep node in the Modeler flow incorporates many preparation steps, including:
1. replacing categorical tokens with numerical IDs
2. scaling continuous values
3. replace outliers (values that are beyond a threshold) with the threshold

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_auto_data_prep.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Prep data 1 - replace categorical tokens with numerical IDs

First we defined two subsets of columns:
- categorical_columns - the subset of columns that are categorical (where the values need to be replaced with numeric values)
- continuous_columns - the subset of columns that are number values (where the values need to be scaled and outliers needs to be dealt with)

Next, convert the values in the categorical columns to numeric values

In [57]:
# identify the subsets of columns that are categorical and continuous
categorical_columns = ['InternetService', 'PaymentMethod', 'OnlineSecurity', 'Churn', 'Contract']
continuous_columns = ['MonthlyCharges', 'TotalCharges','tenure']


In [58]:
# input dataframe and list of columns to be encoded; return dataframe with those columns encoded
le = {}
def encode_columns(df,col_list,define_new):
    for col in col_list:
        print("col is",col)
        if define_new:
            le[col] = LabelEncoder()
            le[col].fit(df[col].tolist())
        df[col] = le[col].transform(df[col])
    return(df)

 

In [59]:
# replace tokens in categorical columns with numeric IDs
customer_data = encode_columns(customer_data,categorical_columns,True)
customer_data.head()

col is InternetService
col is PaymentMethod
col is OnlineSecurity
col is Churn
col is Contract


Unnamed: 0,MonthlyCharges,TotalCharges,InternetService,PaymentMethod,OnlineSecurity,Churn,Contract,tenure
0,29.85,29.85,0,2,0,0,0,1
1,56.95,1889.5,0,3,2,0,1,34
2,53.85,108.15,0,3,2,1,0,2
3,42.3,1840.75,0,0,2,0,1,45
4,70.7,151.65,1,2,0,1,0,2


# Modeler flow: Prep data 2 - scale continuous values


<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_auto_data_prep_scaling.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Prep data 2 - scale continuous values

The following cells implement in Python the same scaling that is applied to continuous columns by the Auto Data Prep node in Modeler.

We will print out the maximum and minimum values for each of the continuous columns before and after the scaling and outlier updates.

In [60]:
# print the min and max values of the continuous columns before scaling and outlier processing

for col in continuous_columns:
    print("max ",col, " ", customer_data[col].max())
    print("min ",col, " ", customer_data[col].min())

max  MonthlyCharges   118.75
min  MonthlyCharges   18.25
max  TotalCharges   8684.8
min  TotalCharges   0.0
max  tenure   72
min  tenure   0


In [61]:
# input dataframe and list of colums to be zscore scaled; return dataframe with those columns scaled
def scale_columns(df,col_list):
    df[col_list] = df[col_list].apply(zscore)
    return df

In [62]:
# scale continuous columns using zscore
customer_data = scale_columns(customer_data,continuous_columns)

# Modeler flow: Prep data 3 - replace outliers
- replace outliers (values that are beyond a threshold) with the threshold

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_auto_data_prep_outliers_replace.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Prep data 3 - replace outliers

The following code implements in Python the same processing done in the Modeler flow to replace outlier values with the outlier threshold.

In [63]:
def sd_max(x,sd,multiplier):
    if x > multiplier*sd:
        return multiplier*sd
    else:
        return x

sd = {}
def replace_outliers(df,multiplier,define_new):
    for col in continuous_columns:
        if define_new:
            sd[col] = df.loc[:,col].std()
        print("sd",sd)
        df[col] = df[col].apply(sd_max,args=(sd[col],multiplier))
    return df
        


In [64]:
# replace outliers that are more than a boundary with the boundary value
customer_data = replace_outliers(customer_data,3.0,True)

sd {'MonthlyCharges': 1.0000710000355904}
sd {'MonthlyCharges': 1.0000710000355904, 'TotalCharges': 1.0000710000355884}
sd {'MonthlyCharges': 1.0000710000355904, 'TotalCharges': 1.0000710000355884, 'tenure': 1.0000710000355943}


In [65]:
# print the min and max values of the continuous columns after scaling and outlier processing

for col in continuous_columns:
    print("max ",col, " ", customer_data[col].max())
    print("min ",col, " ", customer_data[col].min())

max  MonthlyCharges   1.7943521502604476
min  MonthlyCharges   -1.5458598200734601
max  TotalCharges   2.825805577868443
min  TotalCharges   -1.005779833710855
max  tenure   1.6137012404433893
min  tenure   -1.318164947398796


# Modeler flow: data post Auto Data Prep
- values in categorical columns have been replaced with numerical IDs
- the continuous columns have been scaled with a zscore transformation and outliers have been replaced

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_auto_data_prep_preview_post_adp_v2.jpg" width="950" alt="Icon"> </th>
   </tr>
</table>

# Python: data post data prep

Here's a view of the data in Python post data preparation.

In [66]:
customer_data.head(10)

Unnamed: 0,MonthlyCharges,TotalCharges,InternetService,PaymentMethod,OnlineSecurity,Churn,Contract,tenure
0,-1.160323,-0.992611,0,2,0,0,0,-1.277445
1,-0.259629,-0.172165,0,3,2,0,1,0.066327
2,-0.36266,-0.958066,0,3,2,1,0,-1.236724
3,-0.746535,-0.193672,0,0,2,0,1,0.514251
4,0.197365,-0.938874,1,2,0,1,0,-1.236724
5,1.159546,-0.643789,1,2,0,1,0,-0.992402
6,0.808907,-0.145738,1,1,0,0,0,-0.422317
7,-1.163647,-0.872587,0,3,2,0,0,-0.910961
8,1.330711,0.338085,1,2,0,1,0,-0.177995
9,-0.286218,0.533044,0,0,2,0,1,1.206498


# Modeler flow: Split dataset into test and train
<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_partition.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Split dataset into test and train

The following code implements in Python the processing to split the dataset into test and train subsets for both X (the columns used to train the model) and y (the target: Churn):
- train: subset of the data used to train the models
- test: subset of the data held back from training so it can be used to assess the accuracy of the model on predicting churn on data that it has not seen before

In [67]:
# define label
y_le = customer_data['Churn']
# define input values
X_selected = customer_data.drop(['Churn'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_selected, y_le,\
                                                    test_size=testproportion, random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(4930, 7) (4930,)
(2113, 7) (2113,)


In [68]:
X_test.head()

Unnamed: 0,MonthlyCharges,TotalCharges,InternetService,PaymentMethod,OnlineSecurity,Contract,tenure
185,-1.328164,-0.994838,0,2,0,0,-1.277445
2715,-1.313208,-0.566163,2,0,1,0,0.35137
3825,-1.5093,-0.550611,2,3,1,2,0.799294
1807,0.385148,-0.972096,1,2,0,0,-1.277445
132,-0.472339,0.432521,0,0,0,2,1.410099


In [69]:
y_test.head()

185     1
2715    0
3825    0
1807    1
132     0
Name: Churn, dtype: int64

# Modeler flow: Train Support Vector Machine
<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_SVM_train.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Train Support Vector Machine

The following code trains (or "fits") the SVM model using the train subset of the dataset.

In [70]:
# fit SVM using training data
clf_svc = svm.SVC(random_state=42,gamma="auto")
clf_svc.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

# Modeler flow: Train Logistic Regression
<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_logistic_regression_train.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Train Logistic Regression

The following code trains (or "fits") the Logistic Regression model using the train subset of the dataset.

In [71]:
from sklearn.linear_model import LogisticRegression

clf_lr = LogisticRegression(solver = 'lbfgs')
model = clf_lr.fit(X_train, y_train)
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

# Modeler flow: Evaluate Support Vector Machine
<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_SVM_results2.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Modeler flow: SVM results

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_SVM_results_text.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: SVM results

The following code assesses the accuracy of the trained SVM model by comparing the predictions made by the model for the test subset of the dataset with the actual Churn values for the test subset of the dataset.

In [72]:
# Get accuracy score
y_pred_svc = clf_svc.predict(X_test)
acc_svc = accuracy_score(y_test, y_pred_svc)
print(acc_svc)

0.795551348793185


# Modeler flow: Evaluate Logistic Regression
<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_logistic_regression_results.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>



# Modeler flow: Logistic Regression results

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/june2019_ML_bootcamp/master/flow_logistic_regression_results_text.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Python: Logistic Regression results

The following code assesses the accuracy of the trained Logistic Regression model by comparing the predictions made by the model for the test subset of the dataset with the actual Churn values for the test subset of the dataset.

In [73]:
y_pred_lr = clf_lr.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred_lr)
print(acc_lr)

0.7941315664931378
