# Trading Platform Customer Attrition Risk Prediction using sklearn

There are many users of online trading platforms and these companies would like to run analytics on and predict churn based on user activity on the platform. Since competition is rife, keeping customers happy so they do not move their investments elsewhere is key to maintaining profitability.

In this notebook, we will leverage Watson Studio Local (that is a service on IBM Cloud Pak for Data) to do the following:

1. Ingest merged customer demographics and trading activity data
2. Visualize merged dataset and get better understanding of data to build hypotheses for prediction
3. Leverage sklearn library to build classification model that predicts whether customer has propensity to churn
4. Expose the classification model as RESTful API endpoint for the end-to-end customer churn risk prediction and risk remediation application

<img src="https://github.com/burtvialpando/CloudPakWorkshop/blob/master/CPD/images/NotebookImage.png?raw=true" width="800" height="500" align="middle"/>


<a id="top"></a>
## Table of Contents

1. [Load libraries](#load_libraries)
2. [Load and visualize merged customer demographics and trading activity data](#load_data)
3. [Prepare data for building classification model](#prepare_data)
4. [Train classification model and test model performance](#build_model)
5. [Save model to ML repository and expose it as REST API endpoint](#save_model)
6. [Summary](#summary)

### Quick set of instructions to work through the notebook

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) such as this and code such as the one below. 
2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

<a id="load_libraries"></a>
## 1. Load libraries
[Top](#top)

Running the following cell will load all libraries needed to load, visualize, prepare the data and build ML models for our use case

In [None]:
#Uncomment and run once to install the package in your runtime environment
!pip install sklearn-pandas

In [None]:
# If the following cell doesn't work, please un-comment out the next line and do upgrade the patplotlib package. When the upgrade is done, restart the kernal and start from the beginning again. 
#!pip install --user --upgrade matplotlib

In [None]:
import brunel
import pandas as pd
import numpy as np
import sklearn.pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, LabelBinarizer, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
import json
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Changed sk-learn version to be compatible with WML client4 on cpd 3.0.1
!pip install scikit-learn==0.22

# <a id="load_data"></a>
## 2. Load data example
[Top](#top)

Data can be easily loaded within ICPD using point-and-click functionality. The following image illustrates how to load a merged dataset assuming it is called "customer_demochurn_activity_analyze.csv". The file can be located by its name and inserted into the notebook as a **pandas** dataframe as shown below:

<img src="https://github.com/burtvialpando/CloudPakWorkshop/blob/master/CPD/images/InsertPandasDataFrame.png?raw=true" width="300" height="300" align="middle"/>

The interface comes up with a generic name, so it is good practice to rename the dataframe to match context of the use case. In this case, we will use df_churn.

In [None]:

df_churn_pd = pd.read_csv('/project_data/data_asset/customer_demochurn_activity_analyze.csv')
df_churn_pd.head()


Data Visualization is key step in data mining process that helps better understand data before it can be prepared for building ML models

We use Brunel library that comes preloaded within Watson Studio local environment to visualize the merged customer data. 

The Brunel Visualization Language is a highly succinct and novel language that defines interactive data visualizations based on tabular data. The language is well suited for both data scientists and business users. More information about Brunel Visualization: https://github.com/Brunel-Visualization/Brunel/wiki

Try Brunel visualization here: http://brunel.mybluemix.net/gallery_app/renderer

In [None]:
df_churn_pd.dtypes

In [None]:
df_churn_pd.describe()

In [None]:
%brunel data('df_churn_pd') stack polar bar x(CHURNRISK) y(#count) color(CHURNRISK) bar tooltip(#all)

In [None]:
%brunel data('df_churn_pd') bar x(STATUS) y(#count) color(STATUS) tooltip(#all) | stack bar x(STATUS) y(#count) color(CHURNRISK: pink-orange-yellow) bin(STATUS) sort(STATUS) percent(#count) label(#count) tooltip(#all) :: width=1200, height=350 

In [None]:
%brunel data('df_churn_pd') bar x(TOTALUNITSTRADED) y(#count) color(CHURNRISK: pink-gray-orange) sort(STATUS) percent(#count) label(#count) tooltip(#all) :: width=1200, height=350 

In [None]:
%brunel data('df_churn_pd') bar x(DAYSSINCELASTTRADE) y(#count) color(CHURNRISK: pink-gray-orange) sort(STATUS) percent(#count) label(#count) tooltip(#all) :: width=1200, height=350 

<a id="prepare_data"></a>
## 3. Data preparation
[Top](#top)

Data preparation is a very important step in machine learning model building. This is because the model can perform well only when the data it is trained on is good and well prepared. Hence, this step consumes bulk of data scientist's time spent building models.

During this process, we identify categorical columns in the dataset. Categories needed to be indexed, which means the string labels are converted to label indices. These label indices and encoded using One-hot encoding to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features to use categorical features.

Final step in the data preparation process is to assemble all the categorical and non-categorical columns into a feature vector. We use VectorAssembler for this. VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models.

#### Use the DataFrameMapper class to declare transformations and variable imputations.

* LabelBinarizer - Converts a categorical variable into a dummy variable (aka binary variable)
* StandardScaler - Standardize features by removing the mean and scaling to unit variance, z = (x - u) / s

See docs: 
* https://github.com/scikit-learn-contrib/sklearn-pandas
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html#sklearn.preprocessing.LabelBinarizer
* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [None]:
# Defining the categorical columns 
categoricalColumns = ['GENDER', 'STATUS', 'HOMEOWNER', 'AGE_GROUP']
numericColumns = ['CHILDREN', 'ESTINCOME', 'TOTALDOLLARVALUETRADED', 'TOTALUNITSTRADED', 'LARGESTSINGLETRANSACTION', 'SMALLESTSINGLETRANSACTION', 
                          'PERCENTCHANGECALCULATION', 'DAYSSINCELASTLOGIN', 'DAYSSINCELASTTRADE', 'NETREALIZEDGAINS_YTD', 'NETREALIZEDLOSSES_YTD']

In [None]:
mapper = DataFrameMapper([
    (['GENDER'], LabelBinarizer()),
    (['STATUS'], LabelBinarizer()),
    (['HOMEOWNER'], LabelBinarizer()),
    (['AGE_GROUP'], LabelBinarizer()),
    (['CHILDREN'],  StandardScaler()),
    (['ESTINCOME'],  StandardScaler()),
    (['TOTALDOLLARVALUETRADED'],  StandardScaler()),
    (['TOTALUNITSTRADED'],  StandardScaler()),
    (['LARGESTSINGLETRANSACTION'],  StandardScaler()),
    (['SMALLESTSINGLETRANSACTION'],  StandardScaler()),
    (['PERCENTCHANGECALCULATION'],  StandardScaler()),
    (['DAYSSINCELASTLOGIN'],  StandardScaler()),
    (['DAYSSINCELASTTRADE'],  StandardScaler()),
    (['NETREALIZEDGAINS_YTD'],  StandardScaler()),
    (['NETREALIZEDLOSSES_YTD'],  StandardScaler())], default=False)

In [None]:
df_churn_pd.columns

In [None]:
# Define input data to the model
X = df_churn_pd.drop(['ID','CHURNRISK','AGE','TAXID','CREDITCARD','DOB','ADDRESS_1', 'ADDRESS_2', 'CITY', 'STATE', 'ZIP', 'ZIP4', 'LONGITUDE',
       'LATITUDE'], axis=1)

In [None]:
X.shape

In [None]:
# Define the target variable and encode with value between 0 and n_classes-1
le = LabelEncoder()
y = le.fit_transform(df_churn_pd['CHURNRISK'])

In [None]:
# split the data to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

<a id="build_model"></a>
## 4. Build Random Forest classification model
[Top](#top)

We instantiate a decision-tree based classification algorithm, namely, RandomForestClassifier. Next we define a pipeline to chain together the various transformers and estimaters defined during the data preparation step before. Sklearn standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow.

We split original dataset into train and test datasets. We fit the pipeline to training data and apply the trained model to transform test data and generate churn risk class prediction

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Instantiate the Classifier
random_forest = RandomForestClassifier(random_state=5)

# Define the steps in the pipeline to sequentially apply a list of transforms and the estimator, i.e. RandomForestClassifier
steps = [('mapper', mapper),('RandonForestClassifier', random_forest)]
pipeline = sklearn.pipeline.Pipeline(steps)

# train the model
model=pipeline.fit( X_train, y_train )

model

In [None]:
### call pipeline.predict() on your X_test data to make a set of test predictions
y_prediction = model.predict( X_test )


In [None]:
# show first 10 rows of predictions
y_prediction[0:10,]

In [None]:
# show first 10 rows of predictions with the corresponding labels
le.inverse_transform(y_prediction)[0:10]

### Model results

In a supervised classification problem such as churn risk classification, we have a true output and a model-generated predicted output for each data point. For this reason, the results for each data point can be assigned to one of four categories:

1. True Positive (TP) - label is positive and prediction is also positive
2. True Negative (TN) - label is negative and prediction is also negative
3. False Positive (FP) - label is negative but prediction is positive
4. False Negative (FN) - label is positive but prediction is negative

These four numbers are the building blocks for most classifier evaluation metrics. A fundamental point when considering classifier evaluation is that pure accuracy (i.e. was the prediction correct or incorrect) is not generally a good metric. The reason for this is because a dataset may be highly unbalanced. For example, if a model is designed to predict fraud from a dataset where 95% of the data points are not fraud and 5% of the data points are fraud, then a naive classifier that predicts not fraud, regardless of input, will be 95% accurate. For this reason, metrics like precision and recall are typically used because they take into account the type of error. In most applications there is some desired balance between precision and recall, which can be captured by combining the two into a single metric, called the F-measure.

In [None]:
# display label mapping to assist with interpretation of the model results
label_mapping=le.inverse_transform([0,1,2])
print('0: ', label_mapping[0])
print('1: ', label_mapping[1])
print('2: ', label_mapping[2])

In [None]:
### test your predictions using sklearn.classification_report()
report = sklearn.metrics.classification_report( y_test, y_prediction )

### and print the report
print(report)

In [None]:
print('Accuracy:   ',sklearn.metrics.accuracy_score( y_test, y_prediction ))

#### Get the column names of the transformed features

In [None]:
m_step=pipeline.named_steps['mapper']

In [None]:
m_step.transformed_names_

In [None]:
features = m_step.transformed_names_

In [None]:
# Get the features importance
importances = pipeline.named_steps['RandonForestClassifier'][1].feature_importances_
indices = np.argsort(importances)

In [None]:
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b',align='center')
plt.yticks(range(len(indices)), (np.array(features))[indices])
plt.xlabel('Relative Importance')

<a id="save_model"></a>
## 5. Save the model into WML Deployment Space
[Top](#top)

Before we save the model we must create a deployment space. Watson Machine Learning provides deployment spaces where the user can save, configure and deploy their models. We can save models, functions and data assets in this space.

The steps involved for saving and deploying the model are as follows:

1. Create a new deployment space. Enter the name of the space in the cell below. If a space with specified space_name already exists, existing space will be deleted before creating a new space.
2. Set this deployment space as the default space.
3. Store the model pipeline in the deployment space. Enter the name for the model in the cell below. 
4. Deploy the saved model. Enter the deployment name in the cell below. 
5. Retrieve the scoring endpoint to score the model with a payload
6. We will use the watson_machine_learning_client package to complete these steps. 

In [None]:
!pip install watson-machine-learning-client-v4

In [None]:
# Specify a names for the space being created, the saved model and the model deployment
space_name = 'deployment-space-analytics-project-workshop'

model_name = 'churn_risk_model'

deployment_name = 'churn_risk_model-deployment'

In [None]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

import os
token = os.environ['USER_ACCESS_TOKEN']

from project_lib.utils import environment
url = environment.get_common_api_url()

wml_credentials = {
"token": token,
"instance_id" : "wml_local",
"url": url,
"version": "3.0.0"
}

client = WatsonMachineLearningAPIClient(wml_credentials)

If a space with specified space_name already exists, delete the existing space before creating a new one.

In [None]:

for space in client.spaces.get_details()['resources']:
    if space_name in space['entity']['name']:
        client.spaces.delete(space['metadata']['guid'])
        print(space_name, "is deleted")

### 5.1 Create Deployment Space

In [None]:
# create the space and set it as default
space_meta_data = {
        client.spaces.ConfigurationMetaNames.NAME : space_name
}

stored_space_details = client.spaces.store(space_meta_data)

space_uid = stored_space_details['metadata']['guid']

# set the newly created deployment space as the default
client.set.default_space(space_uid)

In [None]:
# fetching details of the space created
stored_space_details

### 5.2 Store the model in the deployment space

In [None]:
# list all supported software specs
client.software_specifications.list()

In [None]:
# run this line if you do not know the version of scikit-learn that was used to build the model
!pip show scikit-learn

In [None]:
software_spec_uid = client.software_specifications.get_uid_by_name('scikit-learn_0.22-py3.6')

In [None]:
metadata = {
    client.repository.ModelMetaNames.NAME: model_name,
    client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    client.repository.ModelMetaNames.TYPE: "scikit-learn_0.22"
}

stored_model_details = client.repository.store_model(pipeline,
                                               meta_props=metadata,
                                               training_data=X_train,
                                               training_target=y_train)

In [None]:
stored_model_details

### 5.3 Create a deployment for the stored model

In [None]:
# deploy the model
meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: deployment_name,
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

# deploy the model

model_uid = stored_model_details["metadata"]["guid"]
deployment_details = client.deployments.create( artifact_uid=model_uid, meta_props=meta_props)

### 5.4 Score the model

In [None]:
# retrieve the scoring endpoint
scoring_endpoint = client.deployments.get_scoring_href(deployment_details)

print('Scoring Endpoint:   ',scoring_endpoint)

In [None]:
scoring_deployment_id = client.deployments.get_uid(deployment_details)
client.deployments.get_details(scoring_deployment_id)

In [None]:
payload = [{"values": [ ['Young adult','M','S', 2,56000, 'N', 5030, 23, 2257, 125, 3.45, 2, 19, 1200, 251]]}]

In [None]:
payload_metadata = {client.deployments.ScoringMetaNames.INPUT_DATA: payload}
# score
predictions = client.deployments.score(scoring_deployment_id, payload_metadata)
predictions

In [None]:
# display label mapping to assist with interpretation of the model results
label_mapping=le.inverse_transform([0,1,2])
print('0: ', label_mapping[0])
print('1: ', label_mapping[1])
print('2: ', label_mapping[2])

#### Write test data into .csv files for batch scoring and model evaluations

In [None]:
# Write the test data a .csv so that we can later use it for batch scoring
write_score_CSV=X_test
write_score_CSV.to_csv('/project_data/data_asset/model_batch_score.csv', sep=',', index=False)

In [None]:
# Write the test data to a .csv so that we can later use it for Evaluation
write_eval_CSV=X_test
write_eval_CSV.to_csv('/project_data/data_asset/model_eval.csv', sep=',', index=False)

**Last updated:** 06/01/2020 - Original Notebook by Anjali Shah, updated in later versions by Sidney Phoon. Final edits by Burt Vialpando and Kent Rubin - IBM