# Targeting Direct Marketing with Random Forest at Local
_**Supervised Learning with Random Forest: A Binary Prediction Problem With Unbalanced Classes**_

This notebook works well with the `Python 3 (Data Science)` kernel on SageMaker Studio.

---

---

## Contents

1. [Overview](#Overview)
1. [Preperation](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)
1. [Exentsions](#Extensions)

---

## Overview
In this workshop, you will learn how to use Amazon SageMaker Managed Notebook environment - <b>SageMaker Studio</b> to build, train and deploy a machine learning (ML) model using SKLearn framework.   

In this exercise, you have been asked to to develop a machine learning model to predict whether a customer will enroll for a certificate of deposit (CD), after the customer has been contacted through mail, email, phone, etc.  The model will be trained on the marketing dataset that contains information on customer demographics, responses to marketing events, and environmental factors. Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  

The data has been labeled for your convenience and a column in the dataset identifies whether the customer is enrolled for a product offered by the bank. A version of this dataset is publicly available  from the ML repository curated by the University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/bank+marketing). This tutorial implements a supervised machine learning model, since the data is labeled. (Unsupervised learning occurs when the datasets are not labeled.)

The steps include:

* Downloading training data into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Fitting a model using Random Forest Classifier
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

## Feature Engineering
In this part of the tutorial, you will learn about the highlighted part of machine learning process:
![png](./image/ML-pipeline.JPG)

---

## Data

In this step you will use your Amazon SageMaker Studio notebook to preprocess the data that you need to train your machine learning model.



Execute each cell by pressing <b> Shift+Enter </b> in each of the cells. While the code runs, an * appears between the square brackets as pictured in the first screenshot to the right. After a few seconds, the code execution will complete, the * will be replaced with the number 1.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import json
from IPython.display import display

import boto3 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score

Make sure pandas version is set to 1.2.4 or later. If it is not the case, restart the kernel before going further

---

## Downloading data
Download the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket by running the 

\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


In [None]:
# cell 05
# restore the shared variables
%store -r bucket
%store -r prefix
%store -r data_folder
%store -r data_file_path

In the next cell, you will load the dataset into a pandas dataframe.

In [None]:
# cell 06
data = pd.read_csv(data_file_path)
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

### Exploration
7. Let's start exploring the data.  First, let's understand how the features are distributed.

In [None]:
# cell 07
# Frequency tables for each categorical feature
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features
display(data.describe())
%matplotlib inline
hist = data.hist(bins=30, sharey=True, figsize=(10, 10))

Notice that:

* Almost 90% of the values for our target variable `y` are "no", so most customers did not subscribe to a term deposit.
* Many of the predictive features take on values of `unknown`.  Some are more common than others.  We should think carefully as to what causes a value of "unknown" (are these customers non-representative in some way?) and how we that should be handled.
  * Even if `unknown` is included as it's own distinct category, what does it mean given that, in reality, those observations likely fall within one of the other categories of that feature?
* Many of the predictive features have categories with very few observations in them.  If we find a small category to be highly predictive of our target outcome, do we have enough evidence to make a generalization about that?
* Contact timing is particularly skewed.  Almost a third in May and less than 1% in December.  What does this mean for predicting our target variable next December?
* There are no missing values in our numeric features.  Or missing values have already been imputed.
  * `pdays` takes a value near 1000 for almost all customers.  Likely a placeholder value signifying no previous contact.
* Several numeric features have a very long tail.  Do we need to handle these few observations with extremely large values differently?
* Several numeric features (particularly the macroeconomic ones) occur in distinct buckets.  Should these be treated as categorical?



Next, let's look at how our features relate to the target that we are attempting to predict.

In [None]:
# cell 08
for column in data.select_dtypes(include=['object']).columns:
    if column != 'y':
        display(pd.crosstab(index=data[column], columns=data['y'], normalize='columns'))

for column in data.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = data[[column, 'y']].hist(by='y', bins=30)
    plt.show()

Notice that:

* Customers who are-- blue-collar", "married", "unknown" default status, contacted by "telephone", and/or in "may" are a substantially lower portion of "yes" than "no" for subscribing.
* Distributions for numeric variables are different across "yes" and "no" subscribing groups, but the relationships may not be straightforward or obvious.



9. Now let's look at how our features relate to one another.

In [None]:
# cell 09 -- using a correlation matrix and scatter matrix understand how features are related to one another
display(data.corr())
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.show()

Notice that:
* Features vary widely in their relationship with one another.  Some with highly negative correlation, others with highly positive correlation.
* Relationships between features is non-linear and discrete in many cases.

### Transformation

Cleaning up data is part of nearly every machine learning project.  It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process.  Several common techniques include:

* <b>Handling missing values:</b> Some machine learning algorithms are capable of handling missing values, but most would rather not.  Options include:
 * <b>Removing observations with missing values:</b> This works well if only a very small fraction of observations have incomplete information.
 * <b>Removing features with missing values</b>: This works well if there are a small number of features which have a large number of missing values.
 * <b>Imputing missing values</b>: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.
* <b>Converting categorical to numeric</b>: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.
* <b>Oddly distributed data</b>: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data.  In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data.  In others, bucketing values into discrete ranges is helpful.  These buckets can then be treated as categorical variables and included in the model when one hot encoded.
* Handling more complicated data types: Mainpulating images, text, or data at varying grains is left for other notebook templates.

Luckily, some of these aspects have already been handled for us, and the algorithm we are showcasing tends to do well at handling sparse or oddly distributed data.  Therefore, let's keep pre-processing simple.

To summarise, we need to A) address some weird values, B) convert the categorical to numeric valriables and C) Remove unnecessary data:

* Many records have the value of "999" for pdays. It is very likely to be a 'magic' number to represent that no contact was made before. Considering that, we will create a new column called "no_previous_contact", then grant it value of "1" when pdays is 999 and "0" otherwise.
* In the job column, there are more than one categories for people who don't work e.g., "student", "retired", and "unemployed". It is very likely the decision to enroll or not to a term deposit depends a lot on whether the customer is working or not. A such, we generate a new column to show whether the customer is working based on job column.
* We will remove the economic features and duration from our data as they would need to be forecasted with high precision to be used as features during inference time.
* We convert categorical variables to numeric using one hot encoding.

In [None]:
# Indicator variable to capture when pdays takes a value of 999
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 

# Indicator for individuals not actively employed
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   

# remove unnecessary data
data = data.drop(
    ['duration', 
     'emp.var.rate', 
     'cons.price.idx', 
     'cons.conf.idx', 
     'euribor3m', 
     'nr.employed'
    ], 
    axis=1)

# Convert categorical variables to sets of indicators
model_data = pd.get_dummies(data)                    

# Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
model_data = pd.concat([model_data['y_yes'], model_data.drop(['y_no', 'y_yes'], axis=1)], axis=1)

model_data

### Splitting data
When building a model whose primary goal is to predict a target value on new data, it is important to understand <b> overfitting</b>.  Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given.  This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown.  These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.
The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on <b>"new"</b> data.  There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc.  For our purposes, we'll simply randomly split the data into 3 uneven groups.  

Use Numpy to split data into 3 groups. The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [None]:
# cell 12 - split our data into 3 channels: train, test,validation sets
train_data, validation_data = np.split(
    model_data.sample(frac=1, random_state=1729), 
    [int(0.8 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

In [None]:
# cell 13
X_train, y_train = train_data.iloc[:, 1:], train_data.iloc[:, 0]

X_val, y_val = validation_data.iloc[:, 1:], validation_data.iloc[:, 0]

In [None]:
y_train.values

In [None]:
%%time

hyper_parameters = {
    "bootstrap": [True],
    "max_depth": [12, 13],
    "max_features": [13, 14],
    "n_estimators": [100, 150]
}

model = RandomForestClassifier()

grid_search = GridSearchCV(model, hyper_parameters, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
print(grid_search.best_estimator_)

---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual values to predicted values.  In particular, we evaluate the model using a <b> confusion matrix </b>.   In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`).

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [None]:
def evaluate_performance(y_test, y_pred):
    print(pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']))
    print('====================')
    print("Accuracy : ", "{:.2f}".format(accuracy_score(y_test, y_pred)))
    print("Precision : ", "{:.2f}".format(precision_score(y_test, y_pred)))
    print("Recall : ","{:.2f}".format(recall_score(y_test, y_pred)))
    print("F1 :", "{:.2f}".format(f1_score(y_test, y_pred)))

In [None]:
y_pred = grid_search.predict(X_val)
evaluate_performance(y_val, y_pred)

#### Conclusion
In this workshop we have walked through the process of building, training, tuning and evaluating the model with Random Forest Algorithm.