# Targeting Direct Marketing with Amazon SageMaker XGBoost
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)
1. [Exentsions](#Extensions)

---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook presents an example problem to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls.  The steps include:

* Preparing your Amazon SageMaker notebook
* Downloading data from the internet into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

---

## Preparation

### Note: This notebook has been upgraded to SageMaker API version 2, please upgrade to sagemaker version 2 using below pip command.
   `!pip install sagemaker --upgrade`

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
# Import packages
import boto3
import re
import sagemaker        # Amazon SageMaker's Python SDK provides many helper functions
from sagemaker import get_execution_role
sess=sagemaker.Session()
!pip install sagemaker --upgrade
print("SageMaker version is: "+sagemaker.__version__)
bucket= sess.default_bucket()
print(bucket)
prefix='sagemaker/DEMO-xgboost-dm-10202020'
role = get_execution_role()


Now let's bring in the Python libraries that we'll use throughout the analysis

In [None]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt # For charts and visualizations
import seaborn as sns
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
%matplotlib inline

---

## Data
Let's start by downloading the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket. 

\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


In [None]:
!wget https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
!apt-get install unzip -y
!unzip -o bank-additional.zip

Now lets read this into a Pandas data frame and take a look.

In [None]:
data = pd.read_csv('./bank-additional/bank-additional-full.csv')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

In [None]:
# Copy the data so that we can look into original data when required
data_copy=data.copy()
data_copy.shape

### Exploration
We have looked into the Exploration part in the feature engineering lab on Week 2 - Day 1

### Transformation

We have looked into the Exploration part in the feature engineering lab on Week 2 - Day 1.
We will only go through what is absolutely needed for this dataset and XGBoost algorithm

In [None]:
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
print(data.head())

### Optional Cell

In [None]:
#Can you find a way to do above just using pandas without numpy?
data_copy['no_previous_contact']=data_copy.pdays.<Replace with the Function>(lambda days: 1 if days == 999 else 0 )
data_copy

In [None]:
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators

Another question to ask yourself before building a model is whether certain features will add value in your final use case.  For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction?  Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather.  So, including this in your model may give you a false sense of precision.

Following this logic, let's remove the economic features and `duration` from our data as they would need to be forecasted with high precision to use as inputs in future predictions.

Even if we were to use values of the economic indicators from the previous quarter, this value is likely not as relevant for prospects contacted early in the next quarter as those contacted later on.

In [None]:
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting.  Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given.  This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown.  These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.

The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on "new" data.  There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc.  For our purposes, we'll simply randomly split the data into 3 uneven groups.  The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [None]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

Now we'll copy the file to S3 for Amazon SageMaker's managed training to pickup.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

In [None]:
data.corr().style.background_gradient(cmap ='Greens').set_properties(**{'font-size': '10px'})

In [None]:
sns.pairplot(data)

---

## Training
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

[xgboost](https://xgboost.readthedocs.io/en/release_0.90/tutorials/model.html) is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [None]:
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "1.2-1")

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation'.format(bucket, prefix), content_type='csv')

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [None]:
sagemaker.estimator.Estimator??

In [None]:
print("Training Start time is: "+strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m5.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='binary:logistic',
                        num_round=100)


xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 


### Options to be aware of for better training 
    1. Using single instance (Horovod)/multiple instances for parallel training (Parameter Server) 
    2. Distributing the data across nodes using S3 key based sharding or use the data in fully replicated mode.
    3. Streaming mode or pipe mode for faster training
    4. Sagemaker Local Mode for training on the notebook instance itself.
    5. Use Managed Spot for saving costs
#### Refer to the parameters passed to the [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html) to understand various options.

### Optional Start
Train the model in pipe mode/with multiple instance, can you try fully distributed mode vs S3 sharding?, do you see any increase in the traning speed? does the model overfit with increased depth of a tree? 

In [None]:
# s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv', distribution='ShardedByS3Key')
# s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation'.format(bucket, prefix), content_type='csv')

In [None]:
# #Cell for implementing above Optional task

# print("Training Start time is: "+strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
# xgb = sagemaker.estimator.Estimator(container,
#                                     role, 
#                                     instance_count=1, 
#                                     instance_type='ml.m5.xlarge',
#                                     output_path='s3://{}/{}/output'.format(bucket, prefix),
#                                     input_mode='Pipe',
#                                     sagemaker_session=sess)
# xgb.set_hyperparameters(max_depth=5,
#                         eta=0.2,
#                         gamma=4,
#                         min_child_weight=6,
#                         subsample=0.8,
#                         objective='binary:logistic',
#                         num_round=100)


# xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 


### Optional End

---

## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m5.xlarge')

### Optional start

### How to deploy the model with custom config such as diff traffic for each model and enabling autoscaling of the model? 
Refer to [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-deploy-model.html) for more details 
 
 Steps:
        
        1. Pass a training job name to xgb.fit function above
        
        2. use sagemaker.describe_training_job to get the model artifacts
        
        3. create a model using sagemaker.create_model
        
        4. Use below cell to create endpoint and endpoint config.

In [None]:

# #set endpoint name/config.
# endpoint_config_name = 'DEMO-model-config-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
# endpoint_name = 'DEMO-model-config-'  + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
# In [ ]:
# create_endpoint_config_response = sm.create_endpoint_config(
#     EndpointConfigName = endpoint_config_name,
#     ProductionVariants=[{
#         'InstanceType':'ml.m4.xlarge',
#         'InitialVariantWeight':1,
#         'InitialInstanceCount':1,
#         'ModelName':model_name,
#         'VariantName':'AllTraffic'}])

# print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])


# create_endpoint_response = sm.create_endpoint(
#     EndpointName=endpoint_name,
#     EndpointConfigName=endpoint_config_name)
# print(create_endpoint_response['EndpointArn'])

# resp = sm.describe_endpoint(EndpointName=endpoint_name)
# status = resp['EndpointStatus']
# print("Status: " + status)

### Please perform below tasks as homework

1. Create endpoint with autoscaling
2. Create an end point with two models where first model gets 75% of the traffic and other model gets 25% of the traffic.


### Optional end

---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values.  In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`), which produces a simple confusion matrix.

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [None]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy())
predictions

Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [None]:
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

So, of the ~4000 potential customers, we predicted 136 would subscribe and 94 of them actually did.  We also had 389 subscribers who subscribed that we did not predict would.  This is less than desirable, but the model can (and should) be tuned to improve this.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](http://media.salford-systems.com/video/tutorial/2015/targeted_marketing.pdf).

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

### Optional section for evaluation

1. Deploy above model with [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html)

### Parameters vs Hyperparamters?

## Automatic model Tuning 
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.
For example, suppose that you want to solve a binary classification problem on this marketing dataset. Your goal is to maximize the area under the curve (auc) metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the eta, alpha, min_child_weight, and max_depth hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.


In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}


In [None]:
objective_metric_name = 'validation:auc'

In [None]:
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3)


In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False,wait=False)

In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

### Challenge
#### what is the hyper parameter tuning strategy used above

In [None]:
tuner.<replace>

### Challenge
#### What are two different Hyper Paramter tuning strategies that Sagemaker provides?
#### What can you do to leverage the results of an existing tuning job for further tuning jobs when 1. Identical data and Algorithm 2. Transfer Learning?
### Optional Section Start

In [None]:
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3,
                            strategy='<Other type of Hyperparamter Tuning Strategy>')

In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False,wait=False)

### Optional section end

### Hyperparameter Optimisation 

Common methods are

1. Naive Grid Search
    --> Learning Rate and Number of Layers --> hard to scale as the data dimentionality increases.
2. Random Search (Sampling Random Combinations) --> Searche for best paramters randomly, better than Naive Grid Search if effect of some of the hyper parameters not much--> 
3. Bayesian Optimization --> Uses ML to simulate and find hyper paramters --> Looks into previous results to find the regions to explore for best hyperparameters --> Less compute intensive than trying to train on a larget dataset and on a big neural network 

---

## Extensions

This example analyzed a relatively small dataset, but utilized Amazon SageMaker features such as distributed, managed training and real-time model hosting, which could easily be applied to much larger problems.  In order to improve predictive accuracy further, we could tweak value we threshold our predictions at to alter the mix of false-positives and false-negatives, or we could explore techniques like hyperparameter tuning.  In a real-world scenario, we would also spend more time engineering features by hand and would likely look for additional datasets to include which contain customer information not available in our initial dataset.

### (Optional) Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)