# Targeting Direct Marketing with Amazon SageMaker XGBoost
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

## Contents

1. [Overview](#Overview)
1. [Preperation](#Preparation)
1. [Data](#Data)
    1. [Exploration](#Exploration)
    1. [Transformation](#Transformation)
1. [Training](#Training)
1. [Hosting](#Hosting)
1. [Evaluation](#Evaluation)
1. [Exentsions](#Extensions)

---

## Overview
In this workshop, you will learn how to use <b> Amazon SageMaker </b> to build, train and deploy a machine learning (ML) model using the popular XGBoost Algorithm.   

Amazon SageMaker is a modular, fully managed machine learning service that enables developers and data scientists to build, train, and deploy ML models at scale.  Taking ML models from concept to production is typically complex and time-consuming. You have to manage large amounts of data to train the model, choose the best algorithm for training it, manage the compute capacity while training it, and then deploy the model into a production environment. Amazon SageMaker reduces this complexity by making it much easier to build and deploy ML models. After you choose the right algorithms and frameworks from the wide range of choices available, Amazon SageMaker manages all of the underlying infrastructure to train your model at petabyte scale, and deploy it to production.

In this exercise, you have been asked to to develop a machine learning model to predict whether a customer will enroll for a certificate of deposit (CD), after the customer has been contacted through mail, email, phone, etc.  The model will be trained on the marketing dataset that contains information on customer demographics, responses to marketing events, and environmental factors. Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  

The data has been labeled for your convenience and a column in the dataset identifies whether the customer is enrolled for a product offered by the bank. A version of this dataset is publicly available  from the ML repository curated by the University of California, Irvine (https://archive.ics.uci.edu/ml/datasets/bank+marketing). This tutorial implements a supervised machine learning model, since the data is labeled. (Unsupervised learning occurs when the datasets are not labeled.)

The steps include:

* Downloading training data into Amazon SageMaker
* Investigating and transforming the data so that it can be fed to Amazon SageMaker algorithms
* Estimating a model using the Gradient Boosting algorithm
* Evaluating the effectiveness of the model
* Setting the model up to make on-going predictions

## Feature Engineering
In this part of the tutorial, you will learn about the highlighted part of machine learning process:
![png](./image/ML-pipeline.JPG)

---

## Data

In this step you will use your Amazon SageMaker Studio notebook to preprocess the data that you need to train your machine learning model.



Execute each cell by pressing <b> Shift+Enter </b> in each of the cells. While the code runs, an * appears between the square brackets as pictured in the first screenshot to the right. After a few seconds, the code execution will complete, the * will be replaced with the number 1.



Let's start by:

- Upgrading Pandas (Python library for scientific computing) to the latest version
- Specifying the S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- Specifying the IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).
- Importing the necessary Python libraries that we will use throughout the analysis and define a few environment variables in your Jupyter environment.

In [None]:
# cell 01 - upgrade Pandas to the latest version
!pip install -qU pandas

In [None]:
# cell 02 - Specify S3 bucket and prefix
import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session

# Define IAM role used to give training and hosting access to your data
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

In [None]:
# cell 03 = bring in the Python libraries uses throughout the analysis
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker 
import zipfile     # Amazon SageMaker's Python SDK provides many helper functions

In [None]:
# cell 04
pd.__version__

Make sure pandas version is set to 1.2.4 or later. If it is not the case, restart the kernel before going further

---

## Downloading data
Download the [direct marketing dataset](https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip) from the sample data s3 bucket by running the 

\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014


In [None]:
# cell 05
# restore the shared variables
%store -r bucket
%store -r prefix
%store -r data_folder
%store -r data_file_path

lab_xgb_prefix = f"{prefix}/demo-xgboost"
lab_xgb_prefix

In the next cell, you will load the dataset into a pandas dataframe.

In [None]:
# cell 06
data = pd.read_csv(data_file_path)
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

### Exploration
7. Let's start exploring the data.  First, let's understand how the features are distributed.

In [None]:
# cell 07
# Frequency tables for each categorical feature
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features
display(data.describe())
%matplotlib inline
hist = data.hist(bins=30, sharey=True, figsize=(10, 10))

Notice that:

* Almost 90% of the values for our target variable `y` are "no", so most customers did not subscribe to a term deposit.
* Many of the predictive features take on values of `unknown`.  Some are more common than others.  We should think carefully as to what causes a value of "unknown" (are these customers non-representative in some way?) and how we that should be handled.
  * Even if `unknown` is included as it's own distinct category, what does it mean given that, in reality, those observations likely fall within one of the other categories of that feature?
* Many of the predictive features have categories with very few observations in them.  If we find a small category to be highly predictive of our target outcome, do we have enough evidence to make a generalization about that?
* Contact timing is particularly skewed.  Almost a third in May and less than 1% in December.  What does this mean for predicting our target variable next December?
* There are no missing values in our numeric features.  Or missing values have already been imputed.
  * `pdays` takes a value near 1000 for almost all customers.  Likely a placeholder value signifying no previous contact.
* Several numeric features have a very long tail.  Do we need to handle these few observations with extremely large values differently?
* Several numeric features (particularly the macroeconomic ones) occur in distinct buckets.  Should these be treated as categorical?



Next, let's look at how our features relate to the target that we are attempting to predict.

In [None]:
# cell 08
for column in data.select_dtypes(include=['object']).columns:
    if column != 'y':
        display(pd.crosstab(index=data[column], columns=data['y'], normalize='columns'))

for column in data.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = data[[column, 'y']].hist(by='y', bins=30)
    plt.show()

Notice that:

* Customers who are-- blue-collar", "married", "unknown" default status, contacted by "telephone", and/or in "may" are a substantially lower portion of "yes" than "no" for subscribing.
* Distributions for numeric variables are different across "yes" and "no" subscribing groups, but the relationships may not be straightforward or obvious.



9. Now let's look at how our features relate to one another.

In [None]:
# cell 09 -- using a correlation matrix and scatter matrix understand how features are related to one another
display(data.corr())
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.show()

Notice that:
* Features vary widely in their relationship with one another.  Some with highly negative correlation, others with highly positive correlation.
* Relationships between features is non-linear and discrete in many cases.

### Transformation

Cleaning up data is part of nearly every machine learning project.  It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process.  Several common techniques include:

* <b>Handling missing values:</b> Some machine learning algorithms are capable of handling missing values, but most would rather not.  Options include:
 * <b>Removing observations with missing values:</b> This works well if only a very small fraction of observations have incomplete information.
 * <b>Removing features with missing values</b>: This works well if there are a small number of features which have a large number of missing values.
 * <b>Imputing missing values</b>: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.
* <b>Converting categorical to numeric</b>: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.
* <b>Oddly distributed data</b>: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data.  In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data.  In others, bucketing values into discrete ranges is helpful.  These buckets can then be treated as categorical variables and included in the model when one hot encoded.
* Handling more complicated data types: Mainpulating images, text, or data at varying grains is left for other notebook templates.

Luckily, some of these aspects have already been handled for us, and the algorithm we are showcasing tends to do well at handling sparse or oddly distributed data.  Therefore, let's keep pre-processing simple.

In [None]:
# cell 10 - Transformation
# Create a new indicator variable, 'no_previous_contact' which takes on a value 0 when pdays = 999
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
# Create a new indicator variable, 'not_working' which takes on a value 1 if job is 'student' or 'retired' or 'unemployed'
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
# Convert all categorial values into numeric using one-hot encoding
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators

Another question to ask yourself before building a model is whether certain features will add value in your final use case.  For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction?  Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather.  So, including this in your model may give you a false sense of precision.

Following this logic, let's remove the economic features [`emp.var.rate`, `cons.price.idx`, `cons.conf.idx`, `euribor3m`,`nr.employed`] and `duration` from our data as they would need to be forecasted with high precision to use as inputs in future predictions.

  Even if we were to use values of the economic indicators from the previous quarter, this value is likely not as relevant for prospects contacted early in the next quarter as those contacted later on.

In [None]:
# cell 11
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

### Splitting data
When building a model whose primary goal is to predict a target value on new data, it is important to understand <b> overfitting</b>.  Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given.  This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown.  These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.
The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on <b>"new"</b> data.  There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc.  For our purposes, we'll simply randomly split the data into 3 uneven groups.  

Use Numpy to split data into 3 groups. The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [None]:
# cell 12 - split our data into 3 channels: train, test,validation sets
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

### Transform data into format expected by SageMaker built-in algorithm
Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  
 - For this example, we'll stick to CSV.  
 - The first column must be the target variable and the CSV should not include headers.  
Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
# cell 13
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

13. Now we'll upload the final data to S3 for Amazon SageMaker's managed training to pickup.

In [None]:
# cell 14
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(lab_xgb_prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(lab_xgb_prefix, 'validation/validation.csv')).upload_file('validation.csv')

#### Verify Data Upload (Optional)
At this point, you can check the S3 bucket through the console to make sure you have uploaded the `train.csv` and `validation.csv`. The steps are as follows:
1. From the AWS Console Main Page, load S3.
2. Find your bucket name, that corresponds to your chosen region (e.g. "ap-southeast-2")
3. Click on “sagemaker/” and then on “DEMO-xgboost-dm/”. You will see a “train/” and “validation/” folder.
4. You can have a look inside of each folder to make sure that the “train.csv” file is there as well as the “validation.csv” file.



---

## Training
We have so far applied data engineering to clean and prepare your data for model building and training.  
In the next section, we will learn about the highlighted phases in the picture ![image.png](attachment:image.png)

Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make <b> gradient boosted trees </b> a good candidate algorithm. Specifically, we will learn how to train, tune and deploy an XGBoost model using SageMaker's built-in <b> XGBoost algorithm </b>.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models.  By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

First we'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [None]:
# from sagemaker import image_uris
# cell 15 - specify the XGBoost ECR container location
print(boto3.Session().region_name)
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

15. Specify training and validation data set locations, and content type as CSV. 

In [None]:
# cell 16
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, lab_xgb_prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, lab_xgb_prefix), content_type='csv')

Define training parameters to the SageMaker `Estimator`.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. SageMaker session
You also define the tuning hyperparameters by invoking `set_hyperparameters' 

And then call the `.fit()` function which specifies:
1. S3 location for output data.  In this case we have both a training and validation set which are passed in.

In [None]:
# cell 17
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, lab_xgb_prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

#### Verify Start of Training Job (Optional)
At this point, you can also check from AWS console to verify the training job has started. The steps are as follows:
1. From the AWS Console Main Page, load Amazon SageMaker service.
2. From the left pane, click on `Training` and choose `Training Jobs`.
4. You will see a training job in state `InProgress`.
5. Wait for 3-5 mins till the training job completes and status becomes `Completed`.

XGBoost model has now been trained successfully.


---

## Hosting
Now that we've trained the `xgboost` algorithm on our data, let's deploy a model that's hosted behind a real-time endpoint.

In [None]:
# cell 18
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m5.xlarge')

#### Verify SageMaker endpoint deployment (Optional)
At this point, you can also check from AWS console to verify the training job has started. The steps are as follows:
1. From the AWS Console Main Page, load Amazon SageMaker service.
2. From the left pane, click on `Endpoints`.
4. You will see the endpoint in state `Creating`.
5. It will then transition to status `InService`.


---

## Evaluation
There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual values to predicted values.  In particular, we evaluate the model using a <b> confusion matrix </b>.   In this case, we're simply predicting whether the customer subscribed to a term deposit (`1`) or not (`0`).

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [None]:
# cell 19
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
# cell 20
def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)

Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [None]:
# cell 21
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

So, of the ~4000 potential customers, model predicted 53 would subscribe and 101 of them actually did.  We also had 382 subscribers who actually subscribed but the model did not predict they would. This is less than desirable, but the model can (and should) be tuned to improve this.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](https://core.ac.uk/download/pdf/55631291.pdf).

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above. Also, this matrix may vary, if you change the parameters along the way._ 


## Automatic model Tuning (optional)
Amazon SageMaker `automatic model tuning`, also known as `hyperparameter tuning`, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

For example, suppose that you want to solve a _binary classification_ problem on this marketing dataset. Your goal is to maximize the _area under the curve (auc)_ metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the `eta`, `alpha`, `min_child_weight`, and `max_depth` hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.

In this example, we will tune four hyperparameters:
 - _eta_: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
 -  _min_child_weight_:Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. 
 -  _alpha_:L1 regularization term on weights. Increasing this value makes models more conservative.
 - _max_depth_:Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.
 
Next we'll specify the `objective metric` that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: `validation:auc` and `train:auc`, and we elected to monitor `validation:auc` as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. 

If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.


In [None]:
# cell 22
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}


In [None]:
# cell 23
objective_metric_name = 'validation:auc'

Now, we will create a Hyperparameter Tuner object to which we pass:
- The XGBoost estimate we created above
- The hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as number of training jobs to run in total and how many training jobs can be run in parallel

In [None]:
# cell 24
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3)


Launch a hyperparameter tuning job by calling `fit()` function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
# cell 25
#tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation},include_cls_metadata=False, wait=False)


#### Track progress of the hyperparameter tuning job (Optional)
It will take around 30 minutes to complete. The steps are as follows:
1. From the AWS Console Main Page, load Amazon SageMaker service.
2. From the left pane, click on `Training` and choose `Hyperparameter tuning jobs`.
4. Select the Tuning job to see the detail view.
5. You will see the progress of the training jobs.

In [None]:
# cell 26
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

Once the tuning job is completed, you can return the training job with the best performance.  You can also view this information on the AWS Management Console. The steps are as follows:
1. From the AWS Console Main Page, load Amazon SageMaker service.
2. From the left pane, click on `Training` and choose `Hyperparameter tuning jobs`.
4. Select the Tuning job to see the detail view.
5. You will see the progress of the training jobs. 

#### Deploy the new best training job (Optional)
Once the tuning job is completed, you can pick the training job with the best performance, deploy, predict and evaluate the model as described earlier.


In [None]:
# cell 27
# return the best training job name
tuner.best_training_job()

In [None]:
# cell 28
#  Deploy the best trained or user specified model to an Amazon SageMaker endpoint
# tuner_predictor = tuner.deploy(initial_instance_count=1,
#                           instance_type='ml.m4.xlarge')

In [None]:
# cell 29
# Create a serializer
# tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [None]:
# cell 30
# Predict
predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(),tuner_predictor)

In [None]:
# cell 31
# Collect predictions and convert from the CSV output our model provides into a NumPy array
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

---

## Extensions

This example analyzed a relatively small dataset, but utilized Amazon SageMaker features such as distributed, managed training and real-time model hosting, which could easily be applied to much larger problems.  In order to improve predictive accuracy further, we could tweak value we threshold our predictions at to alter the mix of false-positives and false-negatives, or we could explore techniques like hyperparameter tuning.  In a real-world scenario, we would also spend more time engineering features by hand and would likely look for additional datasets to include which contain customer information not available in our initial dataset.

### Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
# cell 32
# xgb_predictor.delete_endpoint(delete_endpoint_config=True)

#### Conclusion
In this workshop we have walked through the process of building, training, tuning and deploying an XGBoost Model using SageMaker built-in algorithms. We have also looked at the console view while we performed the training, automated model tuning and hosting the model