# Direct Marketing with Amazon SageMaker XGBoost and Hyperparameter Tuning
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

Last update: Nov. 19th, 2018

---

## Background
Direct marketing, either through mail, email, phone, etc., is a common tactic to acquire customers.  Because resources and a customer's attention is limited, the goal is to only target the subset of prospects who are likely to engage with a specific offer.  Predicting those potential customers based on readily available information like demographics, past interactions, and environmental factors is a common machine learning problem.

This notebook will train a model which can be used to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls. Hyperparameter tuning will be used in order to try multiple hyperparameter settings and produce the best model.

We will use SageMaker Python SDK, a high level SDK, to simplify the way we interact with SageMaker Hyperparameter Tuning.

---

## Preparation

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  We'll use the default bucket. If you want to use your own, make sure it is in the **same region** as SageMaker.
- The IAM role used to give training access to your data.

In [None]:
import sagemaker
import boto3
import os 
 
bucket = sagemaker.Session().default_bucket()                     
prefix = 'sagemaker/DEMO-hpo-xgboost-dm'

# Role when working on a notebook instance
role = sagemaker.get_execution_role()

---

## Downloading the data set
Let's start by downloading the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository.

In [None]:
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip

In [None]:
!head ./bank-additional/bank-additional-full.csv

We need to load this CSV file, inspect it, pre-process it, etc. Please don't write custom Python code to do this!

Instead, developers typically use libraries such as:
* Pandas: a library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language: https://pandas.pydata.org/.
* Numpy: a fundamental package for scientific computing with Python: http://www.numpy.org/

Along the way, we'll use functions from these two libraries. You should definitely become familiar with them, they will make your life much easier when working with large datasets.

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd # For munging tabular data

Let's read the CSV file into a Pandas data frame and take a look at the first few lines.

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page
data[:10] # Show the first 10 lines

In [None]:
data.shape # (number of lines, number of columns)

The two classes are extremely unbalanced and it could be a problem for our classifier.

In [None]:
one_class = data[data['y']=='yes']
one_class_count = one_class.shape[0]
print("Positive samples: %d" % one_class_count)

zero_class = data[data['y']=='no']
zero_class_count = zero_class.shape[0]
print("Negative samples: %d" % zero_class_count)

zero_to_one_ratio = zero_class_count/one_class_count
print("Ratio: %.2f" % zero_to_one_ratio)

Let's talk about the data.  At a high level, we can see:

* We have a little over 40K customer records, 20 features plus a target variable ('y') for each customer
* The features are mixed; some numeric, some categorical
* The data appears to be sorted, at least by `time` and `contact`, maybe more

_**Specifics on each of the features:**_

*Demographics:*
* `age`: Customer's age (numeric)
* `job`: Type of job (categorical: 'admin.', 'services', ...)
* `marital`: Marital status (categorical: 'married', 'single', ...)
* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)

*Past customer events:*
* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)
* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)
* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)

*Past direct marketing contacts:*
* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)
* `month`: Last contact month of year (categorical: 'may', 'nov', ...)
* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)
* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.
 
*Campaign information:*
* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* `previous`: Number of contacts performed before this campaign and for this client (numeric)
* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

*External environment factors:*
* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)
* `cons.price.idx`: Consumer price index - monthly indicator (numeric)
* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)
* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)
* `nr.employed`: Number of employees - quarterly indicator (numeric)

*Target variable:*
* `y`: Has the client subscribed a term deposit? (binary: 'yes','no')

## Transforming the dataset
Cleaning up data is part of nearly every machine learning project.  It arguably presents the biggest risk if done incorrectly and is one of the more subjective aspects in the process.  Several common techniques include:

* Handling missing values: Some machine learning algorithms are capable of handling missing values, but most would rather not.  Options include:
 * Removing observations with missing values: This works well if only a very small fraction of observations have incomplete information.
 * Removing features with missing values: This works well if there are a small number of features which have a large number of missing values.
 * Imputing missing values: Entire [books](https://www.amazon.com/Flexible-Imputation-Missing-Interdisciplinary-Statistics/dp/1439868247) have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.
* Converting categorical to numeric: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.
* Oddly distributed data: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data.  In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data.  In others, bucketing values into discrete ranges is helpful.  These buckets can then be treated as categorical variables and included in the model when one hot encoded.
* Handling more complicated data types: Mainpulating images, text, or data at varying grains.

Luckily, some of these aspects have already been handled for us, and the algorithm we are showcasing tends to do well at handling sparse or oddly distributed data.  Therefore, let's keep pre-processing simple.

First of all, many records have the value of "999" for pdays, number of days that passed by after a client was last contacted. It is very likely to be a magic number to represent that no contact was made before. Considering that, we create a new column called "no_previous_contact", then grant it value of "1" when pdays is 999 and "0" otherwise.

In [None]:
[np.min(data['pdays']), np.max(data['pdays'])]

In [None]:
# Indicator variable to capture when pdays takes a value of 999
# https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.where.html
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 

In the "job" column, there are categories that mean the customer is not working, e.g., "student", "retire", and "unemployed". Since it is very likely whether or not a customer is working will affect his/her decision to enroll in the term deposit, we generate a new column to show whether the customer is working based on "job" column.

In [None]:
# Indicator for individuals not actively employed
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.in1d.html
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)

Last but not the least, we convert categorical to numeric, as is suggested above.

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
model_data = pd.get_dummies(data)  # Convert categorical variables to sets of indicators
model_data[:10]

As you can see, each categorical column (job, marital, education, etc.) has been replaced by a set of new columns, one for each possible value in the category. Accordingly, we now have 67 columns instead of 21.

In [None]:
model_data.shape

## Selecting features

Another question to ask yourself before building a model is whether certain features will add value in your final use case.  For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction?  Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather.  So, including this in your model may give you a false sense of precision.

Following this logic, let's remove the economic features and `duration` from our data as they would need to be forecasted with high precision to use as inputs in future predictions.

Even if we were to use values of the economic indicators from the previous quarter, this value is likely not as relevant for prospects contacted early in the next quarter as those contacted later on.

In [None]:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

In [None]:
model_data.shape

## Splitting the dataset

We'll then split the dataset into training (70%), validation (20%), and test (10%) datasets and convert the datasets to the right format the algorithm expects. We will use training and validation datasets during training and we will try to maximize the accuracy on the validation dataset.
 
Once the model has been deployed, we'll use the test dataset to evaluate its performance.

Amazon SageMaker's XGBoost algorithm expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
# Set the seed to 123 for reproductibility
# https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.sample.html
# https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.split.html
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=123), 
                                                  [int(0.7 * len(model_data)), int(0.9*len(model_data))])  

# Drop the two columns for 'yes' and 'no' and add 'yes' back as first column of the dataframe
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)
#pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)

# Dropping the target value, as we will use the CSV file for batch transform
test_data.drop(['y_no', 'y_yes'], axis=1).to_csv('test.csv', index=False, header=False)

In [None]:
!ls -l *.csv

Now we'll copy the files to S3 for Amazon SageMaker training to pickup.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')

SageMaker needs to know where the training and validation sets are located, so let's define that.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

s3_data = {'train': s3_input_train, 'validation': s3_input_validation}

## Training our first model

The problem we're trying to solve is a classification problem: will a given customer react positively to our marketing offer or not? In order to answer this question, let's train a classification model with XGBoost, a popular open source project available in SageMaker.

Please take a few minutes to read:
* The XGBoost algorithm documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
* The Estimator API documentation: https://sagemaker.readthedocs.io/en/latest/estimators.html. The Estimator object is central to training activities in SageMaker, **you should be very familiar with it**.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

sess = sagemaker.Session()

region = boto3.Session().region_name    
container = get_image_uri(region, 'xgboost')

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c5.2xlarge',
                                    input_mode="File",
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

### Setting hyper parameters
Each built-in algorithm has a set of hyperparameters. Here are the ones for XGBoost: 
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

That probably looks a little weird :) Let's stick to the two **required** parameters:
* Build a binary classifier: 'binary:logistic'
* Train for 100 rounds (but is this the right value? More on this later).

In [None]:
xgb.set_hyperparameters(objective='binary:logistic', num_round=100)

We're all set. Let's train! While the job is running, head out to the SageMaker web console and familiarize yourself with the "Training jobs" section. Take a look at the training log.

In [None]:
xgb.fit(s3_data)

### Questions and things to try
* Imagine you're running this notebook in a different region? Do you need to change anything?
* Why does the Estimator need an IAM role?
* What's the training accuracy? The validation accuracy? Write them down, we'll need them later on.
* What is the recommended instance type for XGBoost? Train on an ml.c4.2xlarge instance and an ml.m5.2xlarge. Compare training times ("billable seconds").
* Try training on two instances. Any improvement? Why?
* What if the dataset is too large to fit in RAM? Can XGBoost still process it?

## Deploying our model
Now let's deploy our model to an HTTPS endpoint. All it takes is one line of code.

While deployment takes place, head out to the SageMaker web console and familiarize yourself with the "Endpoints" section.

In [None]:
xgb_endpoint = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

## Predicting with our model

First we'll need to determine how we pass data into and receive data from our endpoint. Our data is currently stored as NumPy arrays in memory of our notebook instance. To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

Now, we'll use a simple function to:
* Loop over our test dataset
* Split it into mini-batches of rows
* Convert those mini-batches to CSV string payloads (of course, we drop the target variable)
* Retrieve mini-batch predictions by invoking the XGBoost endpoint
* Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.array_split.html

def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_endpoint.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [None]:
from sagemaker.predictor import csv_serializer

xgb_endpoint.content_type = 'text/csv'
xgb_endpoint.serializer = csv_serializer

# We need to drop the target value, as we're predicting it :)
predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).as_matrix())
print(predictions)

For each sample, our binary classifier returns a probability between 0 and 1. Since we decided to maximize accuracy, the model sets a threshold of 0.5: anything lower is treated as a 0, anything higher as a 1. 

In [None]:
# https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.crosstab.html

# Also called a 'confusion matrix'
pd.crosstab(index=test_data['y_yes'], 
            columns=np.round(predictions), 
            rownames=['actuals'], colnames=['predictions'])

How well did we do on the test set (your own numbers might vary):
* 3560 true negatives and 120 true positives were correctly predicted.
* 361 positives were incorrectly predicted as negatives (false negatives), so we'll probably miss business opportunities by not engaging with these customers. It looks like this model is too conservative!
* 78 negatives were incorrectly predicted as positives (false positives), so we'll probably waste our time engaging with these customers.

All in all, our accuracy is: (3560+120)/(3560+78+361+120)=0.8934 aka 89.34%. This is consistent with the validation accuracy we observed during the training process. If it wasn't, then it would mean that our validation set and test set have different distributions. That would be a major problem and we would definitely have to fix the way they're built.

Other useful metrics for classifiers are precision, recall and F1 score: 

https://en.wikipedia.org/wiki/F1_score

* **Precision**: the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
* **Recall**: the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
* **F1 score**: a weighted mean of precision and recall. 1 is the best possible score and 0 is the worst.

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

from sklearn.metrics import precision_recall_fscore_support
score = precision_recall_fscore_support(test_data['y_yes'], np.round(predictions), average='binary')

# Precision, recall, F1 score
print(score)

These false negatives are really hurting recall, which in turns impacts the F1 score. As we saw above, we decided to optimize for accuracy, and this comes at the expense of a rather poor F1 score. If our goal was to maximize the F1 score, we could decide on a different threshold. Keep in mind that you can either optimize false positives or false negatives, but not both. You have to decide which ones have the bigger impact on your application.

This trade-off is made more difficult by the class imbalance problem. Fixing it is beyond the scope of this workshop, but we could use techniques like:
* adding real data to the positive class (real or synthetic data),
* adding synthetic data to the positive class (e.g. over-sampling),
* using the 'scale_pos_weight' hyper parameter to account for class imbalance,
* more pre-processing, more feature engineering, etc.

If you want to dive deeper, this blog post is a good starting point: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

## Deleting the endpoint
Once that we're done predicting, we can delete the endpoint (and stop paying for it). You can re-deploy again by running the appropriate cell above. 

In [None]:
sagemaker.Session().delete_endpoint(xgb_endpoint.endpoint)

## Use batch prediction
Some use cases either don't require or don't work well with HTTPS-based prediction. Imagine having to predict 100GB of bulk data every 24 hours: it wouldn't be efficient to do this with an endpoint.

SageMaker supports batch prediction. Let's apply it to the model we trained earlier: run the next 2 cells and wait for a bit. While this takes place, head out to the SageMaker web console and familiarize yourself with the "Batch transform jobs" section.

In [None]:
transformer = xgb.transformer(instance_count=1, instance_type='ml.c5.2xlarge')

# Reminder: test.csv must only contain features, not the target value
transformer.transform('s3://{}/{}/test/test.csv'.format(bucket, prefix), content_type='text/csv')

In [None]:
transformer.wait()
print(transformer.output_path)

### Copy the output file and display the first 10 predictions
Predictions are written to S3. Let's use the AWS CLI to retrieve them and display the first 10 probabilities.

In [None]:
!aws s3 cp $transformer.output_path/test.csv.out .
!ls -l test.csv.out
!head -10 test.csv.out

This is cool but can we improve our model?

On our first attempt, we used the minimum number of hyperparameters. Surely we can tweak a little more and improve the accuracy :)

## How long should we train for?
In the previous example, we set the number of rounds to 100. How do we know if this is the right value or not? If we don't train long enough, we could be missing out on accuracy. If we train for too long, we could be overfitting... or just wasting time and money.

XGBoost has an hyper parameter named *early_stopping_rounds*. It stops training when accuracy hasn't improved in *early_stopping_rounds* rounds. Take a minute to read the doc: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

Let's try it: we're going to train for 10 times longer (1,000 rounds) and stop if accuracy hasn't improved in 100 rounds.

In [None]:
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c5.2xlarge',
                                    input_mode="File",
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(objective='binary:logistic', 
                        num_round=1000,
                        early_stopping_rounds=100)

xgb.fit(s3_data)

### Questions and things to try
* What's the training accuracy? The validation accuracy? Any improvement compared to the previous training?
* Did we stop early? Which round yielded the highest validation accuracy?
* Assuming we stopped much sooner this time, what probably happened during our initial training with 100 rounds?

## How deep should the trees be?
Tree depth is obviously an important parameter for tree-based algorithms. XGBoost has an hyper parameter named *max_depth*: by default, it is set to 6. Is this too high? Too low? Maybe we could improve our accuracy by building a more complex model based on a deeper tree. Or maybe the model would generalize better with a shallower tree?

Let's try different values for *max_depth*:
* set it first to 4, 
* then 8 
* and finally 0 (i.e. unlimited depth).

In [None]:
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c5.2xlarge',
                                    input_mode="File",
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(objective='binary:logistic', 
                        num_round=1000,
                        early_stopping_rounds=100,
                        max_depth=4)

xgb.fit(s3_data)

### Questions and things to try
Keep track of training and validation accuracies, early stopping, etc. Explain what you see :)

So... Selecting the right value for *max_depth* is not obvious, is it? 

And what about other hyperparameters? **We can't reasonably keep guessing like this**. 

Fortunately, SageMaker supports Automatic Model Tuning.

## Understanding Automatic Model Tuning

We will use SageMaker tuning to automate the searching process effectively. Specifically, we specify a range, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune. SageMaker hyperparameter tuning will automatically launch multiple training jobs with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will give it a budget (max number of training jobs) and it will complete once that many training jobs have been executed.

In this example, we are using SageMaker Python SDK to set up and manage the hyperparameter tuning job. We first configure the training jobs the hyperparameter tuning job will launch by initiating an estimator, which includes:
* The container image for the algorithm (XGBoost)
* Configuration for the output of the training jobs
* The values of static algorithm hyperparameters, those that are not specified will be given default values
* The type and number of instances to use for the training jobs

We will tune four hyperparameters in this example. Don't worry if this sounds over-complicated, we don't need to understand this in detail right now.
* *eta*: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative. 
* *alpha*: L1 regularization term on weights. Increasing this value makes models more conservative. 
* *min_child_weight*: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
* *max_depth*: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

In [None]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'alpha': ContinuousParameter(0, 2),
                        'max_depth': IntegerParameter(1, 10)
                        }

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job.

In [None]:
objective_metric_name = 'validation:error'
objective_type = 'Minimize'

Now, we'll create a `HyperparameterTuner` object, to which we pass:
- The XGBoost estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [None]:
from sagemaker.tuner import HyperparameterTuner

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.2xlarge',
                                    input_mode="File",
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(objective='binary:logistic', 
                        num_round=1000,
                        early_stopping_rounds=100)

tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type=objective_type,
                            max_jobs=20,
                            max_parallel_jobs=3)

## Launching Automatic Model Tuning
Now we can launch a hyperparameter tuning job by calling **fit()** function. Once the job is launched, head out to SageMaker console to track the progress of the hyperparameter tuning job until it is completed. 

This job will run for about 20 minutes, so there's time for a coffee break too. If you're too hardcore for a coffee break, you can use the time to learn more about the XGBoost algo:
* Documentation: https://xgboost.readthedocs.io/en/latest/
* Research paper: https://arxiv.org/abs/1603.02754

In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

Let's just run a quick check of the tuning job status to make sure it started successfully. 

In [None]:
sagemaker = boto3.Session().client(service_name='sagemaker') 

# Get tuning job name
job_name = tuner.latest_tuning_job.job_name
print(job_name)

sagemaker.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=job_name)['HyperParameterTuningJobStatus']

While the job is running, head out to the SageMaker web console and spend a few minutes familiarizing yourself with the "Hyperparameter tuning jobs" section.

In [None]:
# run this cell to check current status of hyperparameter tuning job
tuning_job_result = sagemaker.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)

status = tuning_job_result['HyperParameterTuningJobStatus']
if status != 'Completed':
    print('Reminder: the tuning job has not been completed.')
    
job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']
print("%d training jobs have completed" % job_count)

In [None]:
from pprint import pprint
if tuning_job_result.get('BestTrainingJob',None):
    print("Best model found so far:")
    pprint(tuning_job_result['BestTrainingJob'])
else:
    print("No training jobs have reported results yet.")

Once the tuning job is complete, we can deploy the best model.

## Deploying the best model

In [None]:
tuning_job_result = sagemaker.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)
best_model_name = tuning_job_result['BestTrainingJob']['TrainingJobName']
print(best_model_name)

import time
timestamp = time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = best_model_name + '-ep-' + timestamp
print(endpoint_name)

### Option 1: deploying with a simple configuration
The easiest way to deploy the best model is to use the **deploy()** API of the current HyperparameterTuner object. If we wanted to use a previous tuning job, you could use the **attach()** API to attach to it before calling **deploy()**. More info: https://sagemaker.readthedocs.io/en/latest/tuner.html

While you're waiting, head out to the SageMaker web console and familiarize yourself with the "Endpoints" section".

In [None]:
tuner.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', endpoint_name=endpoint_name)

Let's load the test set from file and predict the first 10 samples.

In [None]:
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

test_samples = [line.rstrip('\n') for line in open('test.csv')]
test_samples = test_samples[:10] # We'll predict the first 10 samples

for sample in test_samples:
    sample = bytes(sample, 'utf-8')
    print(sample)
    response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                  ContentType='text/csv', 
                                  Body=sample)
    print(response['Body'].read())

When we're done, we can delete the endpoint.

In [None]:
sagemaker.delete_endpoint(EndpointName=endpoint_name)

### Option 2: deploying with an advanced configuration
SageMaker lets you deploy multiple variants of a model to the same endpoint. This is useful for testing variations of a model in production. Let's try this and deploy the top 2 models trained by the tuning job.

First, let's figure out what the top 2 jobs are.

In [None]:
# Show completed jobs sorted by descending accuracy
stats = tuner.analytics().dataframe()              # grab job stats in a Pandas dataframe
stats = stats.nsmallest(2, 'FinalObjectiveValue')   # keep top two performing models (lowest validation error)
stats.head()

In [None]:
top1_model_name = stats.iloc[0]['TrainingJobName']
top2_model_name = stats.iloc[1]['TrainingJobName']
print(top1_model_name)
print(top2_model_name)

Then, we use the **create_model()** API to register the two best training jobs as SageMaker models: https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateModel.html

In [None]:
def create_model(model_name):
    model_info = sagemaker.describe_training_job(TrainingJobName=model_name)
    model_data = model_info['ModelArtifacts']['S3ModelArtifacts']
    primary_container = {'Image': container,'ModelDataUrl': model_data}
    create_model_response = sagemaker.create_model(ModelName = model_name,
        ExecutionRoleArn = role,
        PrimaryContainer = primary_container)
    print(create_model_response['ModelArn'])
    
create_model(top1_model_name)
create_model(top2_model_name)

Then, let's define an endpoint configuration with the required infrastructure settings for the endpoint.

In [None]:
import time

endpoint_config_name = best_model_name + '-epc-' + timestamp
endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[
        {
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'ModelName':top1_model_name,
        'InitialVariantWeight':2,     # two thirds of the traffic
        'VariantName':'top1'
        },
        {
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'ModelName':top2_model_name,
        'InitialVariantWeight':1,    # one third of the traffic
        'VariantName':'top2'
        }])

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

Now, we can deploy the endpoint with the **create_endpoint()** API.

In [None]:
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

We have to wait until the endpoint is active. The **get_waiter()** API will block until it's ready. While you're waiting, head out to the SageMaker web console and familiarize yourself with the "Endpoint configurations" and "Endpoints" sections.

In [None]:
# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))

# wait until the status has changed
sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)

# print the status of the endpoint
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed.')

Let's load the test set from file and predict the first 10 samples.

In [None]:
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

test_samples = [line.rstrip('\n') for line in open('test.csv')]
test_samples = test_samples[:10] # We'll predict the first 10 samples

for sample in test_samples:
    sample = bytes(sample, 'utf-8')
    print(sample)
    response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                  ContentType='text/csv', 
                                  Body=sample)
    print(response['Body'].read())

Take a look at the "Endpoints" section in the SageMaker console: you should see some CloudWatch metrics fort each model. Some of the predictions should also be different from the previous run, as they've been round-robined to two models by the endpoint.

When we're done, we can delete the endpoint, its configuration and the two models (files will still exist in S3).

In [None]:
sagemaker.delete_endpoint(EndpointName=endpoint_name)

In [None]:
sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

In [None]:
sagemaker.delete_model(ModelName=top1_model_name)

In [None]:
sagemaker.delete_model(ModelName=top2_model_name)

## One more for the road: build a binary classifier with Linear Learner

Just for fun, let's use another built-in algorithm to build a binary classifier: Linear Learner.

Linear Learner supports CSV files, so we can our existing dataset files.

Please take a few minutes to read the documentation :)
https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html

In [None]:
# The training and validation files have 60 columns: the target (0 or 1) in first position, plus 59 features.
!head train.csv

By now, you're hopefully comfortable with the API calls below. Can you figure everything out? :)

In [None]:
import sagemaker

container = get_image_uri(region, 'linear-learner')

feature_dim = 59

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c5.2xlarge',
                                    input_mode="File",
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(predictor_type='binary_classifier',
                        binary_classifier_model_selection_criteria='accuracy',
                        normalize_data=True,
                        positive_example_weight_mult='balanced',
                        feature_dim=feature_dim)

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='text/csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='text/csv')

s3_data = {'train': s3_input_train, 'validation': s3_input_validation}

xgb.fit(s3_data)

Metrics are visible at the end of the training log: 
* *validation binary_classification_accuracy*
* *validation binary_f_1*
* *validation precision*
* *validation recall*

How well did Linear Learner do on this dataset? 

Better or worse than XGBoost? 

When would you use one algo or the other? Hint: think about the business impact of false positive and false negatives.

## Congratulations! You know now a lot about Amazon SageMaker and it's built-in algorithms :)

We've seen how to:
* load and pre-process columnar data with Pandas,
* build training, validation and test sets,
* use the high-level SageMaker SDK,
* train a model with a built-in algorithm,
* predict with a model deployed to an HTTPS endpoint,
* predict with a model in batch mode,
* use automatic model tuning to find the best hyperparameters,
* using the low-level boto3 SDK, deploy any model on-demand and predict with it.

Now it's your turn to build! We thank you for attending this workshop and we hope you had a good time. Enjoy the rest of AWS re:Invent :)