# Training the Model Using AWS SageMaker

In this notebook, I will train the COMPAS `XGBoost` model using SageMaker.

## Contents

1. Loading the data to S3 buckets
2. Setting the model
3. Hyperparameter tuning
4. Evaluating the model
5. Conclusions

## Loading the data to S3 buckets

In the `Data-Exploration.ipynb` notebook, I've loaded and prepared the data. Also, I've splitted into training (75%) and testing (25%) data using the Scikit-Learn's `train_test_split()` given a random seed generator. Finally, I've exported all the data into `.csv` files in the `data` folder.

Now, I will create a S3 bucket and upload the `.csv` files to it.

In [37]:
import pandas as pd
import boto3
import sagemaker
import numpy as np

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

Specify the `data_dir` where you've saved your `.csv` files. Decide on a descriptive `prefix` that defines where your data will be uploaded in the default S3 bucket. 

Finally, create a pointer to your training data by calling `sagemaker_session.upload_data` and passing in the required parameters. It may help to look at the [Session documentation](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.Session.upload_data).

In [16]:
data_dir = 'data'

# set prefix, a descriptive name for a directory  
prefix = 'sagemaker/compas_model'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir,
                                           bucket=bucket,
                                           key_prefix=prefix)



Now if you go to the `AWS Console` and down to `S3 Management Console`, you shall see a S3 bucket named something like `sagemaker-us-east-######`. It is inside this S3 that you'll find your data.

## Setting the XGBoost Model

Now that I have the training, validation and test data uploaded to S3, I can construct the `XGBoost` model and train it. I will use SageMaker's hyperparameter tuning functionality to train multiple models and use the one that performs the best on the validation set.

Since, in the COMPAS context, I am concerned in reducing the false positive rate while keeping a good accuracy, I will tune the model to maximize the `validation:map` which means *Mean Average Precision*. I do not want that the model label someone (either black/white) as medium/high risk for recidivism if the ground truth is low.

To begin with, I will need to construct an `estimator` object.

In [13]:
from sagemaker.amazon.amazon_estimator import get_image_uri

# construct the image name for the training container.
container = get_image_uri(sagemaker_session.boto_region_name, 'xgboost')

# Now that I know which container to use, I can construct the estimator object.
xgb = sagemaker.estimator.Estimator(container, # The name of the training container
                                    role,      # The IAM role to use (our current role in this case)
                                    train_instance_count=5,  # The number of instances to use for training
                                    train_instance_type='ml.m4.xlarge',  # The type of instance ot use for training
                                    output_path=f"s3://{sagemaker_session.default_bucket()}/{prefix}/output",  # Where to save the output (the model artifacts)
                                    sagemaker_session=sagemaker_session)  # The current SageMaker session

	get_image_uri(region, 'xgboost', '1.0-1').


Before beginning the hyperparameter tuning, make sure to set any model specific hyperparameters that I wish to have default values. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. If you would like to change the hyperparameters below or modify additional ones you can find additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).

Also, note that since the COMPAS model is a binary classifier, I will use the objetive as `'reg:logistic'` instead of the default `'reg:squarederror'`

In [9]:
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='reg:logistic',
                        early_stopping_rounds=10,
                        num_round=200)

## Hyperparameters Tuning

Now that I have my `estimator` object completely set up, it is time to create the `hyperparameter tuner`. To do this I need to construct a new object which contains each of the parameters I want SageMaker to tune. In this case, I wish to find the best values for the `max_depth`, `eta`, `min_child_weight`, `subsample`, and `gamma` parameters. Note that for each parameter that I want SageMaker to tune I need to specify both the *type* of the parameter and the *range* of values that parameter may take on.

In addition, I specify the number of models to construct (`max_jobs`) and the number of those that can be trained in parallel (`max_parallel_jobs`). In the cell below I have chosen to train 20 models, of which I ask that SageMaker train 5 at a time in parallel.

In [10]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb,  # The estimator object to use as the basis for the training jobs.
                                               objective_metric_name = 'validation:map',  # The metric used to compare trained models.
                                               objective_type = 'Maximize',  # Whether I wish to minimize or maximize the metric.
                                               max_jobs = 20,  # The total number of models to train
                                               max_parallel_jobs = 5,  # The number of models to train in parallel
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(3, 12),
                                                    'eta'      : ContinuousParameter(0.05, 0.5),
                                                    'min_child_weight': IntegerParameter(2, 8),
                                                    'subsample': ContinuousParameter(0.5, 0.9),
                                                    'gamma': ContinuousParameter(0, 10),
                                               })

Now that I have my `hyperparameter tuner` object completely set up, it is time to train it. To do this I make sure that SageMaker knows our input data is in `.csv` format and then execute the `.fit()` method.

In [11]:
input_data

's3://sagemaker-us-east-2-349061725184/sagemaker/compas_model'

In [17]:
# This is a wrapper around the location of our train and validation data,
# to make sure that SageMaker knows our data is in csv format.

import os

train_location = os.path.join(input_data, 'train.csv')
val_location = os.path.join(input_data, 'val.csv')
test_location = os.path.join(input_data, 'X_test.csv')

s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})



Now if you go to the `AWS Console` and down to `Amazon SageMaker`, you shall see under `Training > Training jobs` tab, the training jobs being performed in $n$ instances that you designated in `train_instance_count` when creating the `estimator` object.

The .`fit()` method takes care of setting up and fitting a number of different models, each with different hyperparameters. If you wish to wait for this process to finish, you can call the `.wait()` method or monitor in the `Training Jobs` tab.

In [18]:
xgb_hyperparameter_tuner.wait()

..................................................................................................................................................!


Once the hyperamater tuner has finished, I can retrieve information about the best performing model.

In [19]:
xgb_hyperparameter_tuner.best_training_job()

'xgboost-200606-1423-001-117c694a'

In addition, since I'd like to set up a batch transform job to test the best model, I can construct a new `estimator` object from the results of the best training job. The `xgb_attached` object below can now be used as though I constructed an `estimator` with the best performing hyperparameters and then fit it to the training data.

In [20]:
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())



2020-06-06 14:26:05 Starting - Preparing the instances for training
2020-06-06 14:26:05 Downloading - Downloading input data
2020-06-06 14:26:05 Training - Training image download completed. Training in progress.
2020-06-06 14:26:05 Uploading - Uploading generated training model
2020-06-06 14:26:05 Completed - Training job completed[34mArguments: train[0m
[34m[2020-06-06:14:25:53:INFO] Running standalone xgboost training.[0m
[34m[2020-06-06:14:25:53:INFO] Setting up HPO optimized metric to be : map[0m
[34m[2020-06-06:14:25:53:INFO] File size need to be processed in the node: 0.07mb. Available memory size in the node: 8478.98mb[0m
[34m[2020-06-06:14:25:53:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:25:53] S3DistributionType set as FullyReplicated[0m
[34m[14:25:53] 2968x8 matrix with 23744 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-06-06:14:25:53:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:2

## Evaluating the Model

Now that I have my best performing model, I can test it. To do this I will use the `batch transform` functionality. To start with, I need to build a `transformer` object from my fitted model.

In [21]:
xgb_transformer = xgb_attached.transformer(instance_count = 1,
                                           instance_type = 'ml.m4.xlarge')



Next I ask SageMaker to begin a `batch transform` job using the trained model and applying it to the test data previously stored in S3. I need to make sure to provide SageMaker with the type of data that I am providing to our model, in my case `text/csv`, so that it knows how to serialize the data. In addition, I need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.

Note that when I ask SageMaker to do this it will execute the `batch transform` job in the background. Since I need to wait for the results of this job before I can continue, I use the `.wait()` method or I monitor the `Batch transform jobs` tab under `Inference` in the `Amazon SageMaker`. An added benefit of this is that I get some output from the batch transform job which lets me know if anything went wrong.

In [22]:
xgb_transformer.transform(test_location,
                          content_type='text/csv',
                          split_type='Line')

In [23]:
xgb_transformer.wait()

.....................
[32m2020-06-06T14:44:17.110:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34mArguments: serve[0m
[34m[2020-06-06 14:44:16 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-06-06 14:44:16 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-06-06 14:44:16 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-06-06 14:44:16 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-06-06 14:44:17 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-06-06:14:44:17:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-06-06 14:44:17 +0000] [41] [INFO] Booting worker with pid: 41[0m
[34m[2020-06-06:14:44:17:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-06-06 14:44:17 +0000] [42] [INFO] Booting worker with pid: 42[0m
[34m[2020-06-06:14:44:17:INFO] Model loaded successfully for worker : 41[0m
[34m[2020-06-06:14:44:17:INFO] Model loaded successfully f

Now that the `batch transform job` has finished, the resulting output is stored on S3. Since I wish to analyze the output inside of our notebook I can use a bit of notebook magic to copy the output file from its S3 location and save it locally.

In [30]:
!aws s3 cp --recursive $xgb_transformer.output_path $prefix

download: s3://sagemaker-us-east-2-349061725184/xgboost-200606-1423-001-117c694a-2020-06-06-14-40-53-246/X_test.csv.out to sagemaker/compas_model/X_test.csv.out


To see how well my model, I will compute some metrics between the predicted and actual values.

In [42]:
y_pred_proba = pd.read_csv(os.path.join(data_dir, 'X_test.csv.out'), header=None)
y_pred = np.where(y_pred_proba.to_numpy() > 0.5, 1, 0)
y_test = pd.read_csv(os.path.join(data_dir, 'y_test.csv'), header=None)

In [46]:
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import confusion_matrix

acc = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

confusion_matrix = confusion_matrix(y_test, y_pred)

In [45]:
print(f"Overall Accuracy: {round(acc, 3)}")
print(f"Overall Precision: {round(precision, 3)}")

Overall Accuracy: 1.0
Overall Precision: 1.0


In [47]:
confusion_matrix

array([[864,   0],
       [  0, 456]])

# Conclusion

With a Amazon SageMaker, I've built a XGBoost model, did hyperparameter tuning and tested the model. This was easy to be done with the Sagemaker AWS Python SDK using the high level API.

By specifying the right metric (precision), I've found the hyperparameters to train the best the model for that metric and obtained a whooping 100% accuracy and 100% precision in the COMPAS data.