# Playground Sagemaker testing

## Using XGBoost in SageMaker 

As a testing to using SageMaker's High Level Python API for hyperparameter tuning.


In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
import sklearn.model_selection

In [2]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer

session = sagemaker.Session() # Get session

role = get_execution_role() # Get role

## Step 1: Downloading the data

This dataset can be retrieved using sklearn. In this case we use breast cancer.

In [3]:
cancer = load_breast_cancer()

## Step 2: Preparing and splitting the data

Split the rows in the dataset up into train, test and validation sets.

- First, split data to train and test.
- Second, split data to train and validation.

In [4]:
# load data to dataframe X and Y
X_bos_pd = pd.DataFrame(cancer.data, columns=cancer.feature_names)
Y_bos_pd = pd.DataFrame(cancer.target)

# First split
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.3)

# Second split
X_train, X_val, Y_train, Y_val = sklearn.model_selection.train_test_split(X_train, Y_train, test_size=0.3)

## Step 3: Uploading the data files to S3

When a training job is constructed using SageMaker, a container is executed which performs the training operation. This container is given access to data that is stored in S3. This means that we need to upload the data we want to use for training to S3. In addition, when we perform a batch transform job, SageMaker expects the input data to be stored on S3.

### Save the data locally

First we need to create the test, train and validation csv files which we will then upload to S3.

In [5]:
# We need to make sure that it exists.
data_dir = '../data/cancer'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [6]:
# We use pandas to save our test, train and validation data to csv files. Note that we make sure not to include header
# information or an index as this is required by the built in algorithms provided by Amazon. Also, for the train and
# validation data, it is assumed that the first entry in each row is the target variable.

X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

pd.concat([Y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Upload to S3

Since we are currently running inside of a SageMaker session, we can use the object which represents this session to upload our data to the 'default' S3 bucket. Note that it is good practice to provide a custom prefix (essentially an S3 folder) to make sure that you don't accidentally interfere with data uploaded from some other notebook or project.

In [7]:
prefix = 'cancer-xgboost'

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

In [8]:
test_location

's3://sagemaker-us-east-1-532612960880/cancer-xgboost/test.csv'

## Step 4: Train the XGBoost model

Now that we have the training and validation data uploaded to S3, we can construct our XGBoost model and train it.
We will use SageMaker's hyperparameter tuning functionality to train multiple models and use the one that performs the best on the validation set.

In [9]:
# As stated above, we use this utility method to construct the image name for the training container.
container = get_image_uri(session.boto_region_name, 'xgboost')

# Now that we know which container to use, we can construct the estimator object.
xgb = sagemaker.estimator.Estimator(container, # The name of the training container
                                    role,      # The IAM role to use
                                    train_instance_count=1, # The number of instances to use for training
                                    train_instance_type='ml.m4.xlarge', # The type of instance ot use for training
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),                   
                                    sagemaker_session=session)

Before beginning the hyperparameter tuning, we should make sure to set any model specific hyperparameters that we wish to have default values. There are quite a few that can be set when using the XGBoost algorithm, below are just a few of them. Additional information on the [XGBoost hyperparameter page](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html)

In [10]:
xgb.set_hyperparameters(eval_metric='auc',
                        max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=200)

Now that we have our estimator object completely set up, it is time to create the hyperparameter tuner. To do this we need to construct a new object which contains each of the parameters we want SageMaker to tune. In this case, we wish to find the best values for the `max_depth`, `eta`, `min_child_weight`, `subsample`, and `gamma` parameters. Note that for each parameter that we want SageMaker to tune we need to specify both the *type* of the parameter and the *range* of values that parameter may take on.

In addition, we specify the *number* of models to construct (`max_jobs`) and the number of those that can be trained in parallel (`max_parallel_jobs`). In the cell below we have chosen to train `20` models, of which we ask that SageMaker train `3` at a time in parallel. Note that this results in a total of `20` training jobs being executed which can take some time. With more complicated models this can take even longer so be aware!

In [11]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

xgb_hyperparameter_tuner = HyperparameterTuner(estimator = xgb,# The estimator object to use as the basis for the training jobs.
                                               objective_metric_name = 'validation:auc', # The metric used to compare trained models.
                                               objective_type = 'Maximize', # Whether we wish to minimize or maximize the metric.
                                               max_jobs = 10, # The total number of models to train
                                               max_parallel_jobs = 3, # The number of models to train in parallel
                                               hyperparameter_ranges = {
                                                    'max_depth': IntegerParameter(4, 12),
                                                    'subsample': ContinuousParameter(0.75, 0.8)
                                               })

Now that we have our hyperparameter tuner object completely set up, it is time to train it. To do this we make sure that SageMaker knows our input data is in csv format and then execute the `fit` method.

In [12]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

xgb_hyperparameter_tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

The `fit()` method takes care of setting up and fitting a number of different models, each with different hyperparameters. If we wish to wait for this process to finish, we can call the `wait()` method.

In [13]:
xgb_hyperparameter_tuner.wait()

..............................................................................................................................................................!


Once the hyperamater tuner has finished, we can retrieve information about the best performing model. 

In [14]:
xgb_hyperparameter_tuner.best_training_job()

'xgboost-190614-0229-003-1fbe7aeb'

In addition, since we'd like to set up a batch transform job to test the best model, we can construct a new estimator object from the results of the best training job. The `xgb_attached` object below can now be used as though we constructed an estimator with the best performing hyperparameters and then fit it to our training data.

In [15]:
xgb_attached = sagemaker.estimator.Estimator.attach(xgb_hyperparameter_tuner.best_training_job())

2019-06-14 02:33:15 Starting - Preparing the instances for training
2019-06-14 02:33:15 Downloading - Downloading input data
2019-06-14 02:33:15 Training - Training image download completed. Training in progress.
2019-06-14 02:33:15 Uploading - Uploading generated training model
2019-06-14 02:33:15 Completed - Training job completed[31mArguments: train[0m
[31m[2019-06-14:02:33:02:INFO] Running standalone xgboost training.[0m
[31m[2019-06-14:02:33:02:INFO] Setting up HPO optimized metric to be : auc[0m
[31m[2019-06-14:02:33:02:INFO] File size need to be processed in the node: 0.08mb. Available memory size in the node: 8397.6mb[0m
[31m[2019-06-14:02:33:02:INFO] Determined delimiter of CSV input is ','[0m
[31m[02:33:02] S3DistributionType set as FullyReplicated[0m
[31m[02:33:02] 278x30 matrix with 8340 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-06-14:02:33:02:INFO] Determined delimiter of CSV input is ','[0m
[31m[02:33:

## Step 5: Test the model

Now that we have our best performing model, we can test it. To do this we will use the batch transform functionality. To start with, we need to build a transformer object from our fit model.

In [16]:
xgb_transformer = xgb_attached.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

Next we ask SageMaker to begin a batch transform job using our trained model and applying it to the test data we previous stored in S3. We need to make sure to provide SageMaker with the type of data that we are providing to our model, in our case `text/csv`, so that it knows how to serialize our data. In addition, we need to make sure to let SageMaker know how to split our data up into chunks if the entire data set happens to be too large to send to our model all at once.

Note that when we ask SageMaker to do this it will execute the batch transform job in the background. Since we need to wait for the results of this job before we can continue, we use the `wait()` method. An added benefit of this is that we get some output from our batch transform job which lets us know if anything went wrong.

In [17]:
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

In [18]:
xgb_transformer.wait()

.............................................!


Now that the batch transform job has finished, the resulting output is stored on S3. Since we wish to analyze the output inside of our notebook we can use a bit of notebook magic to copy the output file from its S3 location and save it locally.

In [19]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

download: s3://sagemaker-us-east-1-532612960880/xgboost-190614-0229-003-1fbe7aeb-2019-06-14-02-43-15-745/test.csv.out to ../data/cancer/test.csv.out


To see output data.

In [20]:
Y_pred = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
Y_pred.head(10)

Unnamed: 0,0
0,0.958814
1,0.973482
2,0.019823
3,0.103257
4,0.972016
5,0.971702
6,0.983248
7,0.947141
8,0.143874
9,0.983248


## Clean up

The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As you continue to complete and execute notebooks you will eventually fill up this disk space

In [21]:
!rm $data_dir/*

# And then we delete the directory itself
!rmdir $data_dir