# Hyper-Parameter Optimization with NVIDIA RAPIDS + AWS SageMaker

After applying domain knowledge, intuition, and experimentation to build a successful model, data scientists typically run hyper-parameter-optimization (HPO) to find a champion model and reach highest performance before deploying to production. 

HPO searches over models by trying different settings of 'architecture parameters,' parameters not usually optimized by the learning algorithm -- i.e., *maximum depth* and *number-of-trees* in a random forest model, or the *number-of-layers* and *neurons-per-layer* of a neural network. 

Often HPO can improve the generalization quality of a model by 5-15% relative to hand tuned or default model parameters. But there is a problem, HPO is very computationally expensive (we are searching over model architectures not just individual parameters) and can be very slow.

In this notebook we show how we can overcome the computational complexity of HPO by combining two superpowers -- the *scaling power* of the cloud, and the *speed* of the GPU. By using these two super-powers we can vastly accelerate HPO, and best of all you can use these superpowers too! Once you've gone through this content you should be able to plug in custom code and data so you can accelerate HPO on **your ML problem**!


# How it Works: HPO on AWS SageMaker

AWS SageMaker provides a work orchestrator for HPO. Given an Estimator object ( essentially containerized model code -- more on this soon), data, and hyper-parameter ranges SageMaker will use a search strategy to try various combinations of hyper-parameters (i.e., experiment) within the admissable ranges and report back on their performance, ultimately reporting on the best performing combination.

While we expect the search strategy choices to grow, currently AWS SageMaker only supports **Random** and **Bayesian** search. 

- The **Random** strategy is as its name implies, randomly sampling in the possible ranges with no concern for past experiments.

- The **Bayesian** strategy tries several parallel experiments and then uses regression to pick the next batch of hyper-parameters.

In this notebook we'll be using the Random strategy, though you are welcome to switch by changing the .

<img src='../img/HPO_motivation.png' width='1000px'>

# Initialize AWS SageMaker Account & Session Variables

To get things rolling lets make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region [ we'll need this info later on ].

In [1]:
import sagemaker
import uuid
import random

sm_execution_role = sagemaker.get_execution_role()
sm_session = sagemaker.Session()

In [2]:
account=!(aws sts get-caller-identity --query Account --output text)
region=!(aws configure get region)

# 1 - Dataset

# Random Forest Classification of Airline Delays

In this example we'll be leveraging the RAPIDS **cuml.RandomForest** classifier model to try to predict airline arrival delays (see the Dataset section below for more details). To find the best performing model we'll search across three hyper-parameters that control the architecture of the Random Forest 

- **maximum_depth**: the maximum possible depth of any tree
- **n_estimators**: the number of trees in the forest
- **max_features**: the fraction of features used to determine splits in the trees

In this demo we'll utilize the Airline dataset (Carrier On-Time Performance 1987-2020, available from the [Bureau of Transportation Statistics](https://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)). 

Specifically we'll try to classify whether a flight is going to be more than 15 minutes late on arrival, for the last 10years of data [ 2019-2009 ].

For each flight the features in the data include information about time, the airline, source and destination airports, distance, and departure delay.

We have a cleaned version of our dataset on a public S3 bucket, which we specify here and will subsequently use as an input to our HPO Estimators.

In [3]:
target_bucket = 'cloud-ml-examples'
target_bucket_prefix = '10_years'

In [4]:
s3_input_training = 's3://{}/{}'.format(target_bucket, target_bucket_prefix)

In [5]:
s3_input_training

's3://cloud-ml-examples/10_years'

# 2 - BYOContainer / Estimator

To build a RAPIDS enabled SageMaker HPO we first need to build an Estimator. 

An Estimator is a docker container image that captures all the software needed to run an HPO experiment.

The container is augmented with special **entrypoint code** that will be triggered at runtime by each worker. 

The entrypoint code enables us to write custom models and hook them up to data. 

In order to work with SageMaker HPO, the entrypoint logic should parse hyper-parameters (supplied by AWS SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyper-parameter setting.

We've already built sample entrypoint code leveraging the cuml.RandomForest classifier model. If you would like to make changes by adding your custom model logic feel free to modify the **train.py** file.





## 2.1 - Build Custom Code

<img src='../img/estimator.png'>

If you want to dig into the custom code, check out the train.py script as well as its supporting library rapids_cloud_ml.py.

By default we'll run with 10 years of the airline dataset, however if you would like to point the code at your own data, just modify the top few lines of train.py and be sure that the `dataset_columns` (columns/features of you dataset) and `target_variable` (the label column which will be the classification target) match your dataset.

## 2.2 - Containerize Code

Now lets turn to building our container so that it can integrate with the AWS SageMaker HPO API.

Our container takes the latest RAPIDS [ nightly ] image as a starting layer, adds some bits to inter-operate with AWS SageMaker (i.e., github.com/aws/sagemaker-containers), and copies in custom entypoint code that will run when the Estimator is spawned. We'll discuss the custom logic in the section below, for now lets actually build our container and push it to the Amazon Elastic Container Registry (ECR). 



### 2.2.1 Define Container Tag

Let's decide on the full name of our container `image_base:image_tag`

In [6]:
image_base = 'cloud-ml-sagemaker'
image_tag = 'runtime-0.14-10.1-18.04'

In [7]:
ecr_fullname=f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}"

In [8]:
ecr_fullname

'611520507156.dkr.ecr.us-west-2.amazonaws.com/cloud-ml-sagemaker:runtime-0.14-10.1-18.04'

### 2.2.2 Kick-off image download

Let's be sure we have the latest bits by pulling the nightly RAPDIS build.
> Note: This may take a few minutes since we are downloading the latest stable rapids image/container.

In [9]:
!docker pull rapidsai/rapidsai-nightly:0.14-cuda10.1-runtime-ubuntu18.04-py3.7

0.14-cuda10.1-runtime-ubuntu18.04-py3.7: Pulling from rapidsai/rapidsai-nightly
Digest: sha256:b3861b13d8229388a1f8b8d8380d14f25becfe44d9e64240b99101ceec3793b2
Status: Image is up to date for rapidsai/rapidsai-nightly:0.14-cuda10.1-runtime-ubuntu18.04-py3.7


### 2.2.3 - Write Dockerfile
We write out the Dockerfile in this cell, write it to disk, and in the next cell execute the docker build command.
> Note that we're copying in custom logic [ train.py, rapids_csp. py ] that we'll be defining shortly

In [None]:
%cd ~/SageMaker/cloud-ml-examples/aws

In [10]:
%%writefile code/Dockerfile
FROM rapidsai/rapidsai-nightly:0.14-cuda10.1-runtime-ubuntu18.04-py3.7

ENV PYTHONUNBUFFERED=TRUE \
    PYTHONDONTWRITEBYTECODE=TRUE \
    CLOUD_PATH="/opt/ml/code"

RUN apt-get update && apt-get install -y --no-install-recommends build-essential
RUN source activate rapids && pip install sagemaker-containers

COPY code/rapids_cloud_ml.py $CLOUD_PATH/rapids_cloud_ml.py
COPY code/train.py $CLOUD_PATH/train.py
ENV SAGEMAKER_PROGRAM $CLOUD_PATH/train.py

WORKDIR $CLOUD_PATH

Overwriting container/Dockerfile


### 2.2.4 Build and Tag
The build usually take less than 1 minute.

In [11]:
%%time
!docker build . -t $ecr_fullname -f code/Dockerfile

Sending build context to Docker daemon   3.65MB
Step 1/8 : FROM rapidsai/rapidsai-nightly:0.14-cuda10.1-runtime-ubuntu18.04-py3.7
 ---> c53141e217b9
Step 2/8 : ENV PYTHONUNBUFFERED=TRUE     PYTHONDONTWRITEBYTECODE=TRUE     CLOUD_PATH="/opt/ml/code"
 ---> Using cache
 ---> 715fc2e3a8ed
Step 3/8 : RUN apt-get update && apt-get install -y --no-install-recommends build-essential
 ---> Using cache
 ---> 3f2f34119deb
Step 4/8 : RUN source activate rapids && pip install sagemaker-containers
 ---> Using cache
 ---> 29fb094be5c7
Step 5/8 : COPY container/rapids_cloud_ml.py $CLOUD_PATH/rapids_cloud_ml.py
 ---> Using cache
 ---> 2e51b9998921
Step 6/8 : COPY container/train.py $CLOUD_PATH/train.py
 ---> Using cache
 ---> b9a5d4904758
Step 7/8 : ENV SAGEMAKER_PROGRAM $CLOUD_PATH/train.py
 ---> Using cache
 ---> 8d5ed55ce194
Step 8/8 : WORKDIR $CLOUD_PATH
 ---> Using cache
 ---> b7233f6ef7cf
Successfully built b7233f6ef7cf
Successfully tagged 611520507156.dkr.ecr.us-west-2.amazonaws.com/cloud-ml-sag

## 2.3 - Publish Container to Elastic Cloud Registry (ECR)
Now that we've built and tagged our container its time to push it to Amazon's container registry (ECR). Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments.


### 2.3.1 Docker Login to ECR

In [12]:
docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email)

In [13]:
!{docker_login_str[0]}

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


### 2.3.2 Create ECR repository [ if it doesn't already exist]

In [14]:
repository_query = !(aws ecr describe-repositories --repository-names $image_base)
if repository_query[0] == '':
    !(aws ecr create-repository --repository-name $image_base)

### 2.3.3 Push to ECR
> Note the first push to ECR may take some time (hopefully less than 10 minutes).

In [15]:
ecr_fullname

'611520507156.dkr.ecr.us-west-2.amazonaws.com/cloud-ml-sagemaker:runtime-0.14-10.1-18.04'

In [16]:
!docker push $ecr_fullname

The push refers to repository [611520507156.dkr.ecr.us-west-2.amazonaws.com/cloud-ml-sagemaker]

[1B876fbbfd: Preparing 
[1Bdf249d74: Preparing 
[1B6607916d: Preparing 
[1B62af22b0: Preparing 
[1B295fcef4: Preparing 
[1B8835dada: Preparing 
[1Bbd9a1af1: Preparing 
[1B489c106b: Preparing 
[1B74f76be4: Preparing 
[1Bd332a58a: Preparing 
[1Bf11cbf29: Preparing 
[1Ba4b22186: Preparing 
[1Bafb09dc3: Preparing 
[1Bb5a53aac: Preparing 
[1Bc8e5063e: Preparing 
[1B7c529ced: Layer already exists [16A[1K[K[9A[1K[K[4A[1K[Kruntime-0.14-10.1-18.04: digest: sha256:5fbe69f23047cb46f64391b3c2a4d27351efc04f8cc8de34f5322f1cd9b1bdf3 size: 3689


# 2.4 - Map Container to Estimator using SageMaker Python SDK 

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an **Estimator** object -- you can think of the Estimator as the software stack that AWS SageMaker will replicate to each worker node.

We'll build the Estimator using our SageMaker execution role, the ECR image we built/tagged, and add an output path to [optionally] save models trained during the HPO experimentation.

For additional options and details see the [Estimator documentation](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator) (e.g., to change the size in GB of the EBS volume to use for storing input data during training, default = 30GB ).

In [17]:
train_instance_type_GPU = 'ml.p3.8xlarge' 
train_instance_type_CPU = 'ml.c5.4xlarge'

train_instance_type = train_instance_type_GPU

In [18]:
train_instance_type

'ml.p3.8xlarge'

In [19]:
sm_estimator = sagemaker.estimator.Estimator( sagemaker_session = sm_session, 
                                              role = sm_execution_role,
                                              image_name = ecr_fullname,
                                              train_instance_count = 1, 
                                              train_instance_type = train_instance_type,                                               
                                              input_mode = 'File', 
                                              output_path = f's3://{target_bucket}/{target_bucket_prefix}/output' )

### 2.4.1 Test the Estimator [ optional ]
Now that we have an AWS SageMaker Estimator built up, we can feed it data and ask it to train. 

This is a useful step if you've made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. 

To trigger this debugging logic  just uncomment and run the cell below.
> Note: This verification step will use the default hyper-parameter values declared in our custom train code, as SageMaker HPO will not be orchestrating this single run.

In [20]:
job_name = f"estimator-mgpu-CV-1-3-{''.join(random.choices(uuid.uuid4().hex, k=8))}"

sm_estimator.fit(inputs = s3_input_training, job_name=job_name)

2020-05-13 23:03:47 Starting - Starting the training job...
2020-05-13 23:03:48 Starting - Launching requested ML instances......
2020-05-13 23:04:49 Starting - Preparing the instances for training......
2020-05-13 23:06:10 Downloading - Downloading input data
2020-05-13 23:06:10 Training - Downloading the training image..................
2020-05-13 23:09:09 Training - Training image download completed. Training in progress..[34m2020-05-13 23:09:10,625 sagemaker-containers INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[34m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": null,
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {},
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
          

# 3 - HPO
With a working AWS SageMaker Estimator in hand, the hardest part is behind us!

Now all we have to do is tell SageMaker about the space of hyper-parameters in which to search for the best model.

For more documentation check out the AWS SageMaker [HyperParameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html).

## 3.1 - Define Search Ranges

One of the most important choices when running HPO is to choose the bounds of the hyper-parameter search process. 

Below we've set the ranges of the hyper-parameters to allow for significant variation in all of the different dimensions.

In [22]:
from sagemaker.analytics import HyperparameterTuningJobAnalytics
from sagemaker.parameter import ContinuousParameter, IntegerParameter, ParameterRange

In [23]:
random_forest_hyperparameter_ranges = {
    'max_depth'    : IntegerParameter    ( 3,  15  ),
    'n_estimators' : IntegerParameter    ( 100, 500 ),
    'max_features' : ContinuousParameter ( 0.2, 1.0 ),
}

## 3.2 - Define Metric
The definitions below specify a regular expressions (i.e., string parsing rules) to find the metrics which we are using to evalaute performance in the output log of each worker/Estimator. In this case we are case we are onyl interested in the performance of our model on the test data (i.e., `test-accuracy`), so we have a single metric to track.

For additional details on metrics refer to the [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html).

In [24]:
metric_definitions=[{'Name': 'test-accuracy', 'Regex': 'test-accuracy: (.*);'}]

In [25]:
objective_metric_name = 'test-accuracy'

## 3.3 - Build HPO Tuner using SageMaker Python API

Below we are setting up the parameters that will define the HPO job. By default (to avoid accidently spawning large compute jobs), we have limited the number of HPO experiments to run to 2.

To run a more realistic large-scale HPO, change `max_jobs` to 100 and `max_parallel_jobs` to 10 (or as high as your instance limit permits).

In [39]:
HPO_experiment = {
    'model_type' : 'rf', 
    'dataset' : 'airline',
    'dataset_samples' : 20000000,
    'compute_type': 'mGPU',
    'strategy': 'Random',
    'sm_estimator' : sm_estimator,
    'metric_definitions' : metric_definitions,
    'objective_metric_name' : objective_metric_name,
    'hyperparameter_ranges' : random_forest_hyperparameter_ranges,
    's3_input_training' : s3_input_training,    
    'objective_type': 'Maximize', 
    'max_jobs': 4,
    'max_parallel_jobs': 2,
    'CV_folds' : 1,
}

In [40]:
hpo = sagemaker.tuner.HyperparameterTuner( estimator = HPO_experiment['sm_estimator'],
                                           metric_definitions = HPO_experiment['metric_definitions'], 
                                           objective_metric_name = HPO_experiment['objective_metric_name'],
                                           objective_type = HPO_experiment['objective_type'],
                                           hyperparameter_ranges = HPO_experiment['hyperparameter_ranges'],
                                           strategy = HPO_experiment['strategy'],  
                                           max_jobs = HPO_experiment['max_jobs'],
                                           max_parallel_jobs = HPO_experiment['max_parallel_jobs'] )

<img src='../img/max_jobs.png' width='800px'>
<img src='../img/max_parallel.png' width='500px'>

## 3.4 - Build HPO Job Name 
Using these HPO parameters we'll build up a unique name for this HPO job. 
> Note that the entrypoint script relies on the job name to determine some configuration options.

In [41]:
# Maximum job length we can submit
MAX_JOB_LEN=32

HPO_experiment['experiment_name'] = f"{HPO_experiment['model_type']}-{HPO_experiment['compute_type']}-CV-{HPO_experiment['CV_folds']}-{HPO_experiment['dataset_samples']}"

available = (MAX_JOB_LEN - len(HPO_experiment['experiment_name']))
if (available < 0):
    print("Invalid job name, must be less than 32 characters")
else:
    k = min(8, available)
    custom_tag = ''.join(random.choices(uuid.uuid4().hex, k=k))
    HPO_experiment['experiment_name'] += f"-{custom_tag}"

In [42]:
tuning_job_name = HPO_experiment['experiment_name']

In [43]:
tuning_job_name

'rf-mGPU-CV-1-20000000-4gpu'

# 3.5 - Run HPO

In [44]:
import time
start_time = time.perf_counter()

hpo.fit( inputs = HPO_experiment['s3_input_training'], 
         job_name = HPO_experiment['experiment_name'], wait = True, logs = 'All')    
hpo.wait() # block until the .fit call above is completed

HPO_job_total_time = time.perf_counter() - start_time
print(HPO_job_total_time)

......................................................................................................................................................................!
839.910438818999


In [45]:
results_df = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe()

In [46]:
results_df

Unnamed: 0,FinalObjectiveValue,TrainingElapsedTimeSeconds,TrainingEndTime,TrainingJobName,TrainingJobStatus,TrainingStartTime,max_depth,max_features,n_estimators
0,0.928403,236.0,2020-05-13 23:35:17+00:00,rf-mGPU-CV-1-20000000-4gpu-004-ccc2e1e6,Completed,2020-05-13 23:31:21+00:00,4.0,0.445147,251.0
1,0.925242,283.0,2020-05-13 23:36:04+00:00,rf-mGPU-CV-1-20000000-4gpu-003-0068929f,Completed,2020-05-13 23:31:21+00:00,13.0,0.660389,322.0
2,0.929506,227.0,2020-05-13 23:28:44+00:00,rf-mGPU-CV-1-20000000-4gpu-002-d17393b7,Completed,2020-05-13 23:24:57+00:00,7.0,0.36364,167.0
3,0.929304,244.0,2020-05-13 23:28:52+00:00,rf-mGPU-CV-1-20000000-4gpu-001-6901c70a,Completed,2020-05-13 23:24:48+00:00,4.0,0.375529,473.0


# Performance Gains with RAPIDS & GPUs

We are seeing greater than 40x acceleration of model training on the GPU (cuml vs sklearn Random Forest).
Stay tuned for more performance numbers coming soon!

# Summary
AWS SageMaker + NVIDIA RAPIDS HPO FTW!