!WIP: Please suggest edits/improvments!

# Hyper-Parameter Optimization with NVIDIA RAPIDS + AWS SageMaker

After applying domain knowledge, intuition, and experimentation to build a successful model, data scientists typically run hyper-parameter-optimization (HPO) to find a champion model and reach highest performance before deploying to production. 

HPO searches over models by trying different settings of 'architecture parameters,' parameters not usually optimized by the learning algorithm -- i.e., *maximum depth* and *number-of-trees* in a random forest model, or the *number-of-layers* and *neurons-per-layer* of a neural network. 

Often HPO can improve the generalization quality of a model by 5-15% relative to hand tuned or default model parameters. But there is a problem, HPO is very computationally expensive (we are searching over model architectures not just individual parameters) and can be very slow.

In this notebook we show how we can overcome the computational complexity of HPO by combining two superpowers -- the *scaling power* of the cloud, and the *speed* of the GPU. By using these two super-powers we can vastly accelerate HPO, and best of all you can use these superpowers too! Once you've gone through this content you should be able to plug in custom code and data so you can accelerate HPO on **your ML problem**!


# Random Forest Classification of Airline Delays

In this example we'll be leveraging the RAPIDS **cuml.RandomForest** classifier model to try to predict airline arrival delays (see the Dataset section below for more details). To find the best performing model we'll search across three hyper-parameters that control the architecture of the Random Forest 

- **maximum_depth**: the maximum possible depth of any tree
- **n_estimators**: the number of trees in the forest
- **max_features**: the fraction of features used to determine splits in the trees

<img src='figures/tree_depth.png'>
<center>Sample Decision Tree of max_depth = 7</center>

# How it Works: HPO on SageMaker

SageMaker provides a work orchestrator for HPO. Given an Estimator object ( essentially containerized model code -- more on this soon), data, and hyper-parameter ranges SageMaker will use a search strategy to try various combinations of hyper-parameters (i.e., experiment) within the admissable ranges and report back on their performance, ultimately reporting on the best performing combination.

Currently SageMaker supports **Random** and **Bayesian** search strategies. 

- The **Random** strategy is as its name implies, randomly sampling in the possible ranges with no concern for past experiments.

- The **Bayesian** strategy tries several parallel experiments and then uses regression to pick the next batch of hyper-parameters.

In this notebook we'll be using the Random strategy, though you are welcome to switch back and fourth.

<img src='figures/HPO_motivation.png' width='70%'>

# Initialize SageMaker Account & Session Variables

To get things rolling lets make sure we can query our SageMaker execution role and session as well as our account ID and AWS region [ we'll need this info later on ].

In [None]:
import sagemaker
sm_execution_role = sagemaker.get_execution_role()
sm_session = sagemaker.Session()

In [None]:
account=!(aws sts get-caller-identity --query Account --output text)
region=!(aws configure get region)

# 1. Dataset

In this demo we'll utilize the Airline dataset (Carrier On-Time Performance 1987-2020, available from the [Bureau of Transportation Statistics](https://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)). 

Specifically we'll try to classify whether a flight is going to be more than 15 minutes late on arrival. 

For each flight the features in the data include information about time, the airline, source and destination airports, distance, and departure delay.

We have a cleaned version of our dataset on a public S3 bucket, which we specify here and will subsequently use as an input to our HPO Estimators.
> Optional Extension: BYOData

In [None]:
target_bucket = 'rapids-csp'
target_bucket_prefix = 'data'

In [None]:
s3_input_training = 's3://{}/{}'.format(target_bucket, target_bucket_prefix)

# 2. HPO Estimator

To build a RAPIDS enabled SageMaker HPO we first need to build an Estimator. 

An Estimator is a docker container image that captures all the software needed to run an HPO experiment.

The container is augmented with special **entrypoint code** that will be triggered at runtime by each worker. 

The entrypoint code enables us to write custom models and hook them up to data. 

In order to work with SageMaker HPO, the entrypoint logic should parse hyper-parameters (supplied by SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyper-parameter setting.

We've already built sample entrypoint code leveraging the cuml.RandomForest classifier model. If you would like to make changes by adding your custom model logic feel free to modify the **train.py** file.

<img src='figures/estimator.png' width='85%'>




# 2.1 - Prepare To Build Containerized Estimator

Let us come back to the custom code in a bit, and assume we have a working implementation. 

For now lets focus on how we can build our container so that it will fit with the SageMaker HPO API.

Our container takes the latest RAPIDS [ nightly ] image as a starting layer, adds some bits to inter-operate with SageMaker (i.e., github.com/aws/sagemaker-containers), and copies in custom entypoint code that will run when the Estimator is spawned. We'll discuss the custom logic in the section below, for now lets actually build our container and push it to the Amazon Elastic Container Registry (ECR). 



### Define Container Tag

Next lets decide on the full name of our container `image_base:image_tag`

In [None]:
image_base = 'sagemaker-rapids-cloud-ml'
image_tag = '0.13_10.1'

In [None]:
ecr_fullname=f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}"

Let's be sure we have the latest bits by pulling the nightly RAPDIS build.

In [None]:
!docker pull rapidsai/rapidsai-nightly:0.13-cuda10.1-base-ubuntu18.04-py3.7

# 2.2 - Write Dockerfile
We write out the Dockerfile in this cell, write it to disk, and in the next cell execute the docker build command.
> Note that we're copying in custom logic [ train.py, rapids_csp. py ] that we'll be defining shortly

In [None]:
%%writefile container/Dockerfile
FROM rapidsai/rapidsai-nightly:0.13-cuda10.1-base-ubuntu18.04-py3.7

ENV PYTHONUNBUFFERED=TRUE \
    PYTHONDONTWRITEBYTECODE=TRUE \
    CLOUD_PATH="/opt/ml/code"

RUN apt-get update && apt-get install -y --no-install-recommends build-essential
RUN source activate rapids && pip install sagemaker-containers

COPY container/rapids_csp.py $CLOUD_PATH/rapids_csp.py
COPY container/train.py $CLOUD_PATH/train.py
ENV SAGEMAKER_PROGRAM $CLOUD_PATH/train.py

WORKDIR $CLOUD_PATH

In [None]:
%%time
!docker build . --tag $ecr_fullname -f container/Dockerfile

# 2.3 - Push/Publish Container to Elastic Cloud Registry (ECR)
Now that we've built and tagged our container its time to push it to Amazon's container registry (ECR). Once in ECR SageMaker will be able to leverage our image to build Estimators and run experiments.


### Docker Login to ECR

In [None]:
docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email)

In [None]:
!{docker_login_str[0]}

### Create ECR repository [ if it doesn't already exist]

In [None]:
repository_query = !(aws ecr describe-repositories --repository-names $image_base)
if repository_query[0] == '':
    !(aws ecr create-repository --repository-name $image_base)

### Push to ECR

In [None]:
!docker push $ecr_fullname

# 2.4 - Build Estimator

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an **Estimator** object -- you can think of the Estimator as the software stack that SageMaker will replicate to each worker node.

We'll build the Estimator using our SageMaker execution role, the ECR image we built/tagged, and add an output path to [optionally] save models trained during the HPO experimentation.

For additional options and details see the [Estimator documentation](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator) (e.g., to change the size in GB of the EBS volume to use for storing input data during training, default = 30GB ).

In [None]:
train_instance_type_GPU = 'ml.p3.2xlarge' # 'ml.g4dn.4xlarge'
train_instance_type_CPU = 'ml.c5.4xlarge'

train_instance_type = train_instance_type_GPU

In [None]:
train_instance_type

In [None]:
sm_estimator = sagemaker.estimator.Estimator( sagemaker_session = sm_session, 
                                              role = sm_execution_role,
                                              image_name = ecr_fullname,
                                              train_instance_count = 1, 
                                              train_instance_type = train_instance_type,                                               
                                              input_mode = 'File', 
                                              output_path = f's3://{target_bucket}/{target_bucket_prefix}/output' )

### Testing the Estimator [ optional ]
Now that we have a SageMaker Estimator built up, we can feed it data and ask it to train. 

This is a useful step if you've made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. 

To trigger this debugging logic  just uncomment and run the cell below.
> Note: This verification step will use the default hyper-parameter values declared in our custom train code, as SageMaker HPO will not be orchestrating this single run.

In [None]:
# sm_estimator.fit(inputs = s3_input_training, job_name = 'estimator-test-01')

# 3. Define HPO
With a working SageMaker Estimator in hand, the hardest part is behind us!

Now all we have to do is tell SageMaker about the space of hyper-parameters in which to search for the best model.

For more documentation check out the SageMaker [HyperParameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html).

# 3.1 - Defining Search Ranges for HPO

One of the most important choices when running HPO is to choose the bounds of the hyper-parameter search process. 

Below we've set the ranges of the hyper-parameters to allow for significant variation in all of the different dimensions.

In [None]:
from sagemaker.analytics import HyperparameterTuningJobAnalytics
from sagemaker.parameter import ContinuousParameter, IntegerParameter, ParameterRange

In [None]:
random_forest_hyperparameter_ranges = {
    'max_depth'    : IntegerParameter    ( 5,  19  ),
    'n_estimators' : IntegerParameter    ( 50, 500 ),
    'max_features' : ContinuousParameter ( 0.2, 1.0 ),
}

# HPO - Define Metric

In [None]:
metric_definitions=[{'Name': 'train-accuracy', 'Regex': 'train-accuracy: (.*);'},
                    {'Name': 'test-accuracy', 'Regex': 'test-accuracy: (.*);'}]

In [None]:
objective_metric_name = 'test-accuracy'

# HPO - Define Tuning Parameters

In [None]:
HPO_experiment = {
    'model_type' : 'rf', 
    'dataset' : 'airline',
    'dataset_samples' : 20000000,
    'compute_type': 'GPU',
    'strategy': 'Random',
    'sm_estimator' : sm_estimator,
    'metric_definitions' : metric_definitions,
    'objective_metric_name' : objective_metric_name,
    'hyperparameter_ranges' : random_forest_hyperparameter_ranges,
    's3_input_training' : s3_input_training,    
    'objective_type': 'Maximize', 
    'max_jobs': 10,
    'max_parallel_jobs': 2,
    'CV_folds' : 1,
}

In [None]:
hpo = sagemaker.tuner.HyperparameterTuner( estimator = HPO_experiment['sm_estimator'],
                                           metric_definitions = HPO_experiment['metric_definitions'], 
                                           objective_metric_name = HPO_experiment['objective_metric_name'],
                                           objective_type = HPO_experiment['objective_type'],
                                           hyperparameter_ranges = HPO_experiment['hyperparameter_ranges'],
                                           strategy = HPO_experiment['strategy'],  
                                           max_jobs = HPO_experiment['max_jobs'],
                                           max_parallel_jobs = HPO_experiment['max_parallel_jobs'] )

<img src='figures/max_jobs.png' width='800px'>
<img src='figures/max_parallel.png' width='500px'>

# Build HPO Job Name 
Using these HPO parameters we'll build up a unique name for this HPO job. 
> Note that we'll be using the name to determine some of the custom logic 

In [None]:
custom_tag = 'v100'
HPO_experiment['experiment_name'] = f"{HPO_experiment['model_type']}-{HPO_experiment['compute_type']}-CV-{HPO_experiment['CV_folds']}-{HPO_experiment['dataset_samples']}-{custom_tag}"

In [None]:
tuning_job_name = HPO_experiment['experiment_name']

# Run HPO

In [None]:
import time
start_time = time.perf_counter()

hpo.fit( inputs = HPO_experiment['s3_input_training'], 
         job_name = HPO_experiment['experiment_name'], wait = True, logs = 'All')    
hpo.wait() # block until the .fit call above is completed

HPO_job_total_time = time.perf_counter() - start_time
print(HPO_job_total_time)

In [None]:
results_df = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe()

In [None]:
results_df

TODO: describe choices (e.g., instance type), add summary, finalize/confirm defaults with team, add notes on instance limits

# Summary
AWS SageMaker + NVIDIA RAPIDS HPO FTW!

# Detailed Metric Parsing [ Optional]