In [1]:
import sagemaker
from helper_functions import *

<img src='../img/header.png' width='75%'>

Hyper Parameter Optimization (HPO) imporves model quality by searching the space of possible 'architecture parameters,' parameters not usually trained during the learning process.  This search can significantly boost model quality relative to default parameters and non-expert tuning; however, the search over architectures can take a very long time on a non-accelerated platform. In this notebook, we containerize a RAPIDS workflow and run Bring-Yor-Own-Container SageMaker HPO to show how we can overcome the computational complexity of model search. 

We accelerate HPO in two key ways: 
* by *scaling within a node* (e.g., multi-GPU where each GPU brings a magnitude higher core count relative to CPUs), and 
* by *scaling across nodes* and running parallel trials on cloud instances.

By combining these two powers HPO experiments that feel unapproachable and may take multiple days on CPU instances can complete in just hours.

<img src='../img/results.png' width='600'>

For example, when comparing between GPU and CPU instances on 100 HPO experiments using 10 parallel workers, XGBoost with 10 cross-validation folds, and 10 years of the Airline Dataset (~63M flights) we found a <span style="color:#8735fb; font-size:14pt"> **12X** </span> speedup in wall clock time and a <span style="color:#8735fb; font-size:14pt"> **4.5x** </span> reduction in cost. With all these powerful tools at our disposal, every data scientist should feel empowered to uplevel their model before serving it to the world!

<img src='../img/hpo.png' widht='85%'>

<span style="color:#8735fb; font-size:22pt"> **Key Choices** </span>

Let's go ahead and choose the configuration options for our HPO run.

Note that we've set the configuration defaults to values that are on the low end, however you are welcome to scale them up. 

Here are two reference configurations showing a small and a large scale HPO.

> sample small HPO config : 1_year, XGBoost, 3 CV folds, singleGPU, max_jobs = 10, max_parallel_jobs = 2

> sample big HPO config: 10_year, XGBoost, 10 CV folds, multiGPU, max_jobs = 100, max_parallel_jobs = 10

<span style="color:#8735fb; font-size:18pt"> [ Dataset Size and S3 Bucket ] </span>

We target a large real-world structured dataset or flight logs for US airlines and train a model to predict flight delays ( published monthly since 1987 by the Bureau of Transportation [dataset link](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=)). 

We host 3 increasingly larger versions of this dataset as directoreis in a public bucket, and offer `1_year` (2019, 7.2M flights), `3_year` (2016-2019, 18M flights) or `10_year` (2009-2019, 125M flights) configuration. 

In [2]:
dataset_directory = '1_year'

assert( dataset_directory in [ '1_year', '3_year', '10_year'] )

In [3]:
s3_data_URI = f"s3://{'sagemaker-rapids-hpo-us-east-1'}/{dataset_directory}"

<span style="color:#8735fb; font-size:18pt"> [ Algorithm ] </span>

From a ML/algorithm perspective, we offer `XGBoost` and `RandomForest` decision tree models which do quite well on this structured dataset.

In [4]:
algorithm_choice = 'XGBoost'

assert ( algorithm_choice in [ 'XGBoost', 'RandomForest' ])

We can also optionally increase robustness via reshuffles of the train-test split (i.e., cross-validation folds).

In [5]:
cv_folds = 3

assert ( cv_folds >= 1 )

<span style="color:#8735fb; font-size:18pt"> [ Code ] </span>

We enable the option of running different code variations that unlock increasing amounts of parallelism in the compute workflow.

* `singleCPU`* = pandas + sklearn
* `multiCPU`   = dask + pandas + sklearn
* `singleGPU` = cudf + cuml 
* `multiGPU`  = dask + cudf + cuml 

Note that the single-CPU option still uses multiple cores in the model training portion of the workflow; however, to unlock parallelism in each stage of the workflow we use dask. We've integrated all to of these options in the rapids_cloud_ml.py file, we also ofer individual notebooks that look at the variations in more detail. 

In [6]:
code_choice = 'singleGPU' 

assert ( code_choice in [ 'singleCPU', 'singleGPU', 'multiCPU', 'multiGPU'])

<span style="color:#8735fb; font-size:18pt"> [ Search Strategy and Ranges ] </span>

One of the most important choices when running HPO is to choose the bounds of the hyper-parameter search process. Below we've set the ranges of the hyper-parameters to allow for significant variation in all of the different dimensions and we've selected the Random search strategy. You are of course welcome to modify these settings and experiment.

In [7]:
search_strategy = 'Random'

hyperparameter_ranges = {
    'max_depth'    : sagemaker.parameter.IntegerParameter        ( 5, 15 ),
    'n_estimators' : sagemaker.parameter.IntegerParameter        ( 100, 500 ),
    'max_features' : sagemaker.parameter.ContinuousParameter     ( 0.1, 1.0 ),
}

In [8]:
assert ( search_strategy in [ 'Random', 'Bayesian' ])
if 'XGBoost' in algorithm_choice:    
    hyperparameter_ranges['num_boost_round'] = hyperparameter_ranges.pop('n_estimators')

<span style="color:#8735fb; font-size:18pt"> [ Experiment Scale ] </span>

We also need to decide how may total experiments to run, and how many should run in parallel.
> Note that you may need to request an instance limit increase.

In [9]:
max_jobs = 10

In [10]:
max_parallel_jobs = 2

Let's also set the max duration for an individual job to 24 hours so we don't have run-away compute jobs taking too long.

In [11]:
max_duration_of_experiment_seconds = 60*60*24

<span style="color:#8735fb; font-size:18pt"> [ Compute Platform ] </span>

Based on the dataset size and compute choice we will try to recommend an instance choice*, you are of course welcome to select alternate configurations. 
> e.g., we recommend ml.m5.24xlarge CPU instances for the 10_year variations since we'll need upwards of 200GB during model training.

In [12]:
instance_type = recommend_instance_type ( code_choice, dataset_directory  ) 

recommended instance type : ml.p3.2xlarge 
instance details          : 1x V100, 16GB GPU memory, 61GB CPU memory


In [13]:
use_spot_instances_flag = True

<span style="color:#8735fb; font-size:22pt"> **Validate** </span>

In [14]:
summarize_choices( s3_data_URI, code_choice, algorithm_choice, cv_folds,
                   instance_type, use_spot_instances_flag, search_strategy, 
                   max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds )

S3 uri           =	s3://sagemaker-rapids-hpo-us-east-1/1_year
compute          =	singleGPU
algorithm        =	XGBoost, 3 cv-fold
instance         =	ml.p3.2xlarge
spot instances   =	True
hpo strategy     =	Random
max_experiments  =	10
max_parallel     =	2
max runtime      =	86400 sec


<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **1. ML Workflow** </span>

<img src='../img/ml_workflow.png' width='800'> 

<span style="color:#8735fb; font-size:20pt"> 1.1 - Dataset </span>

In this demo we'll utilize the Airline dataset (Carrier On-Time Performance 1987-2020, available from the [Bureau of Transportation Statistics](https://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)). 

The public dataset contains logs/features about flights in the United States (17 airlines) including:

* locations and distance  ( `Origin`, `Dest`, `Distance` )
* airline / carrier ( `Reporting_Airline` )
* scheduled departure and arrival times ( `CRSDepTime` and `CRSArrTime` )
* actual departure and arrival times ( `DpTime` and `ArrTime` )
* difference between scheduled & actual times ( `ArrDelay` and `DepDelay` )
* binary encoded version of late, aka our target variable ( `ArrDelay15` )

Using these features we'll be trying to build a classifier model to predict whether a flight is going to be more than 15 minutes late on arrival as it prepares to depart.

<span style="color:#8735fb; font-size:20pt"> 1.2 - Python DS Workflow [ ETL, Train, Eval ] </span>

To build a RAPIDS enabled SageMaker HPO we first need to build an Estimator. An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be trggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. 

In order to work with SageMaker HPO, the entrypoint logic should parse hyper-parameters (supplied by AWS SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyper-parameter setting. We've already built multiple variations of this code leveraging. If you would like to make changes by adding your custom model logic feel free to modify the **train.py** and **rapids_cloud_ml.py** files in the code directory. Uncomment the cells below to load the read/review the code.

In [15]:
# %load ../code/train.py

In [16]:
# %load ../code/rapids_cloud_ml.py

<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **2. Build Estimator** </span>

<img src='../img/estimator.png' width='800'>

As we've already mentioned, the SageMaker Estimator represents the containerized software stack that AWS SageMaker will replicate to each worker node.

The first step to building our Estimator, is to augment a RAPIDS container with our ML Workflow code from above, and push this image to Amazon Elastic Cloud Registry so it is available to SageMaker.

For additional options and details see the [Estimator documentation](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Estimator).


<span style="color:#8735fb; font-size:20pt"> 2.1 - Containerize and Push to ECR </span>

Now lets turn to building our container so that it can integrate with the AWS SageMaker HPO API.

Our container can either be built on top of the latest RAPIDS [ nightly ] image as a starting layer or the RAPIDS stable image.


In [17]:
rapids_stable = 'rapidsai/rapidsai:0.14-cuda10.1-runtime-ubuntu18.04-py3.7'
rapids_nightly = 'rapidsai/rapidsai-nightly:0.15-cuda10.1-runtime-ubuntu18.04-py3.7'

rapids_base_container = rapids_stable
assert ( rapids_base_container in [ rapids_stable, rapids_nightly ] )

To get things rolling lets make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region.

In [18]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

account=!(aws sts get-caller-identity --query Account --output text)
region=!(aws configure get region)

In [19]:
account, region

(['561241433344'], ['us-east-1'])

Let's decide on the full name of our container `image_base:image_tag`

In [20]:
image_base = 'cloud-ml-sagemaker'
image_tag  = rapids_base_container.split(':')[1]

In [21]:
ecr_fullname = f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}"

In [22]:
ecr_fullname

'561241433344.dkr.ecr.us-east-1.amazonaws.com/cloud-ml-sagemaker:0.15-cuda10.1-runtime-ubuntu18.04-py3.7'

<span style="color:#8735fb; font-size:18pt"> 2.1.1 - Write Dockerfile </span>

We write out the Dockerfile in this cell, write it to disk, and in the next cell execute the docker build command.
> Note that we're copying in custom logic [ train.py, rapids_csp. py ] that we'll be defining shortly

In [23]:
workdir='~/SageMaker/cloud-ml-examples/aws/code'

In [24]:
%cd {workdir}

/home/ec2-user/SageMaker/cloud-ml-examples/aws/code


Let's write our selected RAPDIS image layer as the first FROM statement in the the Dockerfile.

In [25]:
with open('Dockerfile', 'w') as dockerfile_handle: 
    dockerfile_handle.writelines( 'FROM ' + rapids_base_container + '\n')

Next lets write the remaining pieces of the Dockerfile, namely adding the sagemaker-training-toolkit and copying our python code.

In [26]:
%%writefile -a Dockerfile

# install https://github.com/aws/sagemaker-training-toolkit
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \ 
    && source activate rapids && pip3 install sagemaker-training

# path where sagemaker looks for our code
ENV CLOUD_PATH="/opt/ml/code"

# copy our latest [local] code into the container 
COPY rapids_cloud_ml.py $CLOUD_PATH/rapids_cloud_ml.py
COPY train.py $CLOUD_PATH/train.py

# sagemaker entrypoint will be train.py
ENV SAGEMAKER_PROGRAM train.py 

WORKDIR $CLOUD_PATH

Appending to Dockerfile


Lastly, let's ensure that our Dockerfile correctly captured our base image selection.

In [27]:
validate_dockerfile( rapids_base_container )
!cat Dockerfile

FROM rapidsai/rapidsai-nightly:0.15-cuda10.1-runtime-ubuntu18.04-py3.7

# install https://github.com/aws/sagemaker-training-toolkit
RUN apt-get update && apt-get install -y --no-install-recommends build-essential \ 
    && source activate rapids && pip3 install sagemaker-training

# path where sagemaker looks for our code
ENV CLOUD_PATH="/opt/ml/code"

# copy our latest [local] code into the container 
COPY rapids_cloud_ml.py $CLOUD_PATH/rapids_cloud_ml.py
COPY train.py $CLOUD_PATH/train.py

# sagemaker entrypoint will be train.py
ENV SAGEMAKER_PROGRAM train.py 

WORKDIR $CLOUD_PATH


<span style="color:#8735fb; font-size:18pt"> 2.1.2 Build and Tag </span>

The build usually take less than 1 minute.

In [28]:
%%time
!docker build . -t $ecr_fullname -f Dockerfile

Sending build context to Docker daemon  30.21kB
Step 1/7 : FROM rapidsai/rapidsai-nightly:0.15-cuda10.1-runtime-ubuntu18.04-py3.7
 ---> 0aa868e59c3b
Step 2/7 : RUN apt-get update && apt-get install -y --no-install-recommends build-essential     && source activate rapids && pip3 install sagemaker-training
 ---> Using cache
 ---> d4dcac8e3726
Step 3/7 : ENV CLOUD_PATH="/opt/ml/code"
 ---> Using cache
 ---> cdb78892718a
Step 4/7 : COPY rapids_cloud_ml.py $CLOUD_PATH/rapids_cloud_ml.py
 ---> Using cache
 ---> 4fe5d342ef25
Step 5/7 : COPY train.py $CLOUD_PATH/train.py
 ---> Using cache
 ---> 07ca1ddfee71
Step 6/7 : ENV SAGEMAKER_PROGRAM train.py
 ---> Using cache
 ---> 6c0c5ad71ece
Step 7/7 : WORKDIR $CLOUD_PATH
 ---> Using cache
 ---> c7334998c616
Successfully built c7334998c616
Successfully tagged 561241433344.dkr.ecr.us-east-1.amazonaws.com/cloud-ml-sagemaker:0.15-cuda10.1-runtime-ubuntu18.04-py3.7
CPU times: user 9.2 ms, sys: 4.68 ms, total: 13.9 ms
Wall time: 319 ms


<span style="color:#8735fb; font-size:18pt"> 2.1.3 - Publish to Elastic Cloud Registry (ECR) </span>

Now that we've built and tagged our container its time to push it to Amazon's container registry (ECR). Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments.


Docker Login to ECR

In [29]:
docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email)

In [30]:
!{docker_login_str[0]}

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Create ECR repository [ if it doesn't already exist]

In [31]:
repository_query = !(aws ecr describe-repositories --repository-names $image_base)
if repository_query[0] == '':
    !(aws ecr create-repository --repository-name $image_base)

Let's now actually push the container to ECR
> Note the first push to ECR may take some time (hopefully less than 10 minutes).

In [32]:
ecr_fullname

'561241433344.dkr.ecr.us-east-1.amazonaws.com/cloud-ml-sagemaker:0.15-cuda10.1-runtime-ubuntu18.04-py3.7'

In [33]:
!docker push $ecr_fullname

The push refers to repository [561241433344.dkr.ecr.us-east-1.amazonaws.com/cloud-ml-sagemaker]

[1Bc37f3422: Preparing 
[1B3214f96f: Preparing 
[1B6b4e1e7d: Preparing 
[1B6ce3e91a: Preparing 
[1B70bf2e4d: Preparing 
[1B4e7152cf: Preparing 
[1B74f76be4: Preparing 
[1Bd332a58a: Preparing 
[1Bf11cbf29: Preparing 
[1Ba4b22186: Preparing 
[1Bafb09dc3: Preparing 
[1Bb5a53aac: Preparing 
[1Bc8e5063e: Preparing 
[2Bc8e5063e: Layer already exists [14A[2K[8A[2K[1A[2K0.15-cuda10.1-runtime-ubuntu18.04-py3.7: digest: sha256:484b892fec17f5dacd32e5de06ee3a09b66c4a7094074399fab04a7f0dae552d size: 3265


<span style="color:#8735fb; font-size:20pt"> 2.2 - Create Estimator </span>

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an Estimator instance.

In [34]:
estimator_params = {
    'image_name' : ecr_fullname,
    
    'train_instance_type' : instance_type, 
    'train_instance_count' : 1, 
    'train_use_spot_instances': use_spot_instances_flag,
    
    'train_max_run' : 60*60*24, # 24 hours
    'train_max_wait' : 60*60*24 + 1,  
    
    'input_mode' : 'File',
    'sagemaker_session' : session,     
    'role' : execution_role,    
}

In [35]:
estimator = sagemaker.estimator.Estimator( **estimator_params  )



<span style="color:#8735fb; font-size:20pt"> 2.3 - Test Estimator </span>

Now we are ready to test by asking SageMaker to run the BYOContainer logic inside our Estimator. This is a useful step if you've made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. 

> Note: This verification step will use the default hyper-parameter values declared in our custom train code, as SageMaker HPO will not be orchestrating a search for this single run.

In [36]:
summarize_choices( s3_data_URI, code_choice, algorithm_choice, cv_folds,
                   instance_type, use_spot_instances_flag, search_strategy, 
                   max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds )

S3 uri           =	s3://sagemaker-rapids-hpo-us-east-1/1_year
compute          =	singleGPU
algorithm        =	XGBoost, 3 cv-fold
instance         =	ml.p3.2xlarge
spot instances   =	True
hpo strategy     =	Random
max_experiments  =	10
max_parallel     =	2
max runtime      =	86400 sec


In [None]:
assert ( input('confirm test run? [ y / n ] : ').lower() == 'y' )

job_name = new_job_name_from_config( dataset_directory, code_choice, 
                                     algorithm_choice, cv_folds,
                                     instance_type  )

estimator.fit(inputs = s3_data_URI, job_name=job_name.lower())

<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **3. Run HPO** </span>

<img src='../img/run_hpo.png'>

With a working SageMaker Estimator in hand, the hardest part is behind us. Now all we have to do is tell SageMaker about the space of hyper-parameters in which to search for the best model.

For more documentation check out the AWS SageMaker [HyperParameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html).

<span style="color:#8735fb; font-size:20pt"> 3.1 - Define Metric </span>

The definitions below specify a regular expressions (i.e., string parsing rules) to find the metrics which we are using to evalaute performance in the output log of each worker/Estimator. In this case we are case we are onyl interested in the performance of our model on the test data (i.e., `test-accuracy`), so we have a single metric to track.

For additional details on metrics refer to the [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html).

In [None]:
metric_definitions = [{'Name': 'final-score', 'Regex': 'final-score: (.*);'}]

In [None]:
objective_metric_name = 'final-score'

<span style="color:#8735fb; font-size:20pt"> 3.2 - Define Tuner </span>

Below we are setting up the parameters that will define the HPO job. By default (to avoid accidently spawning large compute jobs), we have limited the number of HPO experiments to run to 2.

To run a more realistic large-scale HPO, change `max_jobs` to 100 and `max_parallel_jobs` to 10 (or as high as your instance limit permits).

In [None]:
hpo = sagemaker.tuner.HyperparameterTuner( estimator = estimator,
                                           metric_definitions = metric_definitions, 
                                           objective_metric_name = objective_metric_name,
                                           objective_type = 'Maximize',
                                           hyperparameter_ranges = hyperparameter_ranges,
                                           strategy = search_strategy,  
                                           max_jobs = max_jobs,
                                           max_parallel_jobs = max_parallel_jobs)

<span style="color:#8735fb; font-size:20pt"> 3.3 - Run HPO </span>

In [None]:
summarize_choices( s3_data_URI, code_choice, algorithm_choice, cv_folds,
                   instance_type, use_spot_instances_flag, search_strategy, 
                   max_jobs, max_parallel_jobs, max_duration_of_experiment_seconds )

Let's be sure we take a moment to confirm before launching all of our HPO experiments.

In [None]:
assert ( input('confirm HPO launch? [ y / n ] : ').lower() == 'y' )

tuning_job_name = new_job_name_from_config( dataset_directory, code_choice, 
                                            algorithm_choice, cv_folds, 
                                            instance_type )
hpo.fit( inputs = s3_data_URI, 
         job_name = tuning_job_name, 
         wait = True, logs = 'All') 

hpo.wait() # block until the .fit call above is completed

<span style="color:#8735fb; font-size:20pt"> 3.4 - Results and Summary </span>

In [None]:
results_df = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe()

In [None]:
results_df

AWS SageMaker + NVIDIA RAPIDS HPO FTW!

<span style="color:#8735fb; font-size:20pt"> Rapids References </span>


> [cloud-ml-examples](http://github.com/rapidsai/cloud-ml-examples)

> [RAPIDS HPO](https://rapids.ai/hpo)

> [cuML Documentation](https://docs.rapids.ai/api/cuml/stable/)

<span style="color:#8735fb; font-size:20pt"> SageMaker References </span>

> [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)

> [Estimator Parameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

> Spot Instances [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), and [blog]()