<img src='img/header.png'>

[Hyper Parameter Optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization) (HPO) improves model quality by searching over hyperparameters, parameters not typically learned during the training process but rather values that control the learning process itself (e.g., model size/capacity). This search can significantly boost model quality relative to default settings and non-expert tuning; however, HPO can take a very long time on a non-accelerated platform. In this notebook, we containerize a RAPIDS workflow and run Bring-Your-Own-Container SageMaker HPO to show how we can overcome the computational complexity of model search. 

We accelerate HPO in two key ways: 
* by *scaling within a node* (e.g., multi-GPU where each GPU brings a magnitude higher core count relative to CPUs), and 
* by *scaling across nodes* and running parallel trials on cloud instances.

In this notebook, we cover step 2 of the workflow - building a container - which must then be published to the Amazon Elastic Container Registry (ECR) to be used for our Estimator object. This notebook is intended to set up a container in the ECR to be used in the sagemaker_studio/rapids_studio_hpo.ipynb example. 

<img src='img/hpo.png'>

<span style="color:#8735fb; font-size:22pt"> **Preamble** </span>

To get things rolling let's make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region.

In [1]:
import sagemaker
from helper_functions import *

In [2]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

account=!(aws sts get-caller-identity --query Account --output text)
region=!(aws configure get region)

In [3]:
account, region

(['561241433344'], ['us-east-1'])

<span style="color:#8735fb; font-size:22pt"> **Key Choices** </span>

Let's go ahead and choose the configuration options for our HPO run.

Below are two reference configurations showing a small and a large scale HPO (sized in terms of total experiments/compute). 

The default values in the notebook are set for the small HPO configuration, however you are welcome to scale them up.

> **small HPO**: 1_year, XGBoost, 3 CV folds, singleGPU, max_jobs = 10, max_parallel_jobs = 2

> **large HPO**: 10_year, XGBoost, 10 CV folds, multiGPU, max_jobs = 100, max_parallel_jobs = 10

<span style="color:#8735fb; font-size:18pt"> [ Dataset ] </span>

We offer free hosting for several demo datasets that you can try running HPO with, or alternatively you can bring your own dataset (BYOD). 

By default we leverage the `Airline` dataset, which is a large public tracker of US domestic flight logs which we offer in various sizes (1 year, 3 year, and 10 year) and in <a href='https://parquet.apache.org/'>Parquet</a> (compressed column storage) format. The machine learning objective with this dataset is to predict whether flights will be more than 15 minutes late arriving to their destination ([dataset link](https://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&DB_URL=), additional details in <a href='#dataset'>Section 1.1</a>). 

As an alternative we also offer the `NYC Taxi` dataset which captures yellow cab trip details in Ney York in January 2020, stored in <a href='https://en.wikipedia.org/wiki/Comma-separated_values'>CSV </a> format without any compression. The machine learning objective with this dataset is to predict whether a trip had an above average tip (>$2.20).

We host the demo datasets in public S3 demo buckets in both the **us-east-1** (N. Virginia) or **us-west-2** (Oregon) regions (i.e., `sagemaker-rapids-hpo-us-east-1`, and `sagemaker-rapids-hpo-us-west-2`). You should run the SageMaker HPO workflow in either of these two regions if you wish to leverage the demo datasets since SageMaker requires that the S3 dataset and the compute you'll be renting are co-located. 

Lastly, if you plan to use your own dataset refer to the <a href='#byod'>BYOD checklist in the Appendix</a> to help integrate into the workflow.

| dataset | data_bucket | dataset_directory | # samples | storage type | time span |
|---|---|---|---|---|---|
| Airline Stats Small    | demo    | 1_year   | 6.3M   | Parquet     | 2019         |
| Airline Stats Medium   | demo    | 3_year   | 18M    | Parquet     | 2019-2017    |
| Airline Stats Large    | demo    | 10_year  | 63M    | Parquet     | 2019-2010    |
| NYC Taxi               | demo    | NYC_taxi | 6.3M   | CSV         | 2020 January |
| Bring Your Own Dataset | custom  | custom   | custom | Parquet/CSV | custom       |

In [4]:
# please choose dataset directory
dataset_directory = '3_year' # '1_year', '3_year', '10_year', 'NYC_taxi'

<span style="color:#8735fb; font-size:18pt"> [ Algorithm ] </span>

From a ML/algorithm perspective, we offer [XGBoost](https://xgboost.readthedocs.io/en/latest/#), [RandomForest](https://docs.rapids.ai/api/cuml/stable/cuml_blogs.html#tree-and-forest-models) and [KMeans](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=kmeans#cuml.KMeans). You are free to switch between these algorithm choices and everything in the example will continue to work.

In [5]:
# please choose learning algorithm
algorithm_choice = 'XGBoost'

assert (algorithm_choice in ['XGBoost', 'RandomForest', 'KMeans'])

We can also optionally increase robustness via reshuffles of the train-test split (i.e., [cross-validation folds](https://scikit-learn.org/stable/modules/cross_validation.html)). Typical values here are between 3 and 10 folds.

In [6]:
# please choose cross-validation folds
cv_folds = 3

assert (cv_folds >= 1)

<span style="color:#8735fb; font-size:18pt"> [ ML Workflow Compute Choice ] </span>

We enable the option of running different code variations that unlock increasing amounts of parallelism in the compute workflow. 

* `singleCPU`** = [pandas](https://pandas.pydata.org/) + [sklearn](https://scikit-learn.org/stable/)
* `multiCPU`   = [dask](https://dask.org/) + [pandas](https://pandas.pydata.org/) + [sklearn](https://scikit-learn.org/stable/)

* <span style="color:#8735fb; font-size:14pt"> RAPIDS </span> `singleGPU` = [cudf](https://github.com/rapidsai/cudf) + [cuml](https://github.com/rapidsai/cuml)
* <span style="color:#8735fb; font-size:14pt"> RAPIDS </span> `multiGPU`  = [dask](https://dask.org/) + [cudf](https://github.com/rapidsai/cudf) + [cuml](https://github.com/rapidsai/cuml) 

All of these code paths are available in the `/code/workflows` directory for your reference. 

> **Note that the single-CPU option will leverage multiple cores in the model training portion of the workflow; however, to unlock full parallelism in each stage of the workflow we use [Dask](https://dask.org/). 



In [7]:
# please choose code variant
ml_workflow_choice = 'singleGPU' 

assert (ml_workflow_choice in ['singleCPU', 'singleGPU', 'multiCPU', 'multiGPU'])

<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **1. ML Workflow** </span>

<img src='img/ml_workflow.png' width='800'> 

<span style="color:#8735fb; font-size:20pt"> 1.1 - Dataset </span>
<a id ='dataset'></a>

The default settings for this demo are built to utilize the Airline dataset (Carrier On-Time Performance 1987-2020, available from the [Bureau of Transportation Statistics](https://transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time#)). Below are some additional details about this dataset, we plan to offer a companion notebook that does a deep dive on the data science behind this dataset. Note that if you are using an alternate dataset (e.g., NYC Taxi or BYOData) these details are not relevant.

The public dataset contains logs/features about flights in the United States (17 airlines) including:

* Locations and distance  ( `Origin`, `Dest`, `Distance` )
* Airline / carrier ( `Reporting_Airline` )
* Scheduled departure and arrival times ( `CRSDepTime` and `CRSArrTime` )
* Actual departure and arrival times ( `DpTime` and `ArrTime` )
* Difference between scheduled & actual times ( `ArrDelay` and `DepDelay` )
* Binary encoded version of late, aka our target variable ( `ArrDelay15` )

Using these features we will build a classifier model to predict whether a flight is going to be more than 15 minutes late on arrival as it prepares to depart.

<span style="color:#8735fb; font-size:20pt"> 1.2 - Python ML Workflow </span>

To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be trggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. 

In order to work with SageMaker HPO, the entrypoint logic should parse hyperparameters (supplied by AWS SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting. We've already built multiple variations of this code.

If you would like to make changes by adding your custom model logic feel free to modify the **train.py** and/or the specific workflow files in the `code/workflows` directory. You are also welcome to uncomment the cells below to load the read/review the code.

First, let's switch our working directory to the location of the Estimator entrypoint and library code.

In [8]:
%cd code

/home/ec2-user/SageMaker/updated3/cloud-ml-examples/aws/code


In [9]:
# %load train.py

In [10]:
# %load workflows/MLWorkflowSingleGPU.py

<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **2. Build Estimator** </span>

<img src='img/estimator.png' width='800'>

As we've already mentioned, the SageMaker Estimator represents the containerized software stack that AWS SageMaker will replicate to each worker node.

The first step to building our Estimator, is to augment a RAPIDS container with our ML Workflow code from above, and push this image to Amazon Elastic Cloud Registry so it is available to SageMaker.


<span style="color:#8735fb; font-size:20pt"> 2.1 - Containerize and Push to ECR </span>

Now let's turn to building our container so that it can integrate with the AWS SageMaker HPO API.

Our container can either be built on top of the latest RAPIDS [ nightly ] image as a starting layer or the RAPIDS stable image.


In [11]:
rapids_base_container = 'rapidsai/rapidsai-cloud-ml:latest'

Let's also decide on the full name of our container.

In [12]:
image_base = 'cloud-ml-sagemaker'
image_tag  = rapids_base_container.split(':')[1]

In [13]:
ecr_fullname = f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}"

In [14]:
ecr_fullname

'561241433344.dkr.ecr.us-east-1.amazonaws.com/cloud-ml-sagemaker:latest'

<span style="color:#8735fb; font-size:18pt"> 2.1.1 - Write Dockerfile </span>

We write out the Dockerfile to disk, and in a few cells execute the docker build command. 


Let's now write our selected RAPDIS image layer as the first FROM statement in the the Dockerfile.

In [15]:
with open('Dockerfile', 'w') as dockerfile: 
    dockerfile.writelines( f'FROM {rapids_base_container} \n\n'
                           f'ENV AWS_DATASET_DIRECTORY="{dataset_directory}"\n'
                           f'ENV AWS_ALGORITHM_CHOICE="{algorithm_choice}"\n'
                           f'ENV AWS_ML_WORKFLOW_CHOICE="{ml_workflow_choice}"\n'
                           f'ENV AWS_CV_FOLDS="{cv_folds}"\n')

Next let's append write the remaining pieces of the Dockerfile, namely adding the sagemaker-training-toolkit, flask, dask-ml, and copying our python code.

In [16]:
%%writefile -a Dockerfile

# ensure printed output/log-messages retain correct order
ENV PYTHONUNBUFFERED=True

# path where SageMaker looks for code when container runs in the cloud
ENV CLOUD_PATH="/opt/ml/code"

# copy our latest [local] code into the container 
COPY . $CLOUD_PATH

# make the entrypoint script executable
RUN chmod +x $CLOUD_PATH/entrypoint.sh

WORKDIR $CLOUD_PATH
ENTRYPOINT ["./entrypoint.sh"]

Appending to Dockerfile


Lastly, let's ensure that our Dockerfile correctly captured our base image selection.

In [17]:
validate_dockerfile(rapids_base_container)
!cat Dockerfile

FROM rapidsai/rapidsai-cloud-ml:latest 

ENV AWS_DATASET_DIRECTORY="3_year"
ENV AWS_ALGORITHM_CHOICE="XGBoost"
ENV AWS_ML_WORKFLOW_CHOICE="singleGPU"
ENV AWS_CV_FOLDS="3"

# ensure printed output/log-messages retain correct order
ENV PYTHONUNBUFFERED=True

# path where SageMaker looks for code when container runs in the cloud
ENV CLOUD_PATH="/opt/ml/code"

# copy our latest [local] code into the container 
COPY . $CLOUD_PATH

# make the entrypoint script executable
RUN chmod +x $CLOUD_PATH/entrypoint.sh

WORKDIR $CLOUD_PATH
ENTRYPOINT ["./entrypoint.sh"]


<span style="color:#8735fb; font-size:18pt"> 2.1.2 Build and Tag </span>

The build step will be dominated by the download of the RAPIDS image (base layer). If it's already been downloaded the build will take less than 1 minute.

In [18]:
!docker pull $rapids_base_container

latest: Pulling from rapidsai/rapidsai-cloud-ml
Digest: sha256:d7e654e302cf1c3d0784533875963fb6d4c92a2ba286a665c5c907768314edb1
Status: Image is up to date for rapidsai/rapidsai-cloud-ml:latest
docker.io/rapidsai/rapidsai-cloud-ml:latest


In [19]:
%%time
!docker build . -t $ecr_fullname -f Dockerfile

Sending build context to Docker daemon  148.5kB
Step 1/11 : FROM rapidsai/rapidsai-cloud-ml:latest
 ---> 02d8a62f6d2c
Step 2/11 : ENV AWS_DATASET_DIRECTORY="3_year"
 ---> Using cache
 ---> cc635c1758c8
Step 3/11 : ENV AWS_ALGORITHM_CHOICE="XGBoost"
 ---> Using cache
 ---> 4be2669ce87e
Step 4/11 : ENV AWS_ML_WORKFLOW_CHOICE="singleGPU"
 ---> Using cache
 ---> fc30a90dbb86
Step 5/11 : ENV AWS_CV_FOLDS="3"
 ---> Using cache
 ---> e4d57f0893f0
Step 6/11 : ENV PYTHONUNBUFFERED=True
 ---> Using cache
 ---> fba4d538bb57
Step 7/11 : ENV CLOUD_PATH="/opt/ml/code"
 ---> Using cache
 ---> f323001ce6dd
Step 8/11 : COPY . $CLOUD_PATH
 ---> Using cache
 ---> e4ccba1d9ed2
Step 9/11 : RUN chmod +x $CLOUD_PATH/entrypoint.sh
 ---> Using cache
 ---> 0f9d2cf570af
Step 10/11 : WORKDIR $CLOUD_PATH
 ---> Using cache
 ---> db881b5a4eeb
Step 11/11 : ENTRYPOINT ["./entrypoint.sh"]
 ---> Using cache
 ---> 6917417b835a
Successfully built 6917417b835a
Successfully tagged 561241433344.dkr.ecr.us-east-1.amazonaws.co

<span style="color:#8735fb; font-size:18pt"> 2.1.3 - Publish to Elastic Cloud Registry (ECR) </span>

Now that we've built and tagged our container its time to push it to Amazon's container registry (ECR). Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments.


Docker Login to ECR

In [20]:
docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email)

In [21]:
!{docker_login_str[0]}

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Create ECR repository [ if it doesn't already exist]

In [22]:
repository_query = !(aws ecr describe-repositories --repository-names $image_base)
if repository_query[0] == '':
    !(aws ecr create-repository --repository-name $image_base)

Let's now actually push the container to ECR
> Note the first push to ECR may take some time (hopefully less than 10 minutes).

In [23]:
!docker push $ecr_fullname

The push refers to repository [561241433344.dkr.ecr.us-east-1.amazonaws.com/cloud-ml-sagemaker]

[1B18701c81: Preparing 
[1B59dad267: Preparing 
[1B3ffe903b: Preparing 
[1B32fbf55b: Preparing 
[1B189ca3a3: Preparing 
[1Bb97c445b: Preparing 
[1B78e3bf48: Preparing 
[1B8b120579: Preparing 
[1B87e0621d: Preparing 
[1B7ad6008c: Preparing 
[1Bdd8ed907: Preparing 
[1B872b888e: Preparing 
[1B512fd434: Preparing 
[1B31fc0e08: Preparing 
[1B8308da3d: Layer already exists [10A[2K[8A[2K[4A[2Klatest: digest: sha256:29a7a49d0a8499fbbd9bd68b988b5ec7647045036f3baa77c330d35ad3ac1d25 size: 3475


<span style="color:#8735fb; font-size:20pt"> 2.2 - Create Estimator </span>

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an Estimator instance in SageMaker Studio. Navigate to sagemaker_studio/rapids_studio_hpo.ipynb to continue.

<span style="display: block; color:#8735fb; font-size:25pt"> **Appendix: Bring Your Own Dataset Checklist** </span>
<a id ='byod'></a>

If you plan to use your own dataset (BYOD) here is a checklist to help you integrate into the workflow:

> - [ ] Dataset should be in either CSV or Parquet format.
> - [ ] Dataset is already pre-processed (and all feature-engineering is done).
> - [ ] Dataset is uploaded to S3 and `data_bucket` and `dataset_directory` have been set to the location of your data.
> - [ ] Dataset feature and target columns have been enumerated in `/code/HPODataset.py`


<span style="color:#8735fb; font-size:25pt"> **Rapids References** </span>

> [cloud-ml-examples](http://github.com/rapidsai/cloud-ml-examples)

> [RAPIDS HPO](https://rapids.ai/hpo)

> [cuML Documentation](https://docs.rapids.ai/api/cuml/stable/)

<span style="color:#8735fb; font-size:25pt"> **SageMaker References** </span>

> [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)

> [Estimator Parameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

> Spot Instances [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), and [blog]()