<span style="color:#8735fb; font-size:22pt"> **Demo Overview** </span>

Automated Model Tuning (AMT) also known as Hyper-Parameter Optimization  (HPO) helps to find the best version of a model by exploring the space of possible configurations. While generally desirable, this search is computationally expensive and can feel prohibitive. 


In the notebook demo below, we show how [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) and RAPIDS working together can tackle model tuning by accelerating compute parallelism within a node's GPUs; and simultaneously accelerating the search by leveraging sets of cloud nodes running parallel experiments. 

For more check out our [AWS blog](https://aws.amazon.com/blogs/machine-learning/rapids-and-amazon-sagemaker-scale-up-and-scale-out-to-tackle-ml-challenges/).

<span style="color:#8735fb; font-size:22pt"> **0. Preamble** </span>

This notebook was tested in an Amazon SageMaker Studio notebook, on a ml.t3.medium instance with Python 3 (Data Science) kernel.

To get things rolling let's make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region.

In [None]:
import sagemaker
from helper_functions import *

In [None]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

account=!(aws sts get-caller-identity --query Account --output text)
region = [session.boto_region_name]

In [None]:
account, region

<span style="color:#8735fb; font-size:22pt"> **1. Key Choices** </span>

<span style="color:#8735fb; font-size:18pt"> 1.1 - HPO Configurations </span>

Let's go ahead and choose the configuration options for our HPO run. We will be using the default algorithm configurations in [code/Dockerfile](code/Dockerfile) (three-fold cross validation with XGBoost on a single GPU), which are explained in detail in our [extended notebook example](rapids_sagemaker_hpo_extended.ipynb). If you are using your own workflow and training scripts, be sure to write your Dockerfile accordingly.

In [None]:
# please choose dataset S3 bucket and directory
data_bucket = 'sagemaker-rapids-hpo-' + region[0]
dataset_directory = '3_year' # '1_year', '3_year', '10_year', 'NYC_taxi'

# please choose output bucket for trained model(s)
model_output_bucket = session.default_bucket()

In [None]:
s3_data_input = f"s3://{data_bucket}/{dataset_directory}"
s3_model_output = f"s3://{model_output_bucket}/trained-models"

In [None]:
# please choose HPO search ranges
hyperparameter_ranges = {
    'max_depth'    : sagemaker.parameter.IntegerParameter        ( 5, 15 ),
    'num_boost_round' : sagemaker.parameter.IntegerParameter        ( 100, 500 ),
    'max_features' : sagemaker.parameter.ContinuousParameter     ( 0.1, 1.0 ),    
}

<span style="color:#8735fb; font-size:18pt"> 1.2 - Experiment Scale </span>

We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don't accidently spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional  `max_parallel_jobs` parallel workers. 

In [None]:
# please choose total number of HPO experiments[ we have set this number very low to allow for automated CI testing ]
max_jobs = 2

In [None]:
# please choose number of experiments that can run in parallel
max_parallel_jobs = 2

Let's also set the max duration for an individual job to 24 hours so we don't have run-away compute jobs taking too long.

In [None]:
max_duration_of_experiment_seconds = 60 * 60 * 24

<span style="color:#8735fb; font-size:18pt"> 1.3 - Compute Platform </span>

Depending on the workflow you have chosen, your instance should reflect the specifications needed. For example, for the singleGPU workflow, you should choose an instance with a GPU, such as the p3.2xlarge instance. You can [read about Amazon EC2 Instance Types here](https://aws.amazon.com/ec2/instance-types/). 
> e.g., For the 10_year dataset option, we suggest ml.g4dn.12xlarge instances (4 GPUs) and ml.m5.24xlarge CPU instances ( we will need upwards of 200GB CPU RAM during model training).

In [None]:
# we will recommend a compute instance type, feel free to modify 
instance_type = 'ml.g4dn.2xlarge'  # recommend_instance_type(ml_workflow_choice, dataset_directory)

In addition to choosing our instance type, we can also enable significant savings by leveraging [AWS EC2 Spot Instances](https://aws.amazon.com/ec2/spot/).

We **highly recommend** that you set this flag to `True` as it typically leads to 60-70% cost savings. Note, however that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) to enable Spot instances in SageMaker.


In [None]:
# please choose whether spot instances should be used
use_spot_instances_flag = True

<span style="color:#8735fb; font-size:22pt"> **2. Build Estimator** </span>

To build a RAPIDS enabled SageMaker HPO we first need to build a [SageMaker Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html). An Estimator is a container image that captures all the software needed to run an HPO experiment. The container is augmented with entrypoint code that will be trggered at runtime by each worker. The entrypoint code enables us to write custom models and hook them up to data. 

In order to work with SageMaker HPO, the entrypoint logic should parse hyperparameters (supplied by AWS SageMaker), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting. We've already built multiple variations of this code.

If you would like to make changes by adding your custom model logic feel free to modify the **train.py** and/or the specific workflow files in the `code/workflows` directory.

First, let's switch our working directory to the location of the Estimator entrypoint and library code.

In [None]:
%cd code

<span style="color:#8735fb; font-size:20pt"> 2.1 - Containerize and Push to ECR </span>

Now let's turn to building our container so that it can integrate with the AWS SageMaker HPO API.

Let's first decide on the full name of our container.

In [None]:
rapids_base_container = 'rapidsai/rapidsai-cloud-ml:latest'

In [None]:
image_base = 'cloud-ml-sagemaker'
image_tag  = rapids_base_container.split(':')[1]

In [None]:
ecr_fullname = f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{image_base}:{image_tag}"
ecr_fullname

Lastly, let's ensure that our Dockerfile correctly captured our base image selection.

In [None]:
validate_dockerfile(rapids_base_container)
!cat Dockerfile

<span style="color:#8735fb; font-size:18pt"> 2.1.1 Build and Publish to ECR</span>

In order to build and push to the ECR from SageMaker Studio, we must first install sm-docker: 

In [None]:
!pip install sagemaker-studio-image-build

Now we’re ready to start taking advantage of the new CLI to easily build our custom bring-your-own Docker image from Amazon SageMaker Studio without worrying about the underlying setup and configuration of build services. Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments. We are able to build and publish to the ECR with: 

In [None]:
%%time
!sm-docker build . --repository cloud-ml-sagemaker:latest

<span style="color:#8735fb; font-size:20pt"> 2.2 - Create Estimator </span>

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an Estimator instance.

In [None]:
# 'volume_size' - EBS volume size in GB, default = 30
estimator_params = {
    'image_uri': ecr_fullname,
    'role': execution_role,    
    
    'instance_type': instance_type,
    'instance_count': 1,
    
    'input_mode': 'File',
    'output_path': s3_model_output,
    
    'use_spot_instances': use_spot_instances_flag,
    
    'max_run': max_duration_of_experiment_seconds, # 24 hours 
    'sagemaker_session': session,
}

if use_spot_instances_flag == True:
    estimator_params.update({'max_wait' : max_duration_of_experiment_seconds + 1})

In [None]:
estimator = sagemaker.estimator.Estimator(**estimator_params)

<span style="color:#8735fb; font-size:20pt"> 2.3 - Test Estimator </span>

Now we are ready to test by asking SageMaker to run the BYOContainer logic inside our Estimator. This is a useful step if you've made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. 

> Note: This verification step will use the default hyperparameter values declared in our custom train code, as SageMaker HPO will not be orchestrating a search for this single run.

In [None]:
estimator.fit(inputs = s3_data_input)

<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **3. Run HPO** </span>

With a working SageMaker Estimator in hand, the hardest part is behind us. In the key choices section we <a href='#strategy-and-param-ranges'>already defined our search strategy and hyperparameter ranges</a>, so all that remains is to choose a metric to evaluate performance on. For more documentation check out the AWS SageMaker [Hyperparameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html).

<span style="color:#8735fb; font-size:20pt"> 3.1 - Define Metric </span>

We only focus on a single metric, which we call 'final-score', that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add aditional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker.

In [None]:
metric_definitions = [{'Name': 'final-score', 'Regex': 'final-score: (.*);'}]

In [None]:
objective_metric_name = 'final-score'

<span style="color:#8735fb; font-size:20pt"> 3.2 - Define Tuner </span>

Finally we put all of the elements we've been building up together into a HyperparameterTuner declaration. 

In [None]:
hpo = sagemaker.tuner.HyperparameterTuner(estimator=estimator,
                                          metric_definitions=metric_definitions, 
                                          objective_metric_name=objective_metric_name,
                                          objective_type='Maximize',
                                          hyperparameter_ranges=hyperparameter_ranges,
                                          strategy='Random',  
                                          max_jobs=max_jobs,
                                          max_parallel_jobs=max_parallel_jobs)

<span style="color:#8735fb; font-size:20pt"> 3.3 - Run HPO </span>

Let's be sure we take a moment to confirm before launching all of our HPO experiments. Depending on your configuration options running this cell can kick off a massive amount of computation!
> Once this process begins, we recommend that you use the SageMaker UI to keep track of the <a href='../img/gpu_hpo_100x10.png'>health of the HPO process and the individual workers</a>.

In [None]:
import random
import string

tuning_job_name = 'unified-hpo-' + ''.join(random.choices(string.digits, k = 5))

In [None]:
hpo.fit( inputs=s3_data_input,
         job_name=tuning_job_name,
         wait=True,
         logs='All')

hpo.wait()  # block until the .fit call above is completed

<span style="color:#8735fb; font-size:20pt"> 3.4 - Results and Summary </span>

Once your job is complete there are multiple ways to analyze the results. Below we display the performance of the best job, as well printing each HPO trial/job as a row of a dataframe.

In [None]:
hpo_results = summarize_hpo_results(tuning_job_name)

In [None]:
sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe()

<span style="color:#8735fb; font-size:30pt">RAPIDS References</span>

> [cloud-ml-examples](http://github.com/rapidsai/cloud-ml-examples)

> [RAPIDS HPO](https://rapids.ai/hpo)

> [cuML Documentation](https://docs.rapids.ai/api/cuml/stable/)

<span style="color:#8735fb; font-size:30pt">SageMaker References</span>

> [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)

> [Getting Started with SageMaker Studio](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/index.html)

> [Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)

> [Docker containers with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-basics.html)

> [Estimator Parameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

> Spot Instances [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), and [blog]()