<span style="color:#8735fb; font-size:22pt"> **Demo Overview** </span>

Automated Model Tuning (AMT) also known as Hyper-Parameter Optimization  (HPO) helps to find the best version of a model by exploring the space of possible configurations. While generally desirable, this search is computationally expensive and can feel prohibitive. 


In the notebook demo below, we show how SageMaker and RAPIDS working together can tackle model tuning by accelerating compute parallelism within a node's GPUs; and simultaneously accelerating the search by leveraging sets of cloud nodes running parallel experiments. 

For example, we find a **12x** speedup in wall clock time (6 hours vs 3+ days) and a **4.5x** reduction in cost when comparing between GPU and CPU EC2 Spot instances on 100 XGBoost AMT/HPO trials using 10 parallel workers on 10 years of the Airline Dataset.

For more check out our [AWS blog](https://aws.amazon.com/blogs/machine-learning/rapids-and-amazon-sagemaker-scale-up-and-scale-out-to-tackle-ml-challenges/).

<img src='img/v2_overview.png'>

<span style="color:#8735fb; font-size:22pt"> **Preamble** </span>

To get things rolling let's make sure we can query our AWS SageMaker execution role and session as well as our account ID and AWS region.

In [1]:
!docker images

REPOSITORY                                                      TAG                                        IMAGE ID       CREATED        SIZE
561241433344.dkr.ecr.us-west-2.amazonaws.com/rapids-sagemaker   22.12-cuda11.5-runtime-ubuntu18.04-py3.9   10246f7a77b9   18 hours ago   13.4GB
rapidsai/rapidsai-core                                          22.12-cuda11.5-runtime-ubuntu18.04-py3.9   9de590bd08c5   4 weeks ago    13.1GB


In [2]:
import sagemaker
import string
import random

In [3]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

account=!(aws sts get-caller-identity --query Account --output text)
region=!(aws configure get region)

In [4]:
account, region

(['561241433344'], ['us-west-2'])

<span style="display: block; color:#8735fb; font-size:22pt"> 1. RAPIDS Cloud ML Container </span>

Next let us download the latest RAPIDS container pre-integrated with the libraries needed for SageMaker integration (e.g., [sagemaker-training toolkit](https://github.com/aws/sagemaker-training-toolkit)).

This container also contains the latest code from our [rapidsai/cloud-ml-examples](https://github.com/rapidsai/cloud-ml-examples) repository which will run inside the container. We'll go over the code details in the next section.

In [5]:
estimator_info = {
    'rapids_container': 'rapidsai/rapidsai-cloud-ml:latest',
    'ecr_image': 'sagemaker-rapids-cloud-ml:latest',
    'ecr_repository': 'sagemaker-rapids-cloud-ml'
}

In [6]:
%%time
!docker pull {estimator_info['rapids_container']}

latest: Pulling from rapidsai/rapidsai-cloud-ml

[1B17ec1767: Pulling fs layer 
[1Bc6d5be4b: Pulling fs layer 
[1Bef873090: Pulling fs layer 
[1B23ebd5e4: Pulling fs layer 
[1B5dbeeff1: Pulling fs layer 
[1Ba89506ba: Pulling fs layer 
[1Bebeca2e9: Pulling fs layer 
[1Bb64eb750: Pulling fs layer 
[1B505799c4: Pulling fs layer 
[1Bfc106c51: Pulling fs layer 
[1BDigest: sha256:5c32980fafebea8f4aadde11a6f25cc9202180d798493ae130161285587ffa252K[11A[2K[11A[2K[11A[2K[7A[2K[11A[2K[11A[2K[6A[2K[11A[2K[6A[2K[11A[2K[6A[2K[11A[2K[6A[2K[11A[2K[6A[2K[11A[2K[6A[2K[4A[2K[3A[2K[11A[2K[3A[2K[6A[2K[6A[2K[11A[2K[6A[2K[11A[2K[6A[2K[11A[2K[4A[2K[11A[2K[6A[2K[11A[2K[3A[2K[4A[2K[3A[2K[11A[2K[4A[2K[11A[2K[6A[2K[3A[2K[6A[2K[3A[2K[6A[2K[3A[2K[6A[2K[4A[2K[6A[2K[3A[2K[6A[2K[11A[2K[4A[2K[3A[2K[6A[2K[3A[2K[11A[2K[4A[2K[6A[2K[3A[2K[4A[2K[3A[2K[6A[2K[11A[2K[6A[2K[3A[2K[6A[2K[3A

Once we pull the RAPIDS cloud container we will need to publish it onto the Amazon Elastic Container Registry (ECR) so that it can be used by SageMaker. The full name of the container on ECR will be as follows:

In [8]:
ECR_container_fullname = f"{account[0]}.dkr.ecr.{region[0]}.amazonaws.com/{estimator_info['ecr_image']}"

In [9]:
ECR_container_fullname

'561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-cloud-ml:latest'

Next we can tag this container with its full ECR name.

In [10]:
!docker tag {estimator_info['rapids_container']} {ECR_container_fullname}

In [11]:
print( f"source      : {estimator_info['rapids_container']}\n"
       f"destination : {ECR_container_fullname}")

source      : rapidsai/rapidsai-cloud-ml:latest
destination : 561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-cloud-ml:latest


In [12]:
!docker images

REPOSITORY                                                               TAG                                        IMAGE ID       CREATED        SIZE
561241433344.dkr.ecr.us-west-2.amazonaws.com/rapids-sagemaker            22.12-cuda11.5-runtime-ubuntu18.04-py3.9   10246f7a77b9   18 hours ago   13.4GB
rapidsai/rapidsai-core                                                   22.12-cuda11.5-runtime-ubuntu18.04-py3.9   9de590bd08c5   4 weeks ago    13.1GB
561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-cloud-ml   latest                                     46e43ff6d6e3   2 months ago   10.6GB
rapidsai/rapidsai-cloud-ml                                               latest                                     46e43ff6d6e3   2 months ago   10.6GB


<span style="color:#8735fb; font-size:18pt"> 1.3 - Publish to Elastic Cloud Registry (ECR) </span>

Now that we've built and tagged our container its time to push it to Amazon's container registry (ECR). Once in ECR, AWS SageMaker will be able to leverage our image to build Estimators and run experiments.


Docker Login to ECR

In [13]:
docker_login_str = !(aws ecr get-login --region {region[0]} --no-include-email)

In [14]:
!{docker_login_str[0]}

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


Create ECR repository [ if it doesn't already exist]

In [15]:
repository_query = !(aws ecr describe-repositories --repository-names {estimator_info['ecr_repository']})
if repository_query[0] == '':
    !(aws ecr create-repository --repository-name {estimator_info['ecr_repository']})

{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:561241433344:repository/sagemaker-rapids-cloud-ml",
        "registryId": "561241433344",
        "repositoryName": "sagemaker-rapids-cloud-ml",
        "repositoryUri": "561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-cloud-ml",
        "createdAt": 1673574994.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        },
        "encryptionConfiguration": {
            "encryptionType": "AES256"
        }
    }
}


Let's now actually push the container to ECR
> Note the first push to ECR may take some time (hopefully less than 10 minutes).

In [16]:
%%time
!docker push {ECR_container_fullname}

The push refers to repository [561241433344.dkr.ecr.us-west-2.amazonaws.com/sagemaker-rapids-cloud-ml]

[1Be327d6f2: Preparing 
[1B982c88f5: Preparing 
[1B4b6603aa: Preparing 
[1Bc8dc2a3f: Preparing 
[1B7f57a846: Preparing 
[1B25675d02: Preparing 
[1B3ab326f0: Preparing 
[1B53fde97f: Preparing 
[1Bab64d304: Preparing 
[1B2e83a678: Preparing 
[10B82c88f5: Pushed   5.889GB/5.829GB[10A[2K[7A[2K[10A[2K[11A[2K[10A[2K[9A[2K[11A[2K[9A[2K[11A[2K[8A[2K[9A[2K[10A[2K[11A[2K[9A[2K[10A[2K[8A[2K[11A[2K[8A[2K[10A[2K[8A[2K[11A[2K[10A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[10A[2K[8A[2K[10A[2K[8A[2K[9A[2K[11A[2K[9A[2K[8A[2K[9A[2K[11A[2K[9A[2K[8A[2K[6A[2K[8A[2K[11A[2K[8A[2K[9A[2K[8A[2K[11A[2K[8A[2K[10A[2K[6A[2K[8A[2K[10A[2K[9A[2K[10A[2K[11A[2K[9A[2K[10A[2K[6A[2K[11A[2K[10A[2K[6A[2K[8A[2K[6A[2K[9A[2K[10A[2K[6A[2K[9A[2K[10A[2K[8A[2K[11A[2K[8A[2K[10A[2K[9A[2K[

<span style="color:#8735fb; font-size:22pt"> 2. - DataScience Workflow </span>

The data source for this workflow is 3 years of the ([Airline On-Time Statistics](https://www.transtats.bts.gov/ONTIME/) dataset from the US Bureau of Transportation.

The machine learning objective is to predict whether flights will be more than 15 minutes late arriving to their destination. 

The flow of logic can follow two paths:

In **training mode**, the code loads the dataset, drops flights with missing values, splits the data, trains an XGBoost model, and evaluates its performance on the hold out values.

In **serving or inference mode** there container runs a Flask server which listens for inputs, parses and predicts with the trained model loaded in cache.        

<span style="color:#8735fb; font-size:20pt"> 2.1 - Default Choices </span>

There are choices that have been pre-selected in this demo -- we list them below and provide a brief description for each.

If you would like to learn more about these choices, and have the ability to change them please check out the [extended demo notebook](https://github.com/rapidsai/cloud-ml-examples/blob/main/aws/rapids_sagemaker_hpo_extended.ipynb).


Choice | Demo/Default | Other Option(s)
-------|--------|---------
Model  | XGBoost | RandomForest, KMeans
Datasize  | 3 years | 1, 3, or 10 years
Compute/Code | GPU | CPU, GPU/CPU + Dask

> Note: We host the demo datasets in public S3 demo buckets in both the **us-east-1** (N. Virginia) or **us-west-2** (Oregon) regions (i.e., `sagemaker-rapids-hpo-us-east-1`, and `sagemaker-rapids-hpo-us-west-2`). You should run the SageMaker HPO workflow in either of these two regions if you wish to leverage the demo datasets since SageMaker requires that the S3 dataset and the compute you'll be renting are co-located. 


<span style="color:#8735fb; font-size:20pt"> 2.2 - Active Choices </span>

In [17]:
s3_data_input = f"s3://sagemaker-rapids-hpo-{region[0]}/1_year"
s3_model_output = f"s3://{session.default_bucket()}/trained-models"

<span style="color:#8735fb; font-size:18pt"> [ Search Ranges and Strategy ] </span>
<a id='strategy-and-param-ranges'></a>

One of the most important choices when running HPO is to choose the bounds of the hyperparameter search process. Below we've set the ranges of the hyperparameters to allow for interesting variation, you are of course welcome to revise these ranges based on domain knowledge especially if you plan to plug in your own dataset. 

In [18]:
# please choose HPO search ranges
hyperparameter_ranges = {
    'max_depth'    : sagemaker.parameter.IntegerParameter        ( 5, 15 ),
    'num_boost_round' : sagemaker.parameter.IntegerParameter     ( 100, 500 ),
    'max_features' : sagemaker.parameter.ContinuousParameter     ( 0.1, 1.0 ),    
}

<span style="color:#8735fb; font-size:18pt"> [ Experiment Scale ] </span>

We also need to decide how may total experiments to run, and how many should run in parallel. Below we have a very conservative number of maximum jobs to run so that you don't accidently spawn large computations when starting out, however for meaningful HPO searches this number should be much higher (e.g., in our experiments we often run 100 max_jobs). Note that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) for additional  `max_parallel_jobs` parallel workers. 

In [19]:
# please choose total number of HPO experiments[ we have set this number very low to allow for automated CI testing ]
max_jobs = 2

In [20]:
# please choose number of experiments that can run in parallel
max_parallel_jobs = 2

Let's also set the max duration for an individual job to 24 hours so we don't have run-away compute jobs taking too long.

In [21]:
max_duration_of_experiment_seconds = 60 * 60 * 24

<span style="color:#8735fb; font-size:18pt"> [ Compute Platform ] </span>

Based on the dataset size and compute choice we will try to recommend an instance choice*, you are of course welcome to select alternate configurations. 
> e.g., For the 10_year dataset option, we suggest ml.p3.8xlarge instances (4 GPUs) and ml.m5.24xlarge CPU instances ( we will need upwards of 200GB CPU RAM during model training).

In [23]:
# we will recommend a compute instance type, feel free to modify 
instance_type = 'ml.p3.2xlarge' #recommend_instance_type(ml_workflow_choice, dataset_directory) 

In addition to choosing our instance type, we can also enable significant savings by leveraging [AWS EC2 Spot Instances](https://aws.amazon.com/ec2/spot/).

We **highly recommend** that you set this flag to `True` as it typically leads to 60-70% cost savings. Note, however that you may need to request a [quota limit increase](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) to enable Spot instances in SageMaker.


In [24]:
# please choose whether spot instances should be used
use_spot_instances_flag = False

<span style="color:#8735fb; font-size:20pt"> 2.2 - Create Estimator </span>

Having built our container [ +custom logic] and pushed it to ECR, we can finally compile all of efforts into an Estimator instance.

In [25]:
# 'volume_size' - EBS volume size in GB, default = 30
estimator_params = {
    'image_uri': ECR_container_fullname,
    'role': execution_role,    
    
    'instance_type': instance_type,
    'instance_count': 1,
    
    'input_mode': 'File',
    'output_path': s3_model_output,
    
    'use_spot_instances': use_spot_instances_flag,
    
    'max_run': max_duration_of_experiment_seconds, # 24 hours 
    'sagemaker_session': session,
}

if use_spot_instances_flag == True:
    estimator_params.update({'max_wait' : max_duration_of_experiment_seconds + 1})

In [26]:
estimator = sagemaker.estimator.Estimator(**estimator_params)

<span style="color:#8735fb; font-size:20pt"> 2.3 - Test Estimator </span>

Now we are ready to test by asking SageMaker to run the BYOContainer logic inside our Estimator. This is a useful step if you've made changes to your custom logic and are interested in making sure everything works before launching a large HPO search. 

> Note: This verification step will use the default hyperparameter values declared in our custom train code, as SageMaker HPO will not be orchestrating a search for this single run.

In [28]:
estimator.fit(inputs = s3_data_input)

INFO:sagemaker:Creating training-job with name: sagemaker-rapids-cloud-ml-2023-01-13-02-09-54-873


2023-01-13 02:09:55 Starting - Starting the training job......
2023-01-13 02:10:28 Starting - Preparing the instances for training......
2023-01-13 02:11:54 Downloading - Downloading input data
2023-01-13 02:11:54 Training - Downloading the training image...............
2023-01-13 02:14:00 Training - Training image download completed. Training in progress..[34mRunning SageMaker HPO entrypoint.[0m
[34m@ entrypoint -> launching training script [0m
[34m2023-01-13 02:14:24,217     INFO hpo_log [0m
[34mparsing configuration from environment settings...[0m
[34m2023-01-13 02:14:24,217     INFO hpo_log   Dataset: Airline
  Compute: single-GPU
  Algorithm: XGBoost
  CV_folds: 3[0m
[34m2023-01-13 02:14:24,217     INFO hpo_log parsing model hyperparameters from command line arguments...log[0m
[34m2023-01-13 02:14:24,218     INFO hpo_log {    'gamma': 0.0,
     'lambda': 1,
     'learning_rate': 0.3,
     'max_depth': 5,
     'num_boost_round': 10,
     'objective': 'binary:logistic',

UnexpectedStatusException: Error for Training job sagemaker-rapids-cloud-ml-2023-01-13-02-09-54-873: Failed. Reason: AlgorithmError: , exit code: 255

<span style="display: block; text-align: center; color:#8735fb; font-size:30pt"> **3. Run HPO** </span>

With a working SageMaker Estimator in hand, the hardest part is behind us. In the key choices section we <a href='#strategy-and-param-ranges'>already defined our search strategy and hyperparameter ranges</a>, so all that remains is to choose a metric to evaluate performance on. For more documentation check out the AWS SageMaker [Hyperparameter Tuner documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html).

<span style="color:#8735fb; font-size:20pt"> 3.1 - Define Metric </span>

We only focus on a single metric, which we call 'final-score', that captures the accuracy of our model on the test data unseen during training. You are of course welcome to add aditional metrics, see [AWS SageMaker documentation on Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html). When defining a metric we provide a regular expression (i.e., string parsing rule) to extract the key metric from the output of each Estimator/worker.

In [None]:
metric_definitions = [{'Name': 'final-score', 'Regex': 'final-score: (.*);'}]

In [None]:
objective_metric_name = 'final-score'

<span style="color:#8735fb; font-size:20pt"> 3.2 - Define Tuner </span>

Finally we put all of the elements we've been building up together into a HyperparameterTuner declaration. 

In [None]:
hpo = sagemaker.tuner.HyperparameterTuner(estimator=estimator,
                                          metric_definitions=metric_definitions, 
                                          objective_metric_name=objective_metric_name,
                                          objective_type='Maximize',
                                          hyperparameter_ranges=hyperparameter_ranges,
                                          strategy='Random',  
                                          max_jobs=max_jobs,
                                          max_parallel_jobs=max_parallel_jobs)

<span style="color:#8735fb; font-size:20pt"> 3.3 - Run HPO </span>

Let's be sure we take a moment to confirm before launching all of our HPO experiments. Depending on your configuration options running this cell can kick off a massive amount of computation!
> Once this process begins, we recommend that you use the SageMaker UI to keep track of the <a href='../img/gpu_hpo_100x10.png'>health of the HPO process and the individual workers</a>.

In [None]:
tuning_job_name = 'unified-hpo-19-' + ''.join(random.choices(string.digits, k = 5))

In [None]:
hpo.fit( inputs=s3_data_input,
         job_name=tuning_job_name,
         wait=True,
         logs='All')

hpo.wait()  # block until the .fit call above is completed

<span style="color:#8735fb; font-size:20pt"> 3.4 - Results and Summary </span>

In [None]:
sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name).dataframe()

<span style="color:#8735fb; font-size:30pt">RAPIDS References</span>

> [cloud-ml-examples](http://github.com/rapidsai/cloud-ml-examples)

> [RAPIDS HPO](https://rapids.ai/hpo)

> [cuML Documentation](https://docs.rapids.ai/api/cuml/stable/)

<span style="color:#8735fb; font-size:30pt">SageMaker References</span>

> [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)

> [Estimator Parameters](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

> Spot Instances [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), and [blog]()