# Running RAPIDS hyperparameter experiments at scale on Amazon SageMaker

### Import packages and create Amazon SageMaker and Boto3 sessions

In [None]:
import sagemaker
import time
import boto3

In [None]:
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client('sts').get_caller_identity().get('Account')

In [None]:
account, region

### Upload the higgs-boson dataset to s3 bucket

In [None]:
!mkdir dataset
!wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
!gunzip dataset/HIGGS.csv.gz

In [None]:
s3_data_dir = session.upload_data(path='dataset', key_prefix='dataset/higgs-dataset')

In [None]:
s3_data_dir

### Download latest RAPIDS container from DockerHub

To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-core).

You will need to extend this container by creating a Dockerfile, copying the training script and installing [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) to makes RAPIDS compatible with SageMaker 

In [None]:
estimator_info = {
    'rapids_container':'rapidsai/rapidsai-core:22.12-cuda11.5-runtime-ubuntu18.04-py3.9',
    'ecr_image':'sagemaker-rapids-higgs',
    'ecr_repository':'sagemaker-rapids-higgs'
}

In [None]:
%%time
!docker pull {estimator_info['rapids_container']}

In [None]:
!cat docker/Dockerfile

In [None]:
!docker build -t sagemaker-rapids-higgs docker

In [None]:
!docker images

### Publish to Elastic Container Registry

When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster. 

Note: SageMaker does not support using training images from private docker registry (ie. DockerHub), so we need to push
the SageMaker-compatible RAPIDS container to the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.

In [None]:
ECR_container_fullname = f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"

In [None]:
ECR_container_fullname 

In [None]:
!docker tag {estimator_info['rapids_container']} {ECR_container_fullname}

In [None]:
print( f"source      : {estimator_info['rapids_container']}\n"
       f"destination : {ECR_container_fullname}")

In [None]:
!docker images

In [None]:
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})

In [None]:
!docker push {ECR_container_fullname}

### Testing your Amazon SageMaker compatible RAPIDS container locally
Before you go off and spend time and money on running a large experiment on a large cluster, its always wise to test things locally and make sure it’s doing what it’s suppose to do.

Define some default hyperparameters. Take your best guess, you can find the full list of RandomForest hyperparameters on the [cuML docs](https://docs.rapids.ai/api/cuml/stable/api.html#random-forest) page.

In [None]:
hyperparams={ 
    'n_estimators'       : 15,
    'max_depth'          : 5,
    'n_bins'             : 8,
    'split_criterion'    : 0,      # GINI:0, ENTROPY:1
    'bootstrap'          : 0,      # true: sample with replacement, false: sample without replacement
    'max_leaves'         : -1,     # unlimited leaves
    'max_features'       : 0.2, 
}

Now, run a local Amazon SageMaker training job to ensure the container performs as expected.

In [None]:
from sagemaker.estimator import Estimator

rapids_estimator = Estimator(image_uri=ECR_container_fullname,
                          role=execution_role,
                          instance_count=1,
                          instance_type='ml.p3.2xlarge',
                          hyperparameters=hyperparams,
                          metric_definitions=[{'Name': 'test_acc', 'Regex': 'test_acc: ([0-9\\.]+)'}])

In [None]:
%%time
rapids_estimator.fit(inputs = s3_data_dir)

Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.

### Define hyperparameter ranges and run a large-scale search experiment
There’s not a whole lot of code changes required to go from local training to training at scale.

First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK:

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
    'n_estimators'        : IntegerParameter(10, 200), 
    'max_depth'           : IntegerParameter(1, 22),
    'n_bins'              : IntegerParameter(5, 24),
    'split_criterion'     : CategoricalParameter([0, 1]),
    'bootstrap'           : CategoricalParameter([True, False]),
    'max_features'        : ContinuousParameter(0.01, 0.5),
}

Next, you’ll change the instance type to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with 4 NVIDIA Tesla V100 based GPU instance — `ml.p3.8xlarge`. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.

In [None]:
from sagemaker.estimator import Estimator

rapids_estimator = Estimator(image_uri=image,
                          role=execution_role,
                          instance_count=1,
                          instance_type='ml.p3.8xlarge',
                          hyperparameters=hyperparams,
                          metric_definitions=[{'Name': 'test_acc', 'Regex': 'test_acc: ([0-9\\.]+)'}])

Now you define a HyperparameterTuner object using the estimator you defined above.

In [None]:
tuner = HyperparameterTuner(rapids_estimator,
                            objective_metric_name='test_acc',
                            hyperparameter_ranges=hyperparameter_ranges,
                            strategy='Bayesian',
                            max_jobs=1,
                            max_parallel_jobs=1,
                            objective_type='Maximize',
                            metric_definitions=[{'Name': 'test_acc', 'Regex': 'test_acc: ([0-9\\.]+)'}])

In [None]:
job_name = 'rapidsHPO' + time.strftime('%Y-%m-%d-%H-%M-%S-%j', time.gmtime())
tuner.fit({'dataset': s3_data_dir}, job_name=job_name)

## Clean up

- Delete S3 buckets and files you don't need
- Kill training jobs that you don't want running
- Delete container images and the repository you just created

In [None]:
aws ecr delete-repository --force --repository-name