<span style="display: block;  text-align: center; color:#8735fb; font-size:22pt"> **HPO Benchmarking with RAPIDS and Dask** </span>

Hyper-Parameter Optimization (HPO) helps to find the best version of a model by exploring the space of possible configurations. While generally desirable, this search is computationally expensive and time-consuming.

In the notebook demo below, we compare benchmarking results to show how RAPIDS can accelerate HPO tuning jobs relative to CPU.

For instance, we find a x speedup in wall clock time (6 hours vs 3+ days) and a x reduction in cost when comparing between GPU and CPU EC2 instances on 100 XGBoost HPO trials using 10 parallel workers on 10 years of the Airline Dataset.

For more check out our AWS blog(link).

<span style="display: block;  color:#8735fb; font-size:22pt"> **Preamble** </span>

<span style="display: block; color:#8735fb; font-size:20pt"> 1.1 Create EC2 instance </span>

Create a new Instance with GPUs, the NVIDIA Driver and the NVIDIA Container Runtime.

Amazon maintains an [Amazon Machine Image (AMI)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-tensorflow-2-12-amazon-linux-2/) that pre-installs NVIDIA drivers and container runtimes, we recommend using this image as the starting point.

1. **Open the EC2 Dashboard**.

2. **Select Launch Instance**.

3. In the AMI selection box under **"Amazon Machine Image (AMI)"**, select the [Deep Learning AMI GPU TensorFlow or PyTorch](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) 
<img src='img/launch-ec2.png'>
4) Choose **RAPIDS compatible instance type**, must be Pascal or higher (e.g. "p3.8xlarge")

6) Select your SSH key-pair (create one if you haven’t already).

7) Under network settings create/choose existing security group that allows SSH access on port 22

8) Review and **Launch**.



<span style="display: block; color:#8735fb; font-size:20pt"> 1.2 Connect to the instance </span>

Next we need to connect to the instance.

1. Open the EC2 Dashboard.

2. Locate your VM and note the Public IP Address.

3. In your terminal run `ssh -i <key-pair-name > ec2-user@<ip address>`

<span style="display: block; color:#8735fb; font-size:22pt"> **2. ML Workflow** </span>

<span style="display: block; color:#8735fb; font-size:20pt"> 2.1 - Dataset </span>

The data source for this workflow is 3 years of the [Airline On-Time Statistics](https://www.transtats.bts.gov/ONTIME/) dataset from the US Bureau of Transportation.

The public dataset contains logs/features about flights in the United States (17 airlines) including:

* Locations and distance ( Origin, Dest, Distance )
* Airline / carrier ( Reporting_Airline )
* Scheduled departure and arrival times ( CRSDepTime and CRSArrTime )
* Actual departure and arrival times ( DpTime and ArrTime )
* Difference between scheduled & actual times ( ArrDelay and DepDelay )
* Binary encoded version of late, aka our target variable ( ArrDelay15 )



In [None]:
# !aws configure

In [None]:
## DOWNLOAD THE DATASET
!aws s3 cp --recursive s3://sagemaker-rapids-hpo-us-west-2/3_year/ ./data/

<span style="display: block; color:#8735fb; font-size:20pt"> 2.2 - LocalCUDACluster </span>

To maximize on efficiency, we launch a `LocalCUDACluster` that utilizes GPUs for distributed computing. Then connect a Dask Client to submit and manage computations on the cluster. Refer to this (link) for more information on how to achieve this.

Submit dataset to the Dask client, instructing Dask to store the dataset in memory  at all times. This can improve performance by avoiding unnecessary data transfers during the hpo process. 

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            dataset = ingest_data()
            client.persist(dataset)
    

<span style="display: block; color:#8735fb; font-size:20pt"> 2.3 - Python ML Workflow </span>

In order to work with RAPIDS container, the entrypoint logic should parse model-type parameter (manually supplied at script run), load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting.

`Optuna` is a hyperparameter optimization library in Python. We create an Optuna `study object` that provides a framework to define the search space, objective function, and optimization algorith for the hpo  process.  

In [1]:
%cd code

/home/skirui/home/skirui/tco_hpo_gpu_cpu_perf_benchmark/code


In [2]:
ls

Dockerfile  hpo_cpu.py  hpo_gpu.py


<span style="display: block; color:#8735fb; font-size:22pt"> **3. Build RAPIDS Container** </span>

<span style="display: block; color:#8735fb; font-size:20pt"> 3.1 - Containerize and Build </span>

Build from the latest Rapids container and install other necessary dependencies. Your dockerfile should look something like this

In [3]:
cat Dockerfile

# FROM nvcr.io/nvidia/rapidsai/rapidsai-core:23.06-cuda11.5-runtime-ubuntu20.04-py3.10

FROM rapidsai/rapidsai:23.06-cuda11.5-runtime-ubuntu20.04-py3.10

#FROM rapidsai/rapidsai:23.06-cuda11.8-runtime-ubuntu22.04-py3.10

RUN mamba install -y -n rapids optuna


In [4]:
!docker images

REPOSITORY                              TAG                                         IMAGE ID       CREATED       SIZE
rapidsai/rapidsai                       23.06-cuda11.8-runtime-ubuntu22.04-py3.10   027b84613f4f   2 weeks ago   17.3GB
rapidsai/rapidsai                       23.06-cuda11.5-runtime-ubuntu20.04-py3.10   4db4b31a94fc   2 weeks ago   16.1GB
nvcr.io/nvidia/rapidsai/rapidsai-core   23.06-cuda11.5-runtime-ubuntu20.04-py3.10   597389c5fd96   2 weeks ago   16GB
<none>                                  <none>                                      d5a111ce0d10   3 weeks ago   4.18GB
rapidsai/rapidsai-core                  23.06-cuda11.5-runtime-ubuntu20.04-py3.10   8a9b2eee5b32   3 weeks ago   16GB
databricksruntime/gpu-conda             cuda11                                      cd895161062c   2 years ago   4.18GB


In [7]:
!nvidia-smi

Tue Jul 11 21:27:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 8000     On   | 00000000:15:00.0 Off |                  Off |
| 34%   41C    P8    24W / 260W |     10MiB / 48601MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000     On   | 00000000:2D:00.0  On |                  Off |
| 33%   47C    P8    30W / 260W |    706MiB / 48593MiB |      1%      Default |
|       

In [None]:
!docker build -t rapids-tco-benchmark:v23.06 -f Dockerfile .

Sending build context to Docker daemon  15.87kB
Step 1/2 : FROM rapidsai/rapidsai:23.06-cuda11.5-runtime-ubuntu20.04-py3.10
 ---> 4db4b31a94fc
Step 2/2 : RUN mamba install -y -n rapids optuna
 ---> Running in b73155fe6a06
Transaction

  Prefix: /opt/conda/envs/rapids

  Updating specs:

   - optuna
   - ca-certificates
   - certifi
   - openssl


  Package       Version  Build            Channel                   Size
──────────────────────────────────────────────────────────────────────────
  Install:
──────────────────────────────────────────────────────────────────────────

  + alembic      1.11.1  pyhd8ed1ab_0     conda-forge/noarch       154kB
  + cmaes         0.9.1  pyhd8ed1ab_0     conda-forge/noarch        21kB
  + colorlog      6.7.0  py310hff52083_1  conda-forge/linux-64      18kB
  + greenlet      2.0.2  py310hc6cd4ac_1  conda-forge/linux-64     191kB
  + mako          1.2.4  pyhd8ed1ab_0     conda-forge/noarch        63kB
  + optuna        3.2.0  pyhd8ed1ab_0     conda-for

In [None]:

!nvidia-smi

<span style="display: block; color:#8735fb; font-size:22pt"> **4. Run HPO** </span>

-- should cover the code snippets and what the code does 
* define metric
* define tuner
* run
* results and summary


- Start a tmux session
- Run container with mount -v option , expose nvidia gpus and jupyter
- Run python scripts inside container, passing "model type" as argument and output benchmark results into a txt file
- Wait until experiments finish running

<span style="display: block; color:#8735fb; font-size:22pt"> 5. Cleanup </span>