<span style="display: block;  text-align: center; color:#8735fb; font-size:22pt"> **HPO Benchmarking with RAPIDS and Dask** </span>

Hyper-Parameter Optimization (HPO) helps to find the best version of a model by exploring the space of possible configurations. While generally desirable, this search is computationally expensive and time-consuming.

In the notebook demo below, we compare benchmarking results to show how GPU can accelerate HPO tuning jobs relative to CPU.

For instance, we find a x speedup in wall clock time and a x reduction in cost when comparing between GPU and CPU EC2 instances on 100 XGBoost HPO trials on 3 years of the Airline Dataset.


<span style="display: block;  color:#8735fb; font-size:22pt"> **Preamble** </span>

You can set up local environment but it is recommended to launch a Virtual Machine service (Azure, AWS, GCP, etc).

For the purposes of this notebook, we will be utilizing the [Amazon Machine Image (AMI)](https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-tensorflow-2-12-amazon-linux-2/) as the starting point.


````{docref} /cloud/aws/
Please follow instructions in [AWS Elastic Cloud Compute)](../../cloud/aws/ec2) to launch an EC2 instance with GPUs, the NVIDIA Driver and the NVIDIA Container Runtime.

```{note}
When configuring your instance ensure you select the [Deep Learning AMI GPU TensorFlow or PyTorch](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) in the AMI selection box under **"Amazon Machine Image (AMI)"**

![](../../_static/images/examples/xgboost-rf-gpu-cpu-benchmark/amazon-deeplearning-ami.png)
```

Once your instance is running and you have access to Jupyter save this notebook and run through the cells.

````


<span style="display: block; color:#8735fb; font-size:22pt"> **ML Workflow** </span>


<span style="display: block; font-size:20pt"> Dataset </span>

We leverage the `Airline` dataset, which is a large public tracker of US domestic flight logs which we offer in various sizes (1 year, 3 year, and 10 year) and in [Parquet](https://parquet.apache.org/) (compressed column storage) format. The machine learning objective with this dataset is to predict whether flights will be more than 15 minutes late arriving to their destination.

We host the demo dataset in public S3 demo buckets in both the `us-east-1` or `us-west-2`. To optimize performance, we recommend that you access the s3 bucket in the same region as your EC2 instance to reduce network latency and data transfer costs. 

For this demo, we are using the 3_year dataset, which includes the following features to mention a few:

* Locations and distance ( Origin, Dest, Distance )
* Airline / carrier ( Reporting_Airline )
* Scheduled departure and arrival times ( CRSDepTime and CRSArrTime )
* Actual departure and arrival times ( DpTime and ArrTime )
* Difference between scheduled & actual times ( ArrDelay and DepDelay )
* Binary encoded version of late, aka our target variable ( ArrDelay15 )



In [None]:
# configure aws credentials for access to S3 storage
!aws configure

In [None]:
# download dataset from S3 bucket to your current working dir
!aws s3 cp --recursive s3://sagemaker-rapids-hpo-us-west-2/3_year/ ./data/

<span style="display: block; font-size:20pt"> Algorithm </span>

From a ML/algorithm perspective, we offer `XGBoost` and `RandomForest`. You are free to switch between these algorithm choices and everything in the example will continue to work.

We can also optionally increase robustness via reshuffles of the train-test split (i.e., cross-validation folds). Typical values here are between 3 and 10 folds but we will use `n_cv_folds = 5`

<span style="display: block; font-size:20pt"> Search Range </span>

In order to work with RAPIDS container, the entrypoint logic should parse arguments, load and split data, build and train a model, score/evaluate the trained model, and emit an output representing the final score for the given hyperparameter setting.

`Optuna` is a hyperparameter optimization library in Python. We create an Optuna study object that provides a framework to define the search space, objective function, and optimization algorithm for the hpo process.


<span style="display: block; font-size:20pt"> Local Cluster </span>

To maximize on efficiency, we launch either a `LocalCluster` for cpu or `LocalCUDACluster` that utilizes GPUs for distributed computing. Then connect a Dask Client to submit and manage computations on the cluster. 

WE can then submit and "persist" the dataset to the Dask client, instructing Dask to store the dataset in memory  at all times for faster performance.


<span style="display: block; color:#8735fb; font-size:22pt"> **Build RAPIDS Container** </span>

In [None]:
!nvidia-smi

In [None]:
cat Dockerfile

In [None]:
!docker images

In [None]:
!docker build -t rapids-tco-benchmark:v23.06 .

In [None]:
!docker images

In [None]:
# !tmux

In [None]:
!docker run -it --gpus all -p 8888:8888 -p 8787:8787 -p 8786:8786 -v \
                    /home/ec2-user/tco_hpo_gpu_cpu_perf_benchmark:/rapids/notebooks/host \
                            rapids-tco-benchmark:v23.06 


<span style="display: block; color:#8735fb; font-size:22pt"> **Run HPO** </span>

Navigate to the host directory inside the container and run the python script with the following command : 

    python ./hpo.py --model-type "XGBoost" --mode "gpu"  > xgboost_gpu.txt 2>&1
