# Ray AIR Lab Exercises

## AIR experiments: easy, medium, hard

In this notebook, we will start with a minimal Ray AIR workflow and try to modify it.

## Starting template

As a reminder and starting point, we can review the AIR workflow diagram and the minimal workflow code:

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Introduction_to_Ray_AIR/e2e_air.png" width=600 loading="lazy"/>

In [None]:
import os
import zipfile
import requests

import ray
from ray import tune
from ray import serve
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer
from ray.train.xgboost import XGBoostPredictor
from ray.train.batch_predictor import BatchPredictor
from ray.serve import PredictorDeployment
from ray.serve.http_adapters import pandas_read_json
from ray.tune import Tuner, TuneConfig

ray.init()

__Read, preprocess with Ray Data__

In [None]:
dataset = ray.data.read_parquet("s3://anonymous@anyscale-training-data/intro-to-ray-air/nyc_taxi_2021.parquet").repartition(16)

train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

__Fit model with Ray Train__

In [None]:
trainer = XGBoostTrainer(
    label_column="is_big_tip",
    scaling_config=ScalingConfig(num_workers=4, use_gpu=False),
    params={ "objective": "binary:logistic", },
    datasets={"train": train_dataset, "valid": valid_dataset},
)

result = trainer.fit()

__Optimize hyperparams with Ray Tune__

In [None]:
tuner = Tuner(trainer, 
            param_space={'params' : {'max_depth': tune.randint(2, 12)}},
            tune_config=TuneConfig(num_samples=4, metric='train-logloss', mode='min'))

checkpoint = tuner.fit().get_best_result().checkpoint

## Labs: 3 options

There are three options to choose from depending on your experience and interest -- and of course if you have extra time you could try all three.

### Basic lab: modify XGBoost train/tune scaling

Add scale two ways
1. Change the TuneConfig to generate more samples
1. Observe the Ray cluster autoscale
1. Confirm that XGBoost training is happening using multiple cores __and__ that multiple XGBoost models (trials) are being trained at the same time.
1. Increase the ScalingConfig in the Trainer so that `num_workers` * `num_samples` is more than your total availabe CPUs and observe what happens when we try to train.

### Intermediate lab: use LightGBM 

We can use an alternative gradient-boosting library -- Microsoft's LightGBM (https://lightgbm.readthedocs.io/en/stable/) along with Ray, instead of XGBoost.

1. Look at the Ray Train documentation page -- https://docs.ray.io/en/latest/train/train.html -- and look at the differences in the XGBoost and LightGBM examples on that page
1. You don't need to install anything: LightGBM libraries are already installed in the cluster (the `lightgbm_ray` library has the key integrations to Ray)
1. Add the ncessary import statements
1. Modify the code to use the necessary `Trainer` class
1. Observe the parallel training
    1. If you have extra time, try expanding the scale of the training as in the basic lab above

### Advanced lab: use PyTorch

Integrating PyTorch is a litte more involved than just using `PyTorchTrainer`

First, look at the PyTorch example on the Ray Train doc page. Use that as a starting point for your code.

All necessary libraries and drivers are already installed in your cluster. We'll also start with a simple multi-layer perceptron model and some PyTorch hints supplied here:

For the current dataset, start with the following model

In [None]:
input_size = 6
layer_size = 10
output_size = 1

class BasicMLP(nn.Module):
    def __init__(self):
        super(BasicMLP, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

When preparing your training code...
* for the loss function, use `nn.BCEWithLogitsLoss()` (since we're comparing neural net output logits to "is_big_tip" values that will appear as 0 or 1
* train for one epoch -- we want to make sure the training is working and one epoch is enough to see that

When preparing the tabular data, we'll need to make some adjustments to prepare the matrices that the neural net expects...
* when the training dataset "shard" is loaded in the per-worker training loop function (from `session.get_dataset_shard`)
    * it will appear as as `dict`
    * with a key for each column in the original dataset
    * and a corresponding value as a Torch tensor (vector) of values for that column
* you'll need to stack/concat the vectors corresponding to the predictor columns into a matrix (using regular PyTorch APIs)
    * and remove any other nuisance columns which might appear (take a look at a record to see what else might be present)

Ray Datasets have a schema ... here is one way to work with the schema to 
You can automate the schema cleaning with code like this:

In [None]:
train_dataset.schema()

In [None]:
predictors = [s for s in train_dataset.schema().names if not s.startswith('_')]
predictors.remove("is_big_tip")
predictors