## 3. Running an experiment with Ray AI libraries

### Steps to run:

- Read the dataset with Ray Data
- Split into train and test data
- Define and run a Trainer with Ray Train
- Optimize hyperparameters of this training run with Ray Tune
- Run batch inference on a trained model
- Compute online predictions for a model served with Ray Serve

__Read, preprocess with Ray Data__

In [24]:
import ray

# Read the dataset from S3 (Parquet file)
dataset = ray.data.read_parquet("s3://anonymous@anyscale-training-data/intro-to-ray-air/nyc_taxi_2021.parquet")

Parquet Files Sample 0:   0%|          | 0/1 [00:00<?, ? file/s]

In [25]:
# Split the dataset into training and validation sets
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)

2024-11-15 15:50:21,044	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-11-15_08-44-25_924022_2383/logs/ray-data
2024-11-15 15:50:21,044	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet]


- ReadParquet->SplitBlocks(32) 1: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

2024-11-15 15:50:21,913	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-11-15_08-44-25_924022_2383/logs/ray-data
2024-11-15 15:50:21,913	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet]


- ReadParquet->SplitBlocks(32) 1: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

__Fit model with Ray Train__

In [26]:
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig, RunConfig


# Define the Trainer (similar for other frameworks)
trainer = XGBoostTrainer(
    label_column="is_big_tip",
    # What resources to use?
    scaling_config=ScalingConfig(num_workers=4, use_gpu=False),
    params={"objective": "binary:logistic"},
    datasets={"train": train_dataset, "valid": valid_dataset},
    # How to run training (e.g. where to store training data)?
    run_config=RunConfig(storage_path="/mnt/cluster_storage/"),
)

# Fit the trainer:
#   - Schedule resources for 1 training run (trial)
#   - Trainer setup & data provisioning
#   - Train the model and report back
result = trainer.fit()


2024-11-15 15:54:19,191	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
2024-11-15 15:54:19,206	INFO data_parallel_trainer.py:340 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.


== Status ==
Current time: 2024-11-15 15:54:19 (running for 00:00:00.11)
Using FIFO scheduling algorithm.
Logical resource usage: 5.0/16 CPUs, 0/2 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/accelerator_shape:1xT4, 0.0/2.0 accelerator_type:T4, 0.0/2.0 anyscale/provider:aws)
Result logdir: /tmp/ray/session_2024-11-15_08-44-25_924022_2383/artifacts/2024-11-15_15-54-19/XGBoostTrainer_2024-11-15_15-54-19/driver_artifacts
Number of trials: 1/1 (1 PENDING)




(pid=161686) - split(4, equal=True) 1:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=161686) Running 0:   0%|          | 0/26 [00:00<?, ? bundle/s]

== Status ==
Current time: 2024-11-15 15:54:24 (running for 00:00:05.20)
Using FIFO scheduling algorithm.
Logical resource usage: 5.0/16 CPUs, 0/2 GPUs (0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 anyscale/accelerator_shape:1xT4, 0.0/2.0 accelerator_type:T4, 0.0/2.0 anyscale/provider:aws)
Result logdir: /tmp/ray/session_2024-11-15_08-44-25_924022_2383/artifacts/2024-11-15_15-54-19/XGBoostTrainer_2024-11-15_15-54-19/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




(pid=161687) - split(4, equal=True) 1:   0%|          | 0/39 [00:00<?, ? bundle/s]

(pid=161687) Running 0:   0%|          | 0/39 [00:00<?, ? bundle/s]

== Status ==
Current time: 2024-11-15 15:54:29 (running for 00:00:10.27)
Using FIFO scheduling algorithm.
Logical resource usage: 5.0/16 CPUs, 0/2 GPUs (0.0/2.0 anyscale/accelerator_shape:1xT4, 0.0/2.0 anyscale/provider:aws, 0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2024-11-15_08-44-25_924022_2383/artifacts/2024-11-15_15-54-19/XGBoostTrainer_2024-11-15_15-54-19/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




2024-11-15 15:54:31,784	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/XGBoostTrainer_2024-11-15_15-54-19' in 0.0222s.
2024-11-15 15:54:31,786	INFO tune.py:1041 -- Total run time: 12.60 seconds (12.56 seconds for the tuning loop).


== Status ==
Current time: 2024-11-15 15:54:31 (running for 00:00:12.58)
Using FIFO scheduling algorithm.
Logical resource usage: 5.0/16 CPUs, 0/2 GPUs (0.0/2.0 anyscale/accelerator_shape:1xT4, 0.0/2.0 anyscale/provider:aws, 0.0/2.0 anyscale/region:us-west-2, 0.0/2.0 accelerator_type:T4)
Result logdir: /tmp/ray/session_2024-11-15_08-44-25_924022_2383/artifacts/2024-11-15_15-54-19/XGBoostTrainer_2024-11-15_15-54-19/driver_artifacts
Number of trials: 1/1 (1 TERMINATED)




__Optimize hyperparameters with Ray Tune__

<img src="https://docs.ray.io/en/latest/_images/train-tuner.svg" width=600>

In [27]:
from ray import tune
from ray.tune import Tuner, TuneConfig


tuner = Tuner(
    # Pass the Trainer instance
    trainer,
    # Random search over max_depth
    param_space={"params": {"max_depth": tune.randint(2, 12)}},
    # 3 trials in total, minimize validation (log) loss
    tune_config=TuneConfig(num_samples=3, metric="valid-logloss", mode="min"),
    # Might need a different config here
    run_config=RunConfig(storage_path="/mnt/cluster_storage/"),
)

# Fit the tuner and get the best checkpoint
# Takes about as long as the single run before.
checkpoint = tuner.fit().get_best_result().checkpoint

0,1
Current time:,2024-11-15 15:59:03
Running for:,00:00:13.29
Memory:,5.2/31.0 GiB

Trial name,status,loc,params/max_depth,iter,total time (s),train-logloss,valid-logloss
XGBoostTrainer_81312_00000,TERMINATED,10.0.17.156:163491,5,11,10.0242,0.66008,0.660569
XGBoostTrainer_81312_00001,TERMINATED,10.0.17.156:163492,5,11,10.0059,0.66008,0.660569
XGBoostTrainer_81312_00002,TERMINATED,10.0.25.71:43057,3,11,9.93615,0.662196,0.662499


2024-11-15 15:58:50,120	INFO data_parallel_trainer.py:340 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
2024-11-15 15:58:50,123	INFO data_parallel_trainer.py:340 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
2024-11-15 15:58:50,127	INFO data_parallel_trainer.py:340 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.


(pid=43373, ip=10.0.25.71) - split(4, equal=True) 1:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=43373, ip=10.0.25.71) Running 0:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=163874) - split(4, equal=True) 1:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=163874) Running 0:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=163879) - split(4, equal=True) 1:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=163879) Running 0:   0%|          | 0/26 [00:00<?, ? bundle/s]

(pid=43374, ip=10.0.25.71) - split(4, equal=True) 1:   0%|          | 0/39 [00:00<?, ? bundle/s]

(pid=43374, ip=10.0.25.71) Running 0:   0%|          | 0/39 [00:00<?, ? bundle/s]

(pid=163880) - split(4, equal=True) 1:   0%|          | 0/39 [00:00<?, ? bundle/s]

(pid=163880) Running 0:   0%|          | 0/39 [00:00<?, ? bundle/s]

(pid=163877) - split(4, equal=True) 1:   0%|          | 0/39 [00:00<?, ? bundle/s]

(pid=163877) Running 0:   0%|          | 0/39 [00:00<?, ? bundle/s]

2024-11-15 15:59:03,410	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/mnt/cluster_storage/XGBoostTrainer_2024-11-15_15-58-50' in 0.0516s.
2024-11-15 15:59:03,415	INFO tune.py:1041 -- Total run time: 13.31 seconds (13.24 seconds for the tuning loop).


__Batch inference with Ray Data__

In [28]:
import xgboost
import pandas as pd


class OfflinePredictor:
    def __init__(self):
        # Load model once (expensive, stateful operation)
        self._model = xgboost.Booster()
        self._model.load_model(checkpoint.path + "/model.ubj")

    def __call__(self, batch: dict) -> dict:
        # Make prediction in batches
        dmatrix = xgboost.DMatrix(pd.DataFrame(batch))
        prediction = self._model.predict(dmatrix)
        return {"prediction": prediction}

In [31]:
# Apply the predictor to the validation dataset (minus the labels)
valid_dataset_features = valid_dataset.drop_columns(['is_big_tip'])

# Map batches of features over the predictor
predicted_probabilities = valid_dataset_features.map_batches(OfflinePredictor, concurrency=2)

# Materialize ("take") a batch of 10 predictions from the cluster
predicted_probabilities.take_batch(10)

2024-11-15 16:05:08,459	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-11-15_08-44-25_924022_2383/logs/ray-data
2024-11-15 16:05:08,460	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(drop_columns)->MapBatches(OfflinePredictor)] -> LimitOperator[limit=10]


- MapBatches(drop_columns)->MapBatches(OfflinePredictor) 1: 0 bundle [00:00, ? bundle/s]

- limit=10 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]



{'prediction': array([0.6290559 , 0.6290559 , 0.5404948 , 0.6290559 , 0.5552995 ,
        0.5529555 , 0.6290559 , 0.6290559 , 0.60417205, 0.5588189 ],
       dtype=float32)}

__Online prediction with Ray Serve__

In [32]:
import json
from ray import serve
from starlette.requests import Request


@serve.deployment
class OnlinePredictor:
    def __init__(self, checkpoint):
        # Load model once (same as before)
        self._model = xgboost.Booster()
        self._model.load_model(checkpoint.path + "/model.ubj")

    async def __call__(self, request: Request) -> dict:
        request_data = await request.json()
        data = json.loads(request_data)
        
        # Same structure as in offline prediction (different input data)
        dmatrix = xgboost.DMatrix(pd.DataFrame(data))
        return {"prediction": self._model.predict(dmatrix)}


# Create the model deployment ("handle")
# Binds to localhost:8000 by default
handle = serve.run(OnlinePredictor.bind(checkpoint=checkpoint))

2024-11-15 16:08:51,404	INFO handle.py:126 -- Created DeploymentHandle 'dx5qijjx' for Deployment(name='OnlinePredictor', app='default').
2024-11-15 16:08:51,405	INFO handle.py:126 -- Created DeploymentHandle 'bw4198ni' for Deployment(name='OnlinePredictor', app='default').
2024-11-15 16:08:54,422	INFO handle.py:126 -- Created DeploymentHandle 'es3qaucg' for Deployment(name='OnlinePredictor', app='default').
2024-11-15 16:08:54,423	INFO api.py:574 -- Deployed app 'default' successfully.


In [33]:
import requests

# Form payload
sample_batch = valid_dataset_features.take_batch(1)
data = pd.DataFrame(sample_batch).to_json(orient="records")

# Send HTTP request
requests.post("http://localhost:8000/", json=data).json()

2024-11-15 16:09:49,192	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-11-15_08-44-25_924022_2383/logs/ray-data
2024-11-15 16:09:49,192	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(drop_columns)] -> LimitOperator[limit=1]


- MapBatches(drop_columns) 1: 0 bundle [00:00, ? bundle/s]

- limit=1 2: 0 bundle [00:00, ? bundle/s]

Running 0: 0 bundle [00:00, ? bundle/s]

{'prediction': [0.629055917263031]}

In [34]:
# Shutdown Ray Serve
serve.shutdown()

In [35]:
# Cleanup
!rm -rf /mnt/cluster_storage/XGBoostTrainer*

### Recap

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Introduction_to_Ray_AIR/e2e_air.png" width="100%" loading="lazy">|
|:-:|
|Ray AI Libraries enable end-to-end ML development and provides multiple options for integrating with other tools and libraries form the MLOps ecosystem.|
