# Introduction to Ray AI Runtime (AIR)
---
(*Suggested Time to Complete: 30 minutes*)

![Ray AIR](images/Ray_AIR.png)
*Figure 1*

Ray AI Runtime (AIR) is a unified set of libraries built on top of Ray for distributed data processing, model training, tuning, model serving, and reinforcement learning, all in Python. AIR provides simple scalable machine learning for individual workloads and end-to-end workflows, bringing together an ecosystem of integrations.

Before we lay out each library and their unique jobs to be done, let's take a moment to motivate Ray AIR by taking a high-level view of the data science and machine learning workflow.

Developing a machine learning system is an iterative and often cyclical process that roughly touches on the following stages:

1. Business Needs: work with stakeholders to identify business requirements and align on which metric to optimize
2. Data Collection: source and collect data and labels
3. Feature Engineering: turn raw data into usable material
4. Model Training: the learning part of machine learning
5. Model Evaluation: try to improve upon your baseline model with hyperparameter tuning or more feature engineering or even a more relevant set of data
6. Deployment: deploy your solution to production and/or serve your model to the end user

Each of the five main native libraries that Ray AIR wraps tackles a specific piece of the ML specific tasks outlined above, and because this abstraction layer is built on top of Ray Core, it is distributed by nature.

1. [Ray Data](https://docs.ray.io/en/latest/data/dataset.html): load and manipulate data
2. [Ray Train](https://docs.ray.io/en/latest/train/train.html): scales model training for popular ML frameworks
3. [Ray Tune](https://docs.ray.io/en/latest/tune/index.html): scales experiment execution and hyperparameter tuning
4. [Ray Serve](https://docs.ray.io/en/latest/serve/index.html): scales model serving
5. [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html): for distributed reinforment learning workloads

Let's contextualize Ray Data, Train, Tune, and Serve with a common ML pipeline and discuss how each library intersects with the distinct steps we need to distribute this end-to-end example. Towards the end, we will examine a reinforment learning specific workload for RLlib.

**Learning Objectives**
1. Understand the high-level data science libraries that compose Ray AIR: Data, Train, Tune, Serve, and RLlib
2. Understand how to use Ray AIR as a unified toolkit to write an end-to-end ML application in Python
3. Practice key concepts from each stage of the ML pipeline
    - Data - use out-of-the-box preprocessors
    - Train - use AIR Trainers for supported ML frameworks
    - Tune - use AIR Tuners for hyperparameter search
    - BatchPredictor - use AIR `BatchPredictor` to load model from best checkpoint for batch inference
    - Serve - use `PredictorDeployment` for online inference
    - RLlib - distribute RL workloads with RLlib


# End-to-End Example: Breast Cancer Biopsy Classification
***
To illustrate Ray AIR's capabilities, we will walk through a practical, end-to-end example of building a typical machine learning pipeline using Ray Data, Train, Tune, and Serve. We will build a binary classification model based off the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)) which contains 569 samples of malignant and benign biopsies. Our workflow will start by loading data, performing some basic preprocessing, train the model with XGBoost, tune hyperparameters, perform batch inference, and then finally deploy our application online.

<!-- As we encounter each stage, we will formally introduce the Ray library that does the heavy lifting. -->

## 1. Ray Data
***
First up, we want to load in the breast cancer dataset and transform its raw data into features that will be input to our machine learning model. Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as map, filter, and repartition and are compatible with a variety of file formats, data sources, and distributed frameworks.

**Key Concepts**

`Dataset`: The standard way to load and exchange data in Ray AIR. In AIR, Datasets are used extensively for data loading, preprocessing, and batch inference.

`Preprocessors`: Preprocessors are primitives that can be used to transform input data into features. Preprocessors operate on Datasets, which makes them scalable and compatible with a variety of datasources and dataframe libraries. A Preprocessor is fitted during Training, and applied at runtime in both Training and Serving on data batches in the same way. AIR comes with a collection of built-in preprocessors, and you can also define your own with simple templates which you can read more about in our [User Guide](https://docs.ray.io/en/latest/ray-air/preprocessors.html).

### Import Relevant Packages + Starting Ray
To start, we'll import Ray (check out our [installation instructions](https://docs.ray.io/en/latest/ray-overview/installation.html)) and start a Ray cluster on our machine that can utilize all the cores available to you as workers. We use `ray.is_initialized` to ensure that we only have one Ray cluster active.

In [None]:
import logging, os, random, warnings
import pandas as pd
import ray

warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

if ray.is_initialized:
    ray.shutdown()

ray.init(logging_level=logging.ERROR)

### Create Ray Dataset
Here, we read in the data from an S3 `.csv` datasource. 

In [None]:
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")

# split data into training and test subsets
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
test_dataset = valid_dataset.drop_columns(["target"])

**👾 Coding Excercise 👾**

There exist many [`Dataset` API elements](https://docs.ray.io/en/latest/data/api/dataset.html#) available for common transformations and operations. Using the above as a reference:
1. inspect the schema
2. visualize the first five samples
3. evaluate whether this dataset contains extreme outliers; if so, drop them

In [None]:
### YOUR CODE HERE ###

### Preprocessing
In the code below, we define a `StandardScaler` preprocessor because we wish to normalize the *approximately normal* `mean_radius` and `mean_texture` columns.

What's nice about a Ray AIR `Preprocessor` is that it is automatically incorporated into an end-to-end application built with AIR.

- During Training: `Preprocessor` is passed into a `Trainer` to `fit` and `transform` input `Dataset`s.
- During Tuning: each `Trial` will isntantiate its own copy of the `Preprocessor` and the fitting and transofrmation logic will occur once per `Trial`
- During Checkpoint: the `Preprocessor` is saved in the `Checkpoint` is if was passed into the `Trainer`
- During Predicting: if the `Checkpoint` contains a `Preprocessor`, then it will be used to call `transform_batch` on input batches prior to performing inference

This kind of heavy lifting behind the scenes is one of the major advantages of Ray AIR.

In [None]:
from ray.data.preprocessors import StandardScaler

# create a preprocessor to scale some columns
preprocessor = StandardScaler(columns=["mean radius", "mean texture"])

## 2. Ray Train
***
Following data preprocessing, we can move forward with defining our model and fitting it on our training set. Ray Train is a lightweight library for distributed deep learning that allows you to easily supercharge your distributed training on Ray.

**Key Concept**

`Trainer`: Trainers are wrapper classes around third-party training frameworks such as XGBoost, Pytorch, and Tensorflow. They are built to help integrate with core Ray Actors (for distribution), Ray Tune, and Ray Datasets.

### Define AIR `Trainer`

In Ray AIR, `Trainer`s provide a way to scale out training with popular machine learning frameworks. As part of Ray Train, `Trainer`s enable users to run distributed multi-node training with fault tolerance. Fully integrated with the Ray ecosystem, Trainers leverage Ray Data to enable scalable preprocessing and performant distributed data ingestion. Also, Trainers can be composed with `Tuner`s for distributed hyperparameter tuning. After executing training, `Trainer`s output the trained model in the form of a `Checkpoint`, which can be used for batch or online prediction inference.

There are three broad categories of Trainers that AIR offers:

- Deep Learning Trainers (Pytorch, Tensorflow, Horovod)
- Tree-based Trainers (XGBoost, LightGBM)
- Other ML frameworks (HuggingFace, Scikit-Learn, RLlib)

In the example below, we will use an `XGBoostTrainer`to perform binary classification our breast cancer biopsy samples. To construct a `Trainer`, you extent drom the `BaseTrainer` interface by providing:

- a `scaling_config` which specifies how many parallel training workers and what type of resources (CPUs/GPUs) to use per worker during training.
- a `run_config` which configures a variety of runtime parameters such as fault tolerance, logging, and callbacks
- a collection of datasets and a preprocessor for the provided datasets which configures preprocessing and the datasets to ingest from
- `resume_from_checkpoint` which is a checkpoint path to resume from, should your training run be interrupted

In summary, the steps for our example are:

1. define the parallelism for Ray compute
2. define the XGBoost parameters for training
3. supply the preprocessor for fitting and transforming dataset during training and validation
4. provide the datasets for training and validation
5. invoke `trainer.fit` to fit the model

In [None]:
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer

trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(
        # Number of workers to use for data parallelism.
        num_workers=2,
        # Whether to use GPU acceleration.
        use_gpu=False),
    label_column="target",
    num_boost_round=20,
    params={
        # XGBoost specific params
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    # our train and validation dataset and preprocessor
    datasets={"train": train_dataset, "valid": valid_dataset},
    preprocessor=preprocessor,
)

Trainer objects produce a Result object after calling `.fit()`. These objects contain training metrics as well as checkpoints to retrieve the best model.

In [None]:
# fit the trainer
result = trainer.fit()

## 3. Ray Tune
***
Ray Tune is a Python library for fast hyperparameter tuning at scale. Easily distribute your trial runs to quickly find the best hyperparameters.

**Key Concept**

`Tuner`: Tuners offer scalable hyperparameter tuning as part of Ray Tune. Tuners can work seamlessly with any Trainer but also can support arbitrary training functions.

### Use AIR `Tuner` for Hyperparameter Search
What if you want to do hyperparameter optimization during training and use the best config for the model? Well, you can then use Tuner and supply your training function, `Trainer`, as part of the argument, along with other Tuner configuration.

1. define the hyperparameter space
2. define `TuneConfig` for number of trials and parallelism
3. invoke `tuner.fit()`

In [None]:
from ray import tune

param_space = {"params": {"max_depth": tune.randint(1, 9)}}
metric = "train-logloss"
our_mode="min"

In [None]:
from ray.tune.tuner import Tuner, TuneConfig
from ray.air.config import RunConfig

tuner = Tuner(
    trainer,
    param_space=param_space,
    tune_config=TuneConfig(num_samples=5, metric=metric, mode=our_mode),
)
# Execute tuning.
result_grid = tuner.fit()

## 4. Ray AIR Checkpoints
***
The AIR trainers, tuners, and custom pretrained models generate Checkpoints. An AIR Checkpoint is a format for models that are used across different components of the Ray AI Runtime. This common format allows easy interoperability among AIR components and seamless integration with external supported machine learning frameworks.

**Key Concept**

`Checkpoints`: The AIR trainers, tuners, and custom pretrained model generate a framework-specific Checkpoint object. Checkpoints are a common interface for models that are used across different AIR components and libraries.

`BatchPredictor`: loads the best model from a checkpoint to perform batch inference

### Use AIR `BatchPredictor` for Batch Prediction
Once you have trained and tuned your model, create a batch prdictor from best model using the `best_result.checkpoint` and do batch inference.

In [None]:
from ray.train.batch_predictor import BatchPredictor
from ray.train.xgboost import XGBoostPredictor

batch_predictor = BatchPredictor.from_checkpoint(best_result.checkpoint, XGBoostPredictor)

predicted_probabilities = batch_predictor.predict(test_dataset)
print("PREDICTED PROBABILITIES")
predicted_probabilities.show()

## 5. Ray Serve
***
Ray Serve lets you serve machine learning models in real-time or batch using a simple Python API. Serve individual models or create composite model pipelines, where you can independently deploy, update, and scale individual components.

**Key Concept**

`Deployments`: Deploy the model as an inference service by using Ray Serve and the `PredictorDeployment` class.

### Use `PredictorDeployment` for Online Inference
Deploy the best model as an inference service by using Ray Serve and the `PredictorDeployment` class. After deploying the service, you can send requests to it.

In [None]:
from ray import serve
from fastapi import Request
from ray.serve import PredictorDeployment
from ray.serve.http_adapters import pandas_read_json

serve.run(
    PredictorDeployment.options(name="XGBoostService", num_replicas=2, route_prefix="/rayair").bind(
        XGBoostPredictor, result.checkpoint, http_adapter=pandas_read_json
    )
)

In [None]:
import requests

sample_input = test_dataset.take(1)
sample_input = dict(sample_input[0])

output = requests.post("http://localhost:8000/rayair", json=[sample_input]).json()
print(output)

In [None]:
ray.shutdown()

## Summary
You've now just created a Ray Dataset, preprocessed some features, built a model with XGBoost, searched a hyperparameter space for the best configuration, loaded the best model from a checkpoint to perform batch inference, and served that model for online inference. Through this end-to-end example, you explored how to use Ray AIR to distribute an entire ML pipeline.

### Key Concepts

- `Datasets`
- `Preprocessors`
- `Trainers`
- `Tuner`
- `Checkpoints`
- `BatchPredictor`
- `Deployments`

### Next Up

Now that you've seen how you can use Ray AIR's unified toolkit to scale an end-to-end machine learning application, let's see how we can use it to scale individual workloads. In the next section we will cover a reinforcement learning example

# RL Example
words go here

## Ray RLLib
RLlib is the industry-standard reinforcement learning Python framework built on Ray. Designed for quick interation and a fast path to production, it includes 25+ latest algorithms that are all implemented to run at scale and in multi-agent mode.

### Summary
#### Key Concepts
#### Key API Elements in This Section
#### Next

# Extra Resources
---
If you would like to practice your new skills further with some in-depth examples beyond the embedded coding excercises, take a look at this list of suggested problems:
- Watch the Ray Summit Talk on [Introduction to Ray AIR](https://github.com/ray-project/hackathon5-algo)
- Check out the [Ray AIR Documentation](https://docs.ray.io/en/latest/ray-air/getting-started.html)
- Understand its [Components and APIs](https://docs.ray.io/en/latest/ray-air/package-ref.html)
- Ray AIR [User Guides](https://docs.ray.io/en/latest/ray-air/user-guides.html) and [Examples](https://docs.ray.io/en/latest/ray-air/examples/index.html)


# Next Steps
---
🎉 Congratulations! You have completed the tutorial on an Introduction to Ray AI Runtime! We dicussed each library in Ray AIR (Data, Train, Tune, Serve, RLLib) and saw some example machine learning workloads to be done with each. In the next module, we will introduce the ecosystem of integrated libraries runs on Ray Core's distributed execution engine, and with Ray Clusters, you can deploy your workloads on AWS, GCP, Azure, or on Kubernetes.

From here, you can learn and get more involved with our active community of developers and researchers by checking out the following resources:
- 💻 [Official Ray Website](https://www.ray.io/): Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
- 💬 [Join the Community on Slack](https://forms.gle/9TSdDYUgxYs8SA9e8): Find friends to discuss your new learnings in our Slack space.
- 📣 [Use the Discussion Board](https://discuss.ray.io/): Ask questions, follow topics, and view announcements on this community forum.
- 🙋‍♀️ [Join a Meetup Group](https://www.meetup.com/Bay-Area-Ray-Meetup/): Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
- 🪲 [Open an Issue](https://github.com/ray-project/ray/issues/new/choose): Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.