# An Introduction to the Ray AI Runtime


You can run this notebook directly in
[Colab TODO](https://colab.research.google.com/github/XXX).
<a target="_blank" href="https://colab.research.google.com/github/XXX">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TODO: Make this 2.5 or 2.6 later.
The book has been written for Ray 2.4.0, which you can install using `pip install ray==2.4.0`.

To run the examples for this chapter, you will also need to install the following dependencies:

In [None]:
# TODO pin all versions here
! pip install "ray[air]>=2.4.0" "accelerate>=0.16.0" "transformers>=4.26.0"
! pip install "numpy<1.24" "torch>=1.12.0" "datasets" "evaluate" "deepspeed"

## Overview

In this chapter we’ll introduce you to the core concepts of the Ray AI Runtime (AIR) and how you can
use it to build and deploy common ML workflows. To introduce its components we’ll build
an AIR application that fine-tunes an open-source language model, deploys it for online
inference and uses the model for offline batch inference.
We will also tell you when and why to use AIR and give you a brief overview of its ecosystem.
We close with an in-depth discussion of the relationship of AIR with other systems.

## Why and When to Use AIR?

Running ML workloads with Ray has been a constant evolution over the last couple
of years. Ray RLlib and Tune were the first libraries built on top of Ray Core.
Components like Ray Train, Serve, and more recently Datasets followed shortly
after. The addition of Ray AIR as an umbrella for all other Ray ML libraries is the
result of active discussions with and feedback from the ML community. Ray, as a
Python-native tool with good GPU support and stateful primitives (Ray actors) for
complex ML workloads, is a natural candidate for building a runtime like AIR.

Ray AIR is a unified toolkit for your ML workloads that offers many third-party
integrations for model training or accessing custom data sources. In the spirit of the
other ML libraries built on top of Ray Core, AIR hides lower-level abstractions and
provides an intuitive API that was inspired by common patterns from tools such as
scikit-learn.

At its core, Ray AIR was built for both data scientists and ML engineers alike. As
a data scientist, you can use it to build and scale your end-to-end experiments or
individual subtasks such as preprocessing, training, tuning, scoring, or serving of ML
models. As an ML engineer, you can go so far as to build a custom ML platform on
top of AIR or simply leverage its unified API to integrate it with other libraries from
your ecosystem. And Ray always gives you the flexibility to drop down and delve into
the lower-level Ray Core API.

As part of the Ray ecosystem, AIR can leverage all its benefits, which includes a
seamless transition from experimentation on a laptop to production workflows on a
cluster. You often see data science teams “hand over” their ML code to teams responsible
for production systems. In practice this can be expensive and time-consuming,
as this process often involves modifying or even rewriting parts of the code. As we
will see, Ray AIR helps you with this transition because AIR takes care of concerns
such as scalability, reliability, and robustness for you.

Ray AIR already has a respectable number of integrations today, but it’s also fully
extensible. And as we will show you in the next section, its unified API provides a
smooth workflow that allows you to drop-in-replace many of its components. For
instance, you can use the same interface to define an XGBoost or PyTorch Trainer
with AIR, which makes experimentation with various ML models convenient.

At the same time, by choosing AIR you can avoid the problem of working with
several (distributed) systems and writing glue code for them that’s difficult to deal
with. Teams working with many moving parts often experience rapid deprecation
of integrations and a high maintenance burden. These issues can lead to migration
fatigue, a reluctance to adopt new ideas due to the anticipated complexity of system
changes.

### Workloads to run with AIR

Now that we’ve seen examples of AIR and its fundamental concepts, let’s zoom out
a little and discuss in principle which kinds of workloads you can run with it. We’ve
tackled all of these workloads already throughout the book, but it’s good to recap
them systematically. As the name suggests, AIR is built to capture common tasks in
AI projects. These tasks can be roughly classified in the following way:

- Stateless computation: Tasks like preprocessing data or computing model predictions on a batch of data
    are stateless. Stateless workloads can be computed independently in parallel.
    If you recall our treatment of Ray tasks from Chapter 2, stateless computation
    is exactly what they were built for. AIR primarily uses Ray tasks for stateless
    workloads. Many big data processing tools fall into this category.
- Stateful computation: In contrast, model training and hyperparameter tuning are stateful operations, as
    they update the model state during their respective training procedure. Updating
    stateful workers in such distributed training is a difficult topic that Ray handles
    for you. AIR uses Ray actors for stateful computations.
- Composite workloads: Combining stateless and stateful computation, for instance by first processing
    features and then training a model, is quite common in AI workloads. In fact,
    it’s rare for end-to-end projects to exclusively use one or the other. Running such
    advanced composite workloads in a distributed fashion can be described as big
    data training, and AIR is built to handle both the stateless and stateful parts efficiently.
- Online serving: Lastly, AIR is built to handle scalable online serving of (multiple) models. The
    transition from the previous three workloads to serving is frictionless by design,
    as you still operate within the same AIR ecosystem.

You can use these types of workloads in different scenarios, too. For instance, you can
use AIR to replace and scale out a single component of an existing pipeline. Or you
can create your own end-to-end machine learning apps with AIR.
You can even use AIR to build your own AI platform, as we will see later.

![AIR Workloads](./images/AIR_workloads.png)

## The Key Components of Ray AIR

AIR’s design philosophy is to provide you with the ability to tackle your ML workloads
in a single script, run by a single system.


### Datasets and Preprocessors

The standard way to load data in Ray AIR is with Ray Datasets. AIR Preprocessors are
used to transform input data into features for ML experiments.
Since these preprocessors operate on Datasets and leverage the Ray ecosystem, they
allow you to scale your preprocessing steps efficiently. During training an AIR Preprocessor
is fitted to the specified training data and can then later be used for both
training and serving. AIR comes packaged with many common preprocessors that
cover many use cases. If you don’t find the one you need, you can easily define a
custom preprocessor on your own.

![AIR Data](./images/preprocessor_table.png)

### Trainers

Once you have your training and test datasets ready and your preprocessors defined,
you can move on to specifying a Trainer that runs an ML algorithm on your data.
Trainers provide a consistent wrapper for training frameworks such as TensorFlow, PyTorch, or
HuggingFace. In this example we’ll focus on the latter, but it’s important to note that
all other framework integrations work exactly the same way in terms of the Ray AIR
API.

Trainers provide scalable ML training that operates on AIR Datasets and preprocessors.
On top of that, they’re also built to integrate well with Ray Tune for HPO, as
we’ll see next.
To summarize this section, the following figure shows how AIR Trainers fit ML models on
Ray Datasets given preprocessors and a scaling configuration.

![AIR Trainers](./images/AIR_trainer.png)

### Tuners and Checkpoints

Tuners, introduced with Ray 2.0 as part of AIR, offer scalable hyperparameter tuning
through Ray Tune. Tuners work seamlessly with AIR Trainers, but also support arbitrary
training functions. In our example, instead of calling fit() on your trainer
instance from the previous section, you can pass your trainer into a Tuner. To do
so, a Tuner needs to be instantiated with a parameter space to search over, a so-called
TuneConfig. This config has all Tune-specific configurations like the metric you
want to optimize and an optional RunConfig that lets you configure runtime-specific
aspects such as the log verbosity of your Tune run.

Whenever you run AIR Trainers or Tuners, they generate framework-specific Checkpoints.
You can use these checkpoints to load models for usage across several AIR
libraries, such as Tune, Train, or Serve. You can get a Checkpoint by accessing the
result of a .fit() call on either a Trainer or a Tuner.

Having checkpoints is great because they’re AIR’s native model exchange format.
You can also use them to pick up trained models at a later stage, without having
to worry about custom ways to store and load the models in question. Figure 10-3
schematically shows how AIR Tuners work with AIR Trainers.

![AIR Trainers](./images/AIR_tuner.png)

### Running batch prediction

TODO: this needs to be adapted for new "map_batches" paradigm

![AIR Batch Inference](./images/AIR_predictor.png)

### Online Serving Deployments

Instead of using batch prediction and interacting with the model in question
directly, you can leverage Ray Serve to deploy an inference service that you can
query over HTTP. You do that by using the PredictorDeployment class and deploy
it using our checkpoint.

![AIR Deployments](./images/AIR_deployment.png)

Here's an overview of all components at once:

![AIR Overview](./images/air_plan.png)


It’s important to stress again that we’ve been using a single Python script for this
example and a single distributed system in Ray AIR to do all the heavy lifting. In
fact, you can use this example script and scale it out to a large cluster that uses CPUs
for preprocessing and GPUs for training and separately configure the deployment
simply by modifying the parameters of the scaling configuration and similar options
in that script. This isn’t as easy or common as it may seem, and it is not unusual
for data scientists to have to use multiple frameworks (e.g., one for data loading and
processing, one for training, and one for serving).

Note: You can also use Ray AIR with RLlib, but the integration is still in
its early stages. For instance, to integrate RLlib with AIR Trainers,
you’d use the RLTrainer that allows you to pass in all arguments
that you’d pass to a standard RLlib algorithm. After training, you
can store the resulting RL model in an AIR Checkpoint, just as
with any other AIR Trainer. To deploy your trained RL model,
you can use Serve’s PredictorDeployment class by passing your
checkpoint along with the RLPredictor class.
This API might be subject to change, but you can see an example of
how this works in the AIR documentation.

TODO: maybe 10.9 on distributed model training can be interesting for examples?

## An Example of Training and Deploying Large Language Models with AIR

TODO

### Data Loading and Preprocessing

TODO

### Fine-Tuning and Hyperparameter Optimization

TODO

### Running Batch Inference

TODO

### Running Online Model Serving

TODO

## An Overview of Ray AIR Integrations

Next, we'll show you the full extent of integrations currently available for Ray.
We do so by discussing this ecosystem as seen from Ray AIR so that we can discuss
it in the context of a representative AIR workflow.
Clearly, we simply can’t give you examples for all the libraries in Ray’s ecosystem.
Where appropriate, we’ll point you to more advanced resources to deepen your understanding.

![AIR Data Table](./images/data_eco_table.png)

![AIR Train Table](./images/training_eco_table.png)

![AIR Tune Table](./images/tune_eco_table.png)

![AIR Serve Table](./images/serve_eco_table.png)

### An Overview of Ray’s Integrations

Let’s summarize all the integrations mentioned in this chapter (and throughout the
book) in one concise diagram. In the following figure we list all integrations currently available:

![AIR Eco](./images/Ray_extended_eco.png)

## How AIR compares to related systems

Now that you know much more about Ray and its libraries, this chapter is also the
right place to compare what Ray offers to similar systems. As you’ve seen, Ray’s ecosystem
is quite complex, can be seen from different angles, and is used for different
purposes. That means many aspects of Ray can be compared to other tools in the
market. We’ll also comment on how to integrate Ray into more complex workflows in existing
ML platforms.

We’ve not made any direct comparisons with other systems up to this point, for the
simple reason that it makes little sense to compare Ray to something if you don’t
have a good grasp of what Ray is yet. As Ray is quite flexible and comes with a lot
of components, it can be compared to different types of tools in the broader ML
ecosystem.
Let’s start with a comparison of the more obvious candidates, namely, Python-based
frameworks for cluster computing.

### Distributed Python Frameworks

If you consider frameworks for distributed computing that offer full Python support
and don’t lock you into any cloud offering, the current “big three” are Dask, Spark,
and Ray. While there are certain technical and context-dependent performance differences
between these frameworks, it’s best to compare them in terms of the workloads
you want to run on them. Table XXX compares the most common workload
types:

![AIR Dask Spark Table](./images/dask_spark_ray_table.png)

### Ray AIR and the Broader ML Ecosystem

Ray AIR focuses primarily on AI compute, for instance by providing any kind of
distributed training via Ray Train, but it’s not built to cover every aspect of an AI
workload. For instance, AIR chooses to integrate with tracking and monitoring tools
for ML experiments, as well as with data storage solutions, rather than providing
native solutions.

On the other side of the spectrum, you can find categories of tools for which Ray AIR
can be considered an alternative. For instance, there are many framework-specific
toolkits such as TorchX or TFX that tie in tightly with their respective frameworks. In
contrast, AIR is framework-agnostic, thereby preventing vendor lock-in, and offers
similar tooling.

It’s also interesting to briefly touch on how Ray AIR compares to specific cloud offerings.
Some major cloud services offer comprehensive toolkits to tackle ML workloads
in Python. To name just one, AWS Sagemaker is a great all-in-one package that
allows you to connect well with your AWS ecosystem. AIR does not aim to replace
tools like SageMaker. Instead, it aims to provide alternatives for compute-intensive
components like training, evaluation, and serving.

AIR also represents a valid alternative to ML workflow frameworks such as KubeFlow
or Flyte. In contrast to many container-based solutions, AIR offers an intuitive,
high-level Python API and offers native support for distributed data.

Sometimes the situation is not as clear-cut, and Ray AIR can be seen or used as both
an alternative or a complementary component in the ML ecosystem.
For instance, as open source systems, Ray and AIR in particular can be used within
hosted ML platforms such as SageMaker, but you can also build your own ML
Platforms with it. Also, as mentioned, AIR can’t always compete with dedicated big
data processing systems like Spark or Dask, but often Ray Datasets can be enough to
suit your processing needs.

As we mentioned earlier, it is central to AIR’s design philosophy to have
the ability to express your ML workloads in a single script and execute it on Ray
as a single distributed system. Since Ray handles all the task placement and execution
on your cluster for you under the hood, there’s usually no need to explicitly
orchestrate your workloads (or stitch together many complex distributed systems).
Of course, this philosophy should not be taken too literally—sometimes you need
multiple systems or to split up tasks into several stages. On the other hand, dedicated
workflow orchestration tools like Argo or AirFlow can be very useful when used in
a complementary fashion. For instance, you might want to run Ray as a step in the
Lightning MLOps framework.


### How to Integrate AIR into Your ML Platform

Now that you have a deeper understanding of the relationship of Ray, and AIR in
particular, to other ecosystem components, let’s summarize what it takes to build your
own ML platform and integrate Ray with other ecosystem components.

The core of your ML system build with AIR consists of a set of Ray Clusters, each
responsible for different jobs. For instance, one cluster might run preprocessing, train
a PyTorch model, and run inference; another one might simply pick up previously
trained models for batch inference and model serving, and so on. You can leverage
the Ray Autoscaler to fulfill your scaling needs and could deploy the whole system
on Kubernetes with KubeRay. You can then augment this core system with other
components as you see fit, for example:

- You might want to add other compute steps to your setup, such as running
    data-intensive preprocessing tasks with Spark.
- You can use a workflow orchestrator such as AirFlow, Oozie, or SageMaker
    Pipelines to schedule and create your Ray Clusters and run Ray AIR apps and
    services. Each AIR app can be part of a larger orchestrated workflow, for instance
    by tying into a Spark ETL job from the first bullet point.
- You can also create your Ray AIR clusters for interactive use with Jupyter notebooks,
for instance hosted by Google Colab or Databricks Notebooks.
- If you need access to a feature store such as Feast or Tecton, Ray Train, Datasets,
and Serve have an integration for such tools.16
- For experiment tracking or metric stores, Ray Train and Tune provide integration
    with tools such as MLflow and Weights & Biases.
- You can also retrieve and store your data and models from external storage
    solutions like S3, as shown.

![AIR ML Platform Table](./images/AIR_ML_platform.png)

## Summary

In this chapter you’ve seen how all the Ray libraries we’ve introduced come together
to form the Ray AI Runtime. You’ve learned about all the key concepts that allow
you to build scalable ML projects, from experimentation to production. In particular,
you’ve seen how Ray Datasets are used for stateless computations such as feature
preprocessing, and how Ray Train and Tune are used for stateful computations
such as model training. Seamlessly combining these types of computations in
complex AI workloads and scaling them out to large clusters is a key strength of AIR.
Deploying your AIR projects comes essentially for free, as AIR fully integrates with
Ray Serve as well.

You also learned about Ray AIR’s ecosystem.
You should now be able to go out there and run your own AIR experiments,
together with all the tools you’re already using or intend to use in the future. We’ve
also discussed Ray’s limits, how it compares to various related systems, and how you
can use Ray with other tools to augment or build out your own ML platforms.