# Scalable Batch Inference with Ray

<img src="../../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">

## About this notebook

### Is it right for you?

This module focuses on batch inference task in the computer vision (CV) context, and presents several approaches for scaling it on Ray. It is right for you if:

* You observe performance bottlenecks when working on batch inference problems in your CV projects.
* You want to scale or increase throughput of your existing batch inference pipelines.
* You wish to explore different architectures for scaling batch inference with Ray Core and Ray AIR.

### Prerequisites

For this notebook you should have:

* Practical Python and machine learning experience.
* Familiarity with batch inference task in ML.
* Familiarity with Ray and Ray AIR equivalent to completing these training modules:
  * [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb)
  * [Introduction to Ray AIR](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Introduction_to_Ray_AIR.ipynb)
  * [Ray Core](https://github.com/ray-project/ray-educational-materials/tree/main/Ray_Core)

### Learning objectives

Upon completion of this notebook, you will be able to:

* Evaluate common design patterns for distributed batch inference.
* Decide about which design pattern to use for your batch inference problem in the CV context.
* Implement scalable batch inference with Ray and tune its performance.

### What will you do?

* Learn about and evaluate several distributed batch inference design patterns with Ray Core and Ray AI Runtime.
* Implement distributed batch inference through hands-on coding exercises.

## Part 1: Scalable batch inference design patterns with Ray

One of the end goals for machine learning models is to generate predictions over a set of unseen data. In this notebook, you will look closely at the inference stage of the ML workflow and explore ways to scale it.

Ray Core and Ray AIR APIs allow you to scale batch inference to millions of examples and offer various performance tuning opportunities.

|<img src="../../_static/assets/Scaling_inference/example_ml_workflow.png" width="70%" loading="lazy">|
|:--|
|Example machine learning workflow. It starts with reading raw data and preprocessing it. These stages are followed by training and tuning jobs that eventually produce trained model with desired quality and performance metrics. Trained models are used for inference, often on the large scale data sets.|

### What is (batch) inference?

<div class="alert alert-info">
  <strong>Batch inference</strong> (also known as offline inference): is the process of generating predictions on the batch of data.
</div>

Unlike *online inference* where predictions are returned as soon as possible after an observation is produced, batch inference generates predictions over a large number of input data when immediate response is not required or feasible. For example, batch inference is relevant when generating product recommendations with historical customer data or forecasting using time-aggregated observations.

|<img src="../../_static/assets/Scaling_inference/batch_inference.png" width="70%" loading="lazy">|
|:--|
|Batch inference takes in data batch into trained model and outputs predictions.|

Below, you will conceptually encounter five architectures for performing batch inference on Ray. Each one offers scalability and performance customization.

### Stateless inference using Ray Tasks

In the most basic approach, inference could be performed sequentially where the model scores incoming batches of data one after another. However, performance of this approach is limited by the single machine or GPU and does not scale.

Straightforward and easy-to-implement way to scale out batch inference is with Ray tasks. Task is an arbitrary Python function that will be evaluated remotely in the compute cluster. This is an example of the [Remote Procedure Call](https://en.wikipedia.org/wiki/Remote_procedure_call) or RPC. Ray is an example of the RPC system.

A Ray task is *stateless* because it computes an output (e.g. predictions) determined purely by its input data without storing or modifying internal information. In other words, tasks are *stateless* because they do not have any internal state. Typical example of the stateless function in deep learning is [SGD optimizer](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="70%" loading="lazy">|
|:--|
|Stateless inference using Ray Tasks. Ray assigns batches to workers via tasks as soon as a worker becomes available. Each task loads the trained model and outputs predictions independent of the other concurrent inference jobs. This approach scales with the number of available CPUs or GPUs.|

<img src="../../_static/assets/Scaling_inference/code_task.png" width="70%" loading="lazy">

### Stateful inference using Ray Actors

Loading large, complex models into memory can be computationally expensive. In addition, you may want the flexibility to capture some persistent internal state. This second approach avoids loading and discarding the model after each batch. The overall recipe is like this:

1. Creating a number of replicas (i.e. Ray Actors) of the trained model.
2. Feeding data into these model replicas in parallel and retrieving inference results.

Ray actors are *stateful* because they keep an internal state, just like Python classes. [Adam optimizer](https://arxiv.org/abs/1412.6980), commonly used in deep learning is an example of the stateful object.

|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="70%" loading="lazy">|
|:--|
|Stateful inference using Ray Actors. Ray Actors hold model replicas which generate predictions from batches of data. You can scale out this approach with the number of actors. Actors can be reused multiple times and run inference for many batches of data.|

<img src="../../_static/assets/Scaling_inference/code_actor.png" width="70%" loading="lazy">

### Stateful inference using Ray ActorPool utility

When using Ray Actors, you need to implement *load balancing* in order to keep good level of resources utilization. You need to track when Actors become available to assign new work until the entire process completes. 

This design pattern introduces Ray [ActorPool](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-util-actorpool) utility that wraps a list of actors and automatically handles load balancing (futures management). This is convenient abstraction that let you focus on the inference logic and leave resources utilization to be managed by Ray.

|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="70%" loading="lazy">|
|:--|
|Using ActorPools for Batch Inference.|

In essence, the ActorPool wraps around the `n` actors so you do not have to manage idle actors and manually distributing workloads.

<img src="../../_static/assets/Scaling_inference/code_actorpool.png" width="70%" loading="lazy">

### Batch inference using Ray AIR Datasets

In the previous few approaches, there exist some unoptimized aspects to discuss:
* dispatching file splits one at a time may be inefficient for small batches or cause OutOfMemory errors if batches are too large (e.g. on GPUs)
* you may want to have multiple tasks sent to an Actor at once (i.e. pipelining task submission)
* data fetching and batch preprocessing could be parallelized as well

While you could control how Ray executes by implementing performance optimizations through Ray Core primitives, [Ray AIR](https://docs.ray.io/en/latest/ray-air/getting-started.html) offers high-level composable APIs that have these optimizations built-in.

[Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html) allows for:
1. parallel reading and preprocessing of source data
2. dynamic autoscaling of the actor pool
3. automatic batching and pipelining of data

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="70%" loading="lazy">|
|:--|
|Ray Datasets replace the 'Batch preprocessing' stage.|

In Ray AIR, a trained model is loaded into a `Checkpoint` object (could be from training or tuning). An AIR `Predictor` loads model from the `Checkpoint` to perform inference. Then, using the preprocessed batches provided by Ray Datasets, you extract predictions off of the testing data.

<img src="../../_static/assets/Scaling_inference/code_dataset.png" width="70%" loading="lazy">

### Batch inference using Ray AIR BatchPredictor

Finally, Ray AIR's [`BatchPredictor`](https://docs.ray.io/en/latest/ray-air/package-ref.html#batch-predictor) takes in a [`Checkpoint`](https://docs.ray.io/en/latest/ray-air/package-ref.html#checkpoint) which represents the saved model. This high-level abstraction offers simple and composable APIs that enable preprocessing data in batches with [BatchMapper](https://docs.ray.io/en/latest/ray-air/package-ref.html#generic-preprocessors) and instantiate a distributed predictor given checkpoint data.

As a part of Ray AIR, you specify what you want done through a set of declarative key-value arguments rather than concerning yourself with how to instruct Ray to scale.

|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="70%" loading="lazy">|
|:--|
|Using Ray AIR's `BatchPredictor` for Batch Inference.|

The AIR `BatchPredictor` takes both the `Checkpoint` and `Predictor` to replace the process of manually performing inference on a large dataset.

<img src="../../_static/assets/Scaling_inference/code_batchpredictor.png" width="70%" loading="lazy">

## Part 2: Data and model - computer vision transformers for semantic segmentation

To demonstrate each architecture, you will implement each approach by running inference on a variation on an object detection task: semantic segmentation.

### MIT ADE20K - scene parsing benchmark

Semantic, or image, segmentation takes a scene and classifies image objects into semantic [categories](https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit?usp=sharing) pixel-by-pixel. Often used as a standard for assessing segmentation model quality, the [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/) (also known as "SceneParse150") provides the largest open source data set for scene parsing.

|<img src="../../_static/assets/Scaling_inference/scene.png" width="70%" loading="lazy">|
|:--|
|Test image on the left vs. predicted result on the right.[Source](https://github.com/CSAILVision/semantic-segmentation-pytorch) *Date accessed: November 10, 2022*|

Data set highlights:

* 20k annotated, scene-centric training images
* 3.3k test images
* 150 total categories such as person, car, bed, sky, and more

### SegFormer - transformer-based framework for semantic segmentation

[SegFormer](https://arxiv.org/pdf/2105.15203.pdf) is a simple and powerful semantic segmentation method whose architecture consists of a hierarchical Transformer encoder and a lightweight All-MLP decoder. What sets SegFormer apart from previous approaches boils down to two key features:

1. a novel hierarchically structured Transformer encoder which does not depend on positional encoding, avoiding interpolation when test resolution differs from training
2. avoids complex decoders using a lightweight MLP layer

With demonstrated success on benchmarks such as Cityscapes and [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/), you will use a pre-trained version to perform inference on test images from MITADE20K/SceneParse150.

|<img src="../../_static/assets/Scaling_inference/segformer_architecture.png" width="70%" loading="lazy">|
|:--|
|Segformer architecture taken from [original paper](https://arxiv.org/pdf/2105.15203.pdf). *Date accessed: November 10, 2022*|


## Part 3: Sequential batch inference

To begin, you will build a basic version of batch inference that is sequential.

|<img src="../../_static/assets/Scaling_inference/single_sequential_timeline.png" width="90%" loading="lazy">|
|:--|
|Sequential inference on the single worker. Performance is limited to the single machine performance.|

In [None]:
import torch
import numpy as np
import pandas as pd
from PIL import Image

# set the seed to a fixed value for reproducibility
torch.manual_seed(201)

### Load pre-trained model from the HuggingFace Hub

In [None]:
from utils import get_labels

In [None]:
id2label, label2id = get_labels()

print(f"Total number of labels: {len(id2label)}")
print(f"Example labels: {list(id2label.values())[:5]}")

`get_labels`, a utility function, provides two dictionary mappings from [HuggingFace](https://huggingface.co/datasets/huggingface/label-files/blob/main/ade20k-id2label.json):
* `id2label`
* `label2id`

which allows you to convert between ids (int) and labels (str) for the 150 available categories of objects in images.

#### Load SegFormer

In [None]:
from transformers import SegformerForSemanticSegmentation

In [None]:
MODEL_NAME = "nvidia/segformer-b0-finetuned-ade-512-512"

segformer = SegformerForSemanticSegmentation.from_pretrained(
    MODEL_NAME, id2label=id2label, label2id=label2id
)

print(f"Number of model parameters: {segformer.num_parameters()/(10**6):.2f} M")

From [HuggingFace](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512), you specify the b0-sized (the smallest, ranging up to b5) SegFormer model.

This pre-trained model contains 3.75 million parameters and is fine-tuned on the MITADE20K (SceneParse150) dataset on images with a 512 x 512 resolution. Keep this in mind when comparing strengths and weaknesses of various batch inference approaches.

#### Create feature extractor

In [None]:
from transformers import SegformerFeatureExtractor

In [None]:
segformer_feature_extractor = SegformerFeatureExtractor.from_pretrained(
    MODEL_NAME, reduce_labels=True
)
segformer_feature_extractor

[Feature extractors](https://huggingface.co/docs/transformers/main_classes/feature_extractor) preprocess input features by normalizing, resizing, padding, and converting raw images into the desired cleaned shape.

The [`reduce_labels`](https://huggingface.co/docs/transformers/model_doc/segformer#segformer) flag ensures that the "background" of an image isn't counted as its own separate category when computing loss. 

### Prepare SceneParse150 dataset

#### Load dataset from the HuggingFace Hub

In [None]:
from datasets import load_dataset
from utils import convert_image_to_rgb

In [None]:
SMALL_DATA = True

<div class="alert alert-warning">
  <strong>SMALL_DATA</strong>: default `True` - set to download only 160 images from the data set. Set to `False` (recommended) to work with full testing dataset (3352 images).
</div>

If you set `SMALL_DATA` to `False` it will take some time (depending on your connection download speed), because you download over 20k images to the local machine or cluster.

In [None]:
DATASET_NAME = "scene_parse_150"

if SMALL_DATA:
    train_dataset = load_dataset(DATASET_NAME, split="train[:10]")
    test_dataset = load_dataset(DATASET_NAME, split="test[:160]")
else:
    train_dataset = load_dataset(DATASET_NAME, split="train[:10]")
    test_dataset = load_dataset(DATASET_NAME, split="test")

Download MITADE20K (SceneParse150) dataset from HuggingFace's datasets repository using the `load_dataset` utility.

In [None]:
train_dataset

In [None]:
test_dataset = test_dataset.map(convert_image_to_rgb)
test_dataset

Each example in the data set includes:
* `image` - the PIL image
* `annotation` - human annotations of image regions (annotation mask is `None` in testing set)
* `category` - category of the scene generally (e.g. driveway, voting booth, dairy_outdoor). 

For the datasets:
* `train_dataset` - retrieve a small sample of images for visualization purposes. Full training dataset set is 20210 images.
* `test_dataset` - used for batch inference purposes. Full test data set is 3352 images.

#### Display example images

In [None]:
from utils import display_example_images

In [None]:
# try running multiple times!
display_example_images(train_dataset)

### Run sequential batch inference on the single batch and visualize predictions

In [None]:
def predict(model, feature_extractor, images):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()
    inputs = feature_extractor(images=images, return_tensors="pt")
    with torch.no_grad():
        outputs = model(pixel_values=inputs.pixel_values.to(device))

    image_sizes = [image.size[::-1] for image in images]
    segmentation_maps_postprocessed = (
        feature_extractor.post_process_semantic_segmentation(
            outputs=outputs, target_sizes=image_sizes
        )
    )

    return [j.detach().cpu().numpy() for j in segmentation_maps_postprocessed]

The `predict` function forms the basis for inferencing, and you will reencounter it multiple times throughout this notebook's exploration of approaches.

Inputs

* `model` - which model to use; in this case, SegFormer b0 finetuned on 512x512 ADE20K
* `feature_extractor` - the preprocessing mechanism associated with the model
* `image` - preprocessed image
* `labels` - labels of 150 possible categories
* `device` - type of device responsible for loading into memory

Output
* list of segmentation maps

Core Logic

1. `inputs` - the images are converted using `feature_extractor` into `3x512x512` (`CxHxW`) arrays. This is independent of the original size of the input image.
2. `outputs` (the inference step) - the 512x512 images are then passed to the model, which then produces 150 128x128 predicted images, one image for each available category. Each image is a mask representing the part of the image that belongs to a category.
3. `image_sizes` - flip the dimensions of the image to expected order from PIL (`WxH`) to HuggingFace (`HxW`)
4. `feature_extractor.postprocess_semantic_segmentation` - HuggingFace utility that generates segmentation maps from raw outputs
5. returns a list of segmentation maps, detached from the computation to move from GPU to CPU

#### Run batch prediction on 16 images

In [None]:
from utils import get_image_indices

In [None]:
BATCH_SIZE = 16

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE)
image_indices

Get `BATCH_SIZE` random image IDs from the test data set to run inference on. Each time you run this code, different images from the test set are selected.

In [None]:
batch = [test_dataset[i]["image"] for i in image_indices]
batch

In [None]:
segmentation_maps = predict(
    model=segformer,
    feature_extractor=segformer_feature_extractor,
    images=batch,
)

Time how long it takes to run `predict` on a batch of images.

In [None]:
segmentation_maps[0]

#### Visualize example predictions

In [None]:
from utils import visualize_predictions

In [None]:
visualize_predictions(image=batch[0], segmentation_maps=segmentation_maps[0])

### Run sequential batch inference on 10 batches

#### Prepare batches

In [None]:
BATCH_SIZE = 16
N_BATCHES = 10

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE * N_BATCHES)
image_indices_grouped = np.split(np.asarray(image_indices), N_BATCHES)
image_indices_grouped

In [None]:
batches = []

for image_idx in image_indices_grouped:
    batch = [test_dataset[int(i)]["image"] for i in image_idx]
    batches.append(batch)

batches[0]

To prepare batches, retrieve `BATCH_SIZE` number of images per `N_BATCHES` from the test set. The above code first fetches the indices of shuffled images, then prepares a list of images associated with the indices.

#### Run batch prediction

In [None]:
predictions = []

In [None]:
for batch in batches:
    segmentation_maps = predict(
        model=segformer,
        feature_extractor=segformer_feature_extractor,
        images=batch,
    )
    predictions.append(segmentation_maps)

Notice that increasing the number of batches by 10 increases the runtime by ~10 which is the kind of linear scaling you expect in a sequential approach.

In [None]:
predictions[0][0]

Inspect the resulting `predictions` array to see predicted segmentation maps.

### Summary: sequential batch inference implementation

|<img src="../../_static/assets/Scaling_inference/single_sequential_timeline.png" width="90%" loading="lazy">|
|:--|
|Sequential inference on the single worker. Performance is limited to the single machine performance.|

#### Key concepts

<div class="alert alert-info">
  <strong>Batch inference</strong> (also known as offline inference): is the process of generating predictions on the batch of data.
</div>

## Part 4: Distributed, stateless batch inference with Ray Tasks

In the first approach using Ray, this implementation transitions from sequential to parallel inferencing by loading the model across stateless functions to generate predictions.

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="70%" loading="lazy">|
|:--|
|Stateless inference using Ray Tasks.|

### Initialize Ray runtime

In [None]:
import ray

In [None]:
if ray.is_initialized:
    ray.shutdown()

ray.init()

### Put the model and feature extractor in the object store

In [None]:
segformer_ref = ray.put(segformer)
segformer_feature_extractor_ref = ray.put(segformer_feature_extractor)

When passing a object as an argument to a remote function, Ray calls `ray.put()` implicitly to store that object in the local object store, making it available to all local tasks. However, when that object is large, you want to avoid re-copying it every time the object is passed to a remote function or method.

By explicitly storing both the model and feature extractor into the object store, you avoid having multiple copies which improves performance.

<div class="alert alert-warning">
  <strong>Pro Tip</strong>: Avoid passing the same large argument (like model) by value to multiple tasks, use <a href="https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-put">ray.put()</a> and pass by reference instead (`model_ref`, instead of `model`). Passing the same large argument by value repeatedly <a href="https://docs.ray.io/en/latest/ray-core/patterns/pass-large-arg-by-value.html">harms performance</a>.
</div>

### Implement remote function for inference

In [None]:
@ray.remote
def inference_task(model, feature_extractor, images):
    return predict(
        model=model,
        feature_extractor=feature_extractor,
        images=images,
    )

Notice here that `inference_task` wraps the `predict()` function from before, and `@ray.remote` specifies this as the remote function.

Stateless (lambda style) way of parallelising prediction is to create Ray tasks that load the trained model internally when called. This way we can make the prediction task "stateless", but at the cost of incurring the overhead of loading the model every single time.

When called, each Ray task loads the trained model from the local object store to perform inference. This pattern works well for small models which do not encounter the same level of bottleneck issues upon model loading.

<div class="alert alert-warning">
  <strong>Pro Tip</strong>: Batches should be large enough to avoid <a href="https://docs.ray.io/en/latest/ray-core/patterns/too-fine-grained-tasks.html">too fine grained tasks</a> anti-pattern.
</div>

### Run parallel batch inference on 10 batches

In [None]:
prediction_refs = []
predictions = []

In [None]:
for batch in batches:
    task_ref = inference_task.remote(
        model=segformer_ref,
        feature_extractor=segformer_feature_extractor_ref,
        images=batch,
    )
    prediction_refs.append(task_ref)

In [None]:
predictions = ray.get(prediction_refs)

Ray schedules each task to execute in parallel using the available resources. For each image:
* call `inference_task.remote` to assign a task (returns immediately)
* store the Object Reference `task_ref` to a list `prediction_refs`

Lastly, you use `ray.get()` on the list of prediction references to retrieve the final results, and this step takes the longest to execute because it waits on all processes to complete in order to access predictions.

In [None]:
predictions[0][0]

**Coding Exercise**

You have seen how the sequential version and stateless inference using Ray Tasks performs on 10 batches of 16 images each. Try scaling the number of batches as well as the number of images per batch to see the effect on performance.

Hint: `BATCH_SIZE` and `N_BATCHES` is set in the Part 3 under "Prepare batches"

Note: In order to perform inference on more than 160 total images, you need to set the `SMALL_DATA` flag to `False` to download the complete testing set. 

### Summary: stateless inference - Ray Tasks

#### Key concepts

<div class="alert alert-info">
  <strong>Object store</strong>: Ray's distributed shared-memory store that makes remote objects available anywhere in a Ray cluster.
</div>

<div class="alert alert-info">
  <strong>Stateless inference</strong>: inference that depends only on an inputted trained model and does not preserve state once predictions are generated.
</div>

#### Key API elements

* `ray.init()` - start Ray runtime and connect to the Ray cluster
* `@ray.remote` - functions and classes decorator specifying that it will be executed as a task (remote function) or actor (remote class) in a different process
* `.remote` - postfix to the remote functions and classes. Remote operations are asynchronous
* `ray.put()` - put an object in the in-memory object store and return its ID. Use this ID to pass object to any remote function or method call
* `ray.get()` - get a remote object or a list of remote objects from the object store

## Part 5: Distributed, stateful batch inference with Ray Actors

Moving from stateless to stateful, Ray Actors offer the advantage of holding some mutable internal state as well as avoiding reloading large models for each inference job.

|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="70%" loading="lazy">|
|:--|
|Stateful batch inference using Ray Actors.|

### Implement remote class for inference

In [None]:
@ray.remote
class PredictionActor:
    def __init__(self, model, feature_extractor):
        self.model = model
        self.feature_extractor = feature_extractor

    def predict(self, images):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(device)
        self.model.eval()

        inputs = self.feature_extractor(images=images, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(pixel_values=inputs.pixel_values.to(device))

        image_sizes = [image.size[::-1] for image in images]
        segmentation_maps_postprocessed = (
            self.feature_extractor.post_process_semantic_segmentation(
                outputs=outputs, target_sizes=image_sizes
            )
        )

        return [j.detach().cpu().numpy() for j in segmentation_maps_postprocessed]

Once again, `@ray.remote` declares which class will be a Ray Actor. This actor can then execute remote method calls and maintain its own internal state.

Each instance of `PredictionActor`, will hold its own replica of the model, feature extractor, and device to avoid loading these every time a call to `predict` is made.

Note: the `predict` function contains the same core logic as the ones you have encountered previously, with minor tweaks to fit this pattern.

### Create list of Ray Actors

In [None]:
N_ACTORS = 2

idle_actors = []
for i in range(N_ACTORS):
    idle_actors.append(
        PredictionActor.remote(
            model=segformer_ref, feature_extractor=segformer_feature_extractor_ref
        )
    )

idle_actors

Create a list of `idle_actors` filled with each instance of `PredictionActor` to maintain a revolving record of which actors are available for assignment.

Note: `N_ACTORS` is initally set to 2 here, which hinders performance. Ideally, you want to set the number of actors to be proportional to the amount of resources you have available, such as number of CPUs and/or GPUs.

### Run parallel batch inference on 10 batches and assess scalability

In [None]:
def prediction_results_postprocessing(predictions, segmentation_maps):
    predictions.append(segmentation_maps)

`prediction_results_postprocessing` is simple function in this tutorial and exists to abstract away the final processing step. In practice it will likely be much more complex.

In [None]:
predictions = []
future_to_actor_mapping = {}

To set up batch inference, create:
* `predictions` - list of final predictions
* `future_to_actor_mapping` - a dictionary that maps ObejctReferences to the actor that promised them

In [None]:
while batches:
    if idle_actors:
        actor = idle_actors.pop()
        batch = batches.pop()
        future = actor.predict.remote(images=batch)
        future_to_actor_mapping[future] = actor
    else:
        [ready], _ = ray.wait(list(future_to_actor_mapping.keys()), num_returns=1)
        actor = future_to_actor_mapping.pop(ready)
        idle_actors.append(actor)
        prediction_results_postprocessing(
            predictions=predictions, segmentation_maps=ray.get(ready)
        )

# Process any leftover results at the end.
for future in future_to_actor_mapping.keys():
    prediction_results_postprocessing(
        predictions=predictions, segmentation_maps=ray.get(future)
    )

While there remain in-flight tasks:
* if any actors are idle
    * take the first actor and assign it an image
    * store the ObjectReference as a key in `future_to_actor_mapping` with the actor as a value
* else
    * use [ray.wait()](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-wait) to retrieve the first future to return and [limit pending tasks](https://docs.ray.io/en/master/ray-core/patterns/limit-pending-tasks.html)
    * pop the actor that computed on the result object and add to the list of `idle_actors`
    * send the prediction via `ray.get(ready)` to the postprocessing function

Finally, to ensure that all objects have been retrieved, call `ray.get()` on any remaining futures left in the `future_to_actor_mapping` dictionary.

|<img src="../../_static/assets/Scaling_inference/distributed_timeline.png" width="70%" loading="lazy">|
|:--|
|Timeline of distributed bath inference where a scheduler orchestrates batch assignment as soon as a worker is available.|

In [None]:
predictions[0][0]

Print out the first prediction to verify that `predictions` contains results.

#### Optional: terminate actors after the prediction

In [None]:
[actor.__ray_terminate__.remote() for actor in idle_actors]

**Coding Exercise**

In this tutorial, the default setting for `N_ACTORS` is 2. Try setting the number of actors to the number of CPUs/GPUs you have available minus 1. How does this affect runtime performance?

### Summary: stateful inference with Ray Actors

#### Key concepts

<div class="alert alert-info">
  <strong>Stateful inference</strong>: inference carried out over stateful processes where Ray actors hold model replicas and can mutate and persist state
</div>

## Part 6: Distributed batch inference with Ray ActorPool utility

Building off of the previous approach, the ActorPool utility wraps the list of actors to automatically handle futures management with the trade-off of giving up more granular control.

|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="70%" loading="lazy">|
|:--|
|Using Actor Pools for Batch Inference.|

### Prepare batches

In [None]:
BATCH_SIZE = 16
N_BATCHES = 10

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE * N_BATCHES)
image_indices_grouped = np.split(np.asarray(image_indices), N_BATCHES)

In [None]:
batches = []

for image_idx in image_indices_grouped:
    batch = [test_dataset[int(i)]["image"] for i in image_idx]
    batches.append(batch)

Recreate the batches for inference since the last batches were popped from the existing list.

### Create ActorPool

In [None]:
from ray.util.actor_pool import ActorPool

In [None]:
N_ACTORS = 2

actors = [
    PredictionActor.remote(
        model=segformer_ref, feature_extractor=segformer_feature_extractor_ref
    )
    for _ in range(N_ACTORS)
]

Just as before, you instantiate the `N_ACTORS` of the `PredictionActor` class with model and feature extractor replicas.

In [None]:
actor_pool = ActorPool(actors)

Then, wrap the actors in an [ActorPool](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-util-actorpool) utility to automatically handle futures management.

### Run parallel batch inference on 10 batches and assess scalability

In [None]:
def actor_call(actor, batch_of_images):
    return actor.predict.remote(images=batch_of_images)

`actor_call` takes in an `actor` and image and returns an ObjectRef that computes the image segmentation prediction.

In [None]:
predictions = []

In [None]:
for segmentation_maps in actor_pool.map_unordered(actor_call, batches):
    prediction_results_postprocessing(
        predictions=predictions, segmentation_maps=segmentation_maps
    )

`map_unordered` takes in:
- `actor_call`: a function that takes `(actor, data_item)` as argument and returns an ObjectRef computing the result over the value. The actor will be considered busy until the ObjectRef completes.
- `data`: a list of values that `actor_call(actor, data_item)` should be applied to

Note: `map_unordered` has slightly better efficiency that a similar method `actor_pool.map` since we don't care about the order of the results.

In [None]:
predictions[0][0]

#### Optional: terminate actors after the prediction

In [None]:
if actor_pool.has_next() == False:
    while actor_pool.has_free():
        actor = actor_pool.pop_idle()
        actor.__ray_terminate__.remote()

**Coding Exercise**

While the `ActorPool` utility offers a good level of abstraction above orchestrating actors directly, there are [methods](https://docs.ray.io/en/latest/ray-core/package-ref.html?highlight=actorpool#ray-util-actorpool) available to you to schedule tasks, inspect in-flight jobs, and retrieve idle actors.

Before terminating the actors after the prediction, try look into the actor pool by printing out which actors are idle and which tasks remain during the inferencing step.

### Summary: stateful inference with Ray ActorPool utility

#### Key API elements

* `ActorPool()` - wraps the list of actors that run inference


## Part 7: Distributed batch inference with Ray Datasets

Moving towards the higher-level APIs offered by Ray AIR, you'll see how to use Ray Datasets to parallelize the preprocessing and batching of data.

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="70%" loading="lazy">|
|:--|
|Ray Datasets replace the 'Batch preprocessing' stage.|

### Create Ray dataset with 160 images

In [None]:
BATCH_SIZE = 16
N_BATCHES = 10

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE * N_BATCHES)
data = [test_dataset[i]["image"] for i in image_indices]

Once again, you prepare data batches by retrieving a random `BATCH_SIZE` number of images for every `N_BATCHES` and store them in the `data` list.

In [None]:
dataset = ray.data.from_items(data)
dataset.show(limit=3)

Then, you create a Ray Dataset from the list of data, and you can inspect that each item is a PIL image.

### Implement class that computes predictions

In [None]:
class PredictionClass:
    def __init__(self, model, feature_extractor):
        self.model = model
        self.feature_extractor = feature_extractor

    def __call__(self, batch):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(device)
        self.model.eval()

        inputs = self.feature_extractor(images=batch, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(pixel_values=inputs.pixel_values.to(device))

        image_sizes = [image.size[::-1] for image in batch]
        segmentation_maps_postprocessed = (
            self.feature_extractor.post_process_semantic_segmentation(
                outputs=outputs, target_sizes=image_sizes
            )
        )

        return [j.detach().cpu().numpy() for j in segmentation_maps_postprocessed]

Each [instance](https://docs.ray.io/en/latest/data/transforming-datasets.html#transform-datasets-writing-udfs) of `PredictionClass` contains a replica of the model, feature extractor, and device. 

Define the `__call__` method of the class to make it a callable class and specify the target method. The core logic of `__call__` remains the same as previous `predict()` functions.

Given a `batch` (list), the [return type](https://docs.ray.io/en/latest/data/transforming-datasets.html#batch-udf-output-types) must be one of:

* `pandas.DataFrame`
* `pyarrow.Table`
* `numpy.ndarray`
* `Dict[str, numpy.ndarray]`
* `list`

### Run parallel batch inference on 160 images and assess scalability

In [None]:
from ray.data import ActorPoolStrategy

In Ray Datasets, transformations can either be carried out by Ray Tasks or Actors. With `ActorPoolStrategy`, you can specify an [autoscaling](https://docs.ray.io/en/latest/data/transforming-datasets.html#compute-strategy) pool of `min` to `max` actors to carry out the transforms.

In [None]:
predictions_dataset = dataset.map_batches(
    PredictionClass,
    batch_size=1,
    num_gpus=0,
    num_cpus=1,
    compute=ActorPoolStrategy(min_size=1, max_size=2),
    fn_constructor_args=(segformer, segformer_feature_extractor),
)

Use the Dataset `map_batches()` [function](https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset.map_batches) to apply the model to the Dataset in parallel. You can specify the batch size, any resources, as well as any autoscaling options for the actor pool.

Note: don't forget to pass `fn_constructor_args` to construct `PredictionClass`.

In [None]:
predictions_dataset.take(limit=1)

**Coding Exercise**

In this approach, you are able to control the actors using an `ActorPoolStrategy` which sets an upper and lower limit on the dynamic autoscaling of the pool. Try toggling the `min_size` and `max_size` of the actor pool in the inference step to see the effect on runtime performance.

After running inference, you can inspect predictions to probe the resulting image array.

### Summary: batch inference with Ray AIR Datasets

#### Key concepts
* parallel reading and [preprocessing](https://docs.ray.io/en/master/data/transforming-datasets.html) of the source data
* managing the autoscaling of the ActorPool using `ActorPoolStrategy`
* declarative key-value arguments over fine-grain control over Ray

#### Key API elements
* `map_batches` - a function to apply a transformation and/or model class to all batches

## Part 8: Distributed batch inference with Ray AIR BatchPredictor

With the last approach, you will use Ray's highest-level API for batch inference: BatchPredictor.

|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="70%" loading="lazy">|
|:--|
|Using Ray AIR's `BatchPredictor` for Batch Inference.|

### Implement Predictor for image data

In [None]:
from ray.train.predictor import Predictor

In [None]:
class SemanticSegmentationPredictor(Predictor):
    def __init__(self, model, feature_extractor):
        super().__init__()
        self.model = model
        self.feature_extractor = feature_extractor

    def _predict_pandas(self, batch):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(device)
        self.model.eval()

        batch = [batch["value"][0]]
        inputs = self.feature_extractor(images=batch, return_tensors="pt")

        with torch.no_grad():
            outputs = self.model(pixel_values=inputs.pixel_values.to(device))

        image_sizes = [image.size[::-1] for image in batch]
        segmentation_maps_postprocessed = (
            self.feature_extractor.post_process_semantic_segmentation(
                outputs=outputs, target_sizes=image_sizes
            )
        )

        df = pd.DataFrame(columns=["segmentation_maps"])
        df.loc[0, "segmentation_maps"] = segmentation_maps_postprocessed

        return df

    @classmethod
    def from_checkpoint(self, checkpoint, **kwargs):
        checkpoint_data = checkpoint.to_dict()
        return SemanticSegmentationPredictor(
            model=checkpoint_data["model"],
            feature_extractor=checkpoint_data["feature_extractor"],
        )

Before you run inference, define a custom [Predictor](https://docs.ray.io/en/latest/ray-air/package-ref.html#predictor), `SemanticSegmentationPredictor`, with the same replicas and core `predict()` logic as before, with a few tweaks to fit this pattern.

BatchPredictor also supports multiple framework specific predictors such as TorchPredictor and TensorflowPredictor along with providing support for framework native batch conversions, the ability to resume from an AIR checkpoint, keeping columns, and aggregating batch metrics.

Note: batch in `_predict_pandas` is DataFrame rather than a list.

### Implement BatchPredictor

In [None]:
from ray.air import Checkpoint
from ray.train.batch_predictor import BatchPredictor

In [None]:
batch_predictor = BatchPredictor(
    checkpoint=Checkpoint.from_dict(
        {"model": segformer, "feature_extractor": segformer_feature_extractor}
    ),
    predictor_cls=SemanticSegmentationPredictor,
)

[`BatchPredictor`](https://docs.ray.io/en/latest/ray-air/predictors.html#batch-prediction) takes a [`Checkpoint`](https://docs.ray.io/en/latest/ray-air/package-ref.html#checkpoint) representing the saved model, and allows you to perform inference on an input dataset.

### Run parallel batch inference on 160 images and assess scalability

In [None]:
predictions_dataset = batch_predictor.predict(data=dataset, batch_size=1)

Perform batch inference by using the simple API `batch_predictor.predict()` without specifying *how* execution should occur.

In [None]:
predictions_dataset.take(limit=1)

Once again, you can inspect the predictions to look at the resulting segmentation maps in this Pandas Dataframe.

**Coding Exercise**

In our example, we used a custom `Predictor`, but Ray AIR's BatchPredictor offers support for a number of framework specific predictors. Referring to this [user guide] for assistance, try to implement the same inferencing logic, but this time, use a [HuggingFacePredictor](https://docs.ray.io/en/master/train/api.html?highlight=huggingfacepredictor#ray.train.huggingface.HuggingFacePredictor.predict) instead.

In [None]:
### YOUR CODE HERE ###

**Solution**

In [None]:
### SAMPLE IMPLEMENTATION ###

import tempfile
from ray.train.huggingface import HuggingFaceCheckpoint, HuggingFacePredictor

with tempfile.TemporaryDirectory() as tmpdir:
    huggingface_checkpoint = HuggingFaceCheckpoint.from_model(
        model=segformer, path=tmpdir
    )
    predictor = BatchPredictor.from_checkpoint(
        checkpoint=huggingface_checkpoint,
        predictor_cls=HuggingFacePredictor,
        feature_extractor=segformer_feature_extractor,  # passed to HF pipeline
        task="image-segmentation",  # passed to HF pipeline
        device=-1,
    )

predictions_dataset = predictor.predict(data=dataset, batch_size=1)
predictions_dataset.take(1)

### Shutdown Ray runtime

In [None]:
ray.shutdown()

Disconnect the worker, and terminate processes started by `ray.init()`.

### Summary: BatchPredictor

#### Key API elements

* `BatchPredictor` - takes a predictor class and checkpoint and provides an interface to run batch scoring on Ray datasets; this batch predictor wraps around a predictor class and executes it in a distributed way when calling `predict()`
* `Checkpoint` - a common interface for accessing models across different AIR components and libraries
* `predict()` - run batch scoring on Dataset


## Part 9: Architectures for scalable batch inference with Ray - final summary

Each of the five approaches introduced in this module represents a valid approach for scaling batch inference on Ray. The one you choose depends on how much control you want over how Ray executes.

### Batch inference using Ray Core - parallelism control

If you want to specify how Ray should execute batch inference, then use Ray Tasks, Ray Actors, or the ActorPool utility.

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="100%" loading="lazy">|
|:-:|:-:|:-:|
|Ray Tasks|Ray Actors|`ActorPool`|

### Batch inference using Ray AI Runtime - high level API for productivity

If you want Ray to manage your distribution and inference at scale using high levels APIs, then Ray AIR will be the right way to go.

For data scientists and machine learning practitioners who care more about getting the models to scale for batch inference and worry less about underlying primitives and unter-the-hood execution details, Ray AIR is a desirable option.

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="90%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="90%" loading="lazy">|
|:-:|:-:|
|`Dataset.map_batches()`|`BatchPredictor`|

### Per-pattern summary

Below, the table will further summarize some finer points of comparison between different approaches.

||Ray Tasks|Ray Actors|`ActorPool`|`Dataset`|`BatchPredictor`|
|:-:|:-:|:-:|:-:|:-:|:-:|
| Summary | Launch Ray Tasks, loading the model and data batch to execute predictions and write results. | Launch Ray Actors, holding a model replica to reuse for all inference jobs across data batches. | Use ActorPool utility to abstract away futures managment of actors. | Use Ray Datasets to parallelize preprocessing and inference. | Use BatchPredictor to load model from Checkpoint and use a given custom class for generating predictions. |
| Exposed Ray primitives | ✅ | ✅ | ✅ |  |  |
| Persistent internal state |  | ✅ | ✅ | ✅ | ✅ |
| Parallelized data fetching |  |  |  | ✅ | ✅ |
| Dynamic autoscaling of actor pool |  |  |  | ✅ | ✅ |
| Automatic batching and pipelining of data|  |  |  | ✅ | ✅ |


# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [Ray documentation](https://docs.ray.io/en/latest)
* [Official Ray Website](https://www.ray.io/): Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
* [Join the Community on Slack](https://forms.gle/9TSdDYUgxYs8SA9e8): Find friends to discuss your new learnings in our Slack space.
* [Use the Discussion Board](https://discuss.ray.io/): Ask questions, follow topics, and view announcements on this community forum.
* [Join a Meetup Group](https://www.meetup.com/Bay-Area-Ray-Meetup/): Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
* [Open an Issue](https://github.com/ray-project/ray/issues/new/choose): Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

<img src="../../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">