# Scalable Batch Inference with Ray

<img src="../../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">

## About this notebook

### Is it right for you?

This module focuses on batch inference, presenting several approaches for scaling inference on Ray through hands-on examples. It is right for you if:

* you observe performance bottlenecks when working on model (batch) inference problems
* you want to scale or increase throughput of your existing batch inference pipelines
* you wish to explore different architectures for batch inference with Ray Core and Ray AIR

### Prerequisites

For this notebook you should have:

* practical Python and machine learning experience
* familiarity with batch inference pattern in ML
* familiarity with Ray and Ray AIR equivalent to completing these training modules:
  * [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb)
  * [Introduction to Ray AIR](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Introduction_to_Ray_AIR.ipynb)
  * [Ray Core](https://github.com/ray-project/ray-educational-materials/tree/main/Ray_Core)

### Learning objectives

Upon completion of this notebook, you will be able to:

* understand common design patterns for batch inference
* identify multiple approaches for scaling batch inference with Ray
* compare the benefits and drawbacks of different inference architectures on Ray for different use cases

### What will you do?

* using a semantic segmentation task, encounter several batch inference implementation approaches using:
  * Ray Tasks
  * Ray Actors
  * Ray ActorPool utility
  * Ray AIR Datasets
  * Ray AIR BatchPredictor
* explore parallelized inference through hands-on coding exercises

## Part 1: Scalable batch inference design patterns with Ray

The end goal for machine learning models is to generate performant predictions over a set of unseen data. In this module, you will approach parallelizing batch inference on using Ray Core's API as well as the high-level abstractions available in Ray AI Runtime.

|<img src="../../_static/assets/Scaling_inference/example_ml_workflow.png" width="70%" loading="lazy">|
|:--|
|A simplified machine learning workflow.|

### What is batch inference?

<div class="alert alert-info">
  <strong>Batch inference</strong> (also known as offline inference): is the process of generating predictions on the batch of data.
</div>

Unlike *online inference* where predictions are returned as soon as possible after an observation is produced, batch inference generates predictions over a large number of input data when immediate response is not required and/or feasible. For example, batch inference is relevant when generating product recommendations with historical customer data or forecasting using time-aggregated observations.

|<img src="../../_static/assets/Scaling_inference/batch_inference.png" width="70%" loading="lazy">|
|:--|
|Batch inference takes in data batch into trained model and outputs predictions.|

Below, you will conceptually encounter five architectures for performing batch inference on Ray.

### Stateless inference using Ray Tasks

In the most naive approach, inference could be performed sequentially where the pre-trained model scores incoming batches of data one after another. The first step towards parallelizing this process using Ray involves stateless inference by:

1. exporting the model's mathematical core into a language agnostic format
2. restoring the architecture and weights of a trained model in a stateless function (i.e. Ray tasks)

A Ray task is *stateless* because it computes an output (e.g. predictions) determined purely by its input (e.g. the trained model) without keeping track of any new information.

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="70%" loading="lazy">|
|:--|
|Stateless inference using Ray Tasks.|

In the figure above, Ray assigns batches to workers via Tasks as soon as a worker becomes available. Each Task loads the trained model and outputs predictions independent of the other concurrent inference jobs.

<img src="../../_static/assets/Scaling_inference/code_task.png" width="70%" loading="lazy">

### Stateful inference using Ray Actors

Loading large, complex models into memory can be computationally expensive. In addition, you may want the flexibility to capture some persistent internal state. This second approach avoids loading and discarding the model after each batch by:

1. creating a number of replicas (i.e. Ray Actors) of the trained model
2. feeding data into these model replicas in parallel and retrieving inference results

|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="70%" loading="lazy">|
|:--|
|Stateful inference using Ray Actors.|

Ray Actors hold stateful model replicas which generate predictions from batches of data.

<img src="../../_static/assets/Scaling_inference/code_actor.png" width="70%" loading="lazy">

### Stateful inference using Ray ActorPool utility

When using Ray Actors, you need to keep a backlog of tasks "in-flight" and track when Actors become available to assign new work until the entire process completes. To avoid this hassle, Ray provides a convenient [ActorPool](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-util-actorpool) utility which wraps a list of actors and automatically handles futures management.

|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="70%" loading="lazy">|
|:--|
|Using ActorPools for Batch Inference.|

In essence, the ActorPool wraps around the `n` actors so you do not have to manage idle actors and manually distributing workloads.

<img src="../../_static/assets/Scaling_inference/code_actorpool.png" width="70%" loading="lazy">

### Batch inference using Ray AIR Datasets

In the previous few approaches, there exist some unoptimized aspects to discuss:
* dispatching file splits one at a time may be inefficient for small batches or cause OutOfMemory errors if batches are too large (e.g. on GPUs)
* you may want to have multiple tasks sent to an Actor at once (i.e. pipelining task submission)
* data fetching and batch preprocessing could be parallelized as well

While you could control how Ray executes by implementing performance optimizations through Ray Core primitives, [Ray AIR](https://docs.ray.io/en/latest/ray-air/getting-started.html) offers high-level composable APIs that have these optimizations built-in.

[Ray Datasets](https://docs.ray.io/en/latest/data/dataset.html) allows for:
1. parallel reading and preprocessing of source data
2. dynamic autoscaling of the actor pool
3. automatic batching and pipelining of data

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="70%" loading="lazy">|
|:--|
|Ray Datasets replace the 'Batch preprocessing' stage.|

In Ray AIR, a trained model is loaded into a `Checkpoint` object (could be from training or tuning). An AIR `Predictor` loads model from the `Checkpoint` to perform inference. Then, using the preprocessed batches provided by Ray Datasets, you extract predictions off of the testing data.

<img src="../../_static/assets/Scaling_inference/code_dataset.png" width="70%" loading="lazy">

### Batch inference using Ray AIR BatchPredictor

Finally, Ray AIR's [`BatchPredictor`](https://docs.ray.io/en/latest/ray-air/package-ref.html#batch-predictor) takes in a [`Checkpoint`](https://docs.ray.io/en/latest/ray-air/package-ref.html#checkpoint) which represents the saved model. This high-level abstraction offers simple and composable APIs that enable preprocessing data in batches with [BatchMapper](https://docs.ray.io/en/latest/ray-air/package-ref.html#generic-preprocessors) and instantiate a distributed predictor given checkpoint data.

As a part of Ray AIR, you specify what you want done through a set of declarative key-value arguments rather than concerning yourself with how to instruct Ray to scale.

|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="70%" loading="lazy">|
|:--|
|Using Ray AIR's `BatchPredictor` for Batch Inference.|

The AIR `BatchPredictor` takes both the `Checkpoint` and `Predictor` to replace the process of manually performing inference on a large dataset.

<img src="../../_static/assets/Scaling_inference/code_batchpredictor.png" width="70%" loading="lazy">

## Part 2: Data and model - computer vision transformers for semantic segmentation

To demonstrate each architecture, you will implement each approach by running inference on a variation on an object detection task: semantic segmentation.

### MIT ADE20K - scene parsing benchmark

Semantic, or image, segmentation takes a scene and classifies image objects into semantic [categories](https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit?usp=sharing) pixel-by-pixel. Often used as a standard for assessing segmentation model quality, the [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/) (also known as "SceneParse150") provides the largest open source data set for scene parsing.

|<img src="../../_static/assets/Scaling_inference/scene.png" width="70%" loading="lazy">|
|:--|
|Test image on the left vs. predicted result on the right.[Source](https://github.com/CSAILVision/semantic-segmentation-pytorch) *Date accessed: November 10, 2022*|

Data set highlights:

* 20k annotated, scene-centric training images
* 3.3k test images
* 150 total categories such as person, car, bed, sky, and more

### SegFormer - transformer-based framework for semantic segmentation

[SegFormer](https://arxiv.org/pdf/2105.15203.pdf) is a simple and powerful semantic segmentation method whose architecture consists of a hierarchical Transformer encoder and a lightweight All-MLP decoder. What sets SegFormer apart from previous approaches boils down to two key features:

1. a novel hierarchically structured Transformer encoder which does not depend on positional encoding, avoiding interpolation when test resolution differs from training
2. avoids complex decoders using a lightweight MLP layer

With demonstrated success on benchmarks such as Cityscapes and [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/), you will use a pre-trained version to perform inference on test images from MITADE20K/SceneParse150.

|<img src="../../_static/assets/Scaling_inference/segformer_architecture.png" width="70%" loading="lazy">|
|:--|
|Segformer architecture taken from [original paper](https://arxiv.org/pdf/2105.15203.pdf). *Date accessed: November 10, 2022*|


## Part 3: Sequential batch inference

To begin, you will build a basic version of batch inference that is sequential inference.

|<img src="../../_static/assets/Scaling_inference/single_sequential_timeline.png" width="70%" loading="lazy">|
|:--|
|Sequential inference on the single worker. Performance is limited to the single machine performance.|

In [None]:
import torch
import numpy as np
import pandas as pd
from PIL import Image

# set the seed to a fixed value for reproducibility
torch.manual_seed(201)

### Load pre-trained model from the HuggingFace Hub

In [None]:
from utils import get_labels

In [None]:
id2label, label2id = get_labels()

print(f"Total number of labels: {len(id2label)}")
print(f"Example labels: {list(id2label.values())[:5]}")

`get_labels`, a utility function, provides two dictionary mappings from [HuggingFace](https://huggingface.co/datasets/huggingface/label-files/blob/main/ade20k-id2label.json):
* `id2label`
* `label2id`

which allows you to convert between ids (int) and labels (str) for the 150 available categories of objects in images.

#### Load SegFormer

In [None]:
from transformers import SegformerForSemanticSegmentation

In [None]:
MODEL_NAME = "nvidia/segformer-b0-finetuned-ade-512-512"

segformer = SegformerForSemanticSegmentation.from_pretrained(
    MODEL_NAME, id2label=id2label, label2id=label2id
)

print(f"Number of model parameters: {segformer.num_parameters()/(10**6):.2f} M")

From [HuggingFace](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512), you specify the b0-sized (the smallest, ranging up to b5) SegFormer model.

This pre-trained model contains 3.75 million parameters and is fine-tuned on the MITADE20K dataset on images with a 512 x 512 resolution. Keep this in mind when comparing strengths and weaknesses of various batch inference approaches.

#### Create feature extractor

In [None]:
from transformers import SegformerFeatureExtractor

In [None]:
segformer_feature_extractor = SegformerFeatureExtractor.from_pretrained(
    MODEL_NAME, reduce_labels=True
)
segformer_feature_extractor

[Feature extractors](https://huggingface.co/docs/transformers/main_classes/feature_extractor) preprocess input features by normalizing, resizing, padding, and converting raw images into the desired cleaned shape.

The [`reduce_labels`](https://huggingface.co/docs/transformers/model_doc/segformer#segformer) flag ensures that the "background" of an image isn't counted as its own separate category when computing loss. 

### Prepare SceneParse150 dataset

#### Load dataset from the HuggingFace Hub

In [None]:
from datasets import load_dataset
from utils import convert_image_to_rgb

In [None]:
SMALL_DATA = True

<div class="alert alert-warning">
  <strong>SMALL_DATA</strong>: default `True` - set to download only 160 images from the data set. Set to `False` (recommended) to work with full data set (3352 images).
</div>

In [None]:
DATASET_NAME = "scene_parse_150"

if SMALL_DATA:
    train_dataset = load_dataset(DATASET_NAME, split="train[:10]")
    test_dataset = load_dataset(DATASET_NAME, split="test[:160]")
else:
    train_dataset = load_dataset(DATASET_NAME, split="train[:10]")
    test_dataset = load_dataset(DATASET_NAME, split="test")

In [None]:
train_dataset

In [None]:
test_dataset = test_dataset.map(convert_image_to_rgb)
test_dataset

If you set `SMALL_DATA` to `False` it will take some time (depending on your connection download speed), because you download data - over 20k images to the local machine or cluster.

Inspecting the training dataset, features include the image preprocessed by the FeatureExtractor, human annotations of image regions (annotation mask is `None` in testing set), and the category of the scene generally (e.g. driveway, voting booth, dairy_outdoor). Data splits are 20,210 images for training, 3,352 images for testing, and 2000 images for validation.

Few explanations:
* some images are "L", instead of "RGB". "L" is "Luminosity": 8-bit pixels, black and white.
* "scene_parse_150" is a name of the data set on the HuggingFace's datasets repository.
* If you set `SMALL_DATA` to `False` it will take some time (depending on your connection download speed), because you download 3352 images to the local machine or cluster.
* You downloaded:
  * `test_dataset`: dataset that you will use for batch inference purposes,
  * `train_dataset`: small training set sample (10 images) for visualization purposes.
* The `load_dataset` utility loads the SceneParse150 dataset from Hugging Face's `datasets` library.

#### Display example images

In [None]:
from utils import display_example_images

In [None]:
# try running multiple times!
display_example_images(train_dataset)

### Run sequential batch inference on the single batch and visualize predictions

In [None]:
def predict(model, feature_extractor, images, device):
    inputs = feature_extractor(images=images, return_tensors="pt")
    outputs = model(pixel_values=inputs.pixel_values.to(device))

    _target_sizes = [
        image.size[::-1] for image in images
    ]  # PIL returns (WxH), HF expects (HxW)
    _segmentation_maps = feature_extractor.post_process_semantic_segmentation(
        outputs=outputs, target_sizes=_target_sizes
    )

    return [j.detach().cpu().numpy() for j in _segmentation_maps]

The `predict` function forms the basis for inferencing, and you will reencounter it multiple times throughout this notebook's exploration of approaches.

Inputs

* `model` - which model to use; in this case, SegFormer b0 finetuned on 512x512 ADE20K
* `feature_extractor` - the preprocessing mechanism associated with the model
* `image` - preprocessed image
* `labels` - labels of 150 possible categories
* `device` - type of device responsible for loading into memory

Core Logic

1. The input image is converted using `feature_extractor` into three 512x512 images by color channel. This is independent of the original size of the input image.
2. The 512x512 images are then passed to the model, which then produces 150 128x128 images, one image for each available category. Each image is a mask representing the part of the image that belongs to a category.
3. In order to display the predicted regions on top of the original image, the 150 128x128 images are upsampled to the size of the original.
4. The 150 images are then collapsed into a single image using `argmax`, where each pixel has the label ID of the category predicted for that pixel.

#### Run batch prediction on 16 images

In [None]:
from utils import get_image_indices

In [None]:
BATCH_SIZE = 16

dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE)
image_indices

Get `BATCH_SIZE` random image IDs from the test data set to run inference on. Each time you run this code, a different images from the test set are selected.

In [None]:
batch = [test_dataset[i]["image"] for i in image_indices]
batch

In [None]:
%%time

segmentation_maps = predict(
    model=segformer,
    feature_extractor=segformer_feature_extractor,
    images=batch,
    device=dev,
)

Time how long it takes to run `predict` on a single image.

In [None]:
segmentation_maps[0]

#### Analyse performance

<div class="alert alert-info">
  <strong>Performance</strong>: time needed to run inference on the batch of data. Measured in seconds.
</div>

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|0.45s        |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Average time is approximately 0.45 seconds on the single image for SegFormer model with 3.75M parameters (b0 variant).


*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

#### Visualize example predictions

In [None]:
from utils import visualize_predictions

In [None]:
visualize_predictions(image=batch[0], segmentation_maps=segmentation_maps[0])

### Run sequential batch inference on 10 batches

#### Prepare batches

In [None]:
BATCH_SIZE = 16
N_BATCHES = 10

dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE * N_BATCHES)
image_indices_grouped = np.split(np.asarray(image_indices), N_BATCHES)
image_indices_grouped

In [None]:
batches = []

for image_idx in image_indices_grouped:
    batch = [test_dataset[int(i)]["image"] for i in image_idx]
    batches.append(batch)

batches[0]

#### Run batch prediction

In [None]:
predictions = []

In [None]:
for batch in batches:
    segmentation_maps = predict(
        model=segformer,
        feature_extractor=segformer_feature_extractor,
        images=batch,
        device=dev,
    )
    predictions.append(segmentation_maps)

In [None]:
predictions[0][0]

Notice that increasing the number of batches by 10 increases the runtime by ~10 which is the kind of linear scaling you expect in a sequential approach.

##### Analyse performance

Some experiments results:

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|4.7s         |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Performance is a linear function of the number of batches. Single batch performance was 0.45s -> 10 batches is 4.7s.


### Summary: sequential batch inference implementation

|<img src="../../_static/assets/Scaling_inference/single_sequential_timeline.png" width="70%" loading="lazy">|
|:--|
|Sequential inference on the single worker. Performance is limited to the single machine performance.|

|Compute       |1 image|10 images|100 images|
|:-------------|:------|:--------|:---------|
|M1 MacBook Pro|0.45s  |4.7s     |53s       |
|cluster 1 AWS |x.xxs  |.        |.         |
|cluster 2 AWS |x.xxs  |.        |.         |

#### Key Concepts

<div class="alert alert-info">
  <strong>Batch inference</strong> (also known as offline inference): is the process of generating predictions on the batch of data.
</div>

## Part 4: Stateless inference using Ray Tasks

In the first approach using Ray, this implementation transitions from sequential to parallel inferencing by loading the model across stateless functions to generate predictions.

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="70%" loading="lazy">|
|:--|
|Stateless inference using Ray Tasks.|

### Initialize Ray runtime

In [None]:
import ray

if ray.is_initialized:
    ray.shutdown()

cluster_info = ray.init()
cluster_info.address_info

### Put the modelÂ and feature extractor in the object store

In [None]:
segformer_ref = ray.put(segformer)
segformer_feature_extractor_ref = ray.put(segformer_feature_extractor)

When passing a object as an argument to a remote function, Ray calls `ray.put()` implicitly to store that object in the local object store, making it available to all local tasks. However, when that object is large, you want to avoid re-copying it every time the object is passed to a remote function or method.

By explicitly storing both the model and feature extractor into the object store, you avoid having multiple copies which improves performance.

<div class="alert alert-warning">
  <strong>Pro Tip</strong>: Avoid passing the same large argument (like model) by value to multiple tasks, use <a href="https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-put">ray.put()</a> and pass by reference instead (`model_ref`, instead of `model`). Passing the same large argument by value repeatedly <a href="https://docs.ray.io/en/latest/ray-core/patterns/pass-large-arg-by-value.html">harms performance</a>.
</div>

### Implement remote function for inference

In [None]:
@ray.remote
def inference_task(model, feature_extractor, images, device):
    return predict(
        model=model,
        feature_extractor=feature_extractor,
        images=images,
        device=device,
    )

Notice here that `inference_task` wraps the `predict()` function from before, and `@ray.remote` specifies this as the remote function.

Stateless (lambda style) way of parallelising prediction is to create Ray tasks that load the trained model internally when called. This way we can make the prediction task "stateless", but at the cost of incurring the overhead of loading the model every single time.

When called, each Ray task loads the trained model from the local object store to perform inference. This pattern works well for small models which do not encounter the same level of bottleneck issues upon model loading.

<div class="alert alert-warning">
  <strong>Pro Tip</strong>: Batches should be large enough to avoid <a href="https://docs.ray.io/en/latest/ray-core/patterns/too-fine-grained-tasks.html">too fine grained tasks</a> anti-pattern.
</div>

### Run parallel batch inference on 10 batches

In [None]:
prediction_refs = []
predictions = []

In [None]:
%%time

for batch in batches:
    task_ref = inference_task.remote(
        model=segformer_ref,
        feature_extractor=segformer_feature_extractor_ref,
        images=batch,
        device=dev,
    )
    prediction_refs.append(task_ref)

In [None]:
%%time

predictions = ray.get(prediction_refs)

In [None]:
predictions[0][0]

Ray schedules each task to execute in parallel using the available resources. For each image:
* call `inference_task.remote` to assign a task (returns immediately)
* store the Object Reference `task_ref` to a list `prediction_refs`

Lastly, you use `ray.get()` on the list of prediction references to retrieve the final results, and this step takes the longest to execute because it waits on all processes to complete in order to access predictions.

#### Performance analysis

<div class="alert alert-info">
  <strong>Performance</strong>: time needed to run inference on the batch of data. Measured in seconds.
</div>

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|13s          |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Distributed batch inference yields approximately 4x performance gain, when compared to the sequential implementation.

* Parallel: 13s.
* Sequential: 53s.

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

### Summary: stateless inference - Ray Tasks

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.3s    |13s       |125s     |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

Average speed per prediction is 0.125s. That yields 4x performance speedup, when compared to the sequential approach, which is approximately 0.45s.

#### Key Concepts

<div class="alert alert-info">
  <strong>Object store</strong>: Ray's distributed shared-memory store that makes remote objects available anywhere in a Ray cluster.
</div>

<div class="alert alert-info">
  <strong>Stateless inference</strong>: inference that depends only on an inputted trained model and does not preserve state once predictions are generated.
</div>

#### Key API Elements

* `ray.init()` - start Ray runtime and connect to the Ray cluster
* `@ray.remote` - functions and classes decorator specifying that it will be executed as a task (remote function) or actor (remote class) in a different process
* `.remote` - postfix to the remote functions and classes. Remote operations are asynchronous
* `ray.put()` - put an object in the in-memory object store and return its ID. Use this ID to pass object to any remote function or method call
* `ray.get()` - get a remote object or a list of remote objects from the object store

## Part 5: Stateful inference with Ray Actors

Moving from stateless to stateful, Ray Actors offer the advantage of holding some mutable internal state as well as avoiding reloading large models for each inference job.

|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="70%" loading="lazy">|
|:--|
|Stateful batch inference using Ray Actors.|

### Implement remote class for inference

In [None]:
@ray.remote
class PredictionActor:
    def __init__(self, model, feature_extractor):
        self.model = model
        self.feature_extractor = feature_extractor
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def predict(self, images):
        inputs = self.feature_extractor(images=images, return_tensors="pt")
        outputs = self.model(pixel_values=inputs.pixel_values.to(self.device))

        _target_sizes = [
            image.size[::-1] for image in images
        ]  # PIL returns (WxH), HF expects (HxW)
        _segmentation_maps = self.feature_extractor.post_process_semantic_segmentation(
            outputs=outputs, target_sizes=_target_sizes
        )

        return [j.detach().cpu().numpy() for j in _segmentation_maps]

Once again, `@ray.remote` declares which class will be a Ray Actor. This actor can then execute remote method calls and maintain its own internal state.

Each instance of `PredictionActor`, will hold its own replica of the model, feature extractor, and device to avoid loading these every time a call to `predict` is made.

Note: the `predict` function contains the same core logic as the ones you have encountered previously, with minor tweaks to fit this pattern.

### Create list of Ray Actors

In [None]:
N_ACTORS = 2

idle_actors = []
for i in range(N_ACTORS):
    idle_actors.append(
        PredictionActor.remote(
            model=segformer_ref, feature_extractor=segformer_feature_extractor_ref
        )
    )

idle_actors

Create a list of `idle_actors` filled with each instance of `PredictionActor` to maintain a revolving record of which actors are available for assignment.

### Run parallel batch inference on 10 batches and assess scalability

In [None]:
def prediction_results_postprocessing(predictions, segmentation_maps):
    predictions.append(segmentation_maps)

`prediction_results_postprocessing`, while a simple function in this tutorial, exists to  abstract away the final processing step, and in practice it will likely be much more complex.

In [None]:
predictions = []
future_to_actor_mapping = {}

To set up batch inference, create:
* `predictions` - list of final predictions
* `future_to_actor_mapping` - a dictionary that maps ObejctReferences to the actor that promised them

In [None]:
%%time

while batches:
    if idle_actors:
        actor = idle_actors.pop()
        batch = batches.pop()
        future = actor.predict.remote(images=batch)
        future_to_actor_mapping[future] = actor
    else:
        [ready], _ = ray.wait(list(future_to_actor_mapping.keys()), num_returns=1)
        actor = future_to_actor_mapping.pop(ready)
        idle_actors.append(actor)
        prediction_results_postprocessing(
            predictions=predictions, segmentation_maps=ray.get(ready)
        )

# Process any leftover results at the end.
for future in future_to_actor_mapping.keys():
    prediction_results_postprocessing(
        predictions=predictions, segmentation_maps=ray.get(future)
    )

While there remain in-flight tasks:
* if any actors are idle
    * take the first actor and assign it an image
    * store the ObjectReference as a key in `future_to_actor_mapping` with the actor as a value
* else
    * use [`ray.wait()`](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-wait) to retrieve the first future to return
    * pop the actor that computed on the result object and add to the list of `idle_actors`
    * send the prediction via `ray.get(ready)` to the postprocessing function

Finally, to ensure that all objects have been retrieved, call `ray.get()` on any remaining futures left in the `future_to_actor_mapping` dictionary.

based on this: https://docs.ray.io/en/master/ray-core/patterns/limit-pending-tasks.html

ray.wait() -> https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-wait

In [None]:
predictions[0][0]

Print out the first prediction to verify that `predictions` contains results.

|<img src="../../_static/assets/Scaling_inference/distributed_timeline.png" width="70%" loading="lazy">|
|:--|
|Timeline of distributed bath inference where a scheduler orchestrates batch assignment as soon as a worker is available.|

#### Optional: terminate actors after the prediction

In [None]:
[actor.__ray_terminate__.remote() for actor in idle_actors]

#### Performance analysis

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|15s          |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Results for 7 actors

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

### Summary: stateful inference with Ray Actors

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.4s    |15s       |.        |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

<div class="alert alert-info">
  <strong>Stateful inference</strong>: inference carried out over stateful processes where Ray actors hold model replicas and can mutate and persist state
</div>


## Part 6: Stateful inference with Ray ActorPool utility

Building off of the previous approach, the ActorPool utility wraps the list of actors to automatically handle futures management with the trade-off of giving up more granular control.

|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="70%" loading="lazy">|
|:--|
|Using Actor Pools for Batch Inference.|

### Prepare batches

In [None]:
BATCH_SIZE = 16
N_BATCHES = 10

dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE * N_BATCHES)
image_indices_grouped = np.split(np.asarray(image_indices), N_BATCHES)

In [None]:
batches = []

for image_idx in image_indices_grouped:
    batch = [test_dataset[int(i)]["image"] for i in image_idx]
    batches.append(batch)

### Create ActorPool

In [None]:
from ray.util.actor_pool import ActorPool

In [None]:
N_ACTORS = 2

actors = [
    PredictionActor.remote(
        model=segformer_ref, feature_extractor=segformer_feature_extractor_ref
    )
    for _ in range(N_ACTORS)
]

Just as before, you instantiate the `N_ACTORS` of the `PredictionActor` class with model and feature extractor replicas.

In [None]:
actor_pool = ActorPool(actors)

Then, wrap the actors in an [ActorPool](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-util-actorpool) utility to automatically handle futures management.

### Run parallel batch inference on 10 batches and assess scalability

In [None]:
def actor_call(actor, batch_of_images):
    return actor.predict.remote(images=batch_of_images)

`actor_call` takes in an `actor` and image and returns an ObjectRef that computes the image segmentation prediction.

In [None]:
predictions = []

In [None]:
%%time

for segmentation_maps in actor_pool.map_unordered(actor_call, batches):
    prediction_results_postprocessing(
        predictions=predictions, segmentation_maps=segmentation_maps
    )

`map_unordered` takes in:
- `actor_call`: a function that takes `(actor, data_item)` as argument and returns an ObjectRef computing the result over the value. The actor will be considered busy until the ObjectRef completes.
- `data`: a list of values that `actor_call(actor, data_item)` should be applied to

Note: `map_unordered` has slightly better efficiency that a similar method `actor_pool.map` since we don't care about the order of the results.

In [None]:
predictions[0][0]

#### Performance analysis

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|15s          |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Results for 7 actors

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

#### Optional: terminate actors after the prediction

In [None]:
if actor_pool.has_next() == False:
    while actor_pool.has_free():
        actor = actor_pool.pop_idle()
        actor.__ray_terminate__.remote()

### Summary: stateful inference with Ray ActorPool utility

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.4s    |15s       |.        |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key API Elements

* `ActorPool()` - wraps the list of actors that run inference


## Part 7: Batch inference with Ray AIR Datasets

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="70%" loading="lazy">|
|:--|
|Ray Datasets replace the 'Batch preprocessing' stage.|

### Create Ray dataset with 160 images

In [None]:
from ray import data

In [None]:
BATCH_SIZE = 16
N_BATCHES = 10

dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

image_indices = get_image_indices(dataset=test_dataset, n=BATCH_SIZE * N_BATCHES)
data = [test_dataset[i]["image"] for i in image_indices]

In [None]:
dataset = ray.data.from_items(data)
dataset.show(limit=3)

### Implement class that computes predictions

In [None]:
class PredictionClass:
    def __init__(self, model, feature_extractor):
        self.model = model
        self.feature_extractor = feature_extractor
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def __call__(self, batch):
        segmentation_maps = []

        inputs = self.feature_extractor(images=batch, return_tensors="pt")
        outputs = self.model(pixel_values=inputs.pixel_values.to(self.device))

        _target_sizes = [
            image.size[::-1] for image in batch
        ]  # PIL returns (WxH), HF expects (HxW)
        _segmentation_maps = self.feature_extractor.post_process_semantic_segmentation(
            outputs=outputs, target_sizes=_target_sizes
        )

        return [j.detach().cpu().numpy() for j in _segmentation_maps]

`batch` - argument in __call__ is list.

The return type must be one of:

* `pandas.DataFrame`
* `pyarrow.Table`
* `numpy.ndarray`,
* `Dict[str, numpy.ndarray]`
* `list`

https://docs.ray.io/en/latest/data/transforming-datasets.html#transform-datasets-writing-udfs

https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset.map_batches

https://docs.ray.io/en/latest/data/transforming-datasets.html#batch-udf-output-types

### Run parallel batch inference on 160 images and assess scalability

In [None]:
from ray.data import ActorPoolStrategy

In [None]:
%%time

predictions_dataset = dataset.map_batches(
    PredictionClass,
    batch_size=1,
    num_gpus=0,
    num_cpus=1,
    compute=ActorPoolStrategy(min_size=1, max_size=2),
    fn_constructor_args=(segformer, segformer_feature_extractor),
)

In [None]:
predictions_dataset.take(limit=1)

don't forget to pass `fn_constructor_args` to construct PredictionClass.

What is ActorPoolStrategy?

Try different `batch_size` values.

### Summary: batch inference with Ray AIR Datasets

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.3s    |18.9s     |125s     |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

#### Key API Elements


## Part 8: Ray AIR BatchPredictor

|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="70%" loading="lazy">|
|:--|
|Using Ray AIR's `BatchPredictor` for Batch Inference.|

### Implement Predictor for image data

In [None]:
from ray.train.predictor import Predictor

In [None]:
class SemanticSegmentationPredictor(Predictor):
    def __init__(self, model, feature_extractor):
        super().__init__()
        self.model = model
        self.feature_extractor = feature_extractor
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def _predict_pandas(self, batch):
        batch = [batch["value"][0]]
        inputs = self.feature_extractor(images=batch, return_tensors="pt")
        outputs = self.model(pixel_values=inputs.pixel_values.to(self.device))

        _target_sizes = [
            image.size[::-1] for image in batch
        ]  # PIL returns (WxH), HF expects (HxW)
        _segmentation_maps = self.feature_extractor.post_process_semantic_segmentation(
            outputs=outputs, target_sizes=_target_sizes
        )

        df = pd.DataFrame(columns=["segmentation_maps"])
        df.loc[0, "segmentation_maps"] = _segmentation_maps

        return df

    @classmethod
    def from_checkpoint(self, checkpoint, **kwargs):
        checkpoint_data = checkpoint.to_dict()
        return SemanticSegmentationPredictor(
            model=checkpoint_data["model"],
            feature_extractor=checkpoint_data["feature_extractor"],
        )

batch in `_predict_pandas` is DataFrame.

https://docs.ray.io/en/latest/ray-air/predictors.html#batch-prediction

https://docs.ray.io/en/latest/ray-air/package-ref.html#predictor

Ray AIR Predictors are a class that loads models from Checkpoint to perform inference.

### Implement BatchPredictor

In [None]:
from ray.air import Checkpoint
from ray.train.batch_predictor import BatchPredictor

In [None]:
batch_predictor = BatchPredictor(
    checkpoint=Checkpoint.from_dict(
        {"model": segformer, "feature_extractor": segformer_feature_extractor}
    ),
    predictor_cls=SemanticSegmentationPredictor,
)

### Run parallel batch inference on 160 images and assess scalability

In [None]:
%%time

predictions_dataset = batch_predictor.predict(data=dataset, batch_size=1)

In [None]:
predictions_dataset.count()

In [None]:
predictions_dataset.take(limit=1)

### Summary: BatchPredictor

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.3s    |13s       |125s     |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

#### Key API Elements


## Part 9: Architectures for scalable batch inference with Ray - recap

### Batch inference using Ray Core - parallelism control

Each of the five approaches introduced in this module represents a valid approach for scaling batch inference on Ray. The one you choose depends on how much control you want over how Ray executes.

### Batch inference using Ray Core - parallelism control

If you want to specify how Ray should execute batch inference, then use Ray Tasks, Ray Actors, or the ActorPool utility.

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="100%" loading="lazy">|
|:-:|:-:|:-:|
|Ray Tasks|Ray Actors|`ActorPool`|

### Batch inference using Ray AI Runtime - high level API for productivity

If you want Ray to manage your distribution and inference at scale using high levels APIs, then Ray AIR will be the right way to go.

For data scientists and machine learning practitioners who care more about getting the models to scale for batch inference and worry less about underlying primitives and unter-the-hood execution details, Ray AIR is a desirable option.

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="90%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="90%" loading="lazy">|
|:-:|:-:|
|`Dataset.map_batches()`|`BatchPredictor`|

### Per-pattern summary

Below, the table will further summarize some finer points of comparison between different approaches.

||Ray Tasks|Ray Actors|`ActorPool`|`Dataset.map_batches()`|`BatchPredictor`|
|:--|:--|:--|:--|:--|:--|
|Benefits|<ul><li>good for small models</li><li>good for near-identical inference jobs</li></ul>|<ul><li>keeps track of state</li><li>avoids reloading model during remote calls</li></ul>|<ul><li>convenient way to handle multiple actors</li>|<ul><li>parallelize data fetching and preprocessing</li></ul><li>manages the autoscaling of the ActorPool</li></ul>|<li>pipeline task submission</li></ul><li>connect with other Ray AIR components like Checkpoint and Predictor</li></ul>|
|Drawbacks|<ul><li>large models are reloaded every time a task executes</li><li>does not keep track of state</li></ul>|<ul><li>requires futures management</li><li>manual autoscaling</li></ul>|<ul><li>give up control over how Ray Actors execute</li>|<li>relinquish control of batch preprocessing</li></ul>|<li>not the right choice for custom batch preprocessing or custom checkpointing</li></ul>|

# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [Ray documentation](https://docs.ray.io/en/latest)
* [Official Ray Website](https://www.ray.io/): Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
* [Join the Community on Slack](https://forms.gle/9TSdDYUgxYs8SA9e8): Find friends to discuss your new learnings in our Slack space.
* [Use the Discussion Board](https://discuss.ray.io/): Ask questions, follow topics, and view announcements on this community forum.
* [Join a Meetup Group](https://www.meetup.com/Bay-Area-Ray-Meetup/): Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
* [Open an Issue](https://github.com/ray-project/ray/issues/new/choose): Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

<img src="../../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">