# Scalable Batch Inference with Ray

<img src="../../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">

## About this notebook

### Is it right for you?

This module focuses on batch inference, presenting several approaches for scaling inference on Ray through hands-on examples. It is right for you if:

* you observe performance bottlenecks when working on model (batch) inference problems
* you want to scale or increase throughput of your existing batch inference pipelines
* you wish to explore different architectures for batch inference with Ray Core and Ray AIR

### Prerequisites

For this notebook you should have:

* practical Python and machine learning experience
* familiarity with batch inference pattern in ML
* familiarity with Ray and Ray AIR equivalent to completing these training modules:
  * [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb)
  * [Introduction to Ray AIR](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Introduction_to_Ray_AIR.ipynb)
  * [Ray Core](https://github.com/ray-project/ray-educational-materials/tree/main/Ray_Core)

### Learning objectives

Upon completion of this notebook, you will be able to:

* understand common design patterns for batch inference
* identify multiple approaches for scaling batch inference with Ray
* compare the benefits and drawbacks of different inference architectures on Ray for different use cases

### What will you do?

* using a semantic segmentation task, encounter several batch inference implementation approaches using:
  * Ray Tasks
  * Ray Actors
  * Ray ActorPool utility
  * Ray AIR Datasets
  * Ray AIR BatchPredictor
* explore parallelized inference through hands-on coding exercises

## Part 1: Scalable batch inference design paterns with Ray

The end goal for machine learning models is to generate performant predictions over a set of unseen data. In this module, you will approach parallelizing batch inference on using Ray Core's API as well as the high-level abstractions available in Ray AI Runtime.

|<img src="../../_static/assets/Scaling_inference/example_ml_workflow.png" width="70%" loading="lazy">|
|:--|
|Example of a machine learning workflow.|

### Stateless inference - Ray Tasks

Loading complex models into memory can be expensive and sequential processing of requests limits speed. *Stateless inference* allows an ML system to handle high volume requests by:

1. exporting the model's mathematical core into a language agnostic format
2. restoring the architecture and weights of a trained model in a stateless function (i.e. Ray tasks)

A Ray task is *stateless* because its output (e.g. predictions) is determined purely by its inputs (e.g. the trained model). Performing online inference involves loading the model for every request and synchronously serving results.

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="70%" loading="lazy">|
|:--|
|Stateless inference using Ray Tasks.|

In the figure above, you perform batch inference by preprocessing your big dataset into batches that are assigned to workers via Ray tasks. Each task loads the trained model and outputs predictions on batches as they are assigned.

**Code Snippet**:

```python
object_refs = [task.remote(input) for _ in range(10)]
```

### Stateful inference with Ray Actors

When your deployed model takes too long to generate immediate results, online prediction may not be the right approach. In addition, some situations require predictions to be generated over large volumes of data such as curating personalized playlists. You can use *batch inference*, which is an asynchronous method of batching observations for prediction in advance to process a high volume of samples efficiently.

Setting up distributed batch inference with Ray involves:

1. creating a number of replicas of your model; in Ray, these replicas are represented as Actors (i.e., stateful processes) that can be assigned to GPUs and hold instantiated model objects

2. feeding data into these model replicas in parallel, and retrieve inference results

|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="70%" loading="lazy">|
|:--|
|Stateful inference using Ray Actors.|

Much like stateless inference using Ray tasks, stateful inference replaces Ray tasks with Ray actors and leverages Ray's object store to avoid loading the model for every batch.

**Code Snippet**:

```python
actors = [ActorCls.remote(input) for _ in range(10)]
```

### Stateful inference with Ray ActorsPool utility

Ray provides a convenient [ActorPool utility](https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-util-actorpool) which wraps the above list of actors to avoid futures management.

|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="70%" loading="lazy">|
|:--|
|Using Actor Pools for Batch Inference.|

Building off of the stateful inference diagram, an Actor Pool wraps around the `n` actors so you do not have to manage idle actors and manually distribute workloads.

**Code Snippet**:

```python
from ray.util.actor_pool import ActorPool
actor_pool = ActorPool(actors)
```

### Batch inference with Ray AIR Datasets

Ray Datasets allows for parallel reading and preprocessing of source data along with autoscaling of the ActorPool. As a part of Ray AIR, you specify what you want done through a set of declarative key-value arguments rather than concerning yourself with how to instruct Ray to scale.

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="70%" loading="lazy">|
|:--|
|Ray Datasets replace the 'Batch preprocessing' stage.|

In Ray AIR, a trained model is loaded into a `Checkpoint` object (could be from training or tuning). An AIR `Predictor` loads model from the `Checkpoint` to perform inference. Then, using the preprocessed batches provided by Ray Datasets, you extract predictions off of the testing data.

**Code Snippet**:

```python
batches = data.map_batches(
              MyModel,
              num_gpus=1,
              batch_size-1024,
              compute=ray.data.ActorPoolStrategy(min_size=10, max_size=50)
          )
```

### Batch inference with high-level API - BatchPredictor

Ray AIR's [`BatchPredictor`](https://docs.ray.io/en/latest/ray-air/package-ref.html#batch-predictor) takes in a [`Checkpoint`](https://docs.ray.io/en/latest/ray-air/package-ref.html#checkpoint) which represents the saved model. This high-level abstraction offers simple and composable APIs that enable preprocessing data in batches with [BatchMapper](https://docs.ray.io/en/latest/ray-air/package-ref.html#generic-preprocessors) and instantiate a distributed predictor given checkpoint data.

|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="70%" loading="lazy">|
|:--|
|Using Ray AIR's `BatchPredictor` for Batch Inference.|

Finally, you can use an AIR `BatchPredictor` that takes both the `Checkpoint` and `Predictor` to replace the process of manually performing inference on a large dataset.

**Code Snippet**:

```python
batch_predictor = BatchPredictor(
                      Checkpoint,
                      Predictor
                  )
```

## Part 2: Data and model used in this notebook - vision transformers for semantic segmentation

### SceneParse150 - MIT Scene Parsing Benchmark

Image segmentation takes a scene and classifies image objects [into semantic categories](https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit?usp=sharing) pixel-by-pixel. [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/) (SceneParse150) provides the largest open source dataset for scene parsing, and in this notebook, you will be scaling inference on image regions depicted in these samples.

|<img src="../../_static/assets/Scaling_inference/scene.png" width="70%" loading="lazy">|
|:--|
|Test image on the left vs. predicted result on the right.[Source](https://github.com/CSAILVision/semantic-segmentation-pytorch) *Date accessed: November 10, 2022*|

Dataset Highlights:

* 20k annotated, scene-centric training images
* 2k validation images
* 150 total categories such as person, car, bed, sky, and more

### SegFormer - modern transformer for vision tasks

[SegFormer](https://arxiv.org/pdf/2105.15203.pdf) is a simple and powerful semantic segmentation method whose architecture consists of a hierarchical Transformer encoder and a lightweight All-MLP decoder. What sets SegFormer apart from previous approaches boils down to two key features:

1. a novel hierarchically structured Transformer encoder which does not depend on positional encoding, avoiding interpolation when test resolution differs from training
2. avoids complex decoders

With demonstrated success on benchmarks such as Cityscapes and [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/), you will use a pretrained version to perform inference on test images from the SceneParse 150 dataset.

|<img src="../../_static/assets/Scaling_inference/segformer_architecture.png" width="70%" loading="lazy">|
|:--|
|Segformer architecture taken from [original paper](https://arxiv.org/pdf/2105.15203.pdf). *Date accessed: November 10, 2022*|


## Part 3: Sequential batch inference implementation

|<img src="../../_static/assets/Scaling_inference/single_sequential_timeline.png" width="70%" loading="lazy">|
|:--|
|Sequential inference on the single worker. Performance is limited to the single machine performance.|

In [None]:
# imports
import torch
import numpy as np
import pandas as pd
from PIL import Image

In [None]:
torch.manual_seed(201)

Setting the seed to a constant value ensures that multiple runs of the notebook produce the same results.

### Load pre-trained model from the HuggingFace Hub

In [None]:
from utils import get_labels
from transformers import SegformerForSemanticSegmentation

In [None]:
MODEL_NAME = "nvidia/segformer-b0-finetuned-ade-512-512"

There are five different segformers to choose from. The model chosen here is the smallest of the five. These models are pretrained and they are fine tuned on the MITADE20K dataset of 512x512 images. 

https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512

In [None]:
id2label, label2id = get_labels()
print(f"total labels: {len(id2label)}")
print(f"example lables: {list(id2label.values())[:5]}")

The function `get_labels` downloads from the `huggingface_hub` library the mappings `id2label` and `label2id` between the label IDs and the labels for the categories of objects in the images.

#### Load SegFormer

In [None]:
model = SegformerForSemanticSegmentation.from_pretrained(
    MODEL_NAME, id2label=id2label, label2id=label2id
)
print(f"number of model parameters: {model.num_parameters()/(10**6):.2f} M")

The model loaded here is the smallest of the five segformers.

#### Create feature extractor

In [None]:
# "reduce_labels" is to drop background from loss compute: https://huggingface.co/docs/transformers/model_doc/segformer#segformer
from transformers import SegformerFeatureExtractor

feature_extractor = SegformerFeatureExtractor.from_pretrained(
    MODEL_NAME, reduce_labels=True
)
feature_extractor

Every huggingface model has an associated feature extractor that preprocesses the input features. The flag `reduce_labels` removes the background from the loss computation.

### Prepare SceneParse150 dataset

#### Load dataset from the HuggingFace Hub

In [None]:
# Load dataset from Hugging Face
from datasets import load_dataset

DATASET_NAME = (
    "scene_parse_150"  # name of the dataset on the HuggingFace's datasets repository.
)

# split here only for fast-debug, remove before real use.
# ds = load_dataset(DATASET_NAME, split="train[:50]")  # for dry run only
dataset_dict = load_dataset(DATASET_NAME)
dataset_dict

This can take some time, because you download data - over 20k images to the local machine or cluster.

The `load_dataset` utility loads the SceneParse150 dataset from Hugging Face's `datasets` library.

In [None]:
train_dataset = dataset_dict["train"]

print(f"train_dataset\n{train_dataset}\n")

#### Display example images

In [None]:
from utils import display_example_images

In [None]:
display_example_images(train_dataset)

Each Hugging Face dataset comes with a `train_test_split` method that we're going to use next. We want 80% of the data to be training data, and 20% held back for testing.

To get a feel for what this dataset consists of, let's print the first of it. Since the train-test split we did is randomized, the resulting image will be different every time you load the dataset.

### Run inference on few images and visualize predictions

In [None]:
from utils import visualize_predictions

In [None]:
def predict(model, image, labels, device):
    inputs = feature_extractor(
        images=image, segmentation_maps=labels, return_tensors="pt"
    )
    outputs = model(
        pixel_values=inputs.pixel_values.to(device), labels=inputs.labels.to(device)
    )
    loss = outputs.loss.detach().cpu().numpy()

    upsampled_logits = torch.nn.functional.interpolate(
        outputs.logits.cpu(),
        size=image.size[::-1],
        mode="bilinear",
        align_corners=False,
    )
    return upsampled_logits.argmax(dim=1)[0], loss

The `predict` function uses the input `model` to predict the label of each pixel in input image. The prediction takes the following steps:

1. The input image is converted using `feature_extractor` into three 512x512 images representing the three color chanels of the input image. This is independent of the original size of the input image.
2. The 512x512 images are then passed to the model, which then produces 150 128x128 images, one image for each available category. Each image is a mask representing the part of the image that belongs to that category.
3. In order to display the predicted regions on top of the original image, the 150 128x128 images are upsampled to the size of the original.
4. The 150 images are then collapsed into a single image using `argmax`, where each pixel has the label ID of the category predicted for that pixel.

#### Run inference on train set

In [None]:
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")

j = np.random.randint(train_dataset.num_rows)

random_image = train_dataset[j]["image"]
labels = train_dataset[j]["annotation"]

segmentation, loss = predict(model=model, image=random_image, labels=labels, device=dev)

visualize_predictions(image=random_image, predictions=segmentation, loss=loss)

Each time you run this code, a different image from the training set is passed to `predict`, and the categories assigned to each pixel of the image are displayed with a different color.

### Run sequential batch inference on data

#### Run inference on a single image

In [None]:
j = np.random.randint(train_dataset.num_rows)

random_image = train_dataset[j]["image"]
labels = train_dataset[j]["annotation"]

In [None]:
%%time

segmentation, loss = predict(model=model, image=random_image, labels=labels, device=dev)

Time how long it takes to run `predict` on a single image.

##### Performance analysis

<div class="alert alert-info">
  <strong>Performance</strong>: time needed to run inference on the batch of data. Measured in seconds.
</div>

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|0.45s        |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Average time is approximately 0.45 seconds on the single image for SegFormer model with 3.7M parameters (b0 variant).


*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

#### Run batch inference on 10 images

In [None]:
from utils import get_image_indices

In [None]:
N_IMAGES = 10

image_indices = get_image_indices(dataset=train_dataset, n=N_IMAGES)
image_indices

Get 10 random image IDs from the training data set to run inference on.

In [None]:
%%time

predictions = []

for i in image_indices:
    image = train_dataset[i]["image"]
    labels = train_dataset[i]["annotation"]
    segmentation, loss = predict(model=model, image=image, labels=labels, device=dev)
    predictions.append((segmentation, loss))

Time how long it takes to run `predict` in series on the 10 random images.

##### Performance analysis

Some experiments results:

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|4.7s         |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Performance is a linear function of nuber of images in batch. Single image performance was 0.45s -> 10 images is 4.7s.

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

### Summary: sequential batch inference implementation

|<img src="../../_static/assets/Scaling_inference/single_sequential_timeline.png" width="70%" loading="lazy">|
|:--|
|Sequential inference on the single worker. Performance is limited to the single machine performance.|

|Compute       |1 image|10 images|100 images|
|:-------------|:------|:--------|:---------|
|M1 MacBook Pro|0.45s  |4.7s     |53s       |
|cluster 1 AWS |x.xxs  |.        |.         |
|cluster 2 AWS |x.xxs  |.        |.         |

#### Key Concepts

#### Key API Elements

## Part 4: Stateless inference - Ray Tasks

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="70%" loading="lazy">|
|:--|
|Stateless inference using Ray Tasks.|

### Initialize Ray runtime

In [None]:
import ray

if ray.is_initialized:
    ray.shutdown()

cluster_info = ray.init()
cluster_info.address_info

### Put the model and feature extractor in the object store

In [None]:
model_ref = ray.put(model)
feature_extractor_ref = ray.put(feature_extractor)

Place the model and feature extractor in the object store to avoid copying every time the model is passed to a remote function or method.

### Implement remote function for inference

In [None]:
@ray.remote
def inference_task(model, image, labels, device):
    return predict(model=model, image=image, labels=labels, device=device)

The most naive version of parallelising prediction is to create Ray tasks that load the trained model internally when called. This way we can make the prediction task "stateless", but at the cost of incurring the overhead of loading the model every single time. This is akin to what serverless solutions like AWS Lambda would do, and this pattern could be worth it for tiny models, for which the application doesn't get bottle-necked by the model loading step.

### Run batch inference on 100 images and assess scalability

In [None]:
N_IMAGES = 100

image_indices = get_image_indices(dataset=train_dataset, n=N_IMAGES)

In [None]:
%%time

prediction_refs = []
for i in image_indices:
    task_ref = inference_task.remote(
        model=model_ref,
        image=train_dataset[i]["image"],
        labels=train_dataset[i]["annotation"],
        device=dev,
    )
    prediction_refs.append(task_ref)

predictions = ray.get(prediction_refs)

Each call to the remote function `inference_task.remote` returns immediately. Ray then schedules each task to execute in parallel using the available resources. Calling `ray.get` waits for all the predictions to finish and returns the final results.

#### Performance analysis

<div class="alert alert-info">
  <strong>Performance</strong>: time needed to run inference on the batch of data. Measured in seconds.
</div>

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|13s          |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Distributed batch inference yields approximately 4x performance gain, when compared to the sequential implementation.

* Parallel: 13s.
* Sequential: 53s.

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

### Summary: stateless inference - Ray Tasks

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.3s    |13s       |125s     |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

Average speed per prediction is 0.125s. That yields 4x performance speedup, when compared to the sequential approach, which is approximately 0.45s.

#### Key Concepts

#### Key API Elements


## Part 5: Stateful inference with Ray Actors

|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="70%" loading="lazy">|
|:--|
|Stateful batch inference using Ray Actors.|

### Implement remote class for inference

In [None]:
@ray.remote
class PredictionActor:
    def __init__(self, model, feature_extractor):
        self.model = model
        self.feature_extractor = feature_extractor
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def predict(self, image, labels):
        inputs = feature_extractor(
            images=image, segmentation_maps=labels, return_tensors="pt"
        )
        outputs = self.model(
            pixel_values=inputs.pixel_values.to(self.device),
            labels=inputs.labels.to(self.device),
        )
        loss = outputs.loss.detach().cpu().numpy()

        upsampled_logits = torch.nn.functional.interpolate(
            outputs.logits.cpu(),
            size=image.size[::-1],
            mode="bilinear",
            align_corners=False,
        )

        return upsampled_logits.argmax(dim=1)[0], loss

Predict method is the same as in sequential implementation.

The benefit of using actors over tasks is that actors allow keeping track of state. In this particular case, each instance of `PredictionActor` will hold its own copy of the model, to avoid having to load the model every time a call to `predict` is made.

### Create list of Ray Actors

In [None]:
N_ACTORS = 7

idle_actors = []
for i in range(N_ACTORS):
    idle_actors.append(
        PredictionActor.remote(model=model_ref, feature_extractor=feature_extractor_ref)
    )

idle_actors

You named the list `idle_actors` as they are not doing anything yet.

### Run batch inference on 100 images with Ray Actors and assess scalability

In [None]:
def prediction_results_postprocessing(results, predictions):
    predictions.append(results)

The purpose of `prediction_results_postprocessing` is to abstract away the final processing step. In this demo, the postprocessing step is a very simple one, but in practice it will likely be much more complex.

In [None]:
N_IMAGES = 100
preds = []
future_to_actor_mapping = {}

image_indices = get_image_indices(dataset=train_dataset, n=N_IMAGES)
data = [
    (train_dataset[i]["image"], train_dataset[i]["annotation"]) for i in image_indices
]

The variable `future_to_actor_mapping` will hold a mapping from futures to actors, to be able to determine which actors are idle by looking at the finished futures.

The `data` variable is a list of image-annotation pairs, where the annotation contains the correct labeling of all the segments in the image.

In [None]:
%%time

while data:
    if idle_actors:
        actor = idle_actors.pop()
        image, labels = data.pop()
        future = actor.predict.remote(image=image, labels=labels)
        future_to_actor_mapping[future] = actor
    else:
        [ready], _ = ray.wait(list(future_to_actor_mapping.keys()), num_returns=1)
        actor = future_to_actor_mapping.pop(ready)
        idle_actors.append(actor)
        prediction_results_postprocessing(ray.get(ready), preds)

# Process any leftover results at the end.
for future in future_to_actor_mapping.keys():
    prediction_results_postprocessing(ray.get(future), preds)

The `while` loop goes over all the image-annotation pairs, and if there is an idle actor, that actor is assigned the next image-annotation pair to work on. If no actors are idle, the loop waits until an actor finishes, and the assigns the next image-annotation pair that actor. The `future_to_actor_mapping` is used to keep track of what each actor is working on, so that when a task is finished we know which actor finished it.

Finally, once all the data has been assigned, we wait for all remaining actors to finish their tasks.

In [None]:
preds[0]

based on this: https://docs.ray.io/en/master/ray-core/patterns/limit-pending-tasks.html

ray.wait() -> https://docs.ray.io/en/latest/ray-core/package-ref.html#ray-wait

|<img src="../../_static/assets/Scaling_inference/sequential_timeline.png" width="70%" loading="lazy">|
|:--|
|Timeline of sequential batch assignment spread across three workers.|

|<img src="../../_static/assets/Scaling_inference/distributed_timeline.png" width="70%" loading="lazy">|
|:--|
|Timeline of distributed bath inference where a scheduler orchestrates batch assignment as soon as a worker is available.|

#### Performance analysis

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|15s          |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Results for 7 actors

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

### Summary: stateful inference with Ray Actors

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.4s    |15s       |.        |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

#### Key API Elements


## Part 6: Stateful inference with Ray ActorPool utility

|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="70%" loading="lazy">|
|:--|
|Using Actor Pools for Batch Inference.|

### Create ActorPool

In [None]:
from ray.util.actor_pool import ActorPool

In [None]:
N_ACTORS = 7

actors = [
    PredictionActor.remote(model=model_ref, feature_extractor=feature_extractor_ref)
    for _ in range(N_ACTORS)
]

In [None]:
actor_pool = ActorPool(actors)

Just as before, each actor is an instance `PredictionActor`, and `ActorPool` collectively wraps the actors to manage futures automatically.

### Run batch inference on 100 images with ActorPool and assess scalability

In [None]:
def actor_call(actor, data_item):
    image, labels = data_item
    return actor.predict.remote(image=image, labels=labels)

`actor_call` returns an ObjectRef that computes the image segmentation prediction.

In [None]:
N_IMAGES = 100
preds = []

In [None]:
image_indices = get_image_indices(dataset=train_dataset, n=N_IMAGES)
data = [
    (train_dataset[i]["image"], train_dataset[i]["annotation"]) for i in image_indices
]

In [None]:
%%time

for result in actor_pool.map_unordered(actor_call, data):
    prediction_results_postprocessing(result, preds)

`map_unordered` takes in:
- `actor_call`: a function that takes `(actor, data_item)` as argument and returns an ObjectRef computing the result over the value. The actor will be considered busy until the ObjectRef completes.
- `data`: a list of values that `actor_call(actor, data_item)` should be applied to

In [None]:
preds[0]

#### Performance analysis

|Compute       |Performance  |
|:-------------|:------------|
|M1 MacBook Pro|15s          |
|cluster 1 AWS |x.xxs        |
|cluster 2 AWS |x.xxs        |

Results for 7 actors

*Results are not representative and are meant to provide you with an intuitive understanding of the performance.*

### Summary: stateful inference with Ray ActorPool utility

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.4s    |15s       |.        |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

#### Key API Elements


## Part 7: Batch inference with Ray AIR Datasets

|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="70%" loading="lazy">|
|:--|
|Ray Datasets replace the 'Batch preprocessing' stage.|

### Create Ray dataset with 100 images

In [None]:
from ray import data

In [None]:
N_IMAGES = 100
image_indices = get_image_indices(dataset=train_dataset, n=N_IMAGES)

data = [
    (train_dataset[i]["image"], train_dataset[i]["annotation"]) for i in image_indices
]

In [None]:
dataset = ray.data.from_items(data)
dataset.show(limit=3)

In [None]:
dataset.take(limit=1)[0][0]

### Implement class that computes predictions

In [None]:
class PredictionClass:
    def __init__(self, model, feature_extractor):
        self.model = model
        self.feature_extractor = feature_extractor
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def __call__(self, batch):
        predictions_list = []

        for data_item in batch:
            image, labels = data_item

            inputs = self.feature_extractor(
                images=image, segmentation_maps=labels, return_tensors="pt"
            )
            outputs = self.model(
                pixel_values=inputs.pixel_values.to(self.device),
                labels=inputs.labels.to(self.device),
            )
            loss = outputs.loss.detach().cpu().numpy()

            upsampled_logits = torch.nn.functional.interpolate(
                outputs.logits.cpu(),
                size=image.size[::-1],
                mode="bilinear",
                align_corners=False,
            )
            upsampled_logits = upsampled_logits.argmax(dim=1)[0]

            predictions_list.append(
                {"prediction": upsampled_logits.detach().cpu().numpy(), "loss": loss}
            )

        return predictions_list

`batch` - argument in __call__ is list.

The return type must be one of:

* `pandas.DataFrame`
* `pyarrow.Table`
* `numpy.ndarray`,
* `Dict[str, numpy.ndarray]`
* `list`

https://docs.ray.io/en/latest/data/transforming-datasets.html#transform-datasets-writing-udfs

https://docs.ray.io/en/latest/data/api/dataset.html#ray.data.Dataset.map_batches

https://docs.ray.io/en/latest/data/transforming-datasets.html#batch-udf-output-types

### Run batch inference on 100 images

In [None]:
from ray.data import ActorPoolStrategy

In [None]:
%%time

results_dataset = dataset.map_batches(
    PredictionClass,
    batch_size=1,
    num_gpus=0,
    compute=ActorPoolStrategy(min_size=1, max_size=7),
    fn_constructor_args=(model, feature_extractor),
)

In [None]:
results_dataset.take(limit=1)

don't forget to pass `fn_constructor_args` to construct PredictionClass.

What is ActorPoolStrategy?

Try different `batch_size` values.

### Summary: batch inference with Ray AIR Datasets

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.3s    |18.9s     |125s     |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

#### Key API Elements


## Part 8: Ray AIR BatchPredictor

|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="70%" loading="lazy">|
|:--|
|Using Ray AIR's `BatchPredictor` for Batch Inference.|

### Implement Predictor for image data

In [None]:
from ray.train.predictor import Predictor

In [None]:
class SemanticSegmentationPredictor(Predictor):
    def __init__(self, model, feature_extractor):
        super().__init__()
        self.model = model
        self.feature_extractor = feature_extractor
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def _predict_pandas(self, batch):
        image, labels = batch["value"][0]

        inputs = self.feature_extractor(
            images=image, segmentation_maps=labels, return_tensors="pt"
        )
        outputs = self.model(
            pixel_values=inputs.pixel_values.to(self.device),
            labels=inputs.labels.to(self.device),
        )
        loss = outputs.loss.detach().cpu().numpy()

        upsampled_logits = torch.nn.functional.interpolate(
            outputs.logits.cpu(),
            size=image.size[::-1],
            mode="bilinear",
            align_corners=False,
        )
        upsampled_logits = upsampled_logits.argmax(dim=1)[0]

        df = pd.DataFrame(columns=["prediction", "loss"])
        df.loc[0, "prediction"] = upsampled_logits.detach().cpu().numpy()
        df.loc[0, "loss"] = loss

        return df

    @classmethod
    def from_checkpoint(self, checkpoint, **kwargs):
        checkpoint_data = checkpoint.to_dict()
        return SemanticSegmentationPredictor(
            model=checkpoint_data["model"],
            feature_extractor=checkpoint_data["feature_extractor"],
        )

batch in `_predict_pandas` is DataFrame.

https://docs.ray.io/en/latest/ray-air/predictors.html#batch-prediction

https://docs.ray.io/en/latest/ray-air/package-ref.html#predictor

Ray AIR Predictors are a class that loads models from Checkpoint to perform inference.

### Implement BatchPredictor

In [None]:
from ray.air import Checkpoint
from ray.train.batch_predictor import BatchPredictor

In [None]:
batch_predictor = BatchPredictor(
    checkpoint=Checkpoint.from_dict(
        {"model": model, "feature_extractor": feature_extractor}
    ),
    predictor_cls=SemanticSegmentationPredictor,
)

### Run batch inference on 100 images and assess scalability

In [None]:
preds = batch_predictor.predict(data=dataset, batch_size=1)

In [None]:
preds.count()

### Summary: BatchPredictor

|Compute       |10 image|100 images|1k images|10k images|
|:-------------|:-------|:---------|:--------|:---------|
|M1 MacBook Pro|1.3s    |13s       |125s     |n.a.      |
|cluster 1 AWS |x.xxs   |.         |.        |.         |
|cluster 2 AWS |x.xxs   |.         |.        |.         |

#### Key Concepts

#### Key API Elements


## Part 9: Architectures for scalable batch inference with Ray - recap

|<img src="../../_static/assets/Scaling_inference/task_inference.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/actor_inference.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/actor_pool.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/ray_datasets.png" width="100%" loading="lazy">|<img src="../../_static/assets/Scaling_inference/air_batchpredictor.png" width="100%" loading="lazy">|
|:-:|:-:|:-:|:-:|:-:|
|Ray Tasks|Ray Actors|`ActorPool`|`Dataset.map_batches()`|`BatchPredictor`|

# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [Ray documentation](https://docs.ray.io/en/latest)
* [Official Ray Website](https://www.ray.io/): Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
* [Join the Community on Slack](https://forms.gle/9TSdDYUgxYs8SA9e8): Find friends to discuss your new learnings in our Slack space.
* [Use the Discussion Board](https://discuss.ray.io/): Ask questions, follow topics, and view announcements on this community forum.
* [Join a Meetup Group](https://www.meetup.com/Bay-Area-Ray-Meetup/): Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
* [Open an Issue](https://github.com/ray-project/ray/issues/new/choose): Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

<img src="../../_static/assets/Generic/ray_logo.png" width="20%" loading="lazy">