# Introduction to Ray Serve

This template introduces Ray Serve, a scalable model-serving framework built on Ray. You will learn **what** Ray Serve is, **why** it is a good fit for online ML inference, and **how** to build, deploy, and operate a real model service — starting from a familiar PyTorch classifier and progressively adding features like composition, autoscaling, batching, fault tolerance, and observability.

**Part 1: Core**

1. Why Ray Serve?
2. Build Your First Deployment (MNIST Classifier)
3. Integrating with FastAPI
4. Composing Deployments
5. Resource Specification and Fractional GPUs
6. Autoscaling
7. Observability

**Part 2: Advanced topics**

8. Dynamic Request Batching
9. Model Multiplexing
10. Asynchronous Inference

## Imports

In [1]:
from typing import Any

import json
import logging
import time

import numpy as np
import requests
import torch
from torchvision import transforms

import ray
from ray import serve
from ray.serve.handle import DeploymentHandle
from ray.serve import metrics
from fastapi import FastAPI
from pydantic import BaseModel
from starlette.requests import Request
from matplotlib import pyplot as plt

### Note on Storage

Throughout this tutorial, we use `/mnt/cluster_storage` to represent a shared storage location. In a multi-node cluster, Ray workers on different nodes cannot access the head node's local file system. Use a [shared storage solution](https://docs.anyscale.com/configuration/storage#shared) accessible from every node.

---

## 1. Why Ray Serve?

Consider using Ray Serve when your serving workload has one or more of the following needs:

| **Challenge** | **Ray Serve Solution** |
|---|---|
| **Scalability** — needs to handle variable or high traffic | Autoscaling replicas based on request queue depth; scales across a Ray cluster |
| **Hardware utilization** — GPUs underutilized by one-at-a-time inference | Dynamic request batching and fractional GPU allocation |
| **Model composition** — multiple models or processing stages | Compose heterogeneous deployments with independent scaling; Efficient data transfer between deployments through the Ray object store |
| **Expensive startup** — large model weights to load | Stateful replicas (Ray actors) keep models in memory across requests |
| **Slow iteration speed** — Kubernetes YAML, container builds | Python-first API; develop locally, deploy distributed with the same code |

#### Key Ray Serve Features

- [Response streaming](https://docs.ray.io/en/latest/serve/tutorials/streaming.html)
- [Dynamic request batching](https://docs.ray.io/en/latest/serve/advanced-guides/dyn-req-batch.html)
- [Model multiplexing](https://docs.ray.io/en/latest/serve/model-multiplexing.html)
- [Fractional compute resource usage](https://docs.ray.io/en/latest/serve/configure-serve-deployment.html)

---

## 2. Build Your First Deployment

Let's migrate a standard PyTorch classifier to Ray Serve. We start with a familiar offline `MNISTClassifier` and turn it into an online service.

### 2.1 The Offline Classifier

Here is a standard PyTorch inference class that loads a TorchScript model and classifies images.

In [2]:
class OfflineMNISTClassifier:
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path)
        self.model.to("cuda")
        self.model.eval()

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        return self.predict(batch)
    
    def predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to("cuda")

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

Download the pre-trained model to shared storage:

In [3]:
!aws s3 cp s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt /mnt/cluster_storage/model.pt

download: s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt to ../../../mnt/cluster_storage/model.pt


### 2.2 Migrating to Ray Serve

To turn this into an online service, we make three changes:

1. Add the `@serve.deployment()` decorator — this turns the class into a **Deployment**, Ray Serve's fundamental unit that can be independently scaled and configured
2. Change `__call__` to accept a Starlette `Request` object
3. Parse the incoming JSON body

In [4]:
@serve.deployment()
class OnlineMNISTClassifier:
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path)
        self.model.to("cuda")
        self.model.eval()

    async def __call__(self, request: Request) -> dict[str, Any]:
        batch = json.loads(await request.json())
        return await self.predict(batch)
    
    async def predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to("cuda")

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

### 2.3 Deploy and Test

Use `.bind()` to pass constructor arguments and `serve.run()` to deploy. Setting `num_replicas=1` creates a single **Replica** — a Ray actor that holds your model in memory and processes requests.

`.options()` configures the deployment — replicas, resources, autoscaling, and more. See the [full list of deployment configuration options](https://docs.ray.io/en/latest/serve/configure-serve-deployment.html).

In [5]:
mnist_deployment = OnlineMNISTClassifier.options(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1},
)

mnist_app = mnist_deployment.bind(local_path="/mnt/cluster_storage/model.pt")

> **Note:** `.bind()` is a lazy call — it captures the constructor arguments without creating instances. Replicas are created when `serve.run()` is called.

`serve.run()` creates an **Application** — a group of deployments deployed together — and starts the Serve system:

In [6]:
mnist_handle = serve.run(mnist_app, name="mnist_classifier", blocking=False)

2026-02-18 02:49:14,197	INFO worker.py:1821 -- Connecting to existing Ray cluster at address: 10.0.140.139:6379...
2026-02-18 02:49:14,209	INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-phfra92v85r9zs48xih8i8wr56.i.anyscaleuserdata.com [39m[22m
2026-02-18 02:49:14,212	INFO packaging.py:463 -- Pushing file package 'gcs://_ray_pkg_9a5034543e25b79b6f1e2feb4fa1b1c85a4a0f51.zip' (0.07MiB) to Ray cluster...
2026-02-18 02:49:14,213	INFO packaging.py:476 -- Successfully pushed file package 'gcs://_ray_pkg_9a5034543e25b79b6f1e2feb4fa1b1c85a4a0f51.zip'.
[36m(ProxyActor pid=4600)[0m INFO 2026-02-18 02:49:20,336 proxy 10.0.140.139 -- Proxy starting on node 1349328c4c289b18250dbf2618fd3c610d08a4b682f4f4c62dc0157e (HTTP port: 8000).
INFO 2026-02-18 02:49:20,456 serve 4350 -- Started Serve in namespace "serve".
[36m(ProxyActor pid=4600)[0m INFO 2026-02-18 02:49:20,450 proxy 10.0.140.139 -- Got updated endpoints: {}.
[36m(ServeController pid=4544

[36m(autoscaler +15s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[36m(autoscaler +15s)[0m [autoscaler] [1xT4:8CPU-32GB] Attempting to add 1 node to the cluster (increasing from 0 to 1).
[36m(autoscaler +15s)[0m [autoscaler] [1xT4:8CPU-32GB|g4dn.2xlarge] [us-west-2c] [on-demand] Launched 1 instance.


INFO 2026-02-18 02:51:52,835 serve 4350 -- Application 'mnist_classifier' is ready at http://0.0.0.0:8000/.
[36m(ProxyActor pid=2658, ip=10.0.190.36)[0m INFO 2026-02-18 02:51:53,745 proxy 10.0.190.36 -- Proxy starting on node 9df8eee9130fc167b8bafaf41dcf9fe950124132c1f1501ff7a4a2f7 (HTTP port: 8000).
[36m(ProxyActor pid=2658, ip=10.0.190.36)[0m INFO 2026-02-18 02:51:53,866 proxy 10.0.190.36 -- Got updated endpoints: {Deployment(name='OnlineMNISTClassifier', app='mnist_classifier'): EndpointInfo(route='/', app_is_cross_language=False, route_patterns=None)}.


#### Under the hood

When `serve.run()` returns, Ray Serve has started three types of actors:

| Actor | Role |
|---|---|
| **Controller** | Global singleton. Manages the control plane, creates/destroys replicas, runs the autoscaler. |
| **Proxy** | Runs a Uvicorn HTTP server (one per head node by default). Accepts incoming HTTP requests and forwards them to replicas. |
| **Replica** | Executes your deployment code. Each replica is a Ray actor with its own request queue. |

<img src="https://docs.ray.io/en/latest/_images/architecture-2.0.svg" width="800">

These actors are self-healing: if a replica crashes, the Controller detects and replaces it; if the Proxy crashes, the Controller restarts it; if the Controller itself crashes, Ray restarts it. Application exceptions (bugs in your code) return HTTP 500 but don't take down the replica. For critical workloads, implement client-side retries with exponential backoff. See [End-to-End Fault Tolerance](https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html) for details.

#### Test via HTTP

When you send a request to `localhost:8000`, the **Proxy** receives it, the **Router** selects a replica, and the replica executes your `__call__` method:

In [7]:
images = np.random.rand(2, 1, 28, 28).tolist()
json_request = json.dumps({"image": images})
response = requests.post("http://localhost:8000/", json=json_request)
print("Predicted labels:", response.json()["predicted_label"])

Predicted labels: [1, 1]


[36m(ProxyActor pid=2658, ip=10.0.190.36)[0m INFO 2026-02-18 02:51:53,901 proxy 10.0.190.36 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7bb0ae51eed0>.


Predicted labels: [1, 6]


#### Test via DeploymentHandle

You can also call deployments in-process without HTTP overhead:

In [8]:
batch = {"image": np.random.rand(10, 1, 28, 28)}
response = await mnist_handle.predict.remote(batch)
print("Predicted labels:", response["predicted_label"])

INFO 2026-02-18 02:37:30,355 serve 226002 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7cd508d6ce60>.


Predicted labels: [1 1 6 1 1 6 1 1 6 1]


[36m(ServeReplica:mnist_classifier:OnlineMNISTClassifier pid=2572, ip=10.0.144.150)[0m INFO 2026-02-18 02:37:30,294 mnist_classifier_OnlineMNISTClassifier inxgt6c9 648694cd-5c5a-4c47-b7ff-9f2e6f9b7b91 -- POST / 200 372.8ms


[36m(ServeReplica:mnist_classifier:OnlineMNISTClassifier pid=2587, ip=10.0.190.36)[0m INFO 2026-02-18 02:52:52,370 mnist_classifier_OnlineMNISTClassifier 7swb4ss8 951f3120-8266-4003-b2a1-890fd269f55c -- POST / 200 384.8ms


INFO 2026-02-18 02:53:51,980 serve 4350 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7b50c5602e40>.


Predicted labels: [6 6 1 1 6 6 6 1 6 6]


In [9]:
serve.shutdown()

[36m(ServeReplica:mnist_classifier:OnlineMNISTClassifier pid=2572, ip=10.0.144.150)[0m INFO 2026-02-18 02:37:30,442 mnist_classifier_OnlineMNISTClassifier inxgt6c9 ad8887c4-d53a-4f3e-9ea1-61cae84dd08a -- CALL predict OK 68.5ms
[36m(ServeController pid=226139)[0m INFO 2026-02-18 02:37:30,611 controller 226139 -- Removing 1 replica from Deployment(name='OnlineMNISTClassifier', app='mnist_classifier').
[36m(ServeController pid=226139)[0m INFO 2026-02-18 02:37:32,644 controller 226139 -- Replica(id='inxgt6c9', deployment='OnlineMNISTClassifier', app='mnist_classifier') is stopped.


[33m(raylet)[0m Task ServeController.graceful_shutdown failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.
[33m(raylet)[0m Task ServeController.listen_for_change failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.


[36m(ServeReplica:mnist_classifier:OnlineMNISTClassifier pid=2587, ip=10.0.190.36)[0m INFO 2026-02-18 02:53:52,067 mnist_classifier_OnlineMNISTClassifier 7swb4ss8 424d7b88-4d7b-439d-a10e-286afab392d7 -- CALL predict OK 67.4ms


[36m(ServeController pid=4544)[0m INFO 2026-02-18 02:54:52,042 controller 4544 -- Removing 1 replica from Deployment(name='OnlineMNISTClassifier', app='mnist_classifier').


[33m(raylet)[0m Task ServeController.listen_for_change failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.
[33m(raylet)[0m Task ServeController.graceful_shutdown failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.


[36m(ServeController pid=4544)[0m INFO 2026-02-18 02:54:54,053 controller 4544 -- Replica(id='7swb4ss8', deployment='OnlineMNISTClassifier', app='mnist_classifier') is stopped.


---

## 3. Integrating with FastAPI

Ray Serve integrates with FastAPI to provide HTTP routing, Pydantic validation, and auto-generated OpenAPI docs. Use `@serve.ingress(fastapi_app)` to designate a FastAPI app as the HTTP entrypoint.

Here we wrap our existing `OnlineMNISTClassifier` pattern into a FastAPI-powered deployment to demonstrate the integration:

In [10]:
fastapi_app = FastAPI()

@serve.deployment
@serve.ingress(fastapi_app)
class MNISTFastAPIService:
    """Same model logic as OnlineMNISTClassifier, but using FastAPI for HTTP routing."""
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path)
        self.model.to("cuda")
        self.model.eval()

    @fastapi_app.post("/predict")
    async def predict(self, request: Request):
        batch = json.loads(await request.json())
        images = torch.tensor(batch["image"]).float().to("cuda")
        with torch.no_grad():
            logits = self.model(images).cpu().numpy()
        return {"predicted_label": np.argmax(logits, axis=1).tolist()}

In [11]:
app = MNISTFastAPIService.options(
        num_replicas=1,
        ray_actor_options={"num_gpus": 1},
    ).bind(local_path="/mnt/cluster_storage/model.pt")
serve.run(app, name="mnist_fastapi", blocking=False)

INFO 2026-02-18 02:56:03,951 serve 4350 -- Started Serve in namespace "serve".
[36m(ProxyActor pid=6955)[0m INFO 2026-02-18 02:56:03,883 proxy 10.0.140.139 -- Proxy starting on node 1349328c4c289b18250dbf2618fd3c610d08a4b682f4f4c62dc0157e (HTTP port: 8000).
[36m(ProxyActor pid=6955)[0m INFO 2026-02-18 02:56:03,947 proxy 10.0.140.139 -- Got updated endpoints: {}.
[36m(ServeController pid=6893)[0m INFO 2026-02-18 02:56:04,052 controller 6893 -- Deploying new version of Deployment(name='MNISTFastAPIService', app='mnist_fastapi') (initial target replicas: 1).
[36m(ProxyActor pid=6955)[0m INFO 2026-02-18 02:56:04,055 proxy 10.0.140.139 -- Got updated endpoints: {Deployment(name='MNISTFastAPIService', app='mnist_fastapi'): EndpointInfo(route='/', app_is_cross_language=False, route_patterns=None)}.
[36m(ServeController pid=6893)[0m INFO 2026-02-18 02:56:04,156 controller 6893 -- Adding 1 replica to Deployment(name='MNISTFastAPIService', app='mnist_fastapi').
[36m(ProxyActor pid=695

DeploymentHandle(deployment='MNISTFastAPIService')

In [12]:
images = np.random.rand(2, 1, 28, 28).tolist()
response = requests.post("http://localhost:8000/predict", json=json.dumps({"image": images}))
print("Predicted labels:", response.json()["predicted_label"])

[36m(ProxyActor pid=3192, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:09,144 proxy 10.0.190.36 -- Got updated endpoints: {Deployment(name='MNISTFastAPIService', app='mnist_fastapi'): EndpointInfo(route='/', app_is_cross_language=False, route_patterns=[RoutePattern(methods=['GET', 'HEAD'], path='/docs'), RoutePattern(methods=['GET', 'HEAD'], path='/docs/oauth2-redirect'), RoutePattern(methods=['GET', 'HEAD'], path='/openapi.json'), RoutePattern(methods=['POST'], path='/predict'), RoutePattern(methods=['GET', 'HEAD'], path='/redoc')])}.
[36m(ProxyActor pid=3192, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:09,181 proxy 10.0.190.36 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7b5bcc37c680>.


Predicted labels: [1, 1]


Visit `http://localhost:8000/docs` for the auto-generated interactive API documentation.

For more details on HTTP handling in Ray Serve, see the [HTTP Guide](https://docs.ray.io/en/latest/serve/http-guide.html).

In [13]:
serve.shutdown()

[36m(ServeReplica:mnist_fastapi:MNISTFastAPIService pid=3123, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:09,594 mnist_fastapi_MNISTFastAPIService 4r6mu94r 67b52ae0-4f89-4c07-994e-c1e0f590ab28 -- POST /predict 200 359.2ms
[36m(ServeController pid=6893)[0m INFO 2026-02-18 02:56:09,733 controller 6893 -- Removing 1 replica from Deployment(name='MNISTFastAPIService', app='mnist_fastapi').
[36m(ServeController pid=6893)[0m INFO 2026-02-18 02:56:11,749 controller 6893 -- Replica(id='4r6mu94r', deployment='MNISTFastAPIService', app='mnist_fastapi') is stopped.


[33m(raylet)[0m Task ServeController.listen_for_change failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.
[33m(raylet)[0m Task ServeController.graceful_shutdown failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.


Now that we have a working single-deployment service, let's see how to compose multiple deployments into a pipeline.

---

## 4. Composing Deployments

Ray Serve lets you compose multiple deployments into a single application. This is useful when you need:
- **Independent scaling** — each component scales separately
- **Hardware disaggregation** — CPU preprocessing + GPU inference
- **Reusable components** — share a preprocessor across models

### 4.1 Define a Preprocessor

In [14]:
@serve.deployment
class OnlineMNISTPreprocessor:
    def __init__(self):
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.5,), (0.5,))
        ])
        
    async def run(self, batch: dict[str, Any]) -> dict[str, Any]:
        images = batch["image"]
        images = [self.transform(np.array(image, dtype=np.uint8)).cpu().numpy() for image in images]
        return {"image": images}

### 4.2 Build a Composed Application

Wire the preprocessor and classifier together via an ingress deployment:

In [15]:
@serve.deployment
class ImageServiceIngress:
    def __init__(self, preprocessor, model):
        self.preprocessor = preprocessor
        self.model = model

    async def __call__(self, request: Request):
        batch = json.loads(await request.json())
        response = await self.preprocessor.run.remote(batch)
        return await self.model.predict.remote(response)

In [16]:
image_classifier_app = ImageServiceIngress.bind(
    preprocessor=OnlineMNISTPreprocessor.bind(),
    model=OnlineMNISTClassifier.options(
        num_replicas=1,
        ray_actor_options={"num_gpus": 0.1},
    ).bind(local_path="/mnt/cluster_storage/model.pt"),
)

handle = serve.run(image_classifier_app, name="image_classifier", blocking=False)

INFO 2026-02-18 02:56:16,051 serve 4350 -- Started Serve in namespace "serve".
[36m(ProxyActor pid=7148)[0m INFO 2026-02-18 02:56:15,980 proxy 10.0.140.139 -- Proxy starting on node 1349328c4c289b18250dbf2618fd3c610d08a4b682f4f4c62dc0157e (HTTP port: 8000).
[36m(ProxyActor pid=7148)[0m INFO 2026-02-18 02:56:16,047 proxy 10.0.140.139 -- Got updated endpoints: {}.
[36m(ServeController pid=7083)[0m INFO 2026-02-18 02:56:16,144 controller 7083 -- Deploying new version of Deployment(name='OnlineMNISTPreprocessor', app='image_classifier') (initial target replicas: 1).
[36m(ServeController pid=7083)[0m INFO 2026-02-18 02:56:16,145 controller 7083 -- Deploying new version of Deployment(name='OnlineMNISTClassifier', app='image_classifier') (initial target replicas: 1).
[36m(ServeController pid=7083)[0m INFO 2026-02-18 02:56:16,146 controller 7083 -- Deploying new version of Deployment(name='ImageServiceIngress', app='image_classifier') (initial target replicas: 1).
[36m(ProxyActor pi

### 4.3 Test the Composed App

In [17]:
ds = ray.data.read_images("s3://anyscale-public-materials/ray-ai-libraries/mnist/50_per_index/", include_paths=True)
image_batch = ds.take_batch(10)

2026-02-18 02:56:24,235	INFO logging.py:397 -- Registered dataset logger for dataset dataset_1_0
2026-02-18 02:56:24,304	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_1_0. Full logs are in /tmp/ray/session_2026-02-18_02-46-49_468518_2351/logs/ray-data
2026-02-18 02:56:24,304	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_1_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> LimitOperator[limit=10] -> TaskPoolMapOperator[ReadFiles]
2026-02-18 02:56:24,306	INFO streaming_executor.py:687 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`.
2026-02-18 02:56:24,306	INFO progress_bar.py:155 -- Progress bar disabled because stdout is a non-interactive terminal.
2026-02-18 02:56:25,392	INFO progress_bar.py:213 -- === Ray Data Progress {ListFiles} ===
2026-02-18 02:56:25,393	INFO progress_bar.py:215

In [18]:
json_request = json.dumps({"image": image_batch["image"].tolist()})
response = requests.post("http://localhost:8000/", json=json_request)
print("Predicted labels:", response.json()["predicted_label"])

[36m(ServeReplica:image_classifier:ImageServiceIngress pid=3272, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:30,207 image_classifier_ImageServiceIngress 6a9x1pnn 4b9bcafb-8e14-4b64-8d6e-754d2f75903b -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7b0bedd185f0>.
[36m(ServeReplica:image_classifier:OnlineMNISTPreprocessor pid=3270, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:30,227 image_classifier_OnlineMNISTPreprocessor tfgu42hj 4b9bcafb-8e14-4b64-8d6e-754d2f75903b -- CALL run OK 4.1ms


Predicted labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [19]:
serve.shutdown()

[36m(ServeReplica:image_classifier:ImageServiceIngress pid=3272, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:30,606 image_classifier_ImageServiceIngress 6a9x1pnn 4b9bcafb-8e14-4b64-8d6e-754d2f75903b -- POST / 200 415.8ms
[36m(ServeReplica:image_classifier:OnlineMNISTClassifier pid=3271, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:30,601 image_classifier_OnlineMNISTClassifier wf7hdn2g 4b9bcafb-8e14-4b64-8d6e-754d2f75903b -- CALL predict OK 358.8ms
[36m(ServeController pid=7083)[0m INFO 2026-02-18 02:56:30,727 controller 7083 -- Removing 1 replica from Deployment(name='OnlineMNISTPreprocessor', app='image_classifier').
[36m(ServeController pid=7083)[0m INFO 2026-02-18 02:56:30,727 controller 7083 -- Removing 1 replica from Deployment(name='OnlineMNISTClassifier', app='image_classifier').
[36m(ServeController pid=7083)[0m INFO 2026-02-18 02:56:30,727 controller 7083 -- Removing 1 replica from Deployment(name='ImageServiceIngress', app='image_classifier').
[36m(ServeController pid=708

[33m(raylet)[0m Task ServeController.graceful_shutdown failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.


With the composition pattern in hand, let's explore how to fine-tune resource allocation for each deployment.

---

## 5. Resource Specification and Fractional GPUs

Each replica can specify its resource requirements. For small models like our MNIST classifier, you can use **fractional GPUs** to pack multiple replicas on a single GPU:

In [20]:
mnist_app = OnlineMNISTClassifier.options(
    num_replicas=4,
    ray_actor_options={"num_gpus": 0.1},  # 10% of a GPU per replica → up to 10 replicas per GPU
).bind(local_path="/mnt/cluster_storage/model.pt")

mnist_handle = serve.run(mnist_app, name="mnist_classifier", blocking=False)

[36m(ProxyActor pid=7559)[0m INFO 2026-02-18 02:56:36,835 proxy 10.0.140.139 -- Proxy starting on node 1349328c4c289b18250dbf2618fd3c610d08a4b682f4f4c62dc0157e (HTTP port: 8000).
INFO 2026-02-18 02:56:36,898 serve 4350 -- Started Serve in namespace "serve".
[36m(ProxyActor pid=7559)[0m INFO 2026-02-18 02:56:36,895 proxy 10.0.140.139 -- Got updated endpoints: {}.
[36m(ServeController pid=7494)[0m INFO 2026-02-18 02:56:36,988 controller 7494 -- Deploying new version of Deployment(name='OnlineMNISTClassifier', app='mnist_classifier') (initial target replicas: 4).
[36m(ProxyActor pid=7559)[0m INFO 2026-02-18 02:56:36,991 proxy 10.0.140.139 -- Got updated endpoints: {Deployment(name='OnlineMNISTClassifier', app='mnist_classifier'): EndpointInfo(route='/', app_is_cross_language=False, route_patterns=None)}.
[36m(ProxyActor pid=7559)[0m INFO 2026-02-18 02:56:37,000 proxy 10.0.140.139 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7287501f02f0>.
[36m(Se

#### Request routing

With multiple replicas, Serve uses the **Power of Two Choices** algorithm by default: randomly sample 2 replicas, pick the one with the shorter queue. You can also implement [custom routing logic](https://docs.ray.io/en/latest/serve/advanced-guides/custom-request-router.html) by subclassing `RequestRouter`.

Test the fractional GPU deployment:

In [21]:
images = np.random.rand(2, 1, 28, 28).tolist()
response = requests.post("http://localhost:8000/", json=json.dumps({"image": images}))
print("Predicted labels:", response.json()["predicted_label"])

Predicted labels: [1, 1]


In [22]:
serve.shutdown()

[36m(ServeReplica:mnist_classifier:OnlineMNISTClassifier pid=3699, ip=10.0.190.36)[0m INFO 2026-02-18 02:56:42,439 mnist_classifier_OnlineMNISTClassifier ar6t5cad 6c229dbb-04d6-4858-be65-2c1c3758ba8a -- POST / 200 357.2ms
[36m(ServeController pid=7494)[0m INFO 2026-02-18 02:56:42,609 controller 7494 -- Removing 4 replicas from Deployment(name='OnlineMNISTClassifier', app='mnist_classifier').
[36m(ServeController pid=7494)[0m INFO 2026-02-18 02:56:44,642 controller 7494 -- Replica(id='ar6t5cad', deployment='OnlineMNISTClassifier', app='mnist_classifier') is stopped.
[36m(ServeController pid=7494)[0m INFO 2026-02-18 02:56:44,642 controller 7494 -- Replica(id='3ykvd446', deployment='OnlineMNISTClassifier', app='mnist_classifier') is stopped.
[36m(ServeController pid=7494)[0m INFO 2026-02-18 02:56:44,643 controller 7494 -- Replica(id='7d3n141m', deployment='OnlineMNISTClassifier', app='mnist_classifier') is stopped.
[36m(ServeController pid=7494)[0m INFO 2026-02-18 02:56:44,644

[33m(raylet)[0m Task ServeController.graceful_shutdown failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.


Next, let's see how Ray Serve can automatically scale replicas up and down based on traffic.

---

## 6. Autoscaling

Ray Serve automatically adjusts the number of replicas based on traffic. The key settings are:

- **`target_ongoing_requests`** — the desired average number of active requests per replica. The autoscaler adds replicas when the actual ratio exceeds this target.
- **`max_ongoing_requests`** — the upper limit per replica. Set 20-50% higher than `target_ongoing_requests`. While `max_ongoing_requests` limits concurrency per replica, `max_queued_requests` limits how many requests wait in the caller's queue. When reached, new requests immediately receive HTTP 503.
- **`upscale_delay_s`** / **`downscale_delay_s`** — how long to wait before adding or removing replicas.
- **`look_back_period_s`** — the time window for averaging ongoing requests when making scaling decisions.

### Autoscaling in action

With `initial_replicas=0` and `min_replicas=0`, no GPU resources are allocated until a request arrives:

In [23]:
mnist_app = OnlineMNISTClassifier.options(
    ray_actor_options={"num_cpus": 0.5, "num_gpus": 0.1},
    autoscaling_config={
        "target_ongoing_requests": 10,
        "initial_replicas": 0,
        "min_replicas": 0,
        "max_replicas": 8,
        "upscale_delay_s": 5,
        "downscale_delay_s": 60,
        "look_back_period_s": 5,
    },
).bind(local_path="/mnt/cluster_storage/model.pt")

mnist_handle = serve.run(mnist_app, name="mnist_classifier", blocking=False)

[36m(ProxyActor pid=7758)[0m INFO 2026-02-18 02:56:48,824 proxy 10.0.140.139 -- Proxy starting on node 1349328c4c289b18250dbf2618fd3c610d08a4b682f4f4c62dc0157e (HTTP port: 8000).
INFO 2026-02-18 02:56:48,890 serve 4350 -- Started Serve in namespace "serve".
[36m(ProxyActor pid=7758)[0m INFO 2026-02-18 02:56:48,886 proxy 10.0.140.139 -- Got updated endpoints: {}.
[36m(ServeController pid=7695)[0m INFO 2026-02-18 02:56:48,986 controller 7695 -- Registering autoscaling state for deployment Deployment(name='OnlineMNISTClassifier', app='mnist_classifier')
[36m(ServeController pid=7695)[0m INFO 2026-02-18 02:56:48,987 controller 7695 -- Deploying new version of Deployment(name='OnlineMNISTClassifier', app='mnist_classifier') (initial target replicas: 0).
[36m(ProxyActor pid=7758)[0m INFO 2026-02-18 02:56:48,990 proxy 10.0.140.139 -- Got updated endpoints: {Deployment(name='OnlineMNISTClassifier', app='mnist_classifier'): EndpointInfo(route='/', app_is_cross_language=False, route_pa

Send requests to trigger scale-up:

In [24]:
batch = {"image": np.random.rand(10, 1, 28, 28)}
[mnist_handle.predict.remote(batch) for _ in range(200)]

INFO 2026-02-18 02:56:50,059 serve 4350 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7b50801b9880>.


[<ray.serve.handle.DeploymentResponse at 0x7b50801bb0e0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50802e22d0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801bb4d0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801bb890>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801bbb00>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801bbce0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801bbec0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dc0e0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dc2c0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dc4a0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dc680>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dc860>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dca40>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dccb0>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dce90>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dd070>,
 <ray.serve.handle.DeploymentResponse at 0x7b50801dd250>,
 <ray.serve.ha

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/serve-auto-scaling.png" width="800">

In [25]:
serve.shutdown()

[36m(ServeController pid=7695)[0m INFO 2026-02-18 02:56:50,117 controller 7695 -- Upscaling Deployment(name='OnlineMNISTClassifier', app='mnist_classifier') from 0 to 1 replicas. Current ongoing requests: 13.00, current running replicas: 0.


[36m(ServeController pid=7695)[0m INFO 2026-02-18 02:56:50,228 controller 7695 -- Deregistering autoscaling state for deployment Deployment(name='OnlineMNISTClassifier', app='mnist_classifier')
[36m(ServeController pid=7695)[0m INFO 2026-02-18 02:56:50,229 controller 7695 -- Deregistering autoscaling state for application mnist_classifier


[33m(raylet)[0m Task ServeController.listen_for_change failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.
[33m(raylet)[0m Task ServeController.graceful_shutdown failed. There are infinite retries remaining, so the task will be retried. Error: The actor is dead because it was killed by `ray.kill`.


For advanced use cases, Ray Serve supports [custom autoscaling policies](https://docs.ray.io/en/latest/serve/advanced-guides/advanced-autoscaling.html#custom-autoscaling-policies) that go beyond queue-depth — such as pre-scaling based on time of day, scaling on CPU/memory utilization, or targeting a P90 latency SLA.

---

## 7. Observability

### Metrics

Ray Serve exposes metrics at multiple granularity levels through the Serve dashboard and Grafana:

- **Throughput metrics** — QPS and error QPS, available per application, per deployment, and per replica

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/serve-throughput-metrics.png" width="800">

- **Latency metrics** — P50, P90, P99 latencies at the same granularity levels

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/serve-latency-metrics.png" width="800">

- **Deployment metrics** — replica count and queue size per deployment

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/serve-replica-metrics.png" width="400">

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/serve-queuesize-metrics.png" width="400">

Access these through the Ray Dashboard by navigating to **Ray Dashboard > Serve > VIEW IN GRAFANA**.

### Custom metrics

Define custom metrics using `ray.serve.metrics`:

```python
@serve.deployment(num_replicas=2)
class InstrumentedService:
    def __init__(self):
        self.request_counter = metrics.Counter(
            "my_request_counter",
            description="Total requests processed.",
            tag_keys=("model",),
        )
        self.request_counter.set_default_tags({"model": "mnist"})

    async def __call__(self, request: Request):
        self.request_counter.inc()
        return "ok"
```

To create custom dashboards for monitoring your custom metrics, see [Custom dashboards and alerting](https://docs.anyscale.com/monitoring/custom-dashboards-and-alerting).

Here is how the custom metric looks like in the Anyscale dashboard.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/serve-custom-request-counter.png" width="400">

### Tracing

For end-to-end request tracing across composed deployments, use the Anyscale Tracing integration. A single request's trace displays the hierarchical structure of how it flows through your deployment graph:

```text
1. proxy_http_request (Root) - Duration: 245ms
   └── 2. proxy_route_to_replica (APIGateway) - Duration: 240ms
       └── 3. replica_handle_request (APIGateway) - Duration: 235ms
           └── 4. proxy_route_to_replica (UserService) - Duration: 180ms
               └── 5. replica_handle_request (UserService) - Duration: 175ms
```

For details, see the [Anyscale Tracing guide](https://docs.anyscale.com/monitoring/tracing/).

### Alerts

Ray integrates with Prometheus and Grafana for an enhanced observability experience. [Grafana alerting](https://grafana.com/docs/grafana/latest/alerting/) lets you set up alerts based on Prometheus metrics — for example, alerting when P90 latency exceeds your SLA or error QPS spikes. Grafana supports multiple notification channels including Slack and PagerDuty.

For a comprehensive overview of monitoring and debugging on Anyscale, see the [Anyscale monitoring guide](https://docs.anyscale.com/monitoring) and [custom dashboards and alerting](https://docs.anyscale.com/monitoring/custom-dashboards-and-alerting).

---

# Part 2: Advanced Topics

The following sections cover additional serving patterns, operational features, and production concerns. They don't require running code in sequence and can be read as reference material.

---

## 8. Dynamic Request Batching

When your model can process multiple inputs efficiently (such as GPU inference), batching improves throughput. Ray Serve provides the `@serve.batch` decorator:

```python
@serve.deployment
class BatchMNISTClassifier:
    def __init__(self, local_path: str):
        self.model = torch.jit.load(local_path).to("cuda").eval()

    @serve.batch(max_batch_size=8, batch_wait_timeout_s=0.1)
    async def __call__(self, images_list: list[np.ndarray]) -> list[dict]:
        # images_list is a list of individual request payloads, automatically batched
        stacked = torch.tensor(np.stack(images_list)).float().to("cuda")
        with torch.no_grad():
            logits = self.model(stacked).cpu().numpy()
        predictions = np.argmax(logits, axis=1)
        return [{"predicted_label": int(p)} for p in predictions]
```

Under the hood:
- Requests are buffered in a queue
- Once `max_batch_size` requests arrive (or `batch_wait_timeout_s` elapses), the batch is sent to your method
- Responses are split and returned individually

This is most effective for **vectorized operations on CPUs** and **parallelizable operations on GPUs**.

---

## 9. Model Multiplexing

When serving many models with the same shape but different weights (such as per-customer fine-tuned models), model multiplexing lets a shared pool of replicas efficiently serve all of them. The router inspects the `serve_multiplexed_model_id` request header and routes each request to a replica that already has that model loaded, avoiding redundant loading. Each replica caches up to `max_num_models_per_replica` models and evicts the least recently used one when full.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/intro-ai-libraries/model_multiplexing_architecture.png" width="800">

For the full API walkthrough — including code examples, client headers, and `DeploymentHandle` options — see the [Model Multiplexing docs](https://docs.ray.io/en/latest/serve/model-multiplexing.html).

---

## 10. Asynchronous Inference

Synchronous APIs block until processing completes, which is problematic for long-running tasks such as video processing or document analysis. Asynchronous inference decouples request submission from result retrieval — clients submit a task, receive a task ID immediately, and poll for the result later.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/async_inference_architecture.png" width="900">

The architecture consists of an HTTP ingress that enqueues tasks into a broker (such as Redis or RabbitMQ), a `@task_consumer` deployment that pulls and processes tasks, and a backend that stores results and status. This provides natural backpressure, built-in retries, and dead letter queues for failed tasks.

For the full walkthrough — including configuration, code examples, and monitoring — see the [Asynchronous Inference docs](https://docs.ray.io/en/latest/serve/asynchronous-inference.html).

---

## Summary and Next Steps

In this template, you learned how to:

- **Build** a Ray Serve deployment from a standard PyTorch model
- **Integrate** with FastAPI for HTTP routing and validation
- **Compose** multiple deployments into a pipeline
- **Configure** autoscaling, fractional GPUs, and resource allocation
- **Monitor** with built-in metrics, custom metrics, tracing, and alerts
- **Understand** batching, model multiplexing, and async inference patterns

### Next Steps

1. [Ray Serve documentation](https://docs.ray.io/en/latest/serve/index.html) — full API reference
2. [Production guide](https://docs.ray.io/en/latest/serve/production-guide/index.html) — deploying and managing Serve in production
3. [Anyscale monitoring guide](https://docs.anyscale.com/monitoring) — dashboards, alerts, and debugging
4. [Configure Serve deployments](https://docs.ray.io/en/latest/serve/configure-serve-deployment.html) — full configuration options