# Introduction to Ray Serve with PyTorch



💻 **Launch Locally**: You can run this notebook locally, but performance will be reduced.

🚀 **Launch on Cloud**: A Ray Cluster (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale) is recommended to run this notebook.

This notebook will introduce you to Ray Serve with PyTorch, a framework for building and deploying scalable ML applications.

<div class="alert alert-block alert-info">
    
<b>Here is the roadmap for this notebook:</b>

<ul>
    <li><b>1.</b> When to consider Ray Serve</li>
    <li><b>2.</b> Overview of Ray Serve</li>
    <li><b>3.</b> Implement an image classification service</li>
    <li><b>4.</b> Development workflow with Ray Serve </li>
</ul>
</div>

**Imports**

In [None]:
import subprocess
from typing import Any

import json
import numpy as np
import requests
import torch
from ray import serve
from starlette.requests import Request

## 1. When to Consider Ray Serve

Consider using Ray Serve for your project if it meets one or more of the following criteria:

| **Challenge** | **Details** | **Ray Serve Solution** |
|---------------|------------------|--------------------------|
| **Slow iteration speed for ML engineers** | - Developers need to containerize and rollout components on Kubernetes to test changes<br>- Developers need to use complex protocols (e.g. gRPC) to achieve acceptable performance | - Provides a Python-first API to develop lightweight services<br>- Services are lightweight [Ray actors](https://docs.ray.io/en/latest/ray-core/actors.html)<br>- Ray Serve can be run locally for development |
| **Need to efficiently compose multiple components** | - Requires efficient data sharing between components<br>- Implementing performant streaming protocols (e.g. gRPC) is a complex task | - Relies on [Ray's object store](https://docs.ray.io/en/latest/ray-core/objects.html) to share data optimally<br>- Avoids the need to implement gRPC streaming |
| **Poor utilization of expensive hardware** | Suffering from poor utilization due to naive request handling | - Offers [dynamic batching of requests](https://docs.ray.io/en/latest/serve/advanced-guides/dyn-req-batch.html) to improve hardware utilization<br>- Leverages Ray Core's support for accelerators and custom resources:<br>&nbsp;&nbsp;&nbsp;&nbsp;• [Multi-node/multi-GPU serving](https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html)<br>&nbsp;&nbsp;&nbsp;&nbsp;• [Fractional compute resource usage](https://docs.ray.io/en/latest/serve/configure-serve-deployment.html)<br>- RayTurbo Serve offers [replica compaction](https://www.anyscale.com/blog/new-feature-replica-compaction?_gl=1*lrhlou*_gcl_au*OTY4NjkwODIzLjE3Mzg1Mjc2MzA.) |
| **High-latency outliers when juggling many models** | Stuck with naive load balancing and expensive state loading (e.g. ML models) | - Provides [model multiplexing](https://docs.ray.io/en/latest/serve/model-multiplexing.html) to avoid unnecessary load times<br>- Routes to replicas that already have a model loaded |


## 2. Overview of Ray Serve

Serve is a framework for serving ML applications. 

### Applications

Here is a high-level overview of the architecture of a Ray Serve Application.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/serve_architecture.png' width=700/>

A Ray Serve cluster is made up of one or more Applications.

An Application is composed of one or more Deployments that work together. Key characteristics:
- Applications are coarse-grained units of functionality
- They can be **independently upgraded** without affecting other applications running on the same cluster
- They provide isolation and separate deployment lifecycles

### Deployments

A Deployment is the fundamental building block in Ray Serve's architecture.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=600/>

Deployments enable:
- Separation of concerns (e.g., different models, business logic, data transformations)
- **Independent scaling**, including autoscaling capabilities
- Multiple replicas for handling concurrent requests


### Replicas
Each Replica is a worker process (Ray actor) with its own request processing queue. Replicas offer flexible configuration options:

- Specify its own hardware and resource requirements (e.g., GPUs)
- Specify its own runtime environments (e.g., libraries)
- Maintain state (e.g., models)

This architecture provides a clean separation of concerns while enabling high scalability and efficient resource utilization.

## 3. Implement an image classification service

Let’s jump right in and get a simple ML service up and running on Ray Serve. 

Here is an image classification service that performs inference on a batch of handwritten digits using an `MNISTClassifier` model.

In [None]:
class MNISTClassifier:
    def __init__(self, remote_path: str, local_path: str, device: str):
        subprocess.run(f"aws s3 cp {remote_path} {local_path} --no-sign-request", shell=True, check=True)
        
        self.device = device
        self.model = torch.jit.load(local_path).to(device).eval()

    def __call__(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        return self.predict(batch)
    
    def predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to(self.device)

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

First we need to load the classifier model

In [None]:
storage_folder = '/mnt/cluster_storage'  # Modify this path to your local folder if it runs on your local environment
model_path = f"{storage_folder}/model.pt" # Use your local path
classifier = MNISTClassifier(remote_path="s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt", local_path=model_path, device="cpu")

Then we can run inference to generate predicted labels

In [None]:
output = classifier({"image": np.random.rand(1, 1, 28, 28).astype(np.float32)})  # Example input (B, C, H, W)
output["predicted_label"]  # Should be a numpy array with the predicted label

Now, if we want to migrate to an online inference setting, we can transform this into a Ray Serve Deployment by applying the `@serve.deployment` decorator


In [None]:
@serve.deployment() # this is the decorator to add
class OnlineMNISTClassifier:
    # same code as MNISTClassifier.__init__
    def __init__(self, remote_path: str, local_path: str, device: str):
        subprocess.run(f"aws s3 cp {remote_path} {local_path} --no-sign-request", shell=True, check=True)
        
        self.device = device
        self.model = torch.jit.load(local_path).to(device).eval()

    async def __call__(self, request: Request) -> dict[str, Any]:  # __call__ now takes a Request object
        batch = json.loads(await request.json()) # we will need to parse the JSON body of the request
        return await self.predict(batch)
    
    # same code as MNISTClassifier.predict
    async def predict(self, batch: dict[str, np.ndarray]) -> dict[str, np.ndarray]:
        images = torch.tensor(batch["image"]).float().to(self.device)

        with torch.no_grad():
            logits = self.model(images).cpu().numpy()

        batch["predicted_label"] = np.argmax(logits, axis=1)
        return batch

We have now defined our Ray Serve deployment

In [None]:
OnlineMNISTClassifier

We can now build an Application using `OnlineMNISTClassifier` deployment

In [None]:
model_path = f"{storage_folder}/model.pt" # Use your local path
mnist_app = OnlineMNISTClassifier.bind(remote_path="s3://anyscale-public-materials/ray-ai-libraries/mnist/model/model.pt", local_path=model_path, device="cpu")
mnist_app

<div class="alert alert-block alert-info">

**Note:** `.bind` is a method that takes in the arguments to pass to the Deployment constructor.

</div>


We can then run the application 

In [None]:
mnist_app_handle = serve.run(mnist_app, name='mnist_classifier', blocking=False)
mnist_app_handle

We can test it as an HTTP endpoint

In [None]:
images = np.random.rand(2, 1, 28, 28).tolist()
json_request = json.dumps({"image": images})
response = requests.post("http://localhost:8000/", json=json_request)
response.json()["predicted_label"]

We can also test it as a gRPC endpoint

In [None]:
batch = {"image": np.random.rand(10, 1, 28, 28)}
response = await mnist_app_handle.predict.remote(batch)
response["predicted_label"]

## 4. Development workflow

1. Define application in a `main.py` file
2. Deploy the application with `serve run`
3. Optionally specify configuration in a `config.yaml` file
    - you can use `serve build` to scaffold a basic config.yaml file
    - useful if you want to decouple the deployment configuration from the code
4. After making a change 
    - you can re-run the application with `serve run`
    - Note there is experimental support for hot-reloading of changes to the application (using `serve run --reload`)

In [None]:
# run the app with default config
!cd intro/ && serve run main:mnist_app --non-blocking --name app1

In [None]:
# build and optionally customize config
!cd intro/ && serve build -o config.yaml main:mnist_app 

In [None]:
# update the running app
!cd intro/ && serve run config.yaml --non-blocking

In case you want to **parameterize the application building**, use an "application builder" pattern - i.e. set the import path to point to a callable that will return an application.

To view an example, see `app_builder.py`

In [None]:
!cd intro/ && serve run app_builder:build_app --non-blocking --name app1 device=cpu

For more details on the recommended development workflow, read the [docs here](https://docs.ray.io/en/latest/serve/advanced-guides/dev-workflow.html#development-workflow)


For unit testing and debugging, Ray Serve provides a local testing mode. For more details, see the [docs here](https://docs.ray.io/en/latest/serve/advanced-guides/dev-workflow.html#local-testing-mode)

In [None]:
# Run this cell for file cleanup 
!rm {storage_folder}/model.pt