# Online Model Serving with Ray Serve
© 2025, Anyscale. All Rights Reserved

💻 **Launch Locally**: You can run this notebook locally.

🚀 **Launch on Cloud**: Think about running this notebook on a Ray Cluster (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale)

Model serving is the process of deploying machine learning models to production so that they can be accessed and 
used by applications or users. It involves creating an API or interface that allows users to send requests to the model
and receive predictions in response. There are several libraries and frameworks available for model serving, 
each with its own features and capabilities. In this notebook, we showcase Ray Serve and FastAPI to deploy a sentiment analysis
machine learning (ML) model.


### What is Ray Serve?
Ray Serve is a scalable model serving library that allows you to deploy and manage machine learning models in production.
With Ray Serve, you can easily create a scalable and distributed serving architecture that
can handle high traffic and large workloads. It is built on top of Ray, a distributed computing framework
that allows you to run Python code in parallel across multiple machines. 
Ray Serve provides a simple API for deploying and managing models, as well as features like autoscaling,
load balancing, and versioning.

Ray Serve is designed to be easy to use and integrate with existing machine learning workflows.
It supports a wide range of machine learning frameworks, including TensorFlow, PyTorch, and Scikit-learn.
Ray Serve also provides a simple way to deploy models as REST APIs, using FastAPI,
making it easy to integrate with web applications and other services.

More information: https://docs.ray.io/en/latest/serve/index.html


### Why not use just FastAPI or Flask?
We could have simply used FastAPI or Flask to create a REST API for the model,
but Ray Serve provides additional features like autoscaling and load balancing that 
make it a better choice for production deployments. Ray Serve also allows you to easily
deploy multiple models and manage their versions, which can be useful in a production environment 
where you may need to deploy multiple models or update existing ones."""

### Outline
<div class="alert alert-block alert-info">
<ul>
    <li>Architecture
    <li>Import Libraries
    <li>FastAPI service to accept HTTP requests and scaling with Ray Serve
    <li>Simulate Client: Send test requests
    <li>Shutdown the Serve app and the ray cluster
</ul>
</div>

## Architecture

![Architecture Diagram](https://lz-public-demo.s3.us-east-1.amazonaws.com/anyscale101/01_examples/03_Ray_Serve_architecture.svg?sanitize=true)

### Import libraries
In addition to ray and serve, we also import FastAPI to create webservice and Hugging Face transformers to download ML models.

In [1]:
# Import ray serve and FastAPI libraries
import ray
from ray import serve
from fastapi import FastAPI

# library for pre-trained models
from transformers import pipeline

## FastAPI webservice and deploy a model
FastAPI is used to create a webservice 'app' to accept HTTP requests.

MySentimentModel class loads the ML model and defines *predict* function for online inference. @serve.deployment decorator defines the Ray Serve deployment.

*@app.get()* is used to create a GET '/predict' route. Similarly, @app.post() can be used POST requests. See https://docs.ray.io/en/latest/serve/http-guide.html for more details.

In this example, *application_logic()* function is used to define a sample transformation or business logic that can be applied before sending the input to the ML model for inference. See inline comments for further explanation.

### Scaling deployment
*num_replicas* parameter sets the number of instances of the deployment. FastAPI and RayServe automatically load balances to send requests to each instance. There are more options to set the *accelerator_type* to GPU and even use fractional GPUs. See configuration options here: https://docs.ray.io/en/latest/serve/configure-serve-deployment.html .

In [2]:

# Define a simple FastAPI app
app = FastAPI()

# Define a Ray Serve deployment
# This decorator registers the class as a Ray Serve deployment
@serve.deployment(num_replicas=2) # num_replicas specifies the number of replicas for load balancing
@serve.ingress(app) # This decorator allows the FastAPI app to be served by Ray Serve
class MySentimentModel:
    def __init__(self):
        # Load a pre-trained sentiment analysis model
        self.model = pipeline("sentiment-analysis",
                              model="distilbert-base-uncased-finetuned-sst-2-english")

    # Define any necessary application logic or transformation logic
    def application_logic(self, text):
        """        Apply any necessary application logic to the input text.
        """
        # simple application logic: truncate text if it exceeds a certain length
        if len(text) > 50:
            return text[:50].lower()  # Truncate and convert to lowercase
        else:
            return text.lower()
        
    @app.get("/predict") # Define an endpoint for predictions
    def predict(self, text: str):
        """        Predict sentiment for the given text.
        """
        # Define any necessary application logic or transformation logic
        text = self.application_logic(text) # Apply any necessary application logic to the input text

        # Use the model to make a prediction
        result = self.model(text)
        return {"text": text, "sentiment": result}
 




 ### Deploy the model

In [3]:

serve.run(MySentimentModel.bind()) # Bind the deployment to the Ray Serve runtime


2025-07-11 09:16:58,624	INFO worker.py:1908 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m
[36m(ProxyActor pid=56603)[0m INFO 2025-07-11 09:16:59,927 proxy 127.0.0.1 -- Proxy starting on node c2986ec5ee9361515a52f3c506843d5d247ffc6732250203b8a47f21 (HTTP port: 8000).
[36m(ProxyActor pid=56603)[0m INFO 2025-07-11 09:16:59,946 proxy 127.0.0.1 -- Got updated endpoints: {}.
INFO 2025-07-11 09:16:59,975 serve 18071 -- Started Serve in namespace "serve".
[36m(ServeController pid=56591)[0m INFO 2025-07-11 09:17:00,075 controller 56591 -- Deploying new version of Deployment(name='MySentimentModel', app='default') (initial target replicas: 2).
[36m(ServeController pid=56591)[0m INFO 2025-07-11 09:17:00,178 controller 56591 -- Adding 2 replicas to Deployment(name='MySentimentModel', app='default').
[36m(ProxyActor pid=56603)[0m INFO 2025-07-11 09:17:00,076 proxy 127.0.0.1 -- Got updated endpoints: {Deployment(name='MySentimentModel', ap

DeploymentHandle(deployment='MySentimentModel')

[36m(ServeReplica:default:MySentimentModel pid=56599)[0m INFO 2025-07-11 09:20:39,996 default_MySentimentModel kcd1ofsq 1d920c68-1c59-4e2e-a64c-04f87d31bab7 -- GET /predict 200 128.0ms
[36m(ServeReplica:default:MySentimentModel pid=56595)[0m INFO 2025-07-11 09:21:11,350 default_MySentimentModel t4o4t2pu dda6dadc-f201-421d-9bbc-638d92b9f065 -- GET /predict 200 543.2ms
[36m(ServeReplica:default:MySentimentModel pid=56599)[0m INFO 2025-07-11 09:21:15,969 default_MySentimentModel kcd1ofsq e3611603-54d1-491a-8b05-1f693f413243 -- GET /predict 200 36.5ms
[36m(ServeReplica:default:MySentimentModel pid=56599)[0m INFO 2025-07-11 09:22:16,285 default_MySentimentModel kcd1ofsq 5be0e806-0bd5-4992-b15d-c59fd52e6195 -- GET /predict 200 422.7ms
[36m(ServeReplica:default:MySentimentModel pid=56599)[0m INFO 2025-07-11 09:22:58,953 default_MySentimentModel kcd1ofsq 8890c548-abf6-49b2-b923-b525fcc403da -- GET /predict 200 76.5ms
[36m(ServeController pid=56591)[0m INFO 2025-07-11 09:23:20,248 c

## Simulate Client: Send test requests

We use *requests* library to send HTTP requests to the deployed model.

Note: if you encounter any errors with serve not able to start, most likely it is due to previous instance of serve not being shutdown properly. Restart the notebook or see towards the end of notebook to see how to gracefully shutdown ray serve and the ray cluster.

In [4]:
import requests # used to send HTTP requests to the deployed model

In [7]:

# Query the deployed model
response = requests.get("http://localhost:8000/predict", params={"text": "I love Ray Serve!"})
print(response.json())  # Should print the sentiment analysis result


{'text': 'i love ray serve!', 'sentiment': [{'label': 'POSITIVE', 'score': 0.9998507499694824}]}


In [8]:
text = "Mars landscape is a tough place to live, but it has its own beauty. Venus is even tougher, with its extreme heat and pressure. "
response = requests.get("http://localhost:8000/predict", params={"text": text})
print(response.json())  # prints result, note the text is truncated to 50 characters in the application logic

{'text': 'mars landscape is a tough place to live, but it ha', 'sentiment': [{'label': 'NEGATIVE', 'score': 0.9590334296226501}]}


In [9]:
text = "National Parks are a great way to experience nature and wildlife."
response = requests.get("http://localhost:8000/predict", params={"text": text})
print(response.json())  # # prints result, note the text is truncated to 50 characters in the application logic

{'text': 'national parks are a great way to experience natur', 'sentiment': [{'label': 'POSITIVE', 'score': 0.9996353387832642}]}


### Shutdown the Ray Serve instances and Ray Cluster

In [10]:
# stop ray serve
serve.shutdown()  # Shutdown Ray Serve when done, ray cluster will still be running

In [11]:
ray.shutdown()  # Shutdown Ray cluster

### Summary
In this notebook, we deployed a sentiment analysis model from Hugging Face using Ray Serve and FastAPI. Using *num_replicas* we scaled the number of instances of the model. There are many more options to autoscale to increase the replicas when the traffic is high and downscale to zero when there is no traffic.