In [None]:
import ray
from ray.air.config import ScalingConfig
from ray import serve
import requests, json
from starlette.requests import Request
from typing import Dict
from transformers import SegformerForSemanticSegmentation, SegformerFeatureExtractor
from PIL import Image, ImageEnhance
import numpy as np
import matplotlib.pyplot as plt
import torch
import pickle
from io import BytesIO

# Ray Serve

## Intro

### Outline

-   Deployments
    -   Resources (CPU/GPU/custom)
    -   Runtime environments support, usage (functionality)
    -   Bound deployments, ServeHandles
-   Scaling and Performance
    -   Replicas
        -   num_replicas, autoscaling_config, max_concurrent_queries
    -   Request batching
-   Composition Patterns
    -   Imperative
    -   Declarative / Graph Deployment API
-   Architecture / Under-the-hood
    -   Ray cluster perspective - processes / workers / actors
    -   Request routing, queuing, load balancing in Serve

### Example scenario: computer vision services

For our example use case, we’ll see how to leverage Ray Serve to host a CV segmentation
model and how to enhance it using additional services such as image preprocessing.

### Context: Ray AIR

Ray AIR is the Ray AI Runtime, a set of high-level easy-to-use APIs for
ingesting data, training models – including reinforcement learning
models – tuning those models and then serving them.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Introduction_to_Ray_AIR/e2e_air.png" width=600 loading="lazy"/>

Key principles behind Ray and Ray AIR are
* Performance
* Developer experience and simplicity

# Ray Serve

Serve is a microservices framework for serving ML – the model serving
component of Ray AIR.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/serve_architecture.png' width=700/>

# Deployments

`Deployment` is the fundamental user-facing element of serve.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=600/>

## Our First Service

Let’s jump right in and get something simple up and running on Ray
Serve.

In [None]:
@serve.deployment
class TextConverter:
    def convert(self, text):
        return "***" + str.upper(text) + "***"

In [None]:
app_handle = serve.run(TextConverter.bind())

## Key APIs and concepts

`Deployment` represents a service and is created with the `@serve.deployment` decorator
* As end users, we don't instantiate `Deployment`s directly
* Ray will create them as actors, per our scaling requirements

A __bound deployment__ is created with the `.bind` class method on the deployment class
* e.g., `TextConverter.bind(msg="Yes...")` above creates a bound deployment
* `.bind` allows us to provide constructor params for the deployment class (the `msg` param above)
* bound deployments *can* be passed to other deployments via `.bind` -- this is one way to compose services
* We can pass a bound deployment to `serve.run(...)`
    * to start a service
    * to obtain a `ServeHandle`

A `ServeHandle` can be used to invoke services through the Python API
* At runtime, services can call other services via serve handles
    * Bound deployments provided to deployment constructors via `.bind` become serve handles at runtime

In [None]:
print(type(app_handle))
print(app_handle)

Look at Actors in the dashboard. Why are deployment replicas actors?

In [None]:
app_handle.convert.remote("cat")

In [None]:
ray.get(app_handle.convert.remote("cat"))

Ok, we have a minimal deployment built and running!

In [None]:
serve.shutdown()

What do we want to do next?
* Support some image processing
* Support HTTP

Then...
* Image segmentation with SegFormer
* Service composition -- e.g., grayscale/resize/sharpen/etc. and then segment

And finally...
* Manage GPUs (and resources generally)
* Multiple replicas, autoscaling

In [None]:
@serve.deployment
class Threshold:
    def __init__(self, threshold: int):
        self._threshold = threshold # initial state
    
    def get_response(self, image):
        new_image = np.zeros_like(image)
        new_image[image > self._threshold] = 255
        new_image[new_image < 255] = 0
        return new_image

app_handle = serve.run(Threshold.bind(threshold=128), name='hello_image_world')

In [None]:
im = Image.open("images/cat.jpg")

im

In [None]:
im = im.resize((512,384))

im

In [None]:
np.array(im.getdata()).shape

In [None]:
arr = np.array(im.getdata())
arr = arr.reshape(-1, 512, 3)

plt.imshow(arr)

In [None]:
plt.imshow(arr.mean(axis=2), cmap='gray')

In [None]:
output_ref = app_handle.get_response.remote(arr)

In [None]:
plt.imshow(ray.get(output_ref))

In [None]:
serve.delete('hello_image_world')

Add HTTP ... this is a bit messier just because of conversion between bytes, arrays, and HTTP tools

In [None]:
@serve.deployment
class Threshold:
    def __init__(self, threshold: int):
        self._threshold = threshold # initial state

    def get_response(self, image):
        new_image = np.zeros_like(image)
        new_image[image > self._threshold] = 255
        new_image[new_image < 255] = 0
        return new_image
    
    # a lot of boilerplate as HTTP adapter for images + ndarrays (a text/JSON example would be about 3 lines)
    async def __call__(self, request: Request) -> Dict:
        import numpy as np
        import io
        from imageio import v3 as iio
        from fastapi import Response

        # async collect POST body
        body = await request.body()
        
        # unpickle serialized data
        image = pickle.loads(body)
        
        # get NDArray for our image processing
        data = np.array(image)
        
        # invoke existing business logic
        transformed_data = self.get_response(data)
        
        # convert to image
        transformed_image = Image.fromarray(transformed_data.astype(np.uint8))
        
        # prepare output buffer
        with io.BytesIO() as buf:
            iio.imwrite(buf, transformed_image, plugin="pillow", format="JPEG")
            im_bytes = buf.getvalue()
        
        # prepare and return HTTP Response
        headers = {'Content-Disposition': 'inline'}
        return Response(im_bytes, headers=headers, media_type='image/jpeg')

app_handle = serve.run(Threshold.bind(threshold=128), name='hello_image_world')

Threshold our cat via HTTP

(if we are working with arrays on the client side and want to make an image from an array, we'd call `Image.fromarray(my_array)`)

In [None]:
response = requests.post("http://localhost:8000/", data = pickle.dumps(im)) # uncompressed

response

In [None]:
Image.open(BytesIO(response.content))

In [None]:
serve.delete('hello_image_world')

## Build a semantic segmentation service on SegFormer

At this point, we've done all the hard work -- we know the structure of our service code.

In this use case, we're going to build and test
* SegFormer-based segmentation service
* Image prep service (as a demo, we'll just convert the image to grayscale, but feel free to experiment with other transformations)
* an Ingress service, to separate the HTTP handling code from our other components

### Segmentation service

In [None]:
@serve.deployment
class Segmenter:
    def __init__(self, model_name):
        self.model = SegformerForSemanticSegmentation.from_pretrained(model_name)
        self.feature_extractor = SegformerFeatureExtractor.from_pretrained(model_name, do_reduce_labels=True)

    def segment(self, image) -> list[np.ndarray]: # can process PIL Image, or torch/np tensor

        batch = [image]
        # Set the device on which PyTorch will run.
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(device)  # Move the model to specified device.
        self.model.eval()  # Set the model in evaluation mode on test data.

        # The feature extractor processes raw images.
        inputs = self.feature_extractor(images=batch, return_tensors="pt")

        # The model is applied to input images in the inference step.
        with torch.no_grad():
            outputs = self.model(pixel_values=inputs.pixel_values.to(device))

        # Post-process the output for display.
        image_sizes = [image.size[::-1] for image in batch]
        segmentation_maps_postprocessed = (
            self.feature_extractor.post_process_semantic_segmentation(
                outputs=outputs, target_sizes=image_sizes
            )
        )

        return [j.detach().cpu().numpy() for j in segmentation_maps_postprocessed][0]

In [None]:
segmenter = Segmenter.bind("nvidia/segformer-b0-finetuned-ade-512-512")

In [None]:
app_handle = serve.run(segmenter, name='seg')

In [None]:
out = app_handle.segment.remote(im)

In [None]:
plt.imshow(ray.get(out))

In [None]:
serve.delete('seg')

Ok, out segmenter service works!

#### Next, we'll create a simple image processor service which does some preprocessing -- ours will "sharpen" -- and then segmenting.

The idea is that a user can call our image processor to do the combined tasks, and it will let us demonstrate how to __compose__ two Ray Serve services.

In [None]:
@serve.deployment
class SharpenThenSegment:
    def __init__(self, segmenter_service):
        self._segmenter_service = segmenter_service

    async def process(self, image):
        # first we do some custom preprocessing -- in this case a sharpening
        enhancer = ImageEnhance.Sharpness(image)
        processed_image = enhancer.enhance(8)
        # now we use the segmenter -- note the *remote* and *await*
        result = await self._segmenter_service.segment.remote(processed_image)
        return await result

In [None]:
segmenter = Segmenter.bind("nvidia/segformer-b0-finetuned-ade-512-512")
sharpen_then_segment = SharpenThenSegment.bind(segmenter)

app_handle = serve.run(sharpen_then_segment, name='composition_test')

In [None]:
result = app_handle.process.remote(im)

plt.imshow(ray.get(result))

In [None]:
serve.delete('composition_test')

Now we'll wrap all of this behind an ingress service, to demontrate more composition and factor out the HTTP handling

In [None]:
@serve.deployment
class Ingress:
    def __init__(self, processor_service):
        self._processor_service = processor_service
    
    # a lot of boilerplate as HTTP adapter for images + ndarrays (a text/JSON example would be about 3 lines)
    async def __call__(self, request: Request) -> Dict:
        import numpy as np
        import io
        from imageio import v3 as iio
        from fastapi import Response

        # async collect POST body
        body = await request.body()
        
        # unpickle serialized data
        image = pickle.loads(body)
        
        # invoke existing business logic; await to get obj ref to result
        ref = await self._processor_service.process.remote(image)
        
        # await the actual data, since we need it for the remaining conversion steps
        transformed_data = await ref
        
        # convert to image
        transformed_image = Image.fromarray(transformed_data.astype(np.uint8))
        
        # prepare output buffer
        with io.BytesIO() as buf:
            iio.imwrite(buf, transformed_image, plugin="pillow", format="JPEG")
            im_bytes = buf.getvalue()
        
        # prepare and return HTTP Response
        headers = {'Content-Disposition': 'inline'}
        return Response(im_bytes, headers=headers, media_type='image/jpeg')

In [None]:
segmenter = Segmenter.bind("nvidia/segformer-b0-finetuned-ade-512-512")
sharpen_then_segment = SharpenThenSegment.bind(segmenter)
ingress = Ingress.bind(sharpen_then_segment)

app_handle = serve.run(ingress, name='composition')

In [None]:
response = requests.post("http://localhost:8000/", data = pickle.dumps(im))

Image.open(BytesIO(response.content))

In [None]:
serve.delete('composition')

## Specifying service resources and scaling

### Resources

Resources -- typically GPUs, althoufgh  can be specified on a per-deployment basis and, if we want, in fractional units, via the `ray_actor_options` parameter on the `@serve.deployment` decorator.

Resources can include
* `num_cpus`
* `num_gpus`
* `resources` dictionary containing custom resources
    * custom resources are tracked and accounted as symbols (or tags) in order to match actors to workers
    
Example
```python
@serve.deployment(ray_actor_options={'num_cpus' : 2, 'num_gpus' : 2, resources : {"my_super_accelerator": 1}})
class Demo:
    ...
```

The purpose of the declarative resource mechanism is to allow Ray to place code on suitable nodes in a heterogeneous cluster without our having know which nodes have which resources to where our code should run.

> Best practice: if some nodes have a distinguising feature, mark and request it as a resource, rather than trying to determine which nodes are present and where your code will run.

For more details, see https://docs.ray.io/en/latest/serve/scaling-and-resource-allocation.html#resource-management-cpus-gpus

### Replicas and autoscaling

Each deployment can have its own resource management and autoscaling configuration, with several options for scaling.

By default -- if nothing specified, as in our examples above -- the default is a single. We can specify a larger, constant number of replicas in the decorator:
```python
@serve.deployment(num_replicas=3)
```

For autoscaling, instead of `num_replicas`, we provide an `autoscaling_config` dictionary. With autoscaling, we can specify a minimum and maximum range for the number of replicas, the initial replica count, a load target, and more.

Here is example of extended configuration -- see https://docs.ray.io/en/latest/serve/scaling-and-resource-allocation.html#scaling-and-resource-allocation for more details:

```python
@serve.deployment(
    autoscaling_config={
        'min_replicas': 1,
        'initial_replicas': 2,
        'max_replicas': 5,
        'target_num_ongoing_requests_per_replica': 10,
    }
)
```

`min_replicas` can also be set to zero to create a "serverless" style design: in exchange for potentially slower startup, no actors (or their CPU/GPU resources) need to be permanently reserved.

In [None]:
@serve.deployment(ray_actor_options={'num_gpus': 0.3}, autoscaling_config={ 'min_replicas': 2, 'max_replicas': 3 })
class Segmenter:
    def __init__(self, model_name):
        self.model = SegformerForSemanticSegmentation.from_pretrained(model_name)
        self.feature_extractor = SegformerFeatureExtractor.from_pretrained(model_name, do_reduce_labels=True)

    def segment(self, image) -> list[np.ndarray]: # can process PIL Image, or torch/np tensor

        batch = [image]
        device = 'cuda:0' # explicitly name GPU for demo
        self.model.to(device)
        self.model.eval()  # Set the model in evaluation mode on test data.

        inputs = self.feature_extractor(images=batch, return_tensors="pt")

        with torch.no_grad():
            outputs = self.model(pixel_values=inputs.pixel_values.to(device))

        image_sizes = [image.size[::-1] for image in batch]
        segmentation_maps_postprocessed = (
            self.feature_extractor.post_process_semantic_segmentation(
                outputs=outputs, target_sizes=image_sizes
            )
        )

        return [j.detach().cpu().numpy() for j in segmentation_maps_postprocessed][0]

In [None]:
segmenter = Segmenter.bind("nvidia/segformer-b0-finetuned-ade-512-512")
sharpen_then_segment = SharpenThenSegment.bind(segmenter)
ingress = Ingress.bind(sharpen_then_segment)

app_handle = serve.run(ingress, name='resource_and_scaling')

Notice the two replicas in the Serve and Actors dashboards

In [None]:
Image.open(BytesIO(requests.post("http://localhost:8000/", data = pickle.dumps(im)).content))

Observe that GPU memory is being consumed

In [None]:
serve.delete('resource_and_scaling')

## Alternative composition pattern: Deployment Graph API

What is the Deployment Graph API?

* The Deployment Graph API lets us separate the flow of calls from the logic inside our services.

Why might we want to use the Deployment Graph (DAG) API to separate flow from logic?

* It may be valuable to add a layer of indirection – or abstraction – so that we can more easily create and compose reusable services
* The DAG API lets us use similar patterns across the Ray platform (e.g., Ray Workflow)
    * We can learn one general pattern for graphs and use that intuition in multiple places in our Ray applications
* Although we compose one DAG, we retain the key Ray Serve features of granular autoscaling and resource allocation

Let’s reproduce our chat service flow using the Deployment Graph API

#### Getting started with deployment graphs

For this example, we'll have a linear graph (flow)

<span style='color:red;font-size:18pt;'>REPLACE THIS IMAGE</span>
<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment_graph_simple.png' width=900/>

In [None]:
from ray.serve.dag import InputNode
from ray.serve.drivers import DAGDriver

`InputNode` is a special type of graph node, defined by Ray Serve, which represents values supplied to our service endpoint. 

In [None]:
@serve.deployment
async def unpack_request(request: Request):
    body = await request.body()
    image = pickle.loads(body)
    return image

@serve.deployment
def sharpener(image):
    enhancer = ImageEnhance.Sharpness(image)
    sharpened_image = enhancer.enhance(8)
    return sharpened_image

# in sequence, we'll segment next -- but we can re-use the segmenter we've already defined

@serve.deployment
def pack_response(image):
    import io
    from imageio import v3 as iio
    from fastapi import Response

    transformed_data = image # await???
    transformed_image = Image.fromarray(transformed_data.astype(np.uint8))
    with io.BytesIO() as buf:
        iio.imwrite(buf, transformed_image, plugin="pillow", format="JPEG")
        im_bytes = buf.getvalue()
        
    headers = {'Content-Disposition': 'inline'}
    return Response(im_bytes, headers=headers, media_type='image/jpeg')

Here is a minimal, linear pipeline performs the processing we implemented earlier.

We build up the graph step by step, `bind`ing each deployment to its dependencies.

In [None]:
segmenter = Segmenter.bind("nvidia/segformer-b0-finetuned-ade-512-512")

with InputNode() as http_request:
    input_image = unpack_request.bind(http_request)
    sharpened_image = sharpener.bind(input_image)
    segmented = segmenter.segment.bind(sharpened_image)    
    response = pack_response.bind(segmented)
    
graph = DAGDriver.bind(response)

We start the application by calling `serve.run()` on the DAGDriver, a Ray Serve component which routes HTTP requests through your call graph.

In [None]:
app_handle = serve.run(graph, name='basic_linear')

In [None]:
Image.open(BytesIO(requests.post("http://localhost:8000/", data = pickle.dumps(im)).content))

In [None]:
serve.delete('basic_linear')

## Architecture / under-the-hood

### Ray cluster perspective: actors

In Ray, user code is executed by worker processes. These workers can run tasks (stateless functions) or actors (stateful class instances).

Ray Serve is built on actors, allowing deployments to collect expensive state once (such as loading a ML model) and to reuse it across many service requests.

Although you may never need to code any Ray tasks or actors yourself, your Ray Serve application has full access to those cluster capabilities and you may wish to use them to implement other functionality (e.g., service or operations that don't need to accept HTTP traffic). More information is at https://docs.ray.io/en/latest/ray-core/walkthrough.html

### Serve design

Under the hood, a few other actors are used to make up a serve instance.

* Controller: A global actor unique to each Serve instance is responsible for managing other actors. Serve API calls like creating or getting a deployment make remote calls to the Controller.

* HTTP Proxy: By default there is one HTTP proxy actor on the head node that accepts incoming requests, forwards them to replicas, and responds once they are completed. For scalability and high availability, you can also run a proxy on each node in the cluster via the location field of http_options.

* Deployment Replicas: Actors that execute the code in response to a request. Each replica processes requests from the HTTP proxy.
<img src='https://docs.ray.io/en/latest/_images/architecture-2.0.svg' width=700 />

Incoming requests, once resolved to a particular deployment, are queued. The requests from the queue are assigned round-robin to available replicas as long as capacity is available. This design provides load balancing and elasticity. 

Capacity can be managed with the `max_concurrent_queries` parameter to the deployment decorator. This value defaults to 100 and represents the maximum number of queries that will be sent to a replica of this deployment without receiving a response. Each replica has its own queue to collect and smooth incoming request traffic.

In [None]:
serve.shutdown()