# Multilingual Chat with Ray Serve

In [None]:
import ray
import requests, json
from starlette.requests import Request
from typing import Dict

from ray import serve

## Ray Serve is a microservices framework for serving ML – the model serving component of Ray

Ray Serve provides resource management, scaling, a straightforward component framework, FastAPI compatibility ... and direct integration to the entire Ray ecosystem for scale-out compute.

In [None]:
ray.init()

### Chatbot using Huggingface LLM

To get started with Ray Serve, our main task is to take a standard Python class representing a service and
* add the `@serve.deployment` decorator
* optionally specify properties like resource requirements, environment, scaling, and more

Here, we take a "chatbot hello world" using a model from the Huggingface hub and a FastAPI-style HTTP wrapper, and create our first Ray Serve deployment.

Note that reserving a resource -- even a fractional GPU -- is done declaratively in the same line as the deployment decorator.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class Chat:
    def __init__(self, model: str):
        # configure stateful elements of our service such as loading a model
        self._tokenizer = AutoTokenizer.from_pretrained(model)
        self._model = AutoModelForSeq2SeqLM.from_pretrained(model).to(0)

    async def __call__(self, request: Request) -> Dict:
        # path to handle HTTP requests
        data = await request.json()
        data = json.loads(data)
        # after decoding the payload, we delegate to get_response for logic
        return {"response": self.get_response(data["user_input"], data["history"])}

    def get_response(self, user_input: str, history: list[str]) -> str:
        # this method receives calls directly (from Python) or from __call__ (from HTTP)
        history.append(user_input)
        # the history is client-side state and will be a list of raw strings;
        # for the default config of the model and tokenizer, history should be joined with '</s><s>'
        inputs = self._tokenizer("</s><s>".join(history), return_tensors="pt").to(0)
        reply_ids = self._model.generate(**inputs, max_new_tokens=500)
        response = self._tokenizer.batch_decode(
            reply_ids.cpu(), skip_special_tokens=True
        )[0]
        return response

We've defined a service using the `@serve.deployment` decorator.
* Deployments can be service endpoints
* Or they can be individual components accessed from other services
* Each deployment can have its own resource management and autoscaling configuration

Here is example of extended configuration -- see https://docs.ray.io/en/latest/serve/scaling-and-resource-allocation.html#scaling-and-resource-allocation for more details

```python
@serve.deployment(
    autoscaling_config={
        'min_replicas': 1,
        'initial_replicas': 2,
        'max_replicas': 5,
        'target_num_ongoing_requests_per_replica': 10,
    }
)
```

In [None]:
chat = Chat.bind(model="facebook/blenderbot-400M-distill")

We created the object that represents this service. We do that by using the `.bind()` class method, which captures the "recipe" for instantiating replicas of our service. The params in `.bind()` -- in this case, the name of the Huggingface model we want to use -- will be provided to our deployment class constructor.

In [None]:
handle = serve.run(chat, name="basic_chat")

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=600/>

`serve.run(...)` exposes out service endpoint for call. We can run multiple simultaneous applications with their own collections of deployments, resource management, scaling, etc. The `name` parameter in `serve.run(...)` becomes an identifier not only for the application but for each deployment that is launched to support that application.

Let's test out the service using Ray's Python API first.

In [None]:
message = "I can't wait to see more XFL football."
history = []
response_handle = handle.get_response.remote(message, history)
response = ray.get(response_handle)
response

We prepare a message and a chat history list and call our chat service via Python

In [None]:
history += [message, response]
history

In [None]:
message = "That is accurate... but there is much more to it."
response_handle = handle.get_response.remote(message, history)
response = ray.get(response_handle)
response

In [None]:
history += [message, response]
history

And we'll check that our HTTP endpoint is alive as well.

In [None]:
message = (
    "Even between XFL and NLF American football leagues, there are big differences."
)

json_doc = json.dumps({"user_input": message, "history": history})

json_doc

In [None]:
requests.post("http://localhost:8000/", json=json_doc).json()

Since we'll be creating a new application example, we can delete the old one -- that allows Ray to remove the replicas of our Chat deployment.

In [None]:
serve.delete("basic_chat")

## Goals for Model Serving in Production

As we assemble more complex applications, Ray Serve delivers increasingly valuable features
* The same platform -- so zero integration -- for preprocessing, training, batch prediction and other parts of the ML workflow
* Fine-grained resource allocation and scaling at the level of critical functions and classes
    * Instead of or in addition to scaling at the level of containers, like with Kubernetes, or of nodes
* Automatic batching
* Minimal data movement and data-locality-aware scheduling

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/serve_architecture.png' width=700/>

## Composing Services with Ray for Chatbot en Français

### Roadmap for Additional Services

The underlying chatbot model we’ve used only supports English interaction. But we can use the following recipe to add French language support:

1. Implement a translation service between French and English
1. Implement a language detection service
1. Implement a routing (dispatch) service:
    1. If the incoming prompt is French, then
        1. Route the inbound prompt through the FR-EN translator
        1. Pass the EN prompt to the chat model
        1. Pass the EN output from the chat model through the EN-FR translator
        1. Return the French response
    1. Otherwise (if the prompt is in English), pass it straight to the chatbot as we did earlier and return the (English) response

Let’s look using Ray Serve to implement model inference with these composed and conditional-flow elements using Python method calls (https://docs.ray.io/en/latest/serve/key-concepts.html#servehandle-composing-deployments).
We’ll implement parts 1 and 2 first…

### Translation service

In [None]:
from transformers import pipeline

In [None]:
@serve.deployment(ray_actor_options={"num_gpus": 0.25})
class Translate:
    def __init__(self, task: str, model: str):
        self._pipeline = pipeline(task=task, model=model, device=0)

    def get_response(self, user_input: str) -> str:
        outputs = self._pipeline(user_input)
        response = outputs[0]["translation_text"]
        return response


translate_en_fr = Translate.bind(task="translation_en_to_fr", model="t5-small")
translate_fr_en = Translate.bind(
    task="translation_fr_to_en", model="Helsinki-NLP/opus-mt-fr-en"
)

Notice how we have two different services but they are built on the same reusable code by calling `.bind()` with different initialization parameters.

*We don’t need to define new deployments for every service we use.*

This time we’re haven't published an application (via `serve.run()`) because these components will be invoked only by our main service deployment.

### Language detection

We can create the language detection service in a similar way. 

> This service is lighter weight because we’re using https://github.com/pemistahl/lingua-py … which leverages traditional NLP and n-grams for detection instead of a deep learning model. It can handle more traffic than, e.g., the chat model -- and it won't require a GPU. So we can benefit from Ray Serve's fine-grained resource allocation.
    
Lingua is optimized for strong detection on very short text snippets, like tweets, so it should be useful for our chat exchanges.

In [None]:
from lingua import Language, LanguageDetectorBuilder

In [None]:
@serve.deployment
class LangDetect:
    def __init__(self):
        languages = [Language.ENGLISH, Language.FRENCH]
        self._detector = LanguageDetectorBuilder.from_languages(*languages).build()

    def get_response(self, user_input: str) -> str:
        output = self._detector.detect_language_of(user_input)
        if output == Language.ENGLISH:
            return "en"
        else:
            return "fr"


lang_detect = LangDetect.bind()

### Combine services

Let's bring the whole system together. We'll implement a service which represents our external endpoint for HTTP or Python invocations.
* This service will have references to the deployments we've built so far, and will implement some conditional logic to ensure the correct language is used
* Note that even if the user is interacting in French, we need to return the English response as well so that client can use that to build the chat history

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment_endpoint.png' width=700/>

In [None]:
@serve.deployment
class Endpoint:
    def __init__(self, chat, lang_detect, translate_en_fr, translate_fr_en):
        # assign dependent service handles to instance variables
        self._chat = chat
        self._lang_detect = lang_detect
        self._translate_en_fr = translate_en_fr
        self._translate_fr_en = translate_fr_en

    async def __call__(self, request: Request) -> Dict:
        data = await request.json()
        data = json.loads(data)
        return {
            "response": await self.get_response(data["user_input"], data["history"])
        }

    async def get_response(self, user_input: str, history: list[str]):
        lang_obj_ref = await self._lang_detect.get_response.remote(user_input)

        # if we didn't need the literal value of the language yet, we could pass that (future) object reference to other services
        # here, though, we need the value in order to decide whether to call the translation services
        # we get the Python value by awaiting the object reference
        lang = await lang_obj_ref

        if lang == "fr":
            user_input = await self._translate_fr_en.get_response.remote(user_input)

        response = response_en = await self._chat.get_response.remote(
            user_input, history
        )

        if lang == "fr":
            response = await self._translate_en_fr.get_response.remote(response_en)
            user_input = await user_input

        response = await response
        response_en = await response_en

        return response + "|" + user_input + "|" + response_en


endpoint = Endpoint.bind(chat, lang_detect, translate_en_fr, translate_fr_en)

endpoint_handle = serve.run(endpoint, name="multilingual_chat")

We've implemented control flow through our services and used the async/await pattern in several places so that we don't unnecessarily block.

Then we construct the service endpoint and start a new application serving that endpoint.

In [None]:
message = "I can't wait to see more XFL football."
history = []
response = ray.get(endpoint_handle.get_response.remote(message, history))
response.split("|")[0]

In [None]:
history += response.split("|")[1:]
history

In [None]:
message = "C'est exact... mais il y a bien plus que cela."
response = ray.get(endpoint_handle.get_response.remote(message, history))
response.split("|")[0]

In [None]:
history += response.split("|")[1:]
history

> We haven’t looked under the hood at how all of this works on Ray. The details aren't covered in this session, but there is more information at https://docs.ray.io/en/latest/serve/architecture.html

At this point we have a service which can support the many functional and operational properties we expect to need in production, including scalability, separation of concerns, and composability.

In [None]:
serve.delete("multilingual_chat")
serve.shutdown()

## Ray Serve: Summary

__Ray Serve is a microservices framework for serving ML – the model serving component of Ray.__

Key features include
* High-performance operation at large scale with heterogeneous resources
* Fine-grained resource allocation and scheduling
* Autoscaling
* Simple interaction with preprocessing, training, tuning, and other parts of the ML workflow on Ray
* Out-of-the box integration with third-party tools like PyTorch, Huggingface, Gradio, Weights and Biases, Delta Lake, and more
* Straightforward composition and orchestration of services in a declarative or imperative style, all using Python

