# Multilingual Chat with Ray Serve

*Install dependencies not present on Google Colab*

In [None]:
is_google_colab = 'google.colab' in str(get_ipython())

if is_google_colab:
    %pip install 'ray[default,air,serve]' lingua-language-detector 'transformers[torch]==4.27' numpy==1.24.2 sentencepiece

*On Google Colab, we have to restart the runtime after that install since we've updated some dependencies in place*

In [None]:
import os

if is_google_colab:
    os.kill(os.getpid(), 9)

In [None]:
import ray
import requests, json
from starlette.requests import Request
from typing import Dict

from ray import serve

In [None]:
ray.init(num_cpus=8)

We can observe the performance of our Ray cluster using the Ray Dashboard. Starting Ray with `ray.init(...)` will output a link to the dashboard, which by default is served on port 8265.

In [None]:
if 'google.colab' in str(get_ipython()):
    from google.colab.output import serve_kernel_port_as_iframe, serve_kernel_port_as_window
    serve_kernel_port_as_window(8265)

### Chatbot using Huggingface LLM

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
@serve.deployment
class Chat:
    def __init__(self, model: str):
        # configure stateful elements of our service such as loading a model
        self._tokenizer = AutoTokenizer.from_pretrained(model)
        self._model =  AutoModelForSeq2SeqLM.from_pretrained(model)

    async def __call__(self, request: Request) -> Dict:
        # path to handle HTTP requests
        data = await request.json()
        data = json.loads(data)
        # after decoding the payload, we delegate to get_response for logic
        return {'response': self.get_response(data['user_input'], data['history']) }
    
    def get_response(self, user_input: str, history: list[str]) -> str:
        # this method receives calls directly (from Python) or from __call__ (from HTTP)
        history.append(user_input)
        # the history is client-side state and will be a list of raw strings;
        # for the default config of the model and tokenizer, history should be joined with '</s><s>'
        inputs = self._tokenizer('</s><s>'.join(history), return_tensors='pt')
        reply_ids = self._model.generate(**inputs, max_new_tokens=500)
        response = self._tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0]
        return response

We've defined a service using the `@serve.deployment` decorator.
* Deployments can be service endpoints
* Or they can be individual components accessed from other services
* Each deployment can have its own resource management and autoscaling configuration

Here is example of extended configuration -- see https://docs.ray.io/en/latest/serve/scaling-and-resource-allocation.html#scaling-and-resource-allocation for more details

```python
@serve.deployment(
    autoscaling_config={
        'min_replicas': 1,
        'initial_replicas': 2,
        'max_replicas': 5,
        'target_num_ongoing_requests_per_replica': 10,
    }
)
```

In [None]:
chat = Chat.bind(model='facebook/blenderbot-400M-distill')

We created the object that represents this service. We do that by using the `.bind()` class method, which captures the "recipe" for instantiating replicas of our service. The params in `.bind()` -- in this case, the name of the Huggingface model we want to use -- will be provided to our deployment class constructor.

In [None]:
handle = serve.run(chat, name='basic_chat')

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment.png' width=600/>

`serve.run(...)` exposes out service endpoint for call. We can run multiple simultaneous applications with their own collections of deployments, resource management, scaling, etc. The `name` parameter in `serve.run(...)` becomes an identifier not only for the application but for each deployment that is launched to support that application.

Let's test out the service using Ray's Python API first.

In [None]:
message = 'My friends are cool but they eat too many carbs.'
history = []
response_handle = handle.get_response.remote(message, history)
response = ray.get(response_handle)
response

We prepare a message and a chat history list and call our chat service via Python

In [None]:
history += [message, response]
history

In [None]:
message = "I'm not sure."
response_handle = handle.get_response.remote(message, history)
response = ray.get(response_handle)
response

In [None]:
history += [message, response]
history

And we'll check that our HTTP endpoint is alive as well.

In [None]:
message = 'I love carbs myself -- bread is tasty.'

json_doc = json.dumps({ 'user_input' : message, 'history' : history })

json_doc

In [None]:
requests.post('http://localhost:8000/', json = json_doc).json()

Since we'll be creating a new application example, we can delete the old one -- that allows Ray to remove the replicas of our Chat deployment.

In [None]:
serve.delete('basic_chat')

## Goals for Model Serving in Production

As we assemble more complex applications, Ray Serve delivers increasingly valuable features
* The same platform -- so zero integration -- for preprocessing, training, batch prediction and other parts of the ML workflow
* Fine-grained resource allocation and scaling at the level of critical functions and classes
    * Instead of or in addition to scaling at the level of containers, like with Kubernetes, or of nodes
* Automatic batching
* Minimal data movement and data-locality-aware scheduling

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/serve_architecture.png' width=700/>

## Composing Services with Ray for Chatbot en Français

### Roadmap for Additional Services

The underlying chatbot model we’ve used only supports English interaction. But we can use the following recipe to add French language support:

1. Implement a translation service between French and English
1. Implement a language detection service
1. Implement a routing (dispatch) service:
    1. If the incoming prompt is French, then
        1. Route the inbound prompt through the FR-EN translator
        1. Pass the EN prompt to the chat model
        1. Pass the EN output from the chat model through the EN-FR translator
        1. Return the French response
    1. Otherwise (if the prompt is in English), pass it straight to the chatbot as we did earlier and return the (English) response

Let’s look using Ray Serve to implement model inference with these composed and conditional-flow elements using Python method calls (https://docs.ray.io/en/latest/serve/key-concepts.html#servehandle-composing-deployments). Later we’ll look at an alternative approach using Ray’s Deployment Graph API.

We’ll implement parts 1 and 2 first…

### Translation service

In [None]:
from transformers import pipeline

In [None]:
@serve.deployment
class Translate:
    def __init__(self, task: str, model: str):
        self._pipeline = pipeline(task=task, model=model)
    
    def get_response(self, user_input: str) -> str:
        outputs = self._pipeline(user_input)
        response = outputs[0]['translation_text']
        return response
        
translate_en_fr = Translate.bind(task='translation_en_to_fr', model='t5-small')
translate_fr_en = Translate.bind(task='translation_fr_to_en', model='Helsinki-NLP/opus-mt-fr-en')

Notice how we have two different services but they are built on the same reusable code by calling `.bind()` with different initialization parameters.

*We don’t need to define new deployments for every service we use.*

This time we’re haven't published an application (via `serve.run()`) because these components will be invoked only by our main service deployment.

### Language detection

We can create the language detection service in a similar way. 

> This service is lighter weight because we’re using https://github.com/pemistahl/lingua-py … which leverages traditional NLP and n-grams for detection instead of a deep learning model. It can handle more traffic than, e.g., the chat model -- and it won't require a GPU. So we can benefit from Ray Serve's fine-grained resource allocation.
    
Lingua is optimized for strong detection on very short text snippets, like tweets, so it should be useful for our chat exchanges.

In [None]:
from lingua import Language, LanguageDetectorBuilder

In [None]:
@serve.deployment
class LangDetect:
    def __init__(self):
        languages = [Language.ENGLISH, Language.FRENCH]
        self._detector = LanguageDetectorBuilder.from_languages(*languages).build()
    
    def get_response(self, user_input: str) -> str:
        output = self._detector.detect_language_of(user_input)
        if (output == Language.ENGLISH):
            return 'en'
        else:
            return 'fr'
        
lang_detect = LangDetect.bind()

### Combine services

Let's bring the whole system together. We'll implement a service which represents our external endpoint for HTTP or Python invocations.
* This service will have references to the deployments we've built so far, and will implement some conditional logic to ensure the correct language is used
* Note that even if the user is interacting in French, we need to return the English response as well so that client can use that to build the chat history

In [None]:
@serve.deployment
class Endpoint:
    def __init__(self, chat, lang_detect, translate_en_fr, translate_fr_en):
        # assign dependent service handles to instance variables
        self._chat = chat
        self._lang_detect = lang_detect
        self._translate_en_fr = translate_en_fr
        self._translate_fr_en = translate_fr_en

    async def __call__(self, request: Request) -> Dict:
        data = await request.json()
        data = json.loads(data)
        return {'response': await self.get_response(data['user_input'], data['history']) }
    
    async def get_response(self, user_input: str, history: list[str]):
        lang_obj_ref = await self._lang_detect.get_response.remote(user_input)
        
        # if we didn't need the literal value of the language yet, we could pass that (future) object reference to other services
        # here, though, we need the value in order to decide whether to call the translation services
        # we get the Python value by awaiting the object reference
        lang = await lang_obj_ref

        if (lang == 'fr'):
            user_input = await self._translate_fr_en.get_response.remote(user_input)

        response = response_en = await self._chat.get_response.remote(user_input, history)
        
        if (lang == 'fr'):
            response = await self._translate_en_fr.get_response.remote(response_en)
            user_input = await user_input
            
        response = await response
        response_en = await response_en
        
        return response  + '|' + user_input + '|' + response_en

endpoint = Endpoint.bind(chat, lang_detect, translate_en_fr, translate_fr_en)

endpoint_handle = serve.run(endpoint, name = 'multilingual_chat')

We've implemented control flow through our services and used the async/await pattern in several places so that we don't unnecessarily block.

Then we construct the service endpoint and start a new application serving that endpoint.

In [None]:
message = 'My friends are cool but they eat too many carbs.'
history = []
response = ray.get(endpoint_handle.get_response.remote(message, history))
response.split('|')[0]

In [None]:
history += response.split('|')[1:]
history

In [None]:
message = 'Je ne suis pas sûr.'
response = ray.get(endpoint_handle.get_response.remote(message, history))
response.split('|')[0]

In [None]:
history += response.split('|')[1:]
history

> We haven’t looked under the hood at how all of this works on Ray. The details aren't covered in this session, but there is more information at https://docs.ray.io/en/latest/serve/architecture.html

At this point we have a service which can support the many functional and operational properties we expect to need in production, including scalability, separation of concerns, and composability.

In [None]:
serve.delete('multilingual_chat')

## Deployment Graph API

What is the Deployment Graph API?

* The Deployment Graph API lets us separate the flow of calls from the logic inside our services.

Why might we want to use the Deployment Graph (DAG) API to separate flow from logic?

* It may be valuable to add a layer of indirection – or abstraction – so that we can more easily create and compose reusable services
* The DAG API lets us use similar patterns across the Ray platform (e.g., Ray Workflow)
    * We can learn one general pattern for graphs and use that intuition in multiple places in our Ray applications
* Although we compose one DAG, we retain the key Ray Serve features of granular autoscaling and resource allocation

Let’s reproduce our chat service flow using the Deployment Graph API

### Getting Started with Deployment Graphs

As a first step, to keep things simple, let’s assume for a moment that we are always interacting with the service in French. 

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment_graph_simple.png' width=900/>

In [None]:
from ray.serve.dag import InputNode
from ray.serve.drivers import DAGDriver

`InputNode` is a special type of graph node, defined by Ray Serve, which represents values supplied to our service endpoint. 

We can only have one `InputNode` but we can get access to multiple parameters from that node using a Python context manager.

In [None]:
with InputNode() as inp:
    user_input = inp[0]
    history = inp[1]

Here is a minimal, linear pipeline that allows us to begin a chat in French.

We build up the graph step by step, `bind`ing each deployment to its dependencies.

In [None]:
user_input_en = translate_fr_en.get_response.bind(user_input)    # French->English translator depends on the user input text
chat_response = chat.get_response.bind(user_input_en, history)   # the chat deployment requires the English user input and the history
output = translate_en_fr.get_response.bind(chat_response)        # English->French translator depends on the English chat output
serve_dag = DAGDriver.bind(output)                               # the graph returns the output from the English->French translator

handle = serve.run(serve_dag, name='basic_linear')

We start the application by calling `serve.run()` on the DAGDriver, a Ray Serve component which routes HTTP requests through your call graph.

In [None]:
ray.get(handle.predict.remote('Mes amis sont cool mais ils mangent trop de glucides.', []))

In [None]:
serve.delete('basic_linear')

How can we continue the chat?

We need to supply English history ... but we only have French responses so far.

We can use the pattern of adding a __combine node__ to our graph in order to merge the 3 elements we need to output (English chat message, English chat response, and French chat response).

Combining multiple values is a common requirement -- e.g., in collecting values from a model ensemble.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/ensemble.png' width=900 />

In [None]:
@serve.deployment
def combine(user_input_en:str, chat_response_en: str, chat_response_fr:str)->str:
    return chat_response_fr + '|' + user_input_en + '|' + chat_response_en

The combine node here implemented here is a very simple deployment: it's built from a single function definition instead of a class.

In [None]:
translate_en_fr = Translate.bind(task='translation_en_to_fr', model='t5-small')
translate_fr_en = Translate.bind(task='translation_fr_to_en', model='Helsinki-NLP/opus-mt-fr-en')
chat = Chat.bind(model='facebook/blenderbot-400M-distill')

Event though the definitions of the `Translate` and `Chat` deployments have not changed, we call `.bind()` again to create new DAG nodes since we're composing a new DAG.

In [None]:
with InputNode() as inp:
    user_input = inp[0]
    history = inp[1]
    user_input_en = translate_fr_en.get_response.bind(user_input)
    chat_response_en = chat.get_response.bind(user_input_en, history)
    chat_response_fr = translate_en_fr.get_response.bind(chat_response_en)

# We route the user input, the English chat response, and the French chat response into the combine node
output = combine.bind(user_input_en, chat_response_en, chat_response_fr)

# and we serve the output of the combine node
serve_dag = DAGDriver.bind(output)

handle = serve.run(serve_dag, name='enhanced_linear')

In [None]:
ray.get(handle.predict.remote('Mes amis sont cool mais ils mangent trop de glucides.', []))

In [None]:
serve.delete('enhanced_linear')

Using this pattern, we are getting everything back that we would need to offer a conversation service with the chatbot ... but only in French!

### Adding Conditional Flow

Our real chatbot is a bit more complex. It has a conditional flow where we invoke the translation service only when the user is *not* interacting in English.

We can add the remaining elements of our service and the basic API changes will be fairly minimal. But there is one aspect that requires us to do a little bit of thinking and employ a new pattern.

#### Static Graphs and Conditional Control Flow

The graph we define with the DAG API is static – it’s created ahead of time. 

In the first DAG demo, we were always invoking the same sequence of services, so the static character of the graph might not have been obvious… but now we’re focusing on it so you can see where things might get a bit more complicated.

To implement branching flow control with the DAG API, we’ll use a special pattern so that the same graph always runs … but certain nodes (in our case, translator nodes) behave differently based on data they receive.

<img src='https://technical-training-assets.s3.us-west-2.amazonaws.com/Ray_Serve/deployment_graph_complex.png' width=900 />

In [None]:
@serve.deployment
class Translate:
    def __init__(self, task: str, model: str):
        self._pipeline = pipeline(task=task, model=model)
    
    def get_response(self, user_input:str, user_lang:str) -> str:
        if (user_lang == 'en'):
            return user_input # no-op
        else:
            outputs = self._pipeline(user_input)
            response = outputs[0]['translation_text']
            return response
        
translate_en_fr = Translate.bind(task='translation_en_to_fr', model='t5-small')
translate_fr_en = Translate.bind(task='translation_fr_to_en', model='Helsinki-NLP/opus-mt-fr-en')

The if-else control flow inside `get_response()` calls the transation logic only when the user is *not* using English.

In [None]:
lang_detect = LangDetect.bind()
chat = Chat.bind(model='facebook/blenderbot-400M-distill')

with InputNode() as inp:
    user_input = inp[0]
    history = inp[1]
    user_lang = lang_detect.get_response.bind(user_input)
    user_input_en = translate_fr_en.get_response.bind(user_input, user_lang)
    chat_response_en = chat.get_response.bind(user_input_en, history)
    chat_response_fr = translate_en_fr.get_response.bind(chat_response_en, user_lang)
    output = combine.bind(user_input_en, chat_response_en, chat_response_fr)
    serve_dag = DAGDriver.bind(output)

handle = serve.run(serve_dag, name='full_chatbot')

In this code, the translation services are always part of the graph and participate in the data flow. So the graph is static, even though the translation behavior is dynamic.

In [None]:
message = 'Mes amis sont cool mais ils mangent trop de glucides.'
history = []

response = ray.get(handle.predict.remote(message, history))

response.split('|')[0]

In [None]:
history += response.split('|')[1:]
history

In [None]:
ray.get(handle.predict.remote('Truly bread is delightful', history))

In [None]:
serve.delete('full_chatbot')

## Ray Serve: Summary

__Ray Serve is a microservices framework for serving ML – the model serving component of Ray.__

Key features include
* High-performance operation at large scale with heterogeneous resources
* Fine-grained resource allocation and scheduling
* Autoscaling
* Simple interaction with preprocessing, training, tuning, and other parts of the ML workflow on Ray
* Out-of-the box integration with third-party tools like PyTorch, Huggingface, Gradio, Weights and Biases, Delta Lake, and more
* Straightforward composition and orchestration of services in a declarative or imperative style, all using Python



## Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [**Ray documentation**](https://docs.ray.io/en/latest)

* [**Official Ray site**](https://www.ray.io/)  
Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.

* [**Join the community on Slack**](https://forms.gle/9TSdDYUgxYs8SA9e8)  
Find friends to discuss your new learnings in our Slack space.

* [**Use the discussion board**](https://discuss.ray.io/)  
Ask questions, follow topics, and view announcements on this community forum.

* [**Join a meetup group**](https://www.meetup.com/Bay-Area-Ray-Meetup/)  
Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.

* [**Open an issue**](https://github.com/ray-project/ray/issues/new/choose)  
Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

* [**Become a Ray contributor**](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html)  
We welcome community contributions to improve our documentation and Ray framework.