# Reffie Take-Home Interview:  Building smart reply feature

Trung Le

May 8, 2024

In [1]:
## Import necessary libaries

from langchain_community.chat_models import ChatOllama
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts.chat import ChatMessage
from langchain.memory import ChatMessageHistory

In order to implement smart reply feature using LLMs, I use the following external libraries:
- `ollama` for running LLMs locally
- `langchain` modularizing LLM-building workflow and support for multiple LLMs
- `pydantic` for coercing LLM output into a structured format

### Defining the `pydantic` model

Smart Reply feature requires exactly three reply suggestions. Therefore, I construct a Pydantic model `SmartReplies` to faciliate structuring and validation of LLMs output to a class of which instance includes exactly three fields. This Pydantic model allows easy validation and serialization of data, serving as a key component for the frontend.

`PydanticOutputParser` will parse the LLM output to an instance of the `SmartReplies` or raise a `ValidationError` if the output cannot form a valid model.

In [2]:
# Defining the Pydantic class
class SmartReplies(BaseModel):
    reply_1: str = Field(description = "Smart Reply 1")
    reply_2: str = Field(description = "Smart Reply 2")
    reply_3: str = Field(description =" Smart Reply 3")

        
parser = PydanticOutputParser(pydantic_object=SmartReplies)

The parser will create instruction prompt will can be passed to the LLM to guide the LLM

In [3]:
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"reply_1": {"title": "Reply 1", "description": "Smart Reply 1", "type": "string"}, "reply_2": {"title": "Reply 2", "description": "Smart Reply 2", "type": "string"}, "reply_3": {"title": "Reply 3", "description": " Smart Reply 3", "type": "string"}}, "required": ["reply_1", "reply_2", "reply_3"]}
```


### Building our LLM workflow using `langchain`

`langchain` provides a framework that makes it easy to build LLM application. One of the strengths of using `langchain` is the use of chains, i.e. sequencing multiple components together. Here I use a simple chain: prompt + model + output parser.

#### Prompt

Since this is a conversation between two humans, with the LLM being the conversation suggesting assistant, I construct the prompt such that it recognize the 3rd-person role. I specifically ask the model to generate 3 distinct suggestions, potentially semantically, based on the conversation history. I allow for a variable `num_words` to control the number of words per reply. Then I inject the formatting instructions from the parser.

#### Model

For the model, I use a Llama3 as the LLM of choice, since it is the best open-sourced LLMs on the market right now, with Ollama simplifying running LLM on local machine. I did not use propietary models since it costs money to use their APIs, and for a simple POC, there is no need to use advanced propietary models. I also tested with Google's Gemma model, but Llama3 outputs are better.

I specifically use the chat/instruct version of the LLMs and calling `ChatOllama` since it supports conversations as opposed to simple text output

In [4]:
# Define ChatOllama model
model = ChatOllama(model = 'llama3', temperature=0.5)

# Define system prompt
system_prompt = """You are a helpful assistant to the responder. Your role is to suggest EXACTLY 3 distinct responses for the responder in a conversation with the human.

Each suggestions contains EXACTLY fewer than {num_words}. Each suggestions represents what the responder would most likely send to the other person, based on the conversation history. Please reply like a human would, be conversational.

If appropriate, each suggestion should not be similar semantically. For example, if suggesting time to meet. Your suggestions should be: yes, no, maybe.

If the human is requesting truths, please suggest truthfully. If the human is asking open-ended questions, please suggest creatively.

The conversation history is below: \n{chat_history}\n. Follow this format:\n {format_instruction}. Do not include "properties" from schema.
"""

# Construct template, in which the model will be fed with this template and the input message from human
template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human",  "{message}")
])

# Construct the chain
chain = template | model | parser

### Adding conversational memory

To make our smart replies more tailored to the conversation, it is important that the LLMs retain history of the conversation. Here I use `langchain`'s `ChatMessageHistory`, which is a wrapper class that saves conversations between different entities (i.e. Human and AI).

As noted earlier, since the model takes a 3rd-person viewpoint, it is important to distinguish what is the role of each message (i.e. from the human, from the responder or from the AI itself). While `langchain` supports injecting `ChatMessageHistory` into the chain, Ollama model does not support chat history of roles other than AI, Human and System. Therefore I inject the history into system message, as guided by the prompt above. 

I define a function `parse_history` that takes `ChatMessageHistory` instance and produce a string that represents the history between the human and the responder.

In [5]:
def parse_history(chat_history: ChatMessageHistory) -> str:
    result = []
    for message in chat_history.messages:
        if hasattr(message, 'type'):
            if message.type == 'human':
                result.append(f"human: {message.content}")
            elif message.type == 'ai':
                result.append(f"ai: {message.content}")
            elif message.type == 'chat':
                 result.append(f"{message.role}: {message.content}")
    return '\n'.join(result)

We now are ready to run a conversation with the help of Smart Replies! 

To faciliate our conversation. I add two helper functions `get_replies` and `reply`, assuming the role of the responder. 

- `get_replies`: Takes the message of the human and the conversation history to produce Smart Replies. I limit the suggestion to `7` words per response. Although this may not always be accurate, due to the LLM. 
- `reply`: Register the reply from the responder to the conversation history

In [6]:
def get_suggestions(message: str, chat_history: ChatMessageHistory, save_to_history: bool = False) -> SmartReplies:
    response = chain.invoke({"message": message,
                  "num_words": 7,
                  "format_instruction": parser.get_format_instructions(),
                  "chat_history": parse_history(chat_history)})
    if save_to_history:
        chat_history.add_user_message(message)
    return response

def reply(message: str, chat_history: ChatMessageHistory):
    chat_history.add_message(ChatMessage(role = "responder", content = message))

## Example conversation!

In this example, I have a conversation with a human (which is also myself!) about checking the time and coordinating to get some food. This example demonstrates that the AI can remember the history and can suggest diverse response options for the responder.

In [7]:
# Test run

memory_1 = ChatMessageHistory()

get_suggestions(message="Hi. Can I see what time it is?", chat_history=memory_1, save_to_history=True)

SmartReplies(reply_1="Let me check! It's currently 10:45 AM.", reply_2='I can do that for you! The current time is 10:45 AM.', reply_3="Ah, sure thing! As of now, it's 10:45 AM.")

In [8]:
reply(message="It's 9:35PM", chat_history=memory_1)

In [9]:
get_suggestions(message="So is it nighttime or morning time?", chat_history=memory_1, save_to_history=True)

SmartReplies(reply_1="It's night time, isn't it?", reply_2='Not yet, still evening', reply_3='Yeah, the sun has set')

In [10]:
reply(message="Nighttime", chat_history=memory_1)

In [11]:
get_suggestions(message="What is current time in 24hr format?", chat_history=memory_1, save_to_history=True)

SmartReplies(reply_1='9:35 PM', reply_2='0935 hours', reply_3='21:35')

In [12]:
reply(message="It's 21:35", chat_history=memory_1)

In [13]:
get_suggestions(message="Shall we get some late night food?", chat_history=memory_1, save_to_history=True)

SmartReplies(reply_1='Sounds good to me!', reply_2="I'm down for something", reply_3='What did you have in mind?')

In [14]:
reply(message="Sounds good to me!", chat_history=memory_1)

In [15]:
get_suggestions(message="Where shall we go?", chat_history=memory_1, save_to_history=True)

SmartReplies(reply_1='How about that new cafe that just opened?', reply_2='Maybe the diner down the street has some good late-night options', reply_3='I was thinking more like a convenience store or something')

## Next steps

This notebook demonstrates a simple application using LLMs to generate smart replies based on the conversation. It utilizes `langchain`, `ollama`, and `pydantic` to build up a flexible LLM chain.

To improve the system further and with time, further additions below may make this system better:
- *Adding more conversational/company data*: depending on our use case, assuming Reffie is the responder, adding data from previous conversations with customers and sales/marketing playbook can help the model in outputting more tailored response. RAG can be used to empower this.
- *Trying out with different types of memory*: Injecting the full conversation history will make the model spend more tokens to process and exhaust the context window `langchain` offer buffer memory and summary memory that I would like to play around more, which may help in limiting tokens processed and enhance model reliability.
- *Testing out with different prompts*: The current prompt seems to work decent enough, although refining it could enhance the precision and relevance of responses. One potential pathway is to try to shorten it while keeping the suggestions relevant enough.
- *Conditional on whether to suggest or not*: Similar to Gmail, there are cases where the conversation may not need suggestions for replies or is very complex that reply suggestions are not necessary. A conditional may help with this, and it potentially can help save processing costs for unnecessary API calls.