# How to format inputs to Chat models

Easyllm can be used as an abstract layer to replace `gpt-3.5-turbo` and `gpt-4` with open source models.

You can change your own applications from the OpenAI API, by simply changing the client. 

Chat models take a series of messages as input, and return an AI-written message as output.

This guide illustrates the chat format with a few example API calls.

### 1. Import the easyllm library

In [None]:
# if needed, install and/or upgrade to the latest version of the OpenAI Python library
%pip install --upgrade easyllm 

In [6]:
# import the EasyLLM Python library for calling the EasyLLM API
import easyllm

### 2. An example chat API call

A chat API call has two required inputs:
- `model`: the name of the model you want to use (e.g., `meta-llama/Llama-2-70b-chat-hf`) or leave it empty to just call the api
- `messages`: a list of message objects, where each object has two required fields:
    - `role`: the role of the messenger (either `system`, `user`, or `assistant`)
    - `content`: the content of the message (e.g., `Write me a beautiful poem`)

Compared to OpenAI api is the `huggingface` module also exposing a `prompt_builder` and `stop_sequences` parameter you can use to customize the prompt and stop sequences. The EasyLLM package comes with build in popular methods for both of these parameters, e.g. `llama2_prompt_builder` and `llama2_stop_sequences`. 

Let's look at an example chat API calls to see how the chat format works in practice.

In [1]:

from easyllm.clients import huggingface
from easyllm.prompt_utils import llama2_stop_sequences, build_llama2_prompt
# Example EasyLLM Python library request
MODEL = "meta-llama/Llama-2-70b-chat-hf"
huggingface.prompt_builder = build_llama2_prompt

# The module automatically loads the HuggingFace API key from the environment variable HUGGINGFACE_TOKEN or from the HuggingFace CLI configuration file.
# huggingface.api_key="hf_xxx"

response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Apple."},
    ],
      temperature=0.9,
      top_p=0.6,
      max_tokens=1024,
)
response

{'id': 'hf-VH-mBO0hVT',
 'object': 'chat.completion',
 'created': 1690987326,
 'model': 'meta-llama/Llama-2-70b-chat-hf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant', 'content': ' Apple who?'},
   'finish_reason': 'eos_token'}],
 'usage': {'prompt_tokens': 149, 'completion_tokens': 5, 'total_tokens': 154}}

As you can see, the response object has a few fields:
- `id`: the ID of the request
- `object`: the type of object returned (e.g., `chat.completion`)
- `created`: the timestamp of the request
- `model`: the full name of the model used to generate the response
- `usage`: the number of tokens used to generate the replies, counting prompt, completion, and total
- `choices`: a list of completion objects (only one, unless you set `n` greater than 1)
    - `message`: the message object generated by the model, with `role` and `content`
    - `finish_reason`: the reason the model stopped generating text (either `stop`, or `length` if `max_tokens` limit was reached)
    - `index`: the index of the completion in the list of choices

Extract just the reply with:

In [2]:
print(response['choices'][0]['message']['content'])

 Apple who?


Even non-conversation-based tasks can fit into the chat format, by placing the instruction in the first user message.

For example, to ask the model to explain asynchronous programming in the style of the pirate Blackbeard, we can structure conversation as follows:

In [3]:
# example with a system message
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain asynchronous programming in the style of math teacher."},
    ],
)

print(response['choices'][0]['message']['content'])


 Hello, my dear students! Today, we're going to learn about a fascinating topic that will help us understand how to make our programs more efficient and responsive: asynchronous programming.

Imagine you're working on a project with your classmates, and you need to finish a task, but you're waiting for someone else to finish their part first. You can't start your work until they're done, but you don't want to just sit there twiddling your thumbs. That's where asynchronous programming comes in!

Asynchronous programming is like working on a project with your classmates, but instead of waiting for them to finish, you can start working on something else in the meantime. You can think of it like this:

1. You give a task to your classmate, and they start working on it.
2. While they're working, you start working on another task that doesn't depend on their work.
3. When your classmate finishes their task, they give it back to you, and you can continue working on your task.

This way, you c

In [4]:
# example without a system message and debug flag on:
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Explain asynchronous programming in the style of the pirate Blackbeard."},
    ],
    debug=True,
)

print(response['choices'][0]['message']['content'])


08/02/2023 16:42:33 - DEBUG - easyllm.utils - Prompt sent to model will be:
<s>[INST] Explain asynchronous programming in the style of the pirate Blackbeard. [/INST]
08/02/2023 16:42:33 - DEBUG - easyllm.utils - Url:
https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf
08/02/2023 16:42:33 - DEBUG - easyllm.utils - Stop sequences:
[]
08/02/2023 16:42:33 - DEBUG - easyllm.utils - Generation parameters:
{'do_sample': True, 'return_full_text': False, 'max_new_tokens': 1024, 'top_p': 0.6, 'temperature': 0.9, 'stop_sequences': [], 'repetition_penalty': 1.0, 'top_k': 10, 'seed': 42}
08/02/2023 16:42:58 - DEBUG - easyllm.utils - Response at index 0:
index=0 message=ChatMessage(role='assistant', content=" Ahoy matey! Yer lookin' fer a tale of asynchronous programming, eh? Well, settle yerself down with a pint o' grog and listen close, for Blackbeard's got a story fer ye.\n\nAsynchronous programming, me hearty, be like sailin' a ship through treacherous waters. Ye gotta kee

### 3. Few-shot prompting

In some cases, it's easier to show the model what you want rather than tell the model what you want.

One way to show the model what you want is with faked example messages.

For example:

In [5]:
# An example of a faked few-shot conversation to prime the model into translating business jargon to simpler speech
response = huggingface.ChatCompletion.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful, pattern-following assistant."},
        {"role": "user", "content": "Help me translate the following corporate jargon into plain English."},
        {"role": "assistant", "content": "Sure, I'd be happy to!"},
        {"role": "user", "content": "New synergies will help drive top-line growth."},
        {"role": "assistant", "content": "Things working well together will increase revenue."},
        {"role": "user", "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."},
        {"role": "assistant", "content": "Let's talk later when we're less busy about how to do better."},
        {"role": "user", "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."},
    ],
)

print(response["choices"][0]["message"]["content"])


08/02/2023 16:42:58 - DEBUG - easyllm.utils - Prompt sent to model will be:
<s>[INST] <<SYS>>
You are a helpful, pattern-following assistant.
<</SYS>>

Help me translate the following corporate jargon into plain English. [/INST] Sure, I'd be happy to!</s><s>[INST] New synergies will help drive top-line growth. [/INST] Things working well together will increase revenue.</s><s>[INST] Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage. [/INST] Let's talk later when we're less busy about how to do better.</s><s>[INST] This late pivot means we don't have time to boil the ocean for the client deliverable. [/INST]
08/02/2023 16:42:58 - DEBUG - easyllm.utils - Url:
https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf
08/02/2023 16:42:58 - DEBUG - easyllm.utils - Stop sequences:
[]
08/02/2023 16:42:58 - DEBUG - easyllm.utils - Generation parameters:
{'do_sample': True, 'return_full_text': False, 'max_new_tokens': 1024, 'top_

Not every attempt at engineering conversations will succeed at first.

If your first attempts fail, don't be afraid to experiment with different ways of priming or conditioning the model.

As an example, one developer discovered an increase in accuracy when they inserted a user message that said "Great job so far, these have been perfect" to help condition the model into providing higher quality responses.

For more ideas on how to lift the reliability of the models, consider reading our guide on [techniques to increase reliability](../techniques_to_improve_reliability.md). It was written for non-chat models, but many of its principles still apply.