<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/mistral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a chat bot in 10 minutes

Since the release of Open AI's [ChatGPT](https://openai.com/blog/chatgpt) in November 2022, machine learning has been thrust into the public eye.
The power and versatility of Large Language Models (LLMs) have been brought to the forefront, showcasing their ability to understand and generate human-like text with unprecedented accuracy. This advancement opens up exciting possibilities for developing sophisticated chatbots and human-assistants.

In this tutorial, we will leverage the [🤗 Transformers](https://huggingface.co/docs/transformers/index) library and an open-source LLM to build a powerful chatbot in under 10 minutes. Unlike many closed-source chatbots, like ChatGPT, the chatbot that we build will be **fully open-source**. This means that you have full control over the model, it's weights and the generation parameters. Furthermore, any user responses will be kept locally and not transmitted to a third party. This is helpful for confidential applications, such as those in a financial domain.

## Set-Up Environment

Since we're using larger models than the previous tutorial, we'll need to run them on a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit "Connect T4" in the top right-hand corner of the screen.

Once we've done that, we can go ahead and install the necessary Python packages:

In [None]:
!pip install --upgrade --quiet transformers accelerate bitsandbytes gradio

## Baseline Implementation

In this example, we'll use Transformers' lower-level model + tokenizer API. Let's go ahead an import the two Python classes we'll need to run the Mistral model. The first of these classes, `AutoModelForCausalLM`, is our
model class. This is the Python class that holds the model weights and graph definition. The second class, `AutoTokenizer`, is the tokenizer class,
which converts the input string to a token id (discrete number) representation:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

We can now load the pre-trained model weights from the Hugging Face Hub. For this example, we'll use [Mistral's 7B LLM](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) that has been instructioned tuned for the task of a chatbot.

Now, a 7B parameter LLM in full precision (float32) requires 16GB of memory just to load the weights. The GPU typically assigned to a Google Colab free tier only has a capacity of 16GB. This means we are at risk of an *out-of-memory (OOM)* error, where the memory of the model exceeds that of the GPU. To circumvent this, we'll load the weights in [4-bit precision]((https://huggingface.co/blog/4bit-transformers-bitsandbytes)), which should reduce the memory of the weights roughly by a factor of 8. This is extremely simple in Transformers: just pass the flag [`load_in_4bit`](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained.load_in_4bit) when we load the pre-trained weights.

In [None]:
checkpoint_id = "sanchit-gandhi/Mistral-7B-Instruct-v0.1"

model = AutoModelForCausalLM.from_pretrained(checkpoint_id, low_cpu_mem_usage=True, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_id)

The advantage of using the `AutoModel` and `AutoTokenizer` API over model-specific classes is that we can easily swap the checkpoint id for any other checkpoint
on the Hugging Face Hub and re-use our code without any changes. The auto-classes will take care of loading the correct model and tokenizer classes for us! That means if a better LLM is released (highly probable!), we can quickly update our code to leverage the new model.

Great! We've loaded our model into memory, so now we're ready to define our pre-processing strategy. This looks a bit different to what we had previously when we did summarisation. Instead of just tokenising the input text, we need to tokenise a series of user/assistant exchanges. The precise formatting for tokenisation is quite
fiddly. Luckily for us, the tokenizer provides a convinient method [`apply_chat_template`](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template) to format the tokens correctly for us. We just need to define what messages were sent by the user, and what ones were provided by the assistant:

In [None]:
messages = [
    {"role": "user", "content": "Can you tell me about the current trends in the stock market?"},
    {"role": "assistant", "content": "Certainly! Recently, there's been a growing interest in technology stocks, particularly those involved in artificial intelligence and renewable energy. However, it's important to consider market volatility and global economic factors in your investment decisions."},
    {"role": "user", "content": "What should I consider when investing in tech stocks?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

Perfect! Now, our model is currently sitting on the GPU:

In [None]:
model.device

Whereas our inputs are on the CPU:

In [None]:
encodeds.device

Every PyTorch model expects the inputs to be on the same device as the model. If the model is a race car and we are the passengers, we need to be in the race car before we can get going! Let's move the inputs to the correct device before generating:

In [None]:
model_inputs = encodeds.to(model.device)

Now that we've pre-processed our inputs, we can generate our response using the model in much the same way as before:

In [None]:
generated_ids = model.generate(model_inputs, max_new_tokens=1024, do_sample=True)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_text[0])

Looks like a very reasonable response! We can see the change between the user input and the assistant response marked by the `[INST]` and `[/INST]` tokens. This was handled automatically for us by the tokenizer. Note that we've generated with [*sampling*](https://huggingface.co/blog/how-to-generate#sampling), so there's an element of randomness in the generations, meaining each one will be different from the last. Feel free to re-run generation a few times to get a feel for the kind of variety that sampling brings.

> **Note:** the responses generated by the model are purely for demonstration purposes. They should not be taken as financial recommendations.

## Streaming Outputs

While the generated response looks good, we had to wait a significant amount of time for the model to finish generating. We know from our knowledge of Transformer models that they generate in an *auto-regressive* fashion, that is, one token at a time:

<figure class="image table text-center m-0 w-full">
    <video
        style="max-width: 90%; margin: auto;"
        autoplay loop muted playsinline
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_2_1080p.mov"
    ></video>
</figure>

Rather than waiting for the entire text sequence to finish generating, we can print each token as it is generated. While this doesn't change the overall latency of the model (the total generation time is the same as before), the *percieved* latency is lower, since the first generated token is returned to the user as soon as it's ready.

Transformers provides a [`TextStreamer`](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer) class to easily print the predictions on-the-fly. We simply have to pass the streamer to our generate function, and the rest is taken care for us. Let's run an example using the text streamer:

In [None]:
from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generated_ids = model.generate(model_inputs, max_new_tokens=1024, do_sample=True, streamer=streamer)

Interacting with the LLM this way already feels more personalised, as we watch it generate each token at a time. It also feels faster to the user, as we get the first output as soon as it is ready.

## Prompt Engineering

We can ask the model to perform a more advanced task by modifying the prompt. In the following example, we'll provide the model with a summary of [Apple's Q3 Earnings Report](https://www.apple.com/uk/newsroom/2023/08/apple-reports-third-quarter-results/), and ask it to provide a summary of the potential implications on the US economy. This is an example of *prompt engineering*, where we control the functionality of the model by changing the form of the input prompt:

In [None]:
APPLE_EARNINGS = (
    "Apple today announced financial results for its fiscal 2023 third quarter ended July 1, 2023. The Company posted quarterly revenue of $81.8 billion, down 1 percent year over year, and quarterly earnings per diluted share of $1.26, up 5 percent year over year. "
    "'We are happy to report that we had an all-time revenue record in Services during the June quarter, driven by over 1 billion paid subscriptions, and we saw continued strength in emerging markets thanks to robust sales of iPhone,' said Tim Cook, Apple’s CEO. 'From education to the environment, we are continuing to advance our values, while championing innovation that enriches the lives of our customers and leaves the world better than we found it.'"
    "'Our June quarter year-over-year business performance improved from the March quarter, and our installed base of active devices reached an all-time high in every geographic segment,' said Luca Maestri, Apple’s CFO. 'During the quarter, we generated very strong operating cash flow of $26 billion, returned over $24 billion to our shareholders, and continued to invest in our long-term growth plans."
    "Apple’s board of directors has declared a cash dividend of $0.24 per share of the Company’s common stock. The dividend is payable on August 17, 2023 to shareholders of record as of the close of business on August 14, 2023."
)

messages = [
    {"role": "user", "content": (
        "I have the latest earnings report of Apple. Based on this report's financial performance indicators, including revenue, profits, and growth projections, "
        "can you analyze and advise on the potential impact this company's performance might have on the broader US economy? "
        "Please consider factors such as the company's market influence, sector trends, and any ripple effects that might be observed in related industries "
        f"or the economy as a whole. Here is the report: {APPLE_EARNINGS}"
        )
    }
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(model.device)

In [None]:
generated_ids = model.generate(model_inputs, max_new_tokens=1024, do_sample=True, streamer=streamer)

## Gradio Demo

While it's interesting for us learners to interact with the model through Python code, it's by no means the easiest way to share our model with others. To finish this section, we'll demonstrate how the Mistral model can be wrapped into a simple chat interface and shared with anyone in the community.

To achieve this, we'll use the [Gradio library](https://www.gradio.app). Gradio is an open-source Python library that simplifies the creation of web interfaces for machine learning models. It allows developers and researchers to quickly build and share customisable UI components for model testing and feedback, enhancing the accessibility and usability of AI models.

First, we'll define an end-to-end function that takes the user's message and the chat history, and generates a response. Largely speaking, this simply requires concatenating the three stages of prediction:
1. Tokenizer (pre-process) the message and chat history
2. Define the generation arguments
3. Generate the predicted ids and stream the output

We'll make a small modification to our streamer: instead of using the base [`TextStreamer`](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer), we'll use the [`TextIteratorStreamer`](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextIteratorStreamer), which returns the generated ids in a non-blocking way.

In [None]:
from transformers import TextIteratorStreamer
from threading import Thread

MAX_INPUT_TOKEN_LENGTH = 4096

def generate(message, chat_history):
    # Step 1: pre-process the inputs
    conversation = []
    for user, assistant in chat_history:
        conversation.extend([{"role": "user", "content": user}, {"role": "assistant", "content": assistant}])

    conversation.append({"role": "user", "content": message})

    input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt")

    # in-case our inputs exceed the maximum length, we might need to cut them
    if input_ids.shape[1] > MAX_INPUT_TOKEN_LENGTH:
        input_ids = input_ids[:, -MAX_INPUT_TOKEN_LENGTH:]
        gr.Warning(f"Trimmed input from conversation as it was longer than {MAX_INPUT_TOKEN_LENGTH} tokens.")

    input_ids = input_ids.to(model.device)
    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    # Step 2: define generation arguments
    generate_kwargs = dict(
        {"input_ids": input_ids},
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
    )

    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    # Step 3: generate and stream outputs
    outputs = ""
    for text in streamer:
        outputs += text
        yield outputs

With the generation function defined, we can now create a Gradio demo in just 3 lines of additional code:

In [None]:
import gradio as gr

chat_interface = gr.ChatInterface(generate)
chat_interface.queue().launch(share=True)