# Project 02 - Custom Chatbot with Memory and User Interface

> In this project, you will learn how to add improvements to the chat system we've created by implementing a memory system, allowing the bot to remember the conversation history. This will give it a better understanding of the message context based on past interactions (chatbots with this capability are known as "Context-Aware Chatbots"). Additionally, we will explore how to easily create a user-friendly interface (UI) for your application using a versatile library called Streamlit.

We'll learn how to implement this using Colab and also how to adapt the code to run in your local environment, which might be more beneficial.

To make our application more flexible, we will prepare it to work with different models and LLM providers, both open-source (running locally or via cloud) and proprietary (via API only).

This way, if you're using Colab, the application will still function even if you select CPU instead of GPU.

We'll discuss the advantages of each method later as we develop the integration.

## [ ! ] How to Run Locally

* To run the code for this project in a local environment, follow the instructions to install the necessary dependencies using the commands below. You can use the same installation commands. For more details, check out the video lessons on local setup with Streamlit.

* You can already run it locally as shown in the lesson, but if you're encountering configuration errors in your local environment, we recommend using Colab first to avoid disrupting the learning flow. However, if you choose to do it locally right away, that's also very possible and actually it is the ideal, as Streamlit requires working with .py files, while in Colab we use .ipynb due to Jupyter Notebook. Therefore, you will need to combine everything into a single .py file.

* Additionally, when running locally, you can see changes more quickly. After executing the command to initialize Streamlit (e.g., `!streamlit run proj02.py`), you just need to edit and save the .py script, then refresh the Streamlit page to see the updates. In other words, you don’t need to re-run the `!streamlit`... command.

Before running your code locally, make sure all the libraries listed in the pip install command are installed. If you haven't installed them yet, you can do so directly from the terminal in VS Code (if you're using that IDE) or from a regular terminal/prompt.

## Installation and Configuration

We need to install some libraries that will be necessary for our application, such as LangChain and Streamlit (to create the user interface), and some other necessary packages that we used previously

> If you are running locally: you also need to install pytorch, if you do not have it installed already (remember that Colab already has it installed by default, just import it).
* To avoid compatibility issues, we recommend this command: `pip install torch==2.3.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/test/cu121`

In [1]:
! pip install -q streamlit langchain sentence-transformers
! pip install -q langchain_community langchain-huggingface langchain_ollama langchain_openai

> Installing Localtunnel

If you are running Colab, you also need to install Localtunnel so that we can connect to the application generated with Streamlit.

This will be explained in the step where the interface is initialized.

In [3]:
! npm install localtunnel

'npm' is not recognized as an internal or external command,
operable program or batch file.


### Loading environment variables with dotenv

We will use the **dotenv** library, which simplifies the management of environment variables by storing them in a .env file.

In [4]:
! pip install python-dotenv



#### Creating the .env file

The `%%writefile` command allows the notebook cell to be saved as an external file, with the specified name

In [5]:
%%writefile .env
HUGGINGFACE_API_KEY=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
HUGGINGFACEHUB_API_TOKEN=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
OPENAI_API_KEY=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
TAVILY_API_KEY=##########
SERPAPI_API_KEY=##########
LANGCHAIN_API_KEY=##########

Writing .env


## (code explanations - step by step)

First, we will do all the necessary imports

In [6]:
import streamlit as st
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.prompts import MessagesPlaceholder

from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

import torch
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface  import HuggingFaceEndpoint

from dotenv import load_dotenv

load_dotenv()

True

### Explanations about model provider selection

We are designing our application to allow model loading from different model providers:

1. Hugging Face Hub

 * Inference runs on an external server, so you don't need to worry about local environment setup.
 * When to use: Great for those who need fast inference with open-source models and no cost, running on simpler setups without needing a GPU. Note that for very large models, a Pro subscription on Hugging Face is required (though not needed for the models we use in this course).

2. OpenAI (ChatGPT)

 * Runs via API and requires an internet connection.
 * When to use: Ideal for users who don’t want to deal with setup or lack the hardware to run models locally. It's also suited for those willing to pay a few cents for every 1M tokens generated (in the end, it’s a low cost for extensive use. You can check pricing on OpenAI's pricing page).

3. Ollama

 * Runs locally, optimized for local environments.
 * When to use: Perfect for users looking for a free solution and have the necessary hardware, or for those who want to work offline. Currently, this is the recommended option for local use. That’s why we aren’t using the regular Hugging Face pipeline (the one we learned first in this course), but you can still implement it later, especially if you’re using a GPU on Colab or have your own GPU.

With this in mind, we can proceed to create the model loading function.

### Model loading function

> `[ ! ]` We’ll define a variable now to make it easier to switch between methods later (can be useful either to compare different models or when you just want to use another model).

Note: The `# @param` is just a way to easily parameterize the code in Colab. By changing the value in the field next to it, the corresponding value in the code will be updated.

In [None]:
model_class = "hf_hub" # @param ["hf_hub", "openai", "ollama"]

> **Loading Functions**

We'll organize the code into a function to keep it more structured and organized.

The loading code won’t be explained here, as it’s the same code used in the first Colab, we just copied and placed into functions.

`[!] Note:` If you want to make your application more dynamic, you can customize the function to accept more parameters. We’ve only included the model name and temperature for now, as we don’t need to vary other attributes, but if you want to make your program more flexible, you can modify the function to accept additional parameters.

> **Model Selection**

We’ll set it up so that by default, the Meta-Llama-3-8B-Instruct model is loaded, as we’ve already confirmed it works well for conversations in Portuguese. If we don’t pass the model parameter (which corresponds to the model name) to this function, it will load this model by default.

In [None]:
def model_hf_hub(model="meta-llama/Meta-Llama-3-8B-Instruct", temperature=0.1):
  llm = HuggingFaceEndpoint(
      repo_id=model,
      temperature=temperature,
      max_new_tokens=512,
      return_full_text=False,
      #model_kwargs={
      #    "max_length": 64,
      #    #"stop": ["<|eot_id|>"],
      #}
  )
  return llm

Below, we write a function for the other model loading methods

Again, the loading code will not be explained here because it is the same code used in the first Colab, it was just copied and placed inside functions. During the lesson, you can simply open the Colab link and copy and paste it into your project, there is no need to code manually

In [None]:
def model_openai(model="gpt-4o-mini", temperature=0.1):
    llm = ChatOpenAI(
        model=model,
        temperature=temperature
        # other parameters...
    )
    return llm

def model_ollama(model="phi3", temperature=0.1):
    llm = ChatOllama(
        model=model,
        temperature=temperature,
    )
    return llm

> **`[ ! ]`** Here you could set functions for other services too, like Groq or Google for example (see the end of the 1st Colab)

### Defining the process: Prompt, chain and template response

In this step we will basically:
1. define the prompt template for our virtual assistant
 * and also see how we can further customize the prompt so that it is better suited to our objective
2. load the template
3. create the execution chain (prompt + llm + text post-processing)

Let's wrap this inside a function that will return the model response (here we call it model_response)

In [None]:
def model_response(user_query, chat_history, model_class):

    ## Loading the LLM
    if model_class == "hf_hub":
        llm = model_hf_hub()
    elif model_class == "openai":
        llm = model_openai()
    elif model_class == "ollama":
        llm = model_ollama()

    ## Prompt definition
    system_prompt = """
    You are a helpful assistant answering general questions. Please respond in {language}.
    """
    # corresponds to the language we want in our output
    language = "the same language the user is using to chat" #or, force to answer in a specific language e.g. english

    # Adapting to the pipeline (for open source model with hugging face)
    if model_class.startswith("hf"):
        user_prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    else:
        user_prompt = "{input}"

    prompt_template = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("user", user_prompt)
    ])

    ## Creating the chain
    chain = prompt_template | llm | StrOutputParser()

    ## Response return / Stream
    return chain.stream({
        "chat_history": chat_history,
        "input": user_query,
        "language": language
    })

> **Detailed explanation**

**Loading the LLM**
* Depending on the value of model_class, initializes the corresponding language model, choosing the appropriate function (which we defined before)

**Prompt definition**
(`prompt_template = ChatPromptTemplate.from_messages...`)
* This function generates the chatbot response based on the user query (user_query), the chat history (chat_history) and the chosen language model (model_class).
 * system_prompt: Defines the system message that instructs the assistant to
respond in the same language the user is using to chat, but here you could change the value for variable `language` to be english or any other language, if you want to make sure the LLM will answer in that idiom (so, it's just to reinforce and ensure that it will always respond in that language, but the model is clever enough in general to understand, so that's optional).
 * user_prompt: Format of the user input that will be used in the prompt.
 * prompt_template: Creates a prompt template that combines the system_prompt, the chat history and the user_prompt.
for this, the `MessagesPlaceholder` method is used, which is used to have full control over which messages will be rendered during formatting. This can be useful when you are not sure which function to use for your message prompt templates or when you want to insert a list of messages during formatting (our case).

* about `if model_class.startswith("hf"):`
 * If we are loading using the Hugging Face pipeline, it is necessary to make this adjustment. (as explained in colab 1, in the section that talks about open source model templates).
  * if so, leave the prompt in this template. the `.startswith` is a quick way to check: if model_class starts with "hf" (because we define the hugging face pipeline functions with that name) then it is an HF pipeline and therefore applies the template
  * otherwise, just use the normal prompt
 * Note: at the moment, it is necessary to do this with langchain. If this becomes 100% unnecessary in the future, we will remove this section from here

**Chain Creation**

* Combines the prompt template, the chosen language model and an output parser (StrOutputParser) into a chain.

**Returning the response / Stream** (`return chain.stream...`)

* Executes the execution chain (chain) with the provided parameters and returns the response generated by the model in a continuous manner, known as streaming

* Basically, this consists of streaming back each token as it is generated. This allows the user to see the progress and not wait for a blank screen until the processing is 100% complete. (more explanations about streaming in Colab 1)

* Note: at the moment this mode is not visible using the huggingface pipelined implementation, so it will be visible if you select "openai" or "ollama" (or others, which support this feature). In other words, it will return the text all at once. But using the hugging face hub we have a very fast response, so it is not a problem, since the inference is faster than normal so we can quickly visualize the result.

* To change to stream mode was quite simple: we changed from `chain.invoke` (normal mode we used before) to `chain.stream`. Since chains are runnables they benefit from the [runnable interface](https://python.langchain.com/docs/expression_language/interface/), we can do this in a very practical way, since it already has support for this mode.
* (it would be nice if you ran with `chain.invoke` first and then with `chain.stream`, just for comparison)

### Session state management

In this section, we will be initializing the conversation history, using the session state.

What the code below does is check if the chat_history is not present in the session state (st.session_state). If it is not, it initializes it with a welcome message from the virtual assistant. In `content` we declare this message, which will be the one with which our assistant will start the conversation

* More about the session state:

https://docs.streamlit.io/develop/api-reference/caching-and-state/st.session_state

In [None]:
if "chat_history" not in st.session_state:
    st.session_state.chat_history = [
        AIMessage(content="Hi, I'm your virtual assistant! How can I help you?"),
    ]

### Defining the conversation

Now we will render the message history.

The snippet below iterates over the chat_history and displays each message in the Streamlit interface, differentiating between messages from the AI ​​(AIMessage) and from the user (HumanMessage).

* This loop goes through all the messages stored in the chat history, which is saved in the variable `st.session_state.chat_history`. This history contains both the messages sent by the user and the responses generated by the AI.
* The code condition `if isinstance...` checks whether the current message (message) was generated by the AI. To do this, it uses the isinstance function, which checks whether the message object is an instance of the AIMessage class.
  * If the message was generated by the AI, this code block creates a new "bubble talk" (the message box) that will be identified as "AI" using st.chat_message("AI"). Inside this bubble, the message content is displayed in the interface using st.write.
  * If the message was not generated by LLM, the code checks whether it was sent by the user. This is done by checking whether message is an instance of the HumanMessage class. In this case, it creates a "bubble" identified as "Human" with st.chat_message("Human"), whose content of the message sent by the user will be displayed in the interface using st.write again.

In [None]:
for message in st.session_state.chat_history:
    if isinstance(message, AIMessage):
        with st.chat_message("AI"):
            st.write(message.content)
    elif isinstance(message, HumanMessage):
        with st.chat_message("Human"):
            st.write(message.content)

### User Input

This code snippet is responsible for capturing user input, updating the chat history with the new message, displaying the messages in the interface, and generating a response from the AI.

* `st.chat_input` creates an input field in the Streamlit interface where the user can type their message. The text typed by the user is stored in the user_query variable.
 * The second line of code is a condition that checks whether the user actually typed something in the input field. The check `user_query is not None` ensures that the field is not empty, and `user_query != ""` ensures that the string is not empty. If both conditions are true, the code inside the if block will be executed.
 * The 3rd line: the message typed by the user (user_query) is converted to a HumanMessage object and added to the chat history, `st.session_state.chat_history`. This keeps track of the messages sent by the user. * `with st.chat_message("Human")` - we create a message in the conversation labeled "Human" in the interface to display the user's message. The st.markdown function is used to format and display the message (user_query) in the Streamlit interface.
 * `st.chat_message("AI")` - we create a message labeled "AI" in the interface to display the response. Here the model_response() function is called to generate the AI ​​response based on the user input (user_query), the chat history (st.session_state.chat_history), and the specified model class (model_class).
 * The AI ​​response is generated continuously using `st.write_stream`, which streams the response in a convenient way, i.e., has the response displayed as it is generated. The chat history is then printed to the console for debugging.
 * Finally, the AI ​​response (resp) is converted to an AIMessage object and added to the chat history, `st.session_state.chat_history`. This keeps track of the messages generated by the AI ​​and completes the interaction cycle in the chatbot.

In [None]:
user_query = st.chat_input("Enter your message here...")
if user_query is not None and user_query != "":
    st.session_state.chat_history.append(HumanMessage(content=user_query))

    with st.chat_message("Human"):
        st.markdown(user_query)

    with st.chat_message("AI"):
        resp = st.write_stream(model_response(user_query, st.session_state.chat_history, model_class))
        print(st.session_state.chat_history)

    st.session_state.chat_history.append(AIMessage(content=resp))

Now, we have all the code necessary for the functioning of our application's logic.

## Launching the Interface

In the next step, we just need to define some Streamlit settings and then gather all the code in a .py file, this way we can run it in Colab too.







#### - Streamlit Settings

Let's set up a few simple configs before we start our application. Streamlit supports a lot of other settings, but for now we'll keep it simple.

```
st.set_page_config(page_title="Your AI assistant 🤖", page_icon="🤖")
st.title("Your AI assistant 🤖")
```



* `st.set_page_config()` - This line defines the page configuration in Streamlit, specifying the browser tab title as the first parameter and the page icon (called a favicon) as the second. For the title, we initially leave it as "Your AI assistant 🤖" and the icon as a robot emoji.
* `st.title()` - This line defines the main title of the application interface, which will be displayed prominently at the top of the page. We will set the same title as the page in the browser tab.

#### - Creating the final script

Before we can start our application, we need to gather all of this code into a single .py script.

We use `%%writefile proj02.py` in Google Colab to save the code to a file called proj02.py. This is necessary because when working in Colab, we usually write code directly into notebook cells. However, to run a Streamlit application, the code needs to be in a Python (.py) file. So the %%writefile command allows the notebook cell to be saved as an external file.

Gathering the final code into a single .py file is essential for using Streamlit in Colab because the `!streamlit run app.py` command expects all of the application code to be in a single file. This makes it easier to run the application because Streamlit can load and execute all of the code at once, without having to deal with multiple cells or scattered files. It is a convenient way to consolidate the code so that Streamlit works properly in the Colab environment.

Therefore, the block below was assembled by copying all the code we generated before (with the exception of the installation commands, which start with `!`)

In [6]:
%%writefile proj02.py

import streamlit as st
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.prompts import MessagesPlaceholder

from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

import torch
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface import HuggingFaceEndpoint

from dotenv import load_dotenv

load_dotenv()

# Streamlit Settings
st.set_page_config(page_title="Your AI assistant 🤖", page_icon="🤖")
st.title("Your AI assistant 🤖")

model_class = "hf_hub" # @param ["hf_hub", "openai", "ollama"]

## Model Providers
def model_hf_hub(model="meta-llama/Meta-Llama-3-8B-Instruct", temperature=0.1):
  llm = HuggingFaceEndpoint(
      repo_id=model,
      temperature=temperature,
      max_new_tokens=512,
      return_full_text=False,
      #model_kwargs={
      #    "max_length": 64,
      #    #"stop": ["<|eot_id|>"],
      #}
  )
  return llm

def model_openai(model="gpt-4o-mini", temperature=0.1):
    llm = ChatOpenAI(
        model=model,
        temperature=temperature
        # other parameters...
    )
    return llm

def model_ollama(model="phi3", temperature=0.1):
    llm = ChatOllama(
        model=model,
        temperature=temperature,
    )
    return llm


def model_response(user_query, chat_history, model_class):

    ## Loading the LLM
    if model_class == "hf_hub":
        llm = model_hf_hub()
    elif model_class == "openai":
        llm = model_openai()
    elif model_class == "ollama":
        llm = model_ollama()

    ## Prompt definition
    system_prompt = """
    You are a helpful assistant answering general questions. Please respond in {language}.
    """
    # corresponds to the language we want in our output
    language = "the same language the user is using to chat" #or, force to answer in a specific language e.g. english

    # Adapting to the pipeline (for open source model with hugging face)
    if model_class.startswith("hf"):
        user_prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    else:
        user_prompt = "{input}"

    prompt_template = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("user", user_prompt)
    ])

    ## Creating the Chain
    chain = prompt_template | llm | StrOutputParser()

    ## Response return / Stream
    return chain.stream({
        "chat_history": chat_history,
        "input": user_query,
        "language": language
    })


if "chat_history" not in st.session_state:
    st.session_state.chat_history = [
        AIMessage(content="Hi, I'm your virtual assistant! How can I help you?"),
    ]

for message in st.session_state.chat_history:
    if isinstance(message, AIMessage):
        with st.chat_message("AI"):
            st.write(message.content)
    elif isinstance(message, HumanMessage):
        with st.chat_message("Human"):
            st.write(message.content)

user_query = st.chat_input("Enter your message here...")
if user_query is not None and user_query != "":
    st.session_state.chat_history.append(HumanMessage(content=user_query))

    with st.chat_message("Human"):
        st.markdown(user_query)

    with st.chat_message("AI"):
        resp = st.write_stream(model_response(user_query, st.session_state.chat_history, model_class))
        print(st.session_state.chat_history)

    st.session_state.chat_history.append(AIMessage(content=resp))

Writing proj02.py


### Running Streamlit

With our script ready, simply execute the command below to run our application through Streamlit.
This will make the Streamlit application run in the background.

In [7]:
!streamlit run proj02.py &>/content/logs.txt &

Note:
* The `&` at the end allows Colab to continue executing other cells without waiting for the Streamlit application to finish.

* when running locally, `&>/content/logs.txt &` is not necessary

* here we use it because Colab does not display the information we need in the terminal, since we cannot view it through Colab (as it works in a different way and we do not have access to the terminal that is updated in real time - at least in the free version).

* What this snippet does is add the command logs to a file called `logs.txt`

> If you are accessing locally now, just access the link that will appear in the terminal (local URL or Network URL, if you are on another device on the same network).

* For Colab, you need one more command to open our application (see below)



### Access with LocalTunnel

Before connecting with localtunnel, you need to get the external IP, which will be used as the password when launching the application in this next step.

There are two ways to do this:

1) with the command below

In [8]:
!wget -q -O - ipv4.icanhazip.com

35.236.216.236


2) Or, alternatively, do it this way:

* Open the Colab side panel
* Click on the logs.txt file. Here is what would be displayed in the terminal
* Select the IP number corresponding to the External URL. Only the IP number with the dots, without the http:// or port
 * For example: `35.184.1.10`

Now, just run the command below.

This command uses npx localtunnel to "expose" the locally running Streamlit application to the internet. The application is hosted on port 8501, and localtunnel provides a public URL through which the application can be accessed.

Then, enter the link that appears in the output and enter the IP in the Tunnel Password field. Then, click the button and wait for the interface to initialize.

In [9]:
!npx localtunnel --port 8501

your url is: https://fine-bees-move.loca.lt
^C


Note:
* If you get an error, reload the page and wait a few more moments.
* If you are using a method that is not via API, it is normal for it to take a little longer on the first run.
* If speed is a very important factor, we recommend using solutions where processing is done on an external server and connects via API, such as HF, Open AI or Groq

## Testing the application

> Just a suggestion of what to type to test if it can understand the context of the conversation (send the messages in this order, 1 per line)



```
What is the largest planet in the solar system?
and the smallest?
Thanks for the answers!
Can you read our entire conversation history?
What was the first question I asked?
and the second?
Generate a code in Python that writes the Fibonnaci sequence
```



---
## Creating your own prompt
> **Bonus tip: Structure for creating your own prompt**

You can modify the prompt as you wish to suit your purpose. You can use this format:

* Introduction: Start with a brief introduction to the topic, defining the basic concept.

* Explanation: Provide a detailed but simple explanation of the concept. Use practical examples or analogies when necessary to facilitate understanding.

* Steps or Components: If the concept has several components or steps, list and explain each one concisely.

* Applications: Give examples of how this concept is applied in practice or in real contexts.

* Summary: Conclude with a summary of the main ideas presented.

* Additional Guidance: If relevant, offer additional tips or guidance for further exploration of the topic.

Relevant keywords to add to your prompt and inform how you want your answer to be:
* Clear, Objective, Simple, Practical example, Analogy, Detailed explanation, Summary

Other ideas:
* explain [x] to a layperson; explain it in an easy way as if you were explaining it to a child; explain like i'm five ...

Going further:
* you can also look for prompt frameworks to make LLMs perform their intended role in the best way possible. For example, the [COSTAR](https://medium.com/@frugalzentennial/unlocking-the-power-of-costar-prompt-engineering-a-guide-and-example-on-converting-goals-into-dc5751ce9875) framework, which ensures that all key aspects that influence an LLM’s answer are considered, resulting in more personalized output responses.
* When the goal is to make the model play a specific role or act in a certain way, it is called role-playing, and research on this has grown a lot (such as [this paper](https://arxiv.org/abs/2406.00627)).



## Alternative to Streamlit

Creating our own application with Streamlit gives us a certain freedom, mainly because when creating "from scratch" we can leave it the way we want. But there are other more ready-made ways with the interface already created and available for use, not requiring dealing with code. Since our intention here is also to work with the source code and not depend solely on a program/interface, we did not end up addressing it, but if you are interested in using an alternative like this, then we have some recommendations:

* Open WebUI - https://github.com/open-webui/open-webui
* GPT4All - https://gpt4all.io/index.html
* AnythingLLM - https://anythingllm.com

These solutions have several other interesting features and integrations, so it may be a good idea to check them out if you are interested in exploring more LLMs.