<img src="https://wdsa-ccwi2024.it/wp-content/uploads/2023/02/logo-2-175x127.png" alt="WDSA/CCWI Logo" width="150"/>

# Basic of Retrieval Augmented Generation (RAG) with Huggingface and Langchain

## Introduction


This notebook demonstrates the basics of RAG and its use for improving the capabilities of Large Language Models (LLM). RAG provides relevant and updated context without the need of fine-tuning. RAG also reduces hallucinations, largely increasing the usability of LLMs in real-world applications. We will show the benefits of RAG by i) using it to "teach" an LLM about the *Water Network Tool for Resilience* (WNTR) toolbox, and ii) creating a quick application to chat with online blogs.

This notebook is partially based on [this](https://huggingface.co/learn/cookbook/en/rag_zephyr_langchain) and [this](https://python.langchain.com/v0.1/docs/use_cases/question_answering/chat_history/) contributions.



**What is RAG?**

RAG is a popular approach to address the issue of a LLMs not being aware of specific content due to said content not being in its training data, or hallucinating even when it has seen it before. Such specific content may be proprietary, sensitive, or recent and updated often. If you have sufficient computing capabilities, and your data is static and does not change regularly, you may consider fine-tuning the LLM. In many cases, however, fine-tuning can be costly, and, when done repeatedly (e.g. to address data drift), leads to "model shift". This is when the model's behavior changes in ways that are not desirable.

**RAG (Retrieval Augmented Generation)** does not require model fine-tuning. Instead, RAG works by providing an LLM with additional context that is retrieved from relevant data so that it can generate a better-informed response.

Here's a quick illustration of how it works:

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/rag-diagram.png" alt="RAG Diagram" width="600"/>


* The external data is converted into *embedding* vectors with a separate **embeddings model**. These models are smaller language models which translate parts documents into vectors (a format AI can understand, similar to how tokens are transformed in a LLM). The vectors capture the *semantic meaning* of the text. These vectors are stored in a **vector database**. Embedding models are typically small, so updating the embedding vectors on a regular basis is faster, cheaper, and easier than fine-tuning a model.

* When a **query** is made, the embeddings model converts this query into its embedding vector, which is then used to search the vector database for documents with similar embeddings. This retrieval process ensures that the most relevant documents related to the query are identified.

* The retrieved documents, which are similar to the query, are then combined with the original query to form a new, more informative **prompt**. This prompt provides the LLM with the necessary context to generate a response that is both accurate and relevant to the query.

* The LLM processes the combined prompt (query + similar documents) and generates a **response** that benefits from the augmented context provided by the retrieved documents.

* RAG also gives you the opportunity to swap your LLM for a more powerful one when it becomes available, or switch to a smaller distilled version, should you need faster inference.

This approach leverages the strengths of both *retrieval-based* and *generation-based* models, ensuring that responses are grounded in specific, relevant data while also being generated in a coherent and contextually appropriate manner.



## Part 1: RAG with WNTR documentation

### Tools


We will implement a basic RAG pipeline to provide knowledge of the [Water Network Tool for Resilience (WNTR) toolbox](https://github.com/USEPA/WNTR) to an LLM. We will use [HuggingFace](https://huggingface.co/) to download and access the LLM and embeddings models. We will use [LangChain](https://www.langchain.com/) to build the RAG pipeline.

As you may know, **WNTR** is an open-source Python package developed by the U.S. EPA and Sandia National Laboratories to simulate and analyze the resilience of water distribution networks against disruptive events like natural disasters and infrastructure failures. It integrates hydraulic and water quality simulations, damage estimates, and resilience metrics, and it provides a Python interface to EPANET.

**HuggingFace** is a leading AI platform and community that provides open-source tools and resources for developing, sharing, and using AI models. It is best known for its *Transformers* library, which offers a vast collection of pre-trained models for various tasks like NLP, computer vision, and audio processing, simplifying the integration of sophisticated models into applications. The platform also features the Hugging Face Hub, where users can collaborate on models, datasets, and applications, fostering a collaborative environment for AI development.

**LangChain** is a framework that simplifies the development of applications powered by LLMs. It is made by modular components for handling different tasks such as prompt management, document loading, preprocessing, interfacing with vector databases and other functionalities enabling building complex RAG pipelines.

### Install all required libraries

The following instructions will install all the necessary libraries for this tutorial. You do not need to pay attention to it, but you must run all these cells.

In [None]:
# installing libraries for the tutorial, these will be discussed later; feel free to check online if you need more info on each of them
!pip install --upgrade --quiet torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu langchain langchain-community langchain-huggingface GitPython

In [None]:
# Override the default locale setting to ensure UTF-8 encoding is used.
# The locale is a set of parameters that defines the user's language,
# country, and any special variant preferences, affecting how data is presented.
import locale
locale.getpreferredencoding = lambda: "UTF-8"

### LLM Model

In this notebook, we'll demonstrate RAG using "small" open LLMs. You can choose between HuggingFace's `Zephyr-7b` (7B parameters) or Microsoft's `Phi-3 Mini-128k-Instruct` (3.8B parameters). You can add other small models available on [HuggingFace](https://huggingface.co/models). Due to resource constraints, we suggest to start with Phi-3, a powerful model for its size. Yet, we will apply quantization techniques to fit the model on our hardware. The `128k` identifies the length of the context window.

The code below performs the setup for using a pretrained language model with quantization and tokenization. *Quantization* is a technique that reduces the precision of the model's weights, decreasing memory usage and increasing inference speed without significantly sacrificing accuracy. *Tokenization* is the process of converting text into *tokens*, breaking down the input into smaller units such as words or subwords. Tokens still need to be converted into embeddings that the model can understand and process.



To access HuggingFace's resources, we need to login and authenticate our account. You can do this with the simple lines of code below, and by pasting a "WRITE" token.

In [None]:
from huggingface_hub import login
login()

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Choose the model to be used. Uncomment the desired model_name line.
#model_name = 'HuggingFaceH4/zephyr-7b-beta'
model_name = 'microsoft/Phi-3-mini-128k-instruct'

# Configuration for loading the model with 4-bit quantization
# For more info check https://huggingface.co/docs/transformers/en/main_classes/quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the model with the specified quantization configuration
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

# Load the tokenizer corresponding to the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Creating the LLM Chain

This code snippet demonstrates how to create a language model **chain** (`llm_chain`) using HuggingFace and LangChain. The process involves setting up a text generation pipeline with HuggingFace, integrating it with LangChain, and defining a prompt template to structure the input for the language model. The final `llm_chain` is constructed by combining the prompt template, language model, and output parser, allowing for streamlined input-output processing.

* A text generation pipeline is a component in HuggingFace that enables the generation of text based on a given prompt, using a specified model and tokenizer. It allows for customization of various parameters like temperature, repetition penalty, and maximum number of tokens to generate coherent and contextually relevant text. *Temperature* is a value between 0 and 2 that controls the randomness of the text generation; lower values make the output more deterministic, while higher values increase diversity. For RAG we usually select very low values. *Repetition penalty* reduces the likelihood of repetitive sequences in the generated text, ensuring more varied and natural responses.

* On the other hand, LangChain pipelines, known as chains (hence the name) are created using the `|` operator, known as the *pipe* operator. This chains together different components (like prompt templates, language models, and output parsers) into a single, cohesive workflow. The pipe operator originates from Unix/Linux, where it is used to pass the output of one command as input to another, enabling the construction of complex command sequences. LangChain pipelines differ from HuggingFace pipelines as they enable the integration of various processing steps and components in a modular and extensible manner, facilitating complex input-output transformations and handling within a single framework.

First, we create the HuggingFace pipeline...

In [None]:
from langchain_huggingface import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline

# Create a text generation pipeline using the specified model and tokenizer
text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.1,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

# Wrap the pipeline in a Langchain HuggingFacePipeline object
llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

... next, the code below create a LangChain pipeline from the HuggingFace one. You can see that we are using a structured prompt template with *placeholders* and *roles*.

* Placeholders like `{context}` and `{question}` in the template will be replaced with actual values when used. `{context}` provides the background information to the model to help generate a relevant answer. `{question}` is the specific query the user wants the model to answer. [These placeholders](https://www.w3schools.com/python/ref_string_format.asp) are general Python features, not specific to LLMs, allowing for flexible and dynamic input handling.

* As for the roles, `system` usually defines the structure and rules of the interaction (e.g., prompt template); `user` is the entity asking the question, providing `context` and `question`; `assistant` is the AI model generating the response based on the provided context and question.

We prefer our prompts to be structured to ensure the model receives all necessary information in a consistent format, improving the accuracy and relevance of its responses.

In [None]:
# Define the template for the prompt, including placeholders for context and question
prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """
# Create a PromptTemplate object with specified input variables and template
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Combine the prompt template, language model, and output parser into a single chain
llm_chain = prompt | llm

### Testing the LLM Chain

Now is time to test our LLM chain. We ask two simple questions on WNTR to verify what the LLM already knows about the topic... it does not seem like it knows a lot. We also time the responses to check how long it takes to produce them, given the constrained computational resources of Google Colab. Lastly, we use the "pretty print" module `pprint` to better visualize the answers.

In [None]:
question = "What does WNTR stand for in the context of water distribution systems analysis?"

In [None]:
import time

start_time = time.time()
llm_response = llm_chain.invoke({"context":"", "question": question})
end_time = time.time()

execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")

In [None]:
import pprint
pprint.pprint(llm_response)

In [None]:
question = "What can you tell me about the Pressure Dependent Demand model of EPANET?"
start_time = time.time()
llm_response = llm_chain.invoke({"context":"", "question": question})
end_time = time.time()

execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")
pprint.pprint(llm_response)

The output above demonstrates how we get both the input and the output when querying our LangChain pipeline. In this example, we can see that the `context` is empty because we only provided a question. The answer appears after the assistant role identifier, demarcating where the model's response begins.

It is evident that some post-processing is necessary to extract only the answer, similar to what users expect from tools like ChatGPT. However, ChatGPT is more than just a language model; it is a refined application. Using LangChain and other tools, we can create similar applications, but this is beyond the scope of this basic tutorial.



### Load the data for RAG


We use the `GitLoader` from the `langchain_community.document_loaders` module to clone the `WNTR` GitHub repository locally and extract all the `.rst` files from a specified directory within the repository. `.rst` files are [reStructuredText](https://en.wikipedia.org/wiki/ReStructuredText) files, primarily used for documentation in Python packages. They are readable by humans and can be converted to various formats like HTML and PDF.
The `file_to_load` function is used to filter and ensure that only the `.rst` files located directly in the `./WNTR/documentation` directory are loaded, excluding any subdirectories. One great thing about LangChain is that the developers and the community provide [many loaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/) for different types of sources, including online websites and PDFs.





In [None]:
from langchain_community.document_loaders import GitLoader

In [None]:
import os

def file_to_load(file_path, base_directory='./WNTR/documentation'):
    """
    Returns True if the file is a .rst file in the specified base directory, without checking subfolders..

    Returns:
    bool: True if the file should be loaded, False otherwise.
    """
    # Check if the file is in the specified base directory and ends with .rst
    if os.path.dirname(file_path) == base_directory:
      return file_path.endswith('.rst')
    return False

In [None]:
loader = GitLoader(
    clone_url="https://github.com/reganmurray/WNTR/",
    repo_path="./WNTR/",
    branch="master",
    file_filter=file_to_load,
)

In [None]:
# show the downloaded folder
!ls

In [None]:
# load the data
docs = loader.load()
print(len(docs))

In [None]:
# let's show the acknowledgements document
pprint.pprint(docs[0].page_content)

### Document Chunking, Embeddings, and Vector Database Creation


When doing RAG with large documents, it's essential to chunk them into smaller pieces. This is necessary because many embedding models and vector databases have limitations on the maximum input size they can handle. Chunking ensures that each segment of the document can be processed efficiently and that the embeddings capture the relevant context without exceeding these size limitations.

To create document chunk embeddings, we'll use the `HuggingFaceEmbeddings` class and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. There are many other embedding models available on the Hub, and you can keep an eye on the best-performing ones by checking the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

*Vector databases* store numerical representations (i.e., the embeddings) of data like text and images. They create an index to quickly find similar vectors to a query vector. This is done using special algorithms and data structures. Vector databases help find the most similar vectors to a query by measuring *distances* like cosine similarity or Euclidean distance in multidimensional spaces.

For the vector database, we will use `FAISS`, a library developed by Facebook AI, which offers efficient similarity search and clustering of dense vectors, making it a popular library. We will access both the embeddings model and FAISS via the LangChain API.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chunking documents into smaller pieces to ensure they fit the input size limits of the embedding model
splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=32)

chunked_docs = splitter.split_documents(docs)

In [None]:
# len(chunked_docs)>len(docs)
print(len(chunked_docs))

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Using the HuggingFaceEmbeddings with FAISS to create a vector database from the chunked document embeddings
db = FAISS.from_documents(chunked_docs,
                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))

We need a way to return(retrieve) the documents given a query. For that, we'll use the `as_retriever` method using the `db` as a backbone:
- `search_type="similarity"` means we want to perform similarity search between the query and documents
- `search_kwargs={'k': 5}` instructs the retriever to return top 5 results.

There are other types of strategies for searching, such as [Maximal Marginal Relevance](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/mmr/) which aims to return results that are both relevant and diverse. Sometimes, you might want to retrieve very contrasting chunks to cover a broader spectrum of information related to the query. This can be particularly useful when you want to avoid redundancy and ensure diverse perspectives in the results.


In [None]:
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 5}
)

The vector database and retriever are now set up. Now, we can finalize our RAG chain by combining the `llm_chain` with the retriever.

### Creating the RAG Chain

Now that we created the retriever from the vector database, we can use it in a chain with the language model to generate responses based on retrieved contexts. We will build the chain using the `RunnablePassthrough` Class of LangChain, that simply passes the input (i.e., the question) through without any changes. It is used here to structure the input before passing it to the language model chain `llm_chain` we created before. Note that the `context` field is now filled by the retriever, which will fetch relevant information or documents based on the input query.



In [None]:
from langchain_core.runnables import RunnablePassthrough

# Set up the RAG chain
rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)


### Testing the RAG chain

Let's see the difference RAG makes in generating answers to the library-specific questions.

In [None]:
question = "What does WNTR stand for in the context of water distribution systems analysis?"
rag_response = rag_chain.invoke(question)
pprint.pprint(rag_response)

If you scroll at the bottom of the previous printed text, you will see that the LLM now returns an accurate answer! You will also see at the top the chunks of documents retrieved by the RAG chain to produce such an answer, grounding the "hallucinations" of the LLM. However, now that we retrieve all this information, it is hard to find just the output that we want, i.e., the actual answer.

### Parsing the output of the LLM

Indeed, *parsing* the output of an LLM is crucial because the raw output is often not immediately usable. With LangChain, we can implement various parsers to clean and structure this output. Here, we create a simple parser that removes everything except the assistant's response and we add it at the end of the chain. We do so by creating a new subclass from the `BaseOutputParser` Class of LangChain. You can find a quick recap of these basic Object Oriented Programming concepts in Python [here](https://www.geeksforgeeks.org/create-a-python-subclass/).


In [None]:
from langchain_core.output_parsers import BaseOutputParser

class ReturnOnlyAssistantText(BaseOutputParser):
    def parse(self, text: str):
        # Find the position of the substring
        index = text.find('<|assistant|>')
        if index == -1:
            # If the substring is not found, return an empty string or handle it as needed
            return ""
        # Return the text after the substring
        return text[index + len('<|assistant|>'):]

# Set up the RAG chain with parser
rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain | ReturnOnlyAssistantText()
)

In [None]:
question = "What can you tell me about the Pressure Dependent Demand model of EPANET?"
rag_response = rag_chain.invoke(question)
pprint.pprint(rag_response)

As you can see, even adding a simple parser makes the output of the LLM more usable. This is important in real applications to ensure consistency, accuracy, and relevance in the data being processed and utilized by downstream systems.

## Part 2: Conversational RAG

By now you should be familiar with RAG: the approach uses a retriever to fetch relevant documents based on a query and a generator to produce a natural language response using the retrieved documents. This method ensures that the LLM responses are both accurate and contextually relevant.

Now we are going to look into *conversational RAG*, which maintains context over multiple interactions, making it more interesting and effective than single-shot retrieval. This approach enables chat systems to interact more naturally, much like you would with a human, and is fundamental in creating intelligent assistants. Everyone should be very familiar with this concept by using interfaces like *ChatGPT*, *Bing*, or *Gemini*. While we will keep it simple here, we will demonstrate the basics of building such a system using LangChain.

Conversational RAG is illustrated by the following flowchart:

<img src="https://python.langchain.com/v0.1/assets/images/conversational_retrieval_chain-5c7a96abe29e582bc575a0a0d63f86b0.png" alt="RAG Diagram" width="1000"/>

The system is divided into two main components: the `history aware retriever` and the `question answer chain`.

The process starts with the user's input query, which is combined with the chat history to maintain context. This ensures that the conversation remains coherent and relevant to previous interactions. For the history aware retriever, a prompt is created to contextualize the query. This prompt is processed by the LLM to produce a contextualized query, which is more specific and tailored to the ongoing conversation.

In the subsequent question answer chain, the contextualized query is then used to fetch relevant documents from the document store. The retrieved documents are then used to create a prompt that aims to answer the user's question. This new prompt is again processed by the LLM, which generates the final answer. The answer is then returned to the user, completing the cycle.

We are essentially using two chains, linked together, each of them making use of an LLM. The first chain contextualizes the query, and the second chain generates the final answer based on the retrieved documents.

To illustrate how to build a basic system with LangChain, we will "chat" with the LLM with respect to the content of a [blog](https://blog.dhigroup.com/best-practices-to-address-aging-urban-water-infrastructure/) from DHI on best practices to address aging urban water infrastructure. This example will also showcase how easy it is to access data from websites in LangChain using the `WebBaseLoader`.e capabilities of LLMs.


> Disclaimer: We are using a pretty basic LLM, which is also quantized. Therefore, performance might not be as good as expected. For instance, if you are familiar with *ChatGPT*, you might notice differences in response quality. In general, simple models can perform well in these contexts too, but they require more sophisticated engineering to achieve good performance. This is beyond the scope of this course.


### Get the documents from the web and create the retriever

In [None]:
from langchain_community.document_loaders import WebBaseLoader

In [None]:
# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://blog.dhigroup.com/best-practices-to-address-aging-urban-water-infrastructure/",)
)
docs = loader.load()

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
db = FAISS.from_documents(chunks,
                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))

# Retrieve and generate using the relevant snippets of the blog.
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 3})


### Test the LLM

First, we check the basic LLM response with respect to a specific aspect of the blog. We can see that the answer is pretty generic.

In [None]:
llm_response = llm_chain.invoke({"context":"", "question": "What does creating an holistic plan entail for municipalities?"})
pprint.pprint(llm_response)

### Creation of the conversational RAG pipeline

Now we create our conversational RAG pipeline.

First we create the `history_aware_retriever`. We define a system prompt  to contextualize the user's query, instructing the model to reformulate the latest user question into a standalone question that can be understood without referring to the chat history.

While it may appear counterintuitive to create standalone questions when chat history is available, this process improves clarity, reduces ambiguity, and enhances retrieval accuracy, ensuring the system fetches relevant information more effectively and generates more precise responses.

A `ChatPromptTemplate` is created using this system prompt, along with a placeholders for the chat history (i.e., `MessagesPlaceholder`) and user input. Finally, the `create_history_aware_retriever` function combines the LLM, the retriever, and the contextualized query prompt to create a retriever that maintains context over multiple interactions. This setup allows the system to handle user queries more effectively by considering the conversation history.


In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

# Define the system prompt to contextualize the query.
contextualize_q_system_prompt = (
    """Given a chat history and the latest human question \
    which might reference context in the chat history, \
    formulate a standalone question which can be understood \
    without the chat history. Do NOT answer the question \
    just reformulate the human question if you can \
    and otherwise return it as it is."""
)

# Create a ChatPromptTemplate using the system prompt, chat history, and user input
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

# Create a history-aware retriever using the LLM, retriever, and contextualize query prompt
# This combines the contextualization step with the retrieval process
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

Now we create the `question_answer_chain`. To do so, we define a system prompt to instruct the assistant on how to answer questions concisely using retrieved context. A different `ChatPromptTemplate` is created with this system prompt, incorporating placeholders for the chat history and user input. This template ensures that the model uses relevant context and maintains conversation continuity.


In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Define the system prompt for the question-answering task
qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.

<START CONTEXT>:{context}<END CONTEXT>"""

# Create a ChatPromptTemplate using the QA system prompt, chat history, and user input
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

# Create a chain that combines documents using the specified LLM and QA prompt
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)


Finally, the `rag_chain` is formed by combining the `history_aware_retriever` and the `question_answer_chain`. This setup enables the system to effectively retrieve relevant documents and generate contextually aware answers to user queries.

In [None]:
# Create the retrieval chain by combining the history-aware retriever and the QA chain
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

### Test the chat

We can finally text the chat! While creating a full-fledged conversational user interface is beyond the scope of this tutorial, we can simulate the conversation using the `AIMessage` and `HumanMessage` classes from Langchain, as illustrated below. As you can see, the three questions build on eachother and the responses provided by the LLM at each stage.

To focus on the Q/A pairs without printing the context and other details, we define and use a function `extract_answers` to extract and display only the questions and answers.

> Note that the model, being simple and quantized, might have the tendency to hallucinate, potentially asking new intermediate questions that were not actually posed by the user.


In [None]:
from langchain_core.messages import AIMessage, HumanMessage

# Initialize an empty chat history
chat_history = []

# First question from the user
question = "What does IUWM stand for?"

# Invoke the RAG chain to get the AI response for the first question
ai_msg_1 = rag_chain.invoke({"input": question, "chat_history": chat_history})

# Update chat history with the human question and AI response
chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_1["answer"]),
    ]
)

# Second question
second_question = "How does data collection and analysis improve it?"
ai_msg_2 = rag_chain.invoke({"input": second_question, "chat_history": chat_history})

chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=ai_msg_2["answer"]),
    ]
)

# Third question
third_question = "How can IUWM plans address financial challenges and integrate green infrastructure options?"
ai_msg_3 = rag_chain.invoke({"input": third_question, "chat_history": chat_history})

# print(ai_msg_3["answer"])

In [None]:
# this simple function only returns the Q/A and not the context
def extract_answers(raw_answer):
    marker = "<END CONTEXT>"
    if marker in raw_answer:
        return raw_answer.split(marker)[-1]
    else:
        return ""

# Example usage
result = extract_answers(ai_msg_3["answer"])
print(result)