In this tutorial, we will build a Conversational AI System that can answer questions by retrieving the answers from a document.

This whole tutorial is divided into 2 parts:
- Part 1: Indexing and Storing the Data
    - Step 1: Install all the dependencies (Python and Database)
        - Python Dependencies
        - Database Dependencies
        - PGVector Extension
    - Step 2: Load the LLMs and Embeddings Model
        - LLMs
        - Embeddings Model
    - Step 3: Setup the database
    - Step 4: Index the documents
        - Download the data
        - Load the data
        - Index and store the embeddings
- Part 2: Querying the indexed data using LLMs
    - Step 1: Load the LLMs and Embeddings Model
        - LLMs
        - Embeddings Model
    - Step 2: Load the index from the database
    - Step 3: Setup the query engine
    - Step 4: RAG pipeline in action
    - Step 5: Using gradio to build a UI

In this notebook, we will only cover the second part. The first part is covered in the notebook `1. Indexing and Storing the Embeddings in PGVector.ipynb`.

# Part 2: Querying the indexed data using LLMs

## Step 1: Load the LLMs and Embeddings Model

Similar to the first part, we will load the LLMs and Embeddings Model. For more details, refer to the first part.

### PHI-2 as an LLM (or SLM?)

In the first part, we discussed all the components of `PHI-2`. Refer to the first part if you want to know more about `PHI-2` model, it's performance, how it works, how to load it, etc. In this part, we will only load the model with few additional parameters that we will discuss below.

Similar to first part, we will define our model name. We also have additional models to compare with `PHI-2` model.

In [1]:
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
# model_name = "mistralai/Mistral-7B-v0.1"
# model_name = "microsoft/phi-2"

There are 2 additional parameters that we will pass to the `PHI-2` model:
- `system_prompt`: System prompt are the set of instructions or contextual information that is provided to the LLMs to guide its behavior and responses.
- `query_wrapper_prompt`: Query wrapper prompts are also a type prompt that is used to frame or structure user queries before they are sent to a language model for processing.

In [2]:
import torch
from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index.llms import HuggingFaceLLM

# Context Window specifies how many tokens to use as context for the LLM
context_window = 2048
# Max New Tokens specifies how many new tokens to generate for the LLM
max_new_tokens = 256
# Device specifies which device to use for the LLM
device = "cuda"

# This is the prompt that will be used to instruct the model behavior
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on the instructions and context provided."

# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

# Create the LLM using the HuggingFaceLLM class
llm = HuggingFaceLLM(
    context_window=context_window,
    max_new_tokens=max_new_tokens,
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name=model_name,
    model_name=model_name,
    device_map=device,
    # uncomment this if using CUDA to reduce memory usage
    model_kwargs={"torch_dtype": torch.bfloat16}
)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### BGE as an Embeddings Model

We load the `BGE` model in the exact same way as we did in the first part. Refer to the first part if you want to know more about `BGE` model.

Here embedding model is used to embed the user query and using this embedding we will apply similarity search on the indexed documents which in turn will return the most relevant documents.

In [3]:
embedding_model_name = "BAAI/bge-large-en-v1.5"

In [4]:
from langchain.embeddings.huggingface import HuggingFaceBgeEmbeddings
from llama_index.embeddings import LangchainEmbedding

# Create the embedding model using the HuggingFaceBgeEmbeddings class
embed_model = LangchainEmbedding(
  HuggingFaceBgeEmbeddings(model_name=embedding_model_name)
)

# Get the embedding dimension of the model by doing a forward pass with a dummy input
embed_dim = len(embed_model.get_text_embedding("Hello world")) # 1024

## Step 2: Load the index from the database

### Setting up the database configuration

Similar to the first part, we will setup the database configuration. Refer to the first part if you want to know more about the database configuration.

In [5]:
connection_string = "postgresql://postgres:test123@localhost:5432"
db_name = "ragdb"
table_name = 'embeddings'

### Setting up the Service context

Similar to the first part, we will setup the service context. Refer to the first part if you want to know more about the service context.

In [6]:
from llama_index import ServiceContext
from llama_index import set_global_service_context

# Set the chunk size and overlap that controls how the documents are chunked
chunk_size = 1024
chunk_overlap = 32

# Create the service context
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

# Set the global service context
set_global_service_context(service_context)

### Loading the Indexed data from the database

Similar to the first part, we will create a `vector_store` using the already indexed data in the database from the first part. Refer to the first part if you want to know more about how to create a `vector_store` from the documents.

In [7]:
from sqlalchemy import make_url
from llama_index.vector_stores import PGVectorStore

# Creates a URL object from the connection string
url = make_url(connection_string)

# Create the vector store
vector_store = PGVectorStore.from_params(
    database=db_name,
    host=url.host,
    password=url.password,
    port=url.port,
    user=url.username,
    table_name=table_name,
    embed_dim=embed_dim,
)

Now, we will load the indexed data from the database. We will use the `vector_store` to load the indexed data from the database.

In the first part, we used `from_documents` method of `VectorStoreIndex` class to create the index and save it in the database. Now, we will use `from_vector_store` method of `VectorStoreIndex` class to load the already indexed and saved data from the database.

In [8]:
from llama_index import VectorStoreIndex

# Load the index from the vector store of the database
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

## Step 3: Setup the query engine

Now that we have loaded the indexed data from the database, we also have the LLMs and Embeddings Model ready, we will setup the query engine.

Query engine is the component that is responsible for querying the indexed data using the LLMs and Embeddings Model.

In essence, the query engine first embeds the user query using the Embeddings Model. Then, it uses those embeddings to retrieve the most relevant documents from the indexed data. Once it has extracted the most relevant documents, it uses to extracted docs as a context. Then, this context along with the user query is passed to the LLMs to generate the answer.

Now, we will setup the query engine. Creating a query engine is a 2 step process:

- Step 1: Create a `retriever` using `VectorIndexRetriever`. This retriever is responsible for retrieving the most relevant documents from the indexed data. With the help of `index`, `Retriever` gets access to the indexed data. Then we set `similarity_top_k` as 2, which means that we want to retrieve 2 most relevant documents from the indexed data. You can play around with this parameter to see how it affects the performance of the query engine.

- Step 2: create a `response_synthesizer` using `get_response_synthesizer`. Here we define the `response_mode` as `refine`. There are several options available for `response_mode` which you can explore. Follow this [link](https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/root.html#configuring-the-response-mode) to know more about the different options available for `response_mode`. `refine` option provides us with more refined answers. You can play around with this parameter to see how it affects the performance of the query engine.

We finally combine the `retriever` and `response_synthesizer` to create a `query_engine` using `RetrieverQueryEngine`.

There are other ways as well to create a query engine. The simplest way to create a query engine is to use `index.as_query_engine()` method. This method will create a query engine with default parameters. But, if you want to customize the query engine, you can use the method shown above.

In [9]:
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import VectorIndexRetriever
from llama_index.response_synthesizers import get_response_synthesizer

# Create the retriever that manages the index and the number of results to return
retriever = VectorIndexRetriever(
      index=index,
      similarity_top_k=2,
)

# Create the response synthesizer that will be used to synthesize the response
response_synthesizer = get_response_synthesizer(
      response_mode='refine',
)

# Create the query engine that will be used to query the retriever and synthesize the response
query_engine = RetrieverQueryEngine(
      retriever=retriever,
      response_synthesizer=response_synthesizer,
)

## Step 4: RAG pipeline in action

Wow! It was a long journey to reach here. But, we are finally here. Now, we will be able to see our implemented RAG pipeline in action.

Before we start, let's first address few critical question: Will `PHI-2` give us the correct and relevant answers given that it is a foundational model? 

`PHI-2` model which we are using is not a instruction-fine-tuned model like Mistral-7B-Instruct. That's why earlier I mentioned that we will be comparing `PHI-2` with other models. More specifically, we will be comparing `PHI-2` with `Mistral-7B` and `Mistral-7B-Instruct` models. We will see that `Mistral-7B-Instruct` model will give us the most relevant answers. But, `PHI-2` will also give us some relevant answers along with some irrelevant responses. We will see that `Mistral-7B` model will also suffer from the same problem as `PHI-2` model. But, `Mistral-7B-Instruct` model will give us the most relevant answers.

In [10]:
import textwrap

# Create the text wrapper that will be used to wrap the response
# This is optional and can be removed if you don't want to wrap the response
# This is done to make the response more readable
wrapper = textwrap.TextWrapper(width=75) 

In [11]:
response = query_engine.query('what is this document about?')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [12]:
print(wrapper.fill(text=response.response))

Based on the context provided, this document is a part of the 'MEMENTO'
project's 'Pink Revisions' from September 7, 1999. The document is labeled
as page 116 and contains text and annotations. The provided context
suggests that this document is related to a conversation between a
character named Leonard and another person, with Leonard asking a question.
However, without further information, it's still impossible to determine
the exact content or subject matter of the document.


In [13]:
response = query_engine.query('what is the name of the movie?')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [14]:
print(wrapper.fill(text=response.response))

Memento is a movie that follows the character Leonard as he searches for
his wife's killer. In one scene, he chases a man named Dodd and later
discovers a file card that reads "Tattoo: Access to Drugs." In another
scene, Leonard is seen at a motel, exiting with a shopping bag and later
burning a stuffed toy and an old paperback book by a reservoir. The movie
is known for its nonlinear storytelling and use of color to distinguish
different sequences.


In [15]:
response = query_engine.query('what is the name of Director and the year of release?')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [16]:
print(wrapper.fill(text=response.response))

The name of the director is Christopher Nolan and "Memento" was released in
the year 2000.


In [17]:
response = query_engine.query('who is the writer of the movie?')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [18]:
print(wrapper.fill(text=response.response))

The movie "Memento" is based on a short story by Jonathan Nolan. However,
the screenplay was written by Christopher Nolan.


In [19]:
response = query_engine.query('can you pointout the pecularities of this movie in terms of screenplay, like what confused the audience?')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [20]:
print(wrapper.fill(text=response.response))

Based on the provided context from the "Memento" screenplay, the movie is
known for its unique narrative structure, which can be confusing for some
audiences. The story is told in a reverse chronological order, with scenes
presented in a non-linear fashion. This unconventional storytelling method
can make it challenging for viewers to follow the sequence of events and
understand the cause-and-effect relationships between different scenes.
Additionally, the protagonist, Leonard Shelby, suffers from anterograde
amnesia, which means he cannot remember new information. He relies on
notes, photographs, and tattoos to navigate his world. This narrative style
and the protagonist's condition create a sense of disorientation and
uncertainty, which can be intriguing but also confusing for some viewers.
The context from the screenplay further illustrates this, as it shows
Leonard being suspected of being a bad actor and undergoing more tests, but
the significance of these events and their relatio

As can be seen from the above results, `Mistral-7B-Instruct` model gives us the most relevant answers. But, `PHI-2` model, being a foundational model, lacks the ability to follow the instructions. That's why it gives us some relevant answers along with some irrelevant answers. `Mistral-7B` model also suffers from the same problem as `PHI-2` model. 

Finally! It was a long journey. But, we finally made it. We implemented the RAG pipeline from scratch. We also saw how to index the documents and query the indexed data using LLMs. 

## Step 5: Using gradio to build a UI

Here is an additional step that you can try out. You can use `gradio` to build a UI for the RAG pipeline. `gradio` is a python library that allows you to quickly create UIs for your machine learning models. You can read more about `gradio` [here](https://gradio.app/).

For `gradio`, you first need to define a function that takes in a query along additional parameters named `history`. Ths function is then passed to `gradio` and it creates a UI for you. You can play around with the UI to see how the RAG pipeline works.

In [21]:
def predict(input, history):
    response = query_engine.query(input)
    return str(response)

In [22]:
import gradio as gr

gr.ChatInterface(predict).launch(share=True)

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://2b4a874fb3ce4df436.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
