# Implementing Advanced RAG Techniques with LangChain & Gemma/ Gemini

| | |
|-|-|
|Author(s) | [Tahreem Rasul](https://github.com/tahreemrasul) |

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/tahreemrasul/advanced_rag_techniques/blob/main/Designing_%26_Building_Advanced_RAG_from_scratch.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/tahreemrasul/advanced_rag_techniques/blob/main/Designing_%26_Building_Advanced_RAG_from_scratch.ipynb">
      <img width="28px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

# Install Dependencies

In [None]:
!pip install --quiet langchain langchain-community langchain-groq langchain-google-genai langchain_experimental pypdf faiss-gpu rank_bm25 cohere sentence-transformers==2.2.2 chainlit pyngrok

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2

# Basic RAG

## Indexing

Let's start by importing all necessary libraries!

In [None]:
from langchain_groq import ChatGroq
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain_community.vectorstores.faiss import DistanceStrategy
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from google.colab import userdata

### Step 1:
Remeber the first step in indexing is simply parsing/reading all your input data.

Add a sample pdf document that you want to use in the `content/sample_data/` folder in the `Files` tab on the left.

We will be using the `PyPDFLoader()` class from `LangChain` to load our pdf document.

In [None]:
file_path = "/content/sample_data/2010.11929v2.pdf"
loader = PyPDFLoader(file_path)
doc = loader.load()

Let's see how our data is structured!

Typically, each page text is stored in it's own variable in a list. We can access a particular page by indexing the page and using the `.page_content` method from the `loader()` class in LangChain.

In [None]:
print(len(doc))
print(doc[8].page_content)

doc = doc[:9]

22
Published as a conference paper at ICLR 2021
RGB embedding filters
(first 28 principal components)
1 2 3 4 5 6 7
Input patch column1
2
3
4
5
6
7Input patch rowPosition embedding similarity
1
1
Cosine similarity
0 5 10 15 20
Network depth (layer)020406080100120Mean attention distance (pixels)
ViT-L/16
Head 1
Head 2
Head 3
...
Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Sim-
ilarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position
embedding of the patch with the indicated row and column and the position embeddings of all other
patches. Right: Size of attended area by head and network depth. Each dot shows the mean attention
distance across images for one of 16 heads at one layer. See Appendix D.7 for details.
et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch
prediction for self-supervision, mimicking the masked language modeling task used in BERT. With
s

### Step 2:
The second step in the indexing stage involves splitting the document into smaller chunks. We do this for a variety of reasons:


*   context window of embedding model (which we will be using to create vectors) is limited
*   LLM context window is usually large, however, we would still like to restrict the data we supply as context to cut costs
* tackle hallucination

We will be using the **`RecursiveCharacterTextSplitter()`** from `LangChain` to split our pdf document into smaller chunks. `RecursiveCharacterTextSplitter()` uses hierarchical separators (e.g., new lines, spaces) to split text while preserving sentences or paragraphs.
It maintains coherence but does not account for semantic relationships.

We have kept a chunk size of 1000 and an overlap of 100 characters.

In [None]:
# text split
text_splitter_basic = RecursiveCharacterTextSplitter(chunk_size=1000,
                                                     chunk_overlap=100)
split_docs_basic = text_splitter_basic.split_documents(doc)

Let's see the number of chunks we have created, as well. as an example chunk. We can see data has been split into a couple of sentences.

In [None]:
print(f"Number of chunks with Recursive Splitter: {len(split_docs_basic)}")
print("\n")
print("EXAMPLE CHUNK:")
print("\n")
print(split_docs_basic[7])

Number of chunks with Recursive Splitter: 40


EXAMPLE CHUNK:


page_content='from the input image and applies full self-attention on top. This model is very similar to ViT,
but our work goes further to demonstrate that large scale pre-training makes vanilla transformers
competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020)
use a small patch size of 2×2pixels, which makes the model applicable only to small-resolution
images, while we handle medium-resolution images as well.
There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms
of self-attention, e.g. by augmenting feature maps for image classiﬁcation (Bello et al., 2019) or by
further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018;
Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classiﬁcation (Wu
et al., 2020), unsupervised object discovery (Locatello et al., 2020),

### Step 3:
The last step in the indexing stage involves passing our text data through an embedding model to create embedding vectors.

We will be using opensource embedding models from `HuggingFace`.

Finally, we will create a `Faiss` database, an opensource vector database from Meta, to create vector embeddings. We will supply it the embedding model, the chunks we created and optionally a distance strategy to calculate scores during retrieval. The `from_documents()` method handles embedding creation and index creation behind the scenes, and we don't have to worry about doing it separately.

In [None]:
# db creation
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-l6-v2")
db_basic = FAISS.from_documents(documents=split_docs_basic,
                                 embedding=embedding_model,
                                 distance_strategy=DistanceStrategy.COSINE)

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-l6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

model_O1.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

model_O2.onnx:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

model_O3.onnx:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

model_O4.onnx:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

model_qint8_arm64.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_qint8_avx512.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_qint8_avx512_vnni.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

model_quint8_avx2.onnx:   0%|          | 0.00/23.0M [00:00<?, ?B/s]

openvino_model.bin:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

openvino/openvino_model.xml:   0%|          | 0.00/211k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]



## Retrieval

Now that we have created our index (another way to call our vector database), let's ask it a question from the document and see what chunks it retrieves from the stored index. The chunks should be relevant to our question, and should be enough to answer the question during retrieval.  Typically, the chunks are retrieved based on semantic similarity using a cosine score. The lower the score, the more similar a chunk is to the question.

**This is one of the modules where we will discuss advancements later on.**

In [None]:
question = "Why do vision transformers have low inductive bias than CNNs?"
similar_vectors = db_basic.similarity_search_with_score(query=question, k=5)

In [None]:
print(len(similar_vectors))

5


In [None]:
for i, (document, score) in enumerate(similar_vectors):
  print(f"----------------Retrieved vector # {i + 1}----------------\n")
  print(f"Score: {score}")
  print(f"Source: {document.metadata.get('source', 'Unknown')}")
  print(f"Page: {document.metadata.get('page', 'Unknown')}")
  print(f"Content:\n{document.page_content}\n")

----------------Retrieved vector # 1----------------

Score: 0.6993114352226257
Source: /content/sample_data/2010.11929v2.pdf
Page: 3
Content:
Published as a conference paper at ICLR 2021
The MLP contains two layers with a GELU non-linearity.
z0= [xclass;x1
pE;x2
pE;···;xN
pE] +Epos,E∈R(P2·C)×D,Epos∈R(N+1)×D(1)
z′
ℓ= MSA(LN( zℓ−1)) +zℓ−1, ℓ = 1...L (2)
zℓ= MLP(LN( z′
ℓ)) +z′
ℓ, ℓ = 1...L (3)
y= LN( z0
L) (4)
Inductive bias. We note that Vision Transformer has much less image-speciﬁc inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and transla-
tionally equivariant, while the self-attention layers are global. The two-dimensional neighborhood
structure is used very sparingly: in the beginning of the model by cutting the image into patches and
at ﬁne-tuning time for adjusting the position embeddings for images of different resolution (as 

## Generation

This is the final stage in our RAG pipeline. We now supply the retrieved chunks alongwith a system prompt to our LLM. The LLM uses all the info to create a comprehensive response. If the LLM does not find relevant info in the system prompt, it is instructed to answer with *I don't know.*

We first start off with creating a system prompt. We will keep placeholder variables `{context}` and `{question}` to fill in the actual chunks we retrieve and the user query.

### Securing your API Keys

We will be using both the opensource Gemma model from Google and the closed source Gemini model. For Gemma, we will be using the Groq API. Head over to https://console.groq.com/keys to create your own key free of cost and to start using the Gemma model.

For using Gemini, create your key by logging into https://aistudio.google.com/app/apikey. For this however, you would need to add your payment details.

Store your keys in the `Secrets` tab inside the following variables:
* GROQ_API_KEY
* GEMINI_API_KEY

Run the cell below to add the key you have.

In [None]:
# @title API Key Selection { display-mode: "form" }
# Dropdown for API key selection
api_choice = "GROQ"  # @param ["GROQ", "GEMINI"]

api_key_GROQ, api_key_GEMINI = None, None

# Load the corresponding API key based on the user selection
if api_choice == "GROQ":
    api_key_GROQ = userdata.get('GROQ_API_KEY')
    print("Loaded GROQ API key.")
else:
    api_key_GEMINI = userdata.get('GEMINI_API_KEY')
    print("Loaded GEMINI API key.")

Loaded GROQ API key.


In [None]:
# Generation
qa_template = """
Use the following pieces of context {context} to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Question: {question}
Answer:
"""
generation_prompt = PromptTemplate(input_variables=["context", "question"],
                        template=qa_template)

if api_key_GROQ:
  llm = ChatGroq(temperature=0,
                model_name="gemma2-9b-it",
                api_key=api_key_GROQ)

elif api_key_GEMINI:
  llm_gemini = ChatGoogleGenerativeAI(temperature=0,
                                      model="gemini-1.5-pro",
                                      api_key=api_key_GEMINI)

else:
  raise Exception("Please supply a valid api key to proceed")

generation_chain = RetrievalQA.from_llm(llm=llm,
                                        retriever=db_basic.as_retriever(search_kwargs={"k": 5}),
                                        prompt=generation_prompt)

In [None]:
print(generation_chain.invoke({"query": question})['result'])

According to the provided text, Vision Transformers (ViTs) have much less image-specific inductive bias than Convolutional Neural Networks (CNNs). 

Here's why:

* **CNNs:**  CNNs incorporate inductive biases like locality (weights connect to nearby pixels), two-dimensional neighborhood structure, and translation equivariance (output remains similar even with shifted input) directly into their layer design. These biases help them learn patterns in images effectively.

* **ViTs:** In contrast, ViTs primarily rely on self-attention layers which are global, meaning they consider relationships between all image patches simultaneously.  The only local and translationally equivariant components are the MLP layers.  The initial position embeddings in ViTs don't inherently carry any two-dimensional structure information. 


Let me know if you have any other questions. 



In [None]:
generation_chain_gemini = RetrievalQA.from_llm(llm=llm_gemini,
                                        retriever=db_basic.as_retriever(search_kwargs={"k": 5}),
                                        prompt=generation_prompt)
print(generation_chain_gemini.invoke({"context": db_basic.as_retriever(), "query": question})["result"])

# Advanced RAG

# Query Rewriting

Query Rewriting is part of the **pre-retrieval** enhancement. There are several different query rewriting methods available. We will be implementing the simplest: **zero shot prompting**.

At its core, zero shot prompting is simply a prompt engineering technique. By zero shot, we simply mean we are not giving any examples in the prompt. We will start by defining a `zero_shot_prompt`.

In [None]:
zero_shot_prompt_template = """
You are a helpful assistant that generates multiple questions based on a single input query.
If there are multiple common ways of phrasing a user question or common synonyms for key words in the question, make sure to return multiple versions
of the query with the different phrasings.
If there are acronyms or words you are not familiar with, do not try to rephrase them.
Return 3 different versions of the original question.

{format_instructions}

Original question: {question}

Expanded queries:
"""

Next, we will define some pydantic models to parse our expanded queries into a structured output from the LLM. This is not necessary if you are only rewriting the query into one output. However, if you are asking for multiple queries, a structured output is necessary.

We will use the pydantic models to construct a parser that we will supply with original query.

In [None]:
from typing import List
from pydantic import BaseModel, Field
from langchain import LLMChain, PromptTemplate
from langchain.output_parsers import PydanticOutputParser

# Define your Pydantic models
class ExpandedQuery(BaseModel):
    queries: List[str] = Field(..., min_items=3, max_items=3)

class ExpandedQueries(BaseModel):
    expanded_queries: ExpandedQuery

# Initialize the output parser with your Pydantic model
parser = PydanticOutputParser(pydantic_object=ExpandedQueries)

# Retrieve the format instructions to include in the prompt
format_instructions = parser.get_format_instructions()

# Create the PromptTemplate with the adjusted prompt
zero_shot_prompt = PromptTemplate(
    template=zero_shot_prompt_template,
    input_variables=["question"],
    partial_variables={"format_instructions": format_instructions}
)

Now, we will use a simple `LLMChain` from `LangChain` and supply it our original question, the LLM of choice, and our zero shot prompt that we constructed in the last step. We will get three queries back since we asked to get three versions of original query in original prompt.

In [None]:
question

'Why do vision transformers have low inductive bias than CNNs?'

In [None]:
# Create the LLMChain with the output parser
query_rewriting_chain = LLMChain(llm=llm,
                                 prompt=zero_shot_prompt,
                                 output_parser=parser)

# Invoke the chain with your question
response = query_rewriting_chain({"question": question})

In [None]:
# Access the list of queries from the response
queries_list = response['text'].expanded_queries.queries

for query in queries_list:
  print(query)

Why do vision transformers have lower inductive bias compared to CNNs?
What is the reason for vision transformers having less inductive bias than CNNs?
How does the inductive bias of vision transformers compare to CNNs?


## Retrieval post query rewriting

Now that we have constructed three queries from original, let us retry retrieval with each of the new queries to see the chunks we retrieve. You would see that each query gets us very similar chunks, but there is a slight variation in each, sometimes with a difference in order and score.

In [None]:
# retrieve 5 chunks with rewritten queries
for query in queries_list:
  similar_vectors = []
  similar_vectors = db_basic.similarity_search_with_score(query=query, k=5)
  print(f"**********{ query }*********")
  for i, (document, score) in enumerate(similar_vectors):
    print(f"----------------Retrieved vector # {i + 1}----------------\n")
    print(f"Score: {score}")
    print(f"Source: {document.metadata.get('source', 'Unknown')}")
    print(f"Page: {document.metadata.get('page', 'Unknown')}")
    print(f"Content:\n{document.page_content}\n")

**********Why do vision transformers have lower inductive bias compared to CNNs?*********
----------------Retrieved vector # 1----------------

Score: 0.7333710193634033
Source: /content/sample_data/2010.11929v2.pdf
Page: 3
Content:
Published as a conference paper at ICLR 2021
The MLP contains two layers with a GELU non-linearity.
z0= [xclass;x1
pE;x2
pE;···;xN
pE] +Epos,E∈R(P2·C)×D,Epos∈R(N+1)×D(1)
z′
ℓ= MSA(LN( zℓ−1)) +zℓ−1, ℓ = 1...L (2)
zℓ= MLP(LN( z′
ℓ)) +z′
ℓ, ℓ = 1...L (3)
y= LN( z0
L) (4)
Inductive bias. We note that Vision Transformer has much less image-speciﬁc inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and transla-
tionally equivariant, while the self-attention layers are global. The two-dimensional neighborhood
structure is used very sparingly: in the beginning of the model by cutting the image into patches and
at ﬁne

## Generation post query rewriting

Let's see our final results with each of the three queries to see how each answer looks.

In [None]:
for query in queries_list:
  generation_chain = RetrievalQA.from_llm(llm=llm,
                                        retriever=db_basic.as_retriever(search_kwargs={"k": 5}),
                                        prompt=generation_prompt)
  print(f"**********{ query }*********")
  print(generation_chain.invoke({"query": query})["result"])
  print("\n")

**********Why do vision transformers have lower inductive bias compared to CNNs?*********
According to the provided text, Vision Transformers (ViTs) have less image-specific inductive bias than CNNs because:

* **Locality:** CNNs have convolutional layers that inherently capture local patterns due to their structure. ViTs, on the other hand, primarily rely on self-attention, which is global in nature.

* **Two-dimensional neighborhood structure:** CNNs are designed to process data in a grid-like structure, preserving the spatial relationships between pixels. ViTs treat images as sequences of patches, losing this inherent two-dimensional structure.

* **Translation equivariance:** CNNs often exhibit translation equivariance, meaning their output is relatively unchanged when the input image is shifted. While ViTs can achieve some degree of translation equivariance through their positional embeddings, this is not as deeply ingrained in their architecture as it is in CNNs. 


Essentially, 

# Semantic Chunking

RAG systems, especially for complex documents, struggle with how the data is indexed. If chunks are too small, retrieval misses important info, if it's too large, we might be supplying unnecessary info.

In semantic chunking, the splits are made based on the cosine distance between embeddings of sequential chunks. So we start by dividing the text into small but coherent groups, perhaps using a recursive chunker.

Next we vectorize each chunk using an embedding model. Finally, we look at the cosine distances between the embeddings of subsequent chunks and choose breakpoints where the distances are large. Ideally, this helps to create groups of text that are both coherent and semantically distinct.

We will use the **`SemanticChunker`** from `LangChain` and the **`HuggingFaceBgeEmbeddings`** model. We will be creating the splits based on `percentile` method. In this method, all differences between sentences are calculated, and then any difference greater than the 95 percentile is split.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings

SENTENCE_TRANSFORMERS_HOME = "/content/sample_data/models"
MODEL_KWARGS = {"device": "cpu"}
ENCODE_KWARGS = {"normalize_embeddings": True}
embedding_model_name = "BAAI/bge-small-en-v1.5"

embedding_model = HuggingFaceBgeEmbeddings(model_name=embedding_model_name,
                                           model_kwargs=MODEL_KWARGS,
                                           encode_kwargs=ENCODE_KWARGS,
                                           cache_folder=SENTENCE_TRANSFORMERS_HOME)

text_splitter_semantic = SemanticChunker(embedding_model,
                                         breakpoint_threshold_type="percentile")



.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/133M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Once we create the splitter, the rest of the process is similar to traditional indexing process we discussed in [Indexing Section](https://colab.research.google.com/drive/14vLf0xw9ow-eQ3wAA4FZlPv0n4grBi4B#scrollTo=xE9dU9ITM6n3&line=1&uniqifier=1).

This means we will:


1.   split original document
2. create embeddings
3.   create database

In [None]:
split_docs_semantic = text_splitter_semantic.split_documents(doc)
print(f"No of chunks created with semantic chunking: {len(split_docs_semantic)}")
print(split_docs_semantic[0].page_content)

No of chunks created with semantic chunking: 26
Published as a conference paper at ICLR 2021
ANIMAGE IS WORTH 16X16 W ORDS :
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗,
Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†
∗equal technical contribution,†equal advising
Google Research, Brain Team
{adosovitskiy, neilhoulsby }@google.com
ABSTRACT
While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image patches can perform
very w

In [None]:
db_semantic = FAISS.from_documents(documents=split_docs_semantic,
                                   embedding=embedding_model,
                                   distance_strategy=DistanceStrategy.COSINE)

## Retrieval Post Semantic Chunking

Let's retry retrieval post semantic chunking to see if we have retrieved better results.

We can see chunks are more coherent and contain relevant information within a single chunk. We did all this without worrying about chunk size & overlap! 😁

In [None]:
similar_vectors = db_semantic.similarity_search_with_score(query=question, k=5)

In [None]:
for i, (document, score) in enumerate(similar_vectors):
    print(f"----------------Retrieved vector # {i + 1}----------------\n")
    print(f"Score: {score}")
    print(f"Source: {document.metadata.get('source', 'Unknown')}")
    print(f"Page: {document.metadata.get('page', 'Unknown')}")
    print(f"Content:\n{document.page_content}\n")

----------------Retrieved vector # 1----------------

Score: 0.2917124629020691
Source: /content/sample_data/2010.11929v2.pdf
Page: 6
Content:
Figure 4 contains the results. Vision Transformers overﬁt more than ResNets with
comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than
ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true
for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive
bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from
data is sufﬁcient, even beneﬁcial. Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB
(Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT
is an exciting direction of future work. 7

----------------Retrieved vector # 2----------------

Score: 0.3098767101764679
Source: /content/sample_dat

## Generation Post Semantic Chunking
Let's retry generation post semantic chunking to see if we have generated a more coherent response that is more inline with original answer.

It seems that the LLM was able to reference exact text from the document with greater ease, and also summarized it for clarity.

In [None]:
generation_chain = RetrievalQA.from_llm(llm=llm,
                                        retriever=db_semantic.as_retriever(search_kwargs={"k": 5}),
                                        prompt=generation_prompt)
print(generation_chain.invoke({"query": question})["result"])

The text states: "In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch."


Essentially, CNNs have built-in assumptions about how visual information is structured (local connections, spatial relationships) that Vision Transformers lack. ViT relies more on learning these patterns directly from the data. 



# Hybrid Retrieval & Rerank

Finally, we will discuss how we can look up better results. During retrieval, RAG pipeline typically struggles a lot with how the data is retrieved. If we only have semantic similarity, important keywords in document can be missed.

## Hybrid Search

We will be using the **`BM25Retriever`** from `LangChain` for keyword based retrieval. BM25 also known as the Okapi BM25, is a ranking function used in information retrieval systems to estimate the relevance of documents to a given search query.

BM25 retriever is really good at keyword matching (vs semantic). Typically, in RAG, we are retrieving document chunks based on semantic similarity. When you combine this method with regular semantic search it's known as hybrid search.

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

For the semantic similarity, we will supply our previously created vector database as retriever. We will use the **`EnsembleRetriever`** to combine both the semantic and keyword retriever. We have given more weight to keyword retriever, but you can adjust this based on your needs.

In [None]:
keyword_retriever = BM25Retriever.from_documents(split_docs_semantic)
keyword_retriever.k =  5

vectorstore_retreiver = db_semantic.as_retriever(search_kwargs={"k": 5})

ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver,
                                                   keyword_retriever],
                                       weights=[0.2, 0.8])

## Retrieval post rerank

In [None]:
ensemble_docs = ensemble_retriever.get_relevant_documents(question)
for i, document in enumerate(ensemble_docs):
    print(f"----------------Retrieved vector # {i + 1}----------------\n")
    print(f"Source: {document.metadata.get('source', 'Unknown')}")
    print(f"Page: {document.metadata.get('page', 'Unknown')}")
    print(f"Content:\n{document.page_content}\n")

----------------Retrieved vector # 1----------------

Source: /content/sample_data/2010.11929v2.pdf
Page: 6
Content:
Figure 4 contains the results. Vision Transformers overﬁt more than ResNets with
comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than
ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true
for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive
bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from
data is sufﬁcient, even beneﬁcial. Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB
(Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT
is an exciting direction of future work. 7

----------------Retrieved vector # 2----------------

Source: /content/sample_data/2010.11929v2.pdf
Page: 3
Content:
Published as a c

## Generation post rerank

In [None]:
hybrid_chain = RetrievalQA.from_llm(llm=llm,
                                    retriever=ensemble_retriever,
                                    prompt=generation_prompt)

In [None]:
print(hybrid_chain.invoke({"query": question})["result"])

Vision Transformers (ViTs) have less inductive bias than Convolutional Neural Networks (CNNs) because they lack the built-in assumptions about image structure that CNNs possess. 

Here's a breakdown:

* **CNNs:** CNNs are designed with convolutional layers that inherently learn spatial hierarchies and local patterns. They have:
    * **Locality:**  Filters operate on small, local regions of the image, capturing features at different scales.
    * **Translation Equivariance:**  The network's output is relatively unchanged when an image is shifted.
    * **2D Structure:** The architecture itself is designed around 2D grids, enforcing a spatial understanding.

* **ViTs:** ViTs treat images as sequences of patches (like words in a sentence).  They rely primarily on self-attention, which:
    * **Global Relationships:**  Self-attention considers relationships between *all* patches in an image simultaneously, not just local ones.
    * **Less Spatial Awareness:** ViTs don't have an inherent 

We can see our answer is a lot more detailed, and is also factually aligned with original source.

## Reranking with Cohere (Optional)

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

In [None]:
api_key_COHERE = userdata.get('COHERE_API_KEY')
compressor = CohereRerank(cohere_api_key=api_key_COHERE)

In [None]:
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=vectorstore_retreiver)

In [None]:
compressed_docs = compression_retriever.get_relevant_documents(question)
# Print the relevant documents from using the embeddings and reranker
print(compressed_docs)

[Document(metadata={'source': '/content/sample_data/2304.05133v2.pdf', 'page': 25, 'relevance_score': 0.20657375}, page_content='1. Vectorization causes the input image to loose all of its spatial structure, which could have\nbeen helpful during training.\n2. Let e.g. n0,1=n0,2= 1000, thenn0= 106and the weight matrix W[0]∈Rn1×106contains\nan enormous number of optimization variables. This can make training very slow or even\ninfeasible.\nOn the contrary, convolutional neural networks are designed to exploit the relationships between\nneighboring pixels. In fact, the input of a CNN is typically a matrix or even a three-dimensional\ntensor, which is then passed through the layers while maintaining this structure. CNNs take\nsmall patches, e.g. squares or cubes, from the input images and learn features from them.\nConsequently, they can subsequently recognize these features in other images, even when they\nappear in other parts of the image.\nFigure 17. Architecture of LeNet-5.'), Documen

In [None]:
for i, document in enumerate(compressed_docs):
    print(f"----------------Retrieved vector # {i + 1}----------------\n")
    print(f"Score: {document.metadata.get('relevance_score', 'Unknown')}")
    print(f"Source: {document.metadata.get('source', 'Unknown')}")
    print(f"Page: {document.metadata.get('page', 'Unknown')}")
    print(f"Content:\n{document.page_content}\n")

----------------Retrieved vector # 1----------------

Score: 0.20657375
Source: /content/sample_data/2304.05133v2.pdf
Page: 25
Content:
1. Vectorization causes the input image to loose all of its spatial structure, which could have
been helpful during training.
2. Let e.g. n0,1=n0,2= 1000, thenn0= 106and the weight matrix W[0]∈Rn1×106contains
an enormous number of optimization variables. This can make training very slow or even
infeasible.
On the contrary, convolutional neural networks are designed to exploit the relationships between
neighboring pixels. In fact, the input of a CNN is typically a matrix or even a three-dimensional
tensor, which is then passed through the layers while maintaining this structure. CNNs take
small patches, e.g. squares or cubes, from the input images and learn features from them.
Consequently, they can subsequently recognize these features in other images, even when they
appear in other parts of the image.
Figure 17. Architecture of LeNet-5.

-------------

In [None]:
hybrid_chain = RetrievalQA.from_llm(llm=llm,
                                    retriever=compression_retriever,
                                    prompt=generation_prompt)

In [None]:
print(hybrid_chain.invoke({"query": question})["result"])

While the provided context discusses the advantages and disadvantages of convolutional neural networks (CNNs), it doesn't offer information about vision transformers (ViTs) or their inductive bias.  

To answer your question about why ViTs have lower inductive bias than CNNs, we need information about how ViTs process data. 

Here's a general explanation:

* **CNNs** have a strong inductive bias towards spatial locality. They assume that neighboring pixels in an image are more related than distant pixels. This is achieved through convolutional filters that learn features from small, localized patches of the input.

* **Vision Transformers (ViTs)**, on the other hand, treat an image as a sequence of patches and process them like words in a sentence. They use self-attention mechanisms to learn relationships between all patches in the image, regardless of their spatial distance. This global perspective gives ViTs a lower inductive bias towards spatial locality.


Let me know if you have a

# End to End Advanced RAG App

In [None]:
import chainlit as cl
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings
from langchain_community.vectorstores.faiss import DistanceStrategy
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_groq import ChatGroq
from langchain_community.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from google.colab import userdata

In [None]:
SENTENCE_TRANSFORMERS_HOME = "/content/sample_data/models"
MODEL_KWARGS = {"device": "cpu"}
ENCODE_KWARGS = {"normalize_embeddings": True}
embedding_model_name = "BAAI/bge-small-en-v1.5"



api_key_GROQ = userdata.get('GROQ_API_KEY')

# Generation
qa_template = """
Use the following pieces of context {context} to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Question: {question}
Answer:
"""

# Query Rewriting prompt
zero_shot_prompt_template = """
You are a helpful assistant that generates rewrites original search querry for clarity and better understanding.
If the original query is confusing, rewrite it with different phrasing to elaborate on what the user is trying to ask.
If there are acronyms or words you are not familiar with, do not try to rephrase them.
Return rewritten query.

Original question: {question}

Rewritten Query:
"""

In [None]:
# define all variables to be used on system start
@cl.on_chat_start
async def ingestion_retrieval():
    files = None

    # Wait for the user to upload a file
    while files is None:
        files = await cl.AskFileMessage(
            content="Please upload a text or pdf file to begin!", accept=["text/plain", "application/pdf"],
            max_size_mb=8
        ).send()

    uploaded_file = files[0]

    loader = PyPDFLoader(file_path)
    doc = loader.load()

    # Sending a pdf with the local file path
    elements = [
        cl.Pdf(name=uploaded_file.name, display="side", path=uploaded_file.path),
        cl.Text(name=uploaded_file.name, display="side", content=doc[0].page_content)
    ]

    # Reminder: The name of the pdf must be in the content of the message
    await cl.Message(content=f"You have uploaded {uploaded_file.name}. "
                             f"Click on it to view it in sidebar.", elements=elements).send()

    await cl.Message(content=f"Ingesting {uploaded_file.name} "
                             f"in a database. This operation may take a while").send()

    # text splitter
    text_splitter_semantic = SemanticChunker(embedding_model,
                                         breakpoint_threshold_type="percentile")
    split_docs_semantic = text_splitter_semantic.split_documents(doc)

    # db creation
    embedding_model = HuggingFaceBgeEmbeddings(model_name=embedding_model_name,
                                           model_kwargs=MODEL_KWARGS,
                                           encode_kwargs=ENCODE_KWARGS,
                                           cache_folder=SENTENCE_TRANSFORMERS_HOME)
    db_semantic = FAISS.from_documents(documents=split_docs_semantic,
                                   embedding=embedding_model,
                                   distance_strategy=DistanceStrategy.COSINE)

    # hybrid retriever definition
    keyword_retriever = BM25Retriever.from_documents(split_docs_semantic)
    keyword_retriever.k =  5
    vectorstore_retreiver = split_docs_semantic.as_retriever(search_kwargs={"k": 5})
    ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver,
                                                      keyword_retriever],
                                          weights=[0.2, 0.8])

    # llm definition
    generation_prompt = PromptTemplate(input_variables=["context", "question"],
                        template=qa_template)
    llm = ChatGroq(temperature=0,
               model_name="gemma2-9b-it",
               api_key=api_key_GROQ)

    # retrieval chain definition
    hybrid_chain = RetrievalQA.from_llm(llm=llm,
                                        retriever=ensemble_retriever,
                                        prompt=generation_prompt)



    # query rewriting definition
    zero_shot_prompt = PromptTemplate(template=zero_shot_prompt_template,
                                      input_variables=["question"])
    query_rewriting_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)

    # set chainlit user session variables. We will reuse them later!!
    cl.user_session.set("retrieval_chain", hybrid_chain)
    cl.user_session.set("db", db_semantic)
    cl.user_session.set("query_rewriting_chain", query_rewriting_chain)

    await cl.Message(content=f"The RAG system is now ready for use. Please send in your questions!").send()

In [None]:
# define function to handle incoming user requests
@cl.on_message
async def generation(message: cl.Message):
    question = message.content
    retrieval_chain = cl.user_session.get("retrieval_chain")
    query_rewriting_chain = cl.user_session.get("query_rewriting_chain")
    db = cl.user_session.get("db")

    rewritten_query = await query_rewriting_chain.acall({"question": question},
                                                        callbacks=[cl.AsyncLangchainCallbackHandler()])

    response = await retrieval_chain.acall({"context": db.as_retriever(), "query": rewritten_query},
                                 callbacks=[cl.AsyncLangchainCallbackHandler()])

    await cl.Message(response['result']).send()

In [None]:
%%bash
cat << \EOF >  advanced_rag_chatbot.py
# RUN: chainlit run advanced_rag_chatbot.py

import chainlit as cl
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings
from langchain_community.vectorstores.faiss import DistanceStrategy
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_groq import ChatGroq
from langchain_community.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA, LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

SENTENCE_TRANSFORMERS_HOME = "/content/sample_data/models"
MODEL_KWARGS = {"device": "cpu"}
ENCODE_KWARGS = {"normalize_embeddings": True}
embedding_model_name = "BAAI/bge-small-en-v1.5"

# Generation
qa_template = """
Use the following pieces of context {context} to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Question: {question}
Answer:
"""

# Query Rewriting prompt
zero_shot_prompt_template = """
You are a helpful assistant that generates rewrites original search querry for clarity and better understanding.
If the original query is confusing, rewrite it with different phrasing to elaborate on what the user is trying to ask.
If there are acronyms or words you are not familiar with, do not try to rephrase them.
Return rewritten query.

Original question: {question}

Rewritten Query:
"""

# define all variables to be used on system start
@cl.on_chat_start
async def ingestion_retrieval():
    files = None

    # Wait for the user to upload a file
    while files is None:
        files = await cl.AskFileMessage(
            content="Please upload a text or pdf file to begin!", accept=["text/plain", "application/pdf"],
            max_size_mb=8
        ).send()

    uploaded_file = files[0]

    loader = PyPDFLoader(uploaded_file.path)
    doc = loader.load()

    # Sending a pdf with the local file path
    elements = [
        cl.Pdf(name=uploaded_file.name, display="side", path=uploaded_file.path)
    ]

    # Reminder: The name of the pdf must be in the content of the message
    await cl.Message(content=f"You have uploaded {uploaded_file.name}. "
                             f"Click on it to view it in sidebar.", elements=elements).send()

    await cl.Message(content=f"Ingesting {uploaded_file.name} "
                             f"in a database. This operation may take a while").send()

    # text splitter
    embedding_model = HuggingFaceBgeEmbeddings(model_name=embedding_model_name,
                                           model_kwargs=MODEL_KWARGS,
                                           encode_kwargs=ENCODE_KWARGS,
                                           cache_folder=SENTENCE_TRANSFORMERS_HOME)
    text_splitter_semantic = SemanticChunker(embedding_model,
                                         breakpoint_threshold_type="percentile")
    split_docs_semantic = text_splitter_semantic.split_documents(doc)

    # db creation
    db_semantic = FAISS.from_documents(documents=split_docs_semantic,
                                   embedding=embedding_model,
                                   distance_strategy=DistanceStrategy.COSINE)

    # hybrid retriever definition
    keyword_retriever = BM25Retriever.from_documents(split_docs_semantic)
    keyword_retriever.k =  5
    vectorstore_retreiver = db_semantic.as_retriever(search_kwargs={"k": 5})
    ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver,
                                                      keyword_retriever],
                                          weights=[0.2, 0.8])

    # llm definition
    generation_prompt = PromptTemplate(input_variables=["context", "question"],
                        template=qa_template)
    llm = ChatGroq(temperature=0,
               model_name="gemma2-9b-it",
               api_key=api_key_GROQ)

    # retrieval chain definition
    hybrid_chain = RetrievalQA.from_llm(llm=llm,
                                        retriever=ensemble_retriever,
                                        prompt=generation_prompt)



    # query rewriting definition
    zero_shot_prompt = PromptTemplate(template=zero_shot_prompt_template,
                                      input_variables=["question"])
    query_rewriting_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)

    # set chainlit user session variables. We will reuse them later!!
    cl.user_session.set("retrieval_chain", hybrid_chain)
    cl.user_session.set("db", db_semantic)
    cl.user_session.set("query_rewriting_chain", query_rewriting_chain)

    await cl.Message(content=f"The RAG system is now ready for use. Please send in your questions!").send()

# define function to handle incoming user requests
@cl.on_message
async def generation(message: cl.Message):
    question = message.content
    retrieval_chain = cl.user_session.get("retrieval_chain")
    query_rewriting_chain = cl.user_session.get("query_rewriting_chain")
    db = cl.user_session.get("db")

    rewritten_query = await query_rewriting_chain.acall({"question": question},
                                                        callbacks=[cl.AsyncLangchainCallbackHandler()])

    response = await retrieval_chain.acall({"context": db.as_retriever(), "query": rewritten_query['text']},
                                 callbacks=[cl.AsyncLangchainCallbackHandler()])

    await cl.Message(response['result']).send()

EOF

In [None]:
# CHAINLIT
!chainlit run advanced_rag_chatbot.py -w &> /content/logs.txt &

In [None]:
!ngrok config add-authtoken 2nHG4ajLkkDi6Jwg3O43xfpFo8N_6sRBndmj9r3SmYgFbyd1q

from pyngrok import ngrok
ngrok_tunnel = ngrok.connect(8000)
print('Public URL:', ngrok_tunnel.public_url)

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml
Public URL: https://b81a-34-16-193-82.ngrok-free.app


In [None]:
ngrok.kill()

In [None]:
!ps -ef |grep chainlit | awk '{print $2}' | xargs kill -9
!ps -ef |grep ngrok | awk '{print $2}' | xargs kill -9

kill: (19062): No such process
^C
kill: (19068): No such process
^C
