# RAG (Retrieval-Augmented Generation) Tutorial

In this tutorial, we will walk through the process of building a Retrieval-Augmented Generation (RAG) system, combining document retrieval with text generation using Large Language Models.

# Pre-processing & Creating the Vector Database

## Step 1: Installing Necessary Libraries & Import Them
We’ll start with installing required libraries for retrieval and generation.

In [None]:
!pip install langchain langchain_community unstructured sentence_transformers tiktoken chromadb langchain_chroma langchain_groq

Collecting langchain_chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting langchain_groq
  Downloading langchain_groq-0.2.0-py3-none-any.whl.metadata (2.9 kB)
Collecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl.metadata (6.8 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting groq<1,>=0.4.1 (from langchain_groq)
  Downloading groq-0.11.0-py3-none-any.whl.metadata (13 kB)
Downloading langchain_chroma-0.1.4-py3-none-any.whl (10 kB)
Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m87.0 MB/s[0m eta [36m0:

## Step 2: Import Necessary Modules

Now, import the essential modules required for handling data (e.g., reading CSV files), working with embeddings, and managing document vectors.

In [None]:
import os
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
import markdown
from langchain.text_splitter import RecursiveCharacterTextSplitter

## Step 3: Load the Dataset

Load your data from a CSV file using pandas. In this example, we assume the CSV file is named example.csv. Let's load it and display the first few rows.

In [None]:
!kaggle datasets download -d khaledzsa/example

In [None]:
df = pd.read_csv('example.csv')
df.head()

Unnamed: 0,title,content
0,Geographic distribution and population,"According to the Indian census of 2001, there ..."
1,Language and literature,Malayalam is the language spoken by the Malaya...
2,Arrival of Cove Reber and Saosin EP (2004-2006),After the audition process and several guest v...
3,Formation and Translating the Name (2003-2004),"The original lineup for Saosin, consisting of ..."
4,Red Hot Organization and Tommy Boy Records dis...,"In 1996, Coolio appeared on the Red Hot Organi..."


## Step 4: Set Up a Directory for Markdown Files

Create a directory to store your markdown files. The directory is named `data/markdown_files` in this example. The `os.makedirs` method ensures that the directory is created if it doesn't already exist.

Creating markdowns will give the model better understanding when retriving the information for example `#` will indicate that this is a title for example `# Title`.

In [None]:
directory = 'data/markdown_files'
os.makedirs(directory, exist_ok=True)

In [None]:
for i in range(0, 5_000):

    title = df['title'].iloc[i]
    content = df['content'].iloc[i]

    markdown_content = f"# {title}\n\n"
    markdown_content += f"{content}\n\n"

    with open(f'{directory}/{i}.md', 'w', encoding='utf-8') as file:
        file.write(markdown_content)

## Step 5: Read Markdown Files from the Directory

Now, we will read all markdown (.md) files in the specified directory. Each markdown file is converted into HTML format using the markdown module, and the result is stored in a list called markdown_texts.

In [None]:
markdown_texts = []
for filename in os.listdir(directory):
  if filename.endswith(".md"):
    with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
      markdown_content = file.read()
      html_content = markdown.markdown(markdown_content)
      markdown_texts.append(html_content)

## Step 6: Split the Text into Chunks

Once we have all the markdown texts, we can split them into smaller chunks using the `RecursiveCharacterTextSplitter`. This method ensures the text is split while maintaining logical sections. Here we split the text into chunks of 500 characters, with a 50-character overlap.


![Alt text](https://miro.medium.com/v2/resize:fit:1400/1*jPdizCAKT6c_PrLoi9NYEA.png "Split Chunks")

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = text_splitter.create_documents(markdown_texts)

## Step 7: Initialize the Embedding Model & Create a Vector Store Using Chroma

Next, we initialize the embedding model using `SentenceTransformerEmbeddings`. We'll use the pre-trained model "all-MiniLM-L6-v2" for generating embeddings from the text chunks.

In [None]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(documents, embedding_function, persist_directory="./chroma_db")

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


After we get the embeddings, we'll use the Chroma library to create a vector store. The vector store will store the document embeddings, allowing for efficient similarity searches later. The embeddings will be persisted in the `chroma_db` directory.

# Using Vector Database (Saved Emmbeddings)

## Step 1: Import Libraries

To build a queryable system, we will import additional modules to use Chroma for storage, build a retrieval chain, and perform natural language processing with Groq.

In [None]:
import os
import json
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

## Step 2: Define Persistent Directory for Chroma DB

Specify the persistent directory where the Chroma DB will be stored. Here, we set the directory as `chroma_db` and initialize the vector store with the embedding function.

In [None]:
PRESIST_DIRECTORY = '/content/chroma_db'
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
persist_directory = "./chroma_db"
db = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Step 3: Define a Function to Query the Chroma Vector Store

Finally, let's define a function `query_chroma_db` that takes a query, searches the Chroma vector store for similar documents based on embeddings, and returns the most relevant results.

In [None]:
def query_chroma_db(query, db, top_k=5):
    docs = db.similarity_search(query)
    results = [doc.page_content for doc in docs]
    return results

In [None]:
query_chroma_db(" The number of Malayalam speakers in Lakshadweep", db)

['<p>According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India, and 96.7% of the total population of the state. There were a further 701,673 (2.1% of the total number) in Karnataka, 557,705 (1.7%) in Tamil Nadu and 406,358 (1.2%) in Maharashtra. The number of Malayalam speakers in Lakshadweep is 51,100, which is only 0.15% of the total number, but is as much as about 84% of the population of',
 'but is as much as about 84% of the population of Lakshadweep. In all, Malayalis made up 3.22% of the total Indian population in 2001. Of the total 33,066,392 Malayalam speakers in India in 2001, 33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan. As per the 1991 census data, 28.85% of all Malayalam speakers in India spoke a second language and 19.64% of the total knew three or more languages.  Large numbers o

## Step 4: Create & Test the Retrieval with a Sample Query

You can now test the system by querying the Chroma vector store. Here is an example query asking for information about Malayalam speakers in Lakshadweep.

First lets create the template, it should contain the `context` and the user's `question`.

In [None]:
PROMPT_TEMPLATE="""
Answer the question based only on the following context:
Context: {context}
Question: {question}
Your answer:
"""

prompt_template = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)

Now lets define the llm that we will use in our RAG, in this case we will use `llama3-8b-8192` using `ChatGroq`.

In [None]:
groq_api_key = "your_api_key"
llm = ChatGroq(temperature=0, groq_api_key=groq_api_key, model_name="llama3-8b-8192")

Now create a chain that contains the defined `llm` and `prompt_template`, the `verbose` is also set to True which will allow you to see detailed logs of the chain's execution, including the input query, the constructed prompt, and the model's output at each step.

In [None]:
MODEL = LLMChain(llm=llm,
                 prompt=prompt_template,
                 verbose=True)

  MODEL = LLMChain(llm=llm,


In this step, we'll create a function called `query_rag` that will handle the process of combining document retrieval with question answering. The function will:

1. Perform a similarity search on the Chroma database (`db`) to retrieve the top 4 relevant documents for the query.
2. Concatenate the content of the retrieved documents to form the context for the language model.
3. Pass both the retrieved context and the user query to the model to generate a response using Retrieval-Augmented Generation (RAG).
4. Return the model's response.

In [None]:
def query_rag(query: str):
    similarity_search_results = db.similarity_search_with_score(query, k=4)
    context_text = "\n\n".join([doc.page_content for doc, _score in similarity_search_results])

    rag_response = MODEL.invoke({"context": context_text, "question": query})

    return rag_response

Now we will see how the output should look like

In [None]:
response = query_rag("The number of Malayalam speakers in Lakshadweep")
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: <p>According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India, and 96.7% of the total population of the state. There were a further 701,673 (2.1% of the total number) in Karnataka, 557,705 (1.7%) in Tamil Nadu and 406,358 (1.2%) in Maharashtra. The number of Malayalam speakers in Lakshadweep is 51,100, which is only 0.15% of the total number, but is as much as about 84% of the population of

but is as much as about 84% of the population of Lakshadweep. In all, Malayalis made up 3.22% of the total Indian population in 2001. Of the total 33,066,392 Malayalam speakers in India in 2001, 33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan. As per the 1991 census data, 28.85% of all Malayalam speakers in 

{'context': '<p>According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India, and 96.7% of the total population of the state. There were a further 701,673 (2.1% of the total number) in Karnataka, 557,705 (1.7%) in Tamil Nadu and 406,358 (1.2%) in Maharashtra. The number of Malayalam speakers in Lakshadweep is 51,100, which is only 0.15% of the total number, but is as much as about 84% of the population of\n\nbut is as much as about 84% of the population of Lakshadweep. In all, Malayalis made up 3.22% of the total Indian population in 2001. Of the total 33,066,392 Malayalam speakers in India in 2001, 33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan. As per the 1991 census data, 28.85% of all Malayalam speakers in India spoke a second language and 19.64% of the total knew three or more languages.  Large

In [None]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:
<p>According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India, and 96.7% of the total population of the state. There were a further 701,673 (2.1% of the total number) in Karnataka, 557,705 (1.7%) in Tamil Nadu and 406,358 (1.2%) in Maharashtra. The number of Malayalam speakers in Lakshadweep is 51,100, which is only 0.15% of the total number, but is as much as about 84% of the population of

but is as much as about 84% of the population of Lakshadweep. In all, Malayalis made up 3.22% of the total Indian population in 2001. Of the total 33,066,392 Malayalam speakers in India in 2001, 33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan. As per the 1991 census data, 28.85% of all Malayalam speakers in India spoke a second language and 19.64% of the total knew three or more languages.  Large numbe

To improve the `query_rag` function, we can enhance it by only including documents with a similarity score of 80% or higher. This helps in reducing the number of tokens in the prompt, as only the most relevant documents will be included. Here's how you can update the function:

1. **Filter by Similarity Score:** Only include documents with a similarity score of 0.8 (80%) or higher.
2. **Reduce Token Count:** By filtering out less relevant documents, the context provided to the model will be shorter, saving on token usage in the prompt.

In [None]:
query = "The number of Malayalam speakers in Lakshadweep"
similarity_search_results = db.similarity_search_with_score(query, k=4)

In [None]:
print("First: ", similarity_search_results[0][0].page_content)
print("Second: ", similarity_search_results[1][0].page_content)
print("Third: ", similarity_search_results[2][0].page_content)
print("Fourth: ", similarity_search_results[3][0].page_content)

First:  <p>According to the Indian census of 2001, there were 30,803,747 speakers of Malayalam in Kerala, making up 93.2% of the total number of Malayalam speakers in India, and 96.7% of the total population of the state. There were a further 701,673 (2.1% of the total number) in Karnataka, 557,705 (1.7%) in Tamil Nadu and 406,358 (1.2%) in Maharashtra. The number of Malayalam speakers in Lakshadweep is 51,100, which is only 0.15% of the total number, but is as much as about 84% of the population of
Second:  but is as much as about 84% of the population of Lakshadweep. In all, Malayalis made up 3.22% of the total Indian population in 2001. Of the total 33,066,392 Malayalam speakers in India in 2001, 33,015,420 spoke the standard dialects, 19,643 spoke the Yerava dialect and 31,329 spoke non-standard regional variations like Eranadan. As per the 1991 census data, 28.85% of all Malayalam speakers in India spoke a second language and 19.64% of the total knew three or more languages.  Larg

In [None]:
print(similarity_search_results[0][1])
print(similarity_search_results[1][1])
print(similarity_search_results[2][1])
print(similarity_search_results[3][1])

0.278103232383728
0.44460198283195496
0.8611245155334473
0.8742256164550781


In this step, we'll refine the `query_rag` function by introducing a threshold parameter. This allows us to only include documents that meet a minimum similarity score. Here's how it works:

1. **Threshold Parameter:** The function takes a `threshold` value (a float) that defines the minimum acceptable similarity score.
2. **Filter Results:** Only documents with a similarity score higher than the specified threshold will be included in the context.
3. **Concatenate Context:** The filtered documents are concatenated to form the context, which is passed to the language model.
4. **Generate Response:** The model then generates a response based on the provided context and query.


In [None]:
def query_rag_with_threshold(query: str, threshold: float):
    similarity_search_results = db.similarity_search_with_score(query, k=4)
    context_text = "\n\n".join([doc.page_content for doc, score in similarity_search_results if score > threshold])
    rag_response = MODEL.invoke({"context": context_text, "question": query})
    return rag_response

Now notice how the context has become shorter.

In [None]:
response = query_rag_with_threshold("The number of Malayalam speakers in Lakshadweep", 0.80)
response



Prompt after formatting:
[32;1m[1;3m
Answer the question based only on the following context:
Context: 7,070 people who listed Malayalam as their mother tongue, mostly in the Greater Toronto Area and Southern Ontario. In 2010, the Census of Population of Singapore reported that there were 26,348 Malayalees in Singapore. The 2006 New Zealand census reported 2,139 speakers. 134 Malayalam speaking households were reported in 1956 in Fiji. There is also a considerable Malayali population in the Persian Gulf regions, especially in Bahrain, Muscat, Doha, Dubai, Abu Dhabi, Kuwait and European region

knew three or more languages.  Large numbers of Malayalis have settled in Bangalore, Mangalore, Delhi, Coimbatore, Hyderabad, Mumbai (Bombay), Ahmedabad, Pune, and Chennai (Madras). A large number of Malayalis have also emigrated to the Middle East, the United States, and Europe. Accessed November 22, 2014.</ref> including a large number of professionals. There were 7,093 Malayalam speakers in 

{'context': '7,070 people who listed Malayalam as their mother tongue, mostly in the Greater Toronto Area and Southern Ontario. In 2010, the Census of Population of Singapore reported that there were 26,348 Malayalees in Singapore. The 2006 New Zealand census reported 2,139 speakers. 134 Malayalam speaking households were reported in 1956 in Fiji. There is also a considerable Malayali population in the Persian Gulf regions, especially in Bahrain, Muscat, Doha, Dubai, Abu Dhabi, Kuwait and European region\n\nknew three or more languages.  Large numbers of Malayalis have settled in Bangalore, Mangalore, Delhi, Coimbatore, Hyderabad, Mumbai (Bombay), Ahmedabad, Pune, and Chennai (Madras). A large number of Malayalis have also emigrated to the Middle East, the United States, and Europe. Accessed November 22, 2014.</ref> including a large number of professionals. There were 7,093 Malayalam speakers in Australia in 2006. The 2001 Canadian census reported 7,070 people who listed Malayalam as 

In [None]:
print(f'Context:\n{response["context"]}\n\nQuestion:\n{response["question"]}\n\nText: \n{response["text"]}')

Context:
7,070 people who listed Malayalam as their mother tongue, mostly in the Greater Toronto Area and Southern Ontario. In 2010, the Census of Population of Singapore reported that there were 26,348 Malayalees in Singapore. The 2006 New Zealand census reported 2,139 speakers. 134 Malayalam speaking households were reported in 1956 in Fiji. There is also a considerable Malayali population in the Persian Gulf regions, especially in Bahrain, Muscat, Doha, Dubai, Abu Dhabi, Kuwait and European region

knew three or more languages.  Large numbers of Malayalis have settled in Bangalore, Mangalore, Delhi, Coimbatore, Hyderabad, Mumbai (Bombay), Ahmedabad, Pune, and Chennai (Madras). A large number of Malayalis have also emigrated to the Middle East, the United States, and Europe. Accessed November 22, 2014.</ref> including a large number of professionals. There were 7,093 Malayalam speakers in Australia in 2006. The 2001 Canadian census reported 7,070 people who listed Malayalam as their 