# Creating Retrieval Augmented Generation with Python

Making use of free LLMs to create a Retrieval Augmented Generation that reads PDF documents using Langchain. LangChain is a framework to develop applications powered by LLMs with composability and reliability.

__What is a RAG?__
Retrieval Augmented Generation (RAG). Simply put, RAGs help LLMs by giving them access to external data so that they can generate a response with additional context. This context can be anything from recent news, audio transcripts of a lecture, or in my case — fantasy football news.

Here are the 4 key steps that take place:

- Load a vector database with encoded documents.
- Encode the query into a vector using a sentence transformer.
- Based on the inputted query, retrieve relevant context from thevector database.
- Leverage context along with the query to prompt the LLM.

![title](rag_model.webp)

The main value propositions of the LangChain are:

__Components:__ These are the set of abstractions needed to work with language models. Components are modular and easy to use for a wide range of LLM use cases.
__Off-the-shelf chains:__ A structured assembly of various components and modules to accomplish a specific task such as summarization, Q&A, etc.

LangChain has six main components to build LLM applications: model I/O, Data connections, Chains, Memory, Agents, and Callbacks. The framework also allows integration with many tools to develop full-stack applications, such as OpenAI, Huggingface Transformers, and Vectors stores like Pinecone and chromadb, among others.

__Libraries involved:__
- Langchain
- Chroma
- PyPDFLoader

__Models used:__
- Embedding: MiniLM
- Text-Generation: Mistral

__MiniLM__ is optimized for fast inference with small memory usage. It’s excellent for creating embeddings and handling NLP tasks on resource-constrained systems.

__Mistral__ is a recently developed LLM that has gained attention for its performance, especially in terms of efficiency and scalability. The key features to note are efficient model architecture as it was designed to optimize performance by reducing computational overhead, being an open-weight models, having smal model size but high performance and is designed for real-time use cases.

## Steps

### Preinstallation
- All relevant libraries through pip
- Ollama server set-up with Mistral model pulled and saved locally

### Application steps
1. Load the PDFs with PyPDFLoader
2. Utilise Langchain to split documents into chunks
3. Save a local version of MiniLM model
4. Declare embeddings with MiniLM
5. Set up persistent Chroma DB with MiniLM embeddings
6. Add document chunks into Chroma DB
7. Create a PromptTemplate with LangChain
8. Search the Chroma DB for the most relevant chunks (embeddings) and return the result
9. Set up text-generation model Mistral
10. Process result prompt and format

### Additional
1. Scrape web pages for some recipe
2. Use BeautifulSoupTransformer to eliminate tags
3. Repeat step 6 to 10 from above.

Credits:
- https://www.youtube.com/watch?v=uj1VnDPR9xo (https://github.com/pixegami/rag-tutorial-v2/blob/main/query_data.py)
- https://medium.com/the-modern-scientist/building-generative-ai-applications-using-langchain-and-openai-apis-ee3212400630
- https://www.comet.com/site/blog/top-5-web-scraping-methods-including-using-llms/
- https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146

## Set up

In [1]:
# install libraries
# LLM library
!pip install langchain
# Vector storage
!pip install chromadb
# Loading PDFs
!pip install pypdf
# Unit testing
!pip install pytest





In [2]:
!pip install sentence_transformers



In [3]:
!pip install -U langchain-huggingface



In [4]:
# Reader
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

# Chroma
from langchain.vectorstores.chroma import Chroma

In [47]:
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings

from transformers import pipeline, AutoTokenizer

In [6]:
# Set data path
DATA_PATH = "data"

# Set Chroma path
CHROMA_PATH = "chroma"

## Creating PDF Library

In [7]:
# Load documents
document_loader = PyPDFDirectoryLoader(DATA_PATH)
documents = document_loader.load()

Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 39 0 (offset 0)
Ignoring wrong pointing object 42 0 (offset 0)
Ignoring wrong pointing object 3204 0 (offset 0)


Once the data is loaded, we will use a text splitter to split the text documents into the fixed size of chunks to store them in the vector database. LangChain offers multiple text splitters such as split by character, split by code, etc.

In [8]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=80,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.split_documents(documents)

In [9]:
# Print to check that content of documents is available
chunks[0]

Document(metadata={'source': 'data\\Stardew Cookbook.pdf', 'page': 0}, page_content='Cookbook')

In [10]:
# Define the model name
model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Load the model and save it locally
model = SentenceTransformer(model_name)

# Define where to save the model
model.save("model")

In [11]:
# Declare embeddings
from langchain_huggingface import HuggingFaceEmbeddings

# Define local path where model is saved
local_model_path = "model"

# declare HuggingFace embeddings using MiniLM
embeddings = HuggingFaceEmbeddings(model_name = model_name)

In [12]:
def calculate_chunk_ids(chunks):
    """
    Method that take in a chunk and create IDs
    like "data/sample.pdf:6:2"
    in the order of Page Source : Page Number : Chunk Index
    and return back chunks with appended id
    """
    last_page_id = None
    current_chunk_index = 0
    
    for chunk in chunks:
        source = chunk.metadata.get("source")
        page = chunk.metadata.get("page")
        current_page_id = f"{source}:{page}"
            
        # If the page ID is the same as the last one, increment the index
        if current_page_id == last_page_id:
            current_chunk_index +=1
        else:
            current_chunk_index = 0
        
        # Calculate the chunk ID
        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id
        
        # Add it to the page meta-data
        chunk.metadata["id"] = chunk_id
    
    return chunks

In [13]:
# Add to chroma
# Load the existing database
chroma_db = Chroma(persist_directory = CHROMA_PATH, embedding_function=embeddings)

# Calculate Page IDs
chunks_with_ids = calculate_chunk_ids(chunks)

# Add or update the documents
existing_items = chroma_db.get(include=[]) # IDs are always included by default
existing_ids = set(existing_items["ids"])
print(f"Number of existing documents in DB: {len(existing_ids)}")

# Only add documents that don't exist in the DB
new_chunks = []
for chunk in chunks_with_ids:
    if chunk.metadata["id"] not in existing_ids:
        new_chunks.append(chunk)

if len(new_chunks):
    print(f"Adding new documents: {len(new_chunks)}")
    new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
    chroma_db.add_documents(new_chunks, ids=new_chunk_ids)
    chroma_db.persist()
else:
    print("No new documents to add")

  chroma_db = Chroma(persist_directory = CHROMA_PATH, embedding_function=embeddings)


Number of existing documents in DB: 1439
No new documents to add


## Retrieval of Prompt

In [14]:
# Set up Prompt Template
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [20]:
from langchain.prompts import ChatPromptTemplate

# Example query
query_text = "What is the recipe for Fish Tacos?"

# Search the DB for the most relevant chunks (based on embeddings)
# Find the top 5 results based on the similarity
results = chroma_db.similarity_search_with_score(query_text, k=5)

# Create the context for the search results
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])

In [16]:
# Load the template for the prompt
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)
print(prompt)

Human: 
Answer the question based only on the following context:

74 
 Recipe: Fish Tacos  
 
Description: It smells delicious.  
Game ingredients: Tuna, Tortilla, Red Cabbage, Mayonnaise  
This recipe restores 165 energy and 66 health. It can be obtained from Linus after achieving 7 hearts and 
it gives a +2 Fishing bonus. It sells for 500g.  
Difficulty: Medium, 1 hour. Makes 4 tacos.   
This recipe uses Tortilla.  It’s optional, you can use store -bought instead to decrease the total time from 1 
hour to 4 0 minutes. You can also use the breading recipe from Crispy Bass  for the fish in this recipe, or 
you can bake it as explained below.   
-4 tortillas   
-¼ small red cabbage  
-1 large fish fillet, thawed   
-1 large carrot  
-1 green onion  
-¼ small red onion  
-½ small tomato

---

75 
 Thinly slice the cab bage, julienne the carrot (cut the carrot on an angle and then slice them thinly as 
shown below), and then finely chop the green onion, red onion, and the tomato. Combine 

## Generate Response with Mistral

In [17]:
from langchain_community.llms.ollama import Ollama

model = Ollama(model="mistral")
response_text = model.invoke(prompt)

sources = [doc.metadata.get("id", None) for doc, _score in results]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

  model = Ollama(model="mistral")


Response:  The recipe for Fish Tacos, as provided in the context, consists of the following steps:

1. Thinly slice a quarter of small red cabbage and julienne one large carrot. Finely chop one green onion, one-fourth of a small red onion, and half of a small tomato. Combine all the vegetables in a medium bowl. Add cumin, garlic powder, chili powder, mustard powder, dried mint, and a dash of salt. Mix well.

2. Cook four tortillas on a hot pan and transfer them to a plate once they’re done.

3. Prepare one large fish fillet, thawed. When the fish finishes cooking, let it cool for 5 minutes and then cut it into four equal pieces.

4. Arrange the fish on the tortillas and spoon on some mayonnaise. Top with the vegetable mixture and then garnish with cilantro leaves. Pour on your favorite hot sauce or dab different types on the plate for varied flavors.
Sources: ['data\\Stardew Cookbook.pdf:73:0', 'data\\Stardew Cookbook.pdf:74:0', 'data\\Stardew Cookbook.pdf:70:0', 'data\\Stardew Cookboo

## Add more to Database from online articles

In [25]:
from langchain.document_loaders import AsyncChromiumLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.vectorstores import FAISS
import nest_asyncio

nest_asyncio.apply()

articles = ["http://www.geekychef.com/2020/04/beer-potage.html",
           "http://www.geekychef.com/2020/01/date-palm-cocktail.html",
           "http://www.geekychef.com/2014/03/hot-spiced-wine.html",
           "http://www.geekychef.com/2008/12/laura-moons-chili.html",
           "http://www.geekychef.com/2008/12/butterbeer.html",
           "http://www.geekychef.com/2013/10/five-flavor-soup.html",
           "https://www.geekychef.com/2023/02/steak-sandwich.html"]

# Scrapes the blog posts above
loader = AsyncChromiumLoader(articles)
docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [31]:
from langchain_community.document_transformers import BeautifulSoupTransformer

# Transform the loaded HTML using BeautifulSoupTransformer
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(docs, tags_to_extract=["h1", "p"])

# Converts HTML to plain text
# html2text = Html2TextTransformer()
# docs_transformed = html2text.transform_documents(docs)

# # Chunk text
new_documents = text_splitter.split_documents(docs_transformed)

In [32]:
# print chunk
print(new_documents[0])

page_content='Beer Potage In The Blood of Elves, Ciri and Triss are enjoying dinner at Kaer Morhen, dinner being "beer potage, thick with cheese and croutons." Potage is a French word, meaning a thick and creamy soup. Initially, I was thinking this was along the lines of German beer and cheddar soup, but apparently Polish beer soup is a little different. It's heavier and sweet, almost like a smooth porridge. It's very filling. I recommend using Żywiec Porter, a wonderful Polish import, but any full bodied dark beer will do.  So sit down, toss a coin to your Witcher, and eat some hearty beer potage.   Ingredients  For the Rye Croutons:  For the soup:  Directions   To make the rye croutons:  For the soup: Ta zupa jest absolutnie pyszna!' metadata={'source': 'http://www.geekychef.com/2020/04/beer-potage.html'}


In [38]:
def calculate_web_chunk(chunks):
    """
    Method that take in a URL chunk and create IDs
    like "source:1:2"
    in the order of URL Source : Chunk Index
    and return back chunks with appended id
    """
    current_chunk_index = 0
    prevSource = ''
    
    for chunk in chunks:
        source = chunk.metadata.get("source")
        
        if prevSource == '':
            current_chunk_index = 0
        elif prevSource == source:
            current_chunk_index += 1
        
        chunk_id = f"{source}:{current_chunk_index}"
        
        # Add it to the page meta-data
        chunk.metadata["id"] = chunk_id
        prevSource = source
    
    return chunks

In [39]:
# Calculate Page IDs
url_chunks_with_ids = calculate_web_chunk(new_documents)

# Add or update the documents
existing_items = chroma_db.get(include=[]) # IDs are always included by default
existing_ids = set(existing_items["ids"])
print(f"Number of existing documents in DB: {len(existing_ids)}")

# Only add url documents that don't exist in the DB
new_chunks = []
for chunk in url_chunks_with_ids:
    if chunk.metadata["id"] not in existing_ids:
        new_chunks.append(chunk)

if len(new_chunks):
    print(f"Adding new documents: {len(new_chunks)}")
    new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
    chroma_db.add_documents(new_chunks, ids=new_chunk_ids)
    chroma_db.persist()
else:
    print("No new documents to add")

Number of existing documents in DB: 1439
Adding new documents: 18


  chroma_db.persist()


In [42]:
# Example query
query_text = "What is the recipe for Beer Potage?"

# Search the DB for the most relevant chunks (based on embeddings)
# Find the top 5 results based on the similarity
results = chroma_db.similarity_search_with_score(query_text, k=5)

# Create the context for the search results
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])

In [43]:
# Load the template for the prompt
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)
print(prompt)

Human: 
Answer the question based only on the following context:

Beer Potage In The Blood of Elves, Ciri and Triss are enjoying dinner at Kaer Morhen, dinner being "beer potage, thick with cheese and croutons." Potage is a French word, meaning a thick and creamy soup. Initially, I was thinking this was along the lines of German beer and cheddar soup, but apparently Polish beer soup is a little different. It's heavier and sweet, almost like a smooth porridge. It's very filling. I recommend using Żywiec Porter, a wonderful Polish import, but any full bodied dark beer will do.  So sit down, toss a coin to your Witcher, and eat some hearty beer potage.   Ingredients  For the Rye Croutons:  For the soup:  Directions   To make the rye croutons:  For the soup: Ta zupa jest absolutnie pyszna!

---

• Cocktail Shaker
• Serving glass (Hurricane glass if you want to 
be accurate to the in-game image)
• Spoon or stirring rod
• Strainer
Non-Alcoholic Version Only
• Small pot
• Wooden spoon (or oth

In [52]:
response_text = model.invoke(prompt)

sources = [doc.metadata.get("id", None) for doc, _score in results]
formatted_response = f"Response: {response_text}\nSources: {sources}"
print(formatted_response)

Response:  To make the Beer Potage, you'll need the following ingredients:

For the Rye Croutons:
- 1 loaf of rye bread, cubed
- 2 tablespoons of olive oil
- Salt and pepper to taste

For the soup:
- 4 cups of dark beer (such as Żywiec Porter)
- 1 large onion, chopped
- 3 potatoes, peeled and diced
- 2 carrots, peeled and sliced
- 1 parsnip, peeled and diced
- 2 cloves of garlic, minced
- Salt and pepper to taste
- 4 tablespoons of butter
- 1 cup of sour cream (optional)
- Chopped chives for garnish (optional)

Directions:

1. Preheat the oven to 350°F (175°C). Toss the rye bread cubes with olive oil, salt, and pepper. Spread them on a baking sheet and bake for about 10 minutes or until golden brown. Set aside.

2. In a large pot, heat the dark beer over medium heat. Add the chopped onion, potatoes, carrots, parsnip, and garlic. Season with salt and pepper.

3. Reduce the heat to low, cover the pot, and let it simmer for about 20-30 minutes or until the vegetables are tender.

4. Using