# Retrieval Augmented Generation (RAG)

## Why RAG?
1. **Knowledge Cutoff**: *They don't know information after their training date*

   * **Training data freeze**: Models trained on data up to specific cutoff date.
   * **Growing knowledge gap**: Information becomes increasingly outdated as time passes
   * **Missing recent events**: No knowledge of current news, updates, releases, or discoveries
   * **Example**: Can't answer "Who won the 2024 election?" or "Latest iPhone features"
  
2. **No Real-time Data**: *Can't access current information*

    * **Static knowledge only**: Cannot connect to internet, APIs, or live databases
    * **No dynamic information**: Cannot fetch stock prices, weather, traffic, or breaking news
    * **Frozen in time**: Information reflects training period, not current state
    * **Example**: Cannot provide today's weather or current market conditions
  
3. **Hallucination**: *May generate plausible but incorrect information*

    * **Confident fabrication**: Creates believable but false information when uncertain
    * **Pattern-based guessing**: Fills knowledge gaps with plausible-sounding responses
    * **No verification mechanism:** Cannot fact-check or validate generated content
    * **Example**: May invent fake statistics, URLs, or medical advice
  
4. **Domain-Specific Knowledge**: *Limited knowledge about your private/company data*

    * P**ublic data only**: Only knows information available during training
    * **No private access**: Cannot read internal documents, policies, or proprietary data
    * **Generic responses**: Cannot provide company-specific procedures or information
    * **Example**: Cannot answer questions about internal APIs, company policies, or customer data

5. **Memory Limitations**: *Can't remember previous conversations*

    * **No conversation history**: Each session starts fresh with no memory of past interactions
    * **Context window limits**: Forgets earlier parts of long conversations
    * **No user preferences**: Cannot learn or adapt to individual user needs over time
    * **Example**: User must re-explain context and preferences in every new session

## How RAG Solves These Problems?

RAG bridges these gaps by:

* **Retrieving current information** from updated knowledge bases
* **Grounding responses** in verified, sourced content
* **Accessing private data** through custom document collections
* **Maintaining context** through conversation and document history

## Part 0: Introduction, Installations and Environment

**Indexing**

1. **Load**: First we need to load our data. This is done with *Document Loaders*.
2. **Split**: *Text splitters* break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a *VectorStore and Embeddings model*.

![RAG_VectorStore.png](RAG_VectorStore.png)

**Retrieval and Generation**

4. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a *Retriever*.
5. **Generate**: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data.

![Retrieve_Generate.png](Retrieve_Generate.png)

In [1]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain sentence-transformers

Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl.metadata (659 bytes)
Collecting chromadb
  Downloading chromadb-1.0.12-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.9 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting types-requests<3.0.0.0,>=2.31.0.2 (from langchainhub)
  Downloading types_requests-2.32.0.20250602-py3-none-any.whl.metadata (2.1 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting fastapi==0.115.9 (from chromadb)
  Downloading fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.3-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-4.2.0-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.19.2-cp39-cp39-m

In [2]:
! pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting beautifulsoup4 (from bs4)
  Using cached beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Using cached soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Using cached beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
Using cached soupsieve-2.7-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4, bs4
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [bs4]
[1A[2KSuccessfully installed beautifulsoup4-4.13.4 bs4-0.0.2 soupsieve-2.7


In [3]:
import os
from dotenv import load_dotenv
 
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

## Part 1: Overview
We’ll build an app that answers questions about the website's content. The specific website we will use is the LLM Powered Autonomous Agents blog post by Lilian Weng, which allows us to ask questions about the contents of the post.

We can create a simple indexing pipeline and RAG chain to do this in ~50 lines of code.

In [6]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Load Documents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Embed
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

retriever = vectorstore.as_retriever()

In [11]:
from langchain import hub
from langchain_openai import ChatOpenAI

# Prompt
prompt = hub.pull("rlm/rag-prompt")

# LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [12]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [13]:
# Question
rag_chain.invoke("What is Task Decomposition?")

'Task decomposition is the process of breaking down a complex task into smaller, manageable sub-tasks or steps. This can be achieved through various methods, including simple prompting, task-specific instructions, or human inputs. Techniques like Chain of Thought (CoT) and Tree of Thoughts further enhance this process by structuring reasoning and exploring multiple possibilities at each step.'

In [15]:
# import
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# load the document and split it into chunks
loader = TextLoader("shakespeare.txt")
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma with a specific collection name
db = Chroma.from_documents(
    docs, 
    embedding_function,
    collection_name="shakespeare"
)

# query it
query = "What is Malcolm?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


MALCOLM. Let us seek out some desolate shade and there
    Weep our sad bosoms empty.
  MACDUFF. Let us rather
    Hold fast the mortal sword, and like good men
    Bestride our downfall'n birthdom. Each new morn
    New widows howl, new orphans cry, new sorrows
    Strike heaven on the face, that it resounds
    As if it felt with Scotland and yell'd out
    Like syllable of dolor.
  MALCOLM. What I believe, I'll wall;
    What know, believe; and what I can redress,
    As I shall find the time to friend, I will.
    What you have spoke, it may be so perchance.
    This tyrant, whose sole name blisters our tongues,
    Was once thought honest. You have loved him well;
    He hath not touch'd you yet. I am young, but something
    You may deserve of him through me, and wisdom  
    To offer up a weak, poor, innocent lamb
    To appease an angry god.
  MACDUFF. I am not treacherous.
  MALCOLM. But Macbeth is.
    A good and virtuous nature may recoil


In [16]:
from langchain_core.prompts import PromptTemplate

# Combine the content of the retrieved documents for context
context = docs[0].page_content

# Create a prompt for the language model
prompt = PromptTemplate.from_template(
    "Based on the following context, answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"
)

# Initialize the language model (e.g., OpenAI)
llm_model="gpt-4o-mini"
llm = ChatOpenAI(temperature=0.0, model=llm_model) # Replace with your model of choice

# Generate an answer using the language model
answer = llm.invoke(prompt.format(context=context, query=query))

# Print the generated answer
print(answer)

content='Malcolm is a character in Shakespeare\'s play "Macbeth." He is the son of King Duncan and is portrayed as a noble and virtuous figure. In the context provided, he is discussing the state of Scotland under Macbeth\'s tyrannical rule and is contemplating the need for action against the tyranny. He expresses a sense of responsibility and a desire to restore order, while also testing Macduff\'s loyalty and intentions.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 85, 'prompt_tokens': 286, 'total_tokens': 371, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_34a54ae93c', 'id': 'chatcmpl-BeP2GVtfXY1k0pZ3DcHpAuWKECSpK', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--e6e91c2c-ead6-477e-9a

## Part 2: Indexing
There are different algorithms to index the documents/splits generated out of the initial document: Hierarchical Navigable Small World (HNSW - Chroma, Inverted File Index - FAISS and Pinecone, Locality Sensitive Hashing, Tree-Based Indexing)


In [1]:
# Documents
question = "What kinds of pets do I like?"
document = "My favorite pet is a cat."

In [2]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string(question, "cl100k_base")



8

In [22]:
from langchain_openai import OpenAIEmbeddings

embd = OpenAIEmbeddings()
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)

print(f'Length of query: {len(query_result)}')
print(f'Length of document: {len(document_result)}')

Length of query: 1536
Length of document: 1536


In [23]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)
print("Cosine Similarity:", similarity)

Cosine Similarity: 0.8806915835035412


In [24]:
# Load blog
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

There are different text splitters available.

**RecursiveCharacterTextSplitter** is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [25]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=50)

# Make splits
splits = text_splitter.split_documents(blog_docs)

## Part 3 - Retrieval

In [26]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

In [27]:
docs = retriever.invoke("What is Task Decomposition?")

In [29]:
docs

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='Component One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a

In [30]:
len(docs)

4

## Part 4 - Generation

In [31]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'), additional_kwargs={})])

In [32]:
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [33]:
chain = prompt | llm

In [34]:
chain.invoke({"context":docs,"question":"What is Task Decomposition?"})

AIMessage(content='Task Decomposition is the process of breaking down a complicated task into smaller, more manageable steps. It involves techniques such as Chain of Thought (CoT), which encourages a model to think step by step to simplify complex tasks, and Tree of Thoughts, which explores multiple reasoning possibilities at each step by creating a tree structure of thoughts. Task decomposition can be achieved through simple prompting, task-specific instructions, or human inputs.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 83, 'prompt_tokens': 959, 'total_tokens': 1042, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_34a54ae93c', 'id': 'chatcmpl-BedTLjlvBINZsliU5hyQsuimn8Njj', 'service_tier': 'default', 'finish_rea

In [35]:
from langchain import hub
prompt_hub_rag = hub.pull("rlm/rag-prompt")

In [36]:
prompt_hub_rag

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [37]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_hub_rag
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

'Task Decomposition is the process of breaking down a complex task into smaller, manageable steps. Techniques like Chain of Thought (CoT) and Tree of Thoughts enhance this process by guiding models to think step by step and explore multiple reasoning possibilities. This approach helps in planning and executing tasks more effectively.'