# Simple RAG (Retrieval-Augmented Generation) System

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

## Key Components

1. PDF processing and text extraction
2. Text chunking for manageable processing
3. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
4. Retriever setup for querying the processed documents
5. Evaluation of the RAG system

## Method Details

### Document Preprocessing

1. The PDF is loaded using PyPDFLoader.
2. The text is split into chunks using RecursiveCharacterTextSplitter with specified chunk size and overlap.

### Text Cleaning

A custom function `replace_t_with_space` is applied to clean the text chunks. This likely addresses specific formatting issues in the PDF.

### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.

### Encoding Function

The `encode_pdf` function encapsulates the entire process of loading, chunking, cleaning, and encoding the PDF into a vector store.

## Key Features

1. Modular Design: The encoding process is encapsulated in a single function for easy reuse.
2. Configurable Chunking: Allows adjustment of chunk size and overlap.
3. Efficient Retrieval: Uses FAISS for fast similarity search.
4. Evaluation: Includes a function to evaluate the RAG system's performance.

## Usage Example

The code includes a test query: "What is the main cause of climate change?". This demonstrates how to use the retriever to fetch relevant context from the processed document.

## Evaluation

The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [None]:
# Install required packages
!pip install -q pypdf
!pip install -q PyMuPDF
!pip install -q python-dotenv
!pip install -q langchain-community
!pip install -q langchain_google_genai
!pip install -q rank_bm25
!pip install -q faiss-cpu
!pip install -q deepeval

In [1]:
import os
os.chdir("..")
os.getcwd()

'd:\\Rahul-Github\\Daily-Task\\all_rag_techniques'

In [2]:
os.listdir()

['.deepeval',
 '.python-version',
 '.venv',
 'data',
 'evaluation',
 'evalute_rag.py',
 'helper_functions.py',
 'notebook',
 'pyproject.toml',
 'README.md',
 'test.py',
 'uv.lock',
 '__pycache__']

In [3]:
import os
import sys
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Original path append replaced for Colab compatibility

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from helper_functions import (
    EmbeddingProvider, 
    retrieve_context_per_question, 
    replace_t_with_space, 
    get_langchain_embedding_provider, 
    show_context
)

# Import the new free evaluation functions
from evalute_rag import simple_rag_evaluation, get_llm

In [4]:
path = "data/Understanding_Climate_Change.pdf"

### Encode document

In [5]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using Gemini embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings (Tested with Gemini and Amazon Bedrock)
    embeddings = get_langchain_embedding_provider(EmbeddingProvider.GOOGLE_GENAI)
    #embeddings = get_langchain_embedding_provider(EmbeddingProvider.AMAZON_BEDROCK)

    # Create vector store
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [6]:
chunks_vector_store = encode_pdf(path, chunk_size=1000, chunk_overlap=200)

### Create retriever

In [7]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 2})

### Test retriever

In [8]:
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

Context 1:
Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an extended period. Over the past century, human 
activities, particularly the burning of fossil fuels and deforestation, have significantly 
contributed to climate change. 
Historical Context 
The Earth's climate has changed throughout history. Over the past 650,000 years, there have 
been seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about 
11,700 years ago marking the beginning of the modern climate era and human civilization. 
Most of these climate changes are attributed to very small variations in Earth's orbit that 
change the amount of solar energy our planet receives. During the Holocene epoch, which


Context 2:
Chapter 2: Causes of Climate

In [10]:
# Simple RAG evaluation using free models
print("=== Simple RAG Evaluation with Free Models ===")
result = simple_rag_evaluation(
    retriever=chunks_query_retriever,
    test_question="What is the main cause of climate change?",
    llm_provider="gemini"  # Using free Google Gemini
)

print("\nEvaluation Results:")
print(f"Question: {result['question']}")
print(f"Documents Retrieved: {result['num_docs_retrieved']}")
print(f"Context Length: {result['context_length']} characters")
print(f"Relevance Score: {result['relevance_score']:.2f}/1.0")
print(f"Model Used: {result['model_used']}")

print("\nContext Preview:")
print(result['context_preview'])

=== Simple RAG Evaluation with Free Models ===

Evaluation Results:
Question: What is the main cause of climate change?
Documents Retrieved: 2
Context Length: 1749 characters
Relevance Score: 1.00/1.0
Model Used: gemini

Context Preview:
Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation, and wind patterns, over an exte...

Evaluation Results:
Question: What is the main cause of climate change?
Documents Retrieved: 2
Context Length: 1749 characters
Relevance Score: 1.00/1.0
Model Used: gemini

Context Preview:
Understanding Climate Change 
Chapter 1: Introduction to Climate Change 
Climate change refers to significant, long-term changes in the global climate. The term 
"global climate" encompasses the planet's overall weather patterns, including temperature, 
precipitation

In [11]:
# Alternative free evaluation options
print("=== Alternative Free Models for Evaluation ===")

# Test different questions
test_questions = [
    "What are greenhouse gases?",
    "How does deforestation contribute to climate change?",
    "What are renewable energy sources?"
]

print("\nTesting multiple questions with free Gemini model:")
for i, question in enumerate(test_questions, 1):
    print(f"\n{i}. Testing: {question}")
    try:
        result = simple_rag_evaluation(
            retriever=chunks_query_retriever,
            test_question=question,
            llm_provider="gemini"
        )
        print(f"   Relevance Score: {result['relevance_score']:.2f}")
        print(f"   Documents Retrieved: {result['num_docs_retrieved']}")
    except Exception as e:
        print(f"   Error: {e}")

print("\n" + "="*50)
print("Free Model Options Available:")
print("1. Google Gemini (gemini-2.0-flash) - Default, requires GOOGLE_API_KEY")
print("2. Ollama (local models) - Completely free and local")
print("3. Groq (fast inference) - Free tier available")
print("\nTo use different models, install:")
print("- Ollama: pip install langchain-ollama")
print("- Groq: pip install langchain-groq")

=== Alternative Free Models for Evaluation ===

Testing multiple questions with free Gemini model:

1. Testing: What are greenhouse gases?
   Relevance Score: 1.00
   Documents Retrieved: 2

2. Testing: How does deforestation contribute to climate change?
   Relevance Score: 1.00
   Documents Retrieved: 2

2. Testing: How does deforestation contribute to climate change?
   Relevance Score: 1.00
   Documents Retrieved: 2

3. Testing: What are renewable energy sources?
   Relevance Score: 1.00
   Documents Retrieved: 2

3. Testing: What are renewable energy sources?
   Relevance Score: 1.00
   Documents Retrieved: 2

Free Model Options Available:
1. Google Gemini (gemini-2.0-flash) - Default, requires GOOGLE_API_KEY
2. Ollama (local models) - Completely free and local
3. Groq (fast inference) - Free tier available

To use different models, install:
- Ollama: pip install langchain-ollama
- Groq: pip install langchain-groq
   Relevance Score: 1.00
   Documents Retrieved: 2

Free Model Opti

In [12]:
# Advanced evaluation with custom functions
from evalute_rag import evaluate_test_cases, evaluate_relevance, evaluate_faithfulness

print("=== Advanced Evaluation with Custom Functions ===")

# Example test case
test_question = "What are the main greenhouse gases?"
expected_answer = "The main greenhouse gases are carbon dioxide, methane, nitrous oxide, and fluorinated gases."

# Get context from retriever
context_docs = chunks_query_retriever.get_relevant_documents(test_question)
context = "\n".join([doc.page_content for doc in context_docs])

# Generate an answer using our helper functions
llm = get_llm("gemini")
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

answer_prompt = PromptTemplate.from_template("""
Based on the following context, answer the question:

Context: {context}

Question: {question}

Answer:
""")

answer_chain = answer_prompt | llm | StrOutputParser()
generated_answer = answer_chain.invoke({"context": context, "question": test_question})

print(f"Question: {test_question}")
print(f"Generated Answer: {generated_answer}")
print(f"Expected Answer: {expected_answer}")

# Evaluate using custom functions
relevance_score = evaluate_relevance(test_question, context, llm)
faithfulness_score = evaluate_faithfulness(generated_answer, context, llm)

print(f"\nEvaluation Scores:")
print(f"Relevance (context to question): {relevance_score:.2f}")
print(f"Faithfulness (answer to context): {faithfulness_score:.2f}")

print(f"\nContext Preview: {context[:200]}...")

=== Advanced Evaluation with Custom Functions ===
Question: What are the main greenhouse gases?
Generated Answer: According to the text, the main greenhouse gases are carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O).
Expected Answer: The main greenhouse gases are carbon dioxide, methane, nitrous oxide, and fluorinated gases.
Question: What are the main greenhouse gases?
Generated Answer: According to the text, the main greenhouse gases are carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O).
Expected Answer: The main greenhouse gases are carbon dioxide, methane, nitrous oxide, and fluorinated gases.

Evaluation Scores:
Relevance (context to question): 1.00
Faithfulness (answer to context): 1.00

Context Preview: Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2)...

Evaluation Scores:
Relevance (context to question): 