## RAG Workflow 
This notebook demonstrates the basic workflow of Retrieval-Augmented Generation (RAG) for testing purposes.

#### Steps included:
1. Importing necessary libraries from LangChain, HuggingFace, and other tools.
2. Loading PDF documents from a directory using LangChain's DirectoryLoader.
3. Splitting the documents into overlapping text chunks for better retrieval.
4. Creating vector embeddings of these chunks and storing them in a Chroma vector database.
5. Setting up an LLM (Google Gemini) for answer generation based on retrieved chunks.
6. Running retrieval QA to answer user queries based on the ingested documents.
7. Benchmarking different text splitter chunk sizes for retrieval and similarity performance.
 
The notebook is intended for experimentation and prototyping on academic PDFs.

### Import Libraries

In [2]:
import os
import pandas as pd
import time
from dotenv import load_dotenv
import numpy as np

# Langchain Components
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAI
from langchain_classic.chains.retrieval_qa.base import RetrievalQA
from sklearn.metrics.pairwise import cosine_similarity

### Data Loading and Preparation
- the data being loaded is PDF documents

In [3]:
# Load all PDF documents from the data directory
loader = DirectoryLoader(
    path="../data",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)

documents = loader.load()

100%|██████████| 2/2 [00:02<00:00,  1.10s/it]


In [4]:
# Preview loaded documents
documents[:5]

[Document(metadata={'producer': 'pdfTeX-1.40.17', 'creator': 'LaTeX with hyperref package', 'creationdate': '2019-05-28T00:07:51+00:00', 'author': '', 'keywords': '', 'moddate': '2019-05-28T00:07:51+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\BERT-Pre-training-of-Deep-Bidirectional-Transformers-paper.pdf', 'total_pages': 16, 'page': 0, 'page_label': '1'}, page_content='BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout}@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT, which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to 

### Embedding and Vector Store Creation
- Type of Embedding model being used: `AAI/bge-base-en-v1.5`
- Type of LLM being used: `Gemini 2.0 flash`

In [5]:
# -- Set up Embeddding model --
embeddings_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

# -- Set up LLM model --
# Load Gemini API Key
load_dotenv()
os.environ["GOOGLE_API_KEY"] = os.getenv("GEMINI_API_KEY")
if not os.getenv("GEMINI_API_KEY"):
    raise ValueError("GEMINI_API_KEY not found in environment variables.")

llm = GoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.1)

# -- Set up Question and Ground Truth Pairs --
questions = [
    "What does BERT stand for?",
    "What kind of attention mechanism is introduced in 'Attention Is All You Need'?",
    "How is positional encoding implemented in the Transformer model?",
    "What are the pre-training objectives used in BERT?",
    "Compare the architecture of BERT and the Transformer model.",
]

ground_truths = [
    "Bidirectional Encoder Representations from Transformers.",
    "Self-attention mechanism and multi-head attention.",
    "Positional encoding is implemented using sine and cosine functions.",
    "BERT's pre-training objectives include masked language modeling and next sentence prediction.",
    "BERT uses only the encoder part of the Transformer model."
]

  embeddings_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")


In [6]:
# Function to build vector store
def build_vector_store(docs, chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ".", "?", "!", " ", ""]
    )

    chunks = splitter.split_documents(docs)
    chroma = Chroma.from_documents(
        chunks, 
        embeddings_model,
        persist_directory=f"../chroma_db/{chunk_size}_{chunk_overlap}"
    )

    return chroma

# Function to evaluate a single configuration
def evaluate_config(docs, chunk_size, chunk_overlap, k=3):
    print(f"\nEvaluating for chunk_size={chunk_size}, overlap={chunk_overlap}")
    vector_store = build_vector_store(docs, chunk_size, chunk_overlap)
    retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": k})

    # -- Set up RAG Chain ---
    rag_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )

    # --- Evaluate the RAG Chain ---
    results = []
    for q, gt in zip(questions, ground_truths):
        # Compute retrieval time
        start = time.time()
        result = rag_chain.invoke({"query": q})
        end = time.time()

        response = result["result"]
        docs = result["source_documents"]

        # Compute cosine similarity between response and ground truth
        emb_pred = embeddings_model.embed_query(response)
        emb_true = embeddings_model.embed_query(gt)
        similarity = cosine_similarity([emb_pred], [emb_true])[0][0]

        results.append({
            "question": q,
            "ground_truth": gt,
            "response": response,
            "mean_similarity": similarity,
            "retrieval_time": round(end - start, 3)
        })

    df = pd.DataFrame(results)
    summary = {
        "chunk_size": chunk_size,
        "chunk_overlap": chunk_overlap,
        "avg_similarity": np.mean(df["mean_similarity"]),
        "avg_retrieval_time": np.mean(df["retrieval_time"])
    }
    print(pd.DataFrame([summary]))
    return df, summary

### Chunking Parameters Evaluation
- Comparison for different chunking parameters like `chunk_size`, `chunk_overlap`
- The comparison is made based on responses produced by LLM, mean retrieval time and mean similarity score
- Top-k: `k=3` is chosen, which means only select the top 3 documents with the highest relevance scores

In [10]:
config = [
    {"chunk_size": 200, "chunk_overlap": 0},
    {"chunk_size": 512, "chunk_overlap": 50},
    {"chunk_size": 1024, "chunk_overlap": 100}
]

all_details, summaries = [], []
for cfg in config:
    df, summary = evaluate_config(documents, cfg["chunk_size"], cfg["chunk_overlap"])
    all_details.append(df)
    summaries.append(summary)

comparison_df = pd.DataFrame(summaries)


Evaluating for chunk_size=200, overlap=0
   chunk_size  chunk_overlap  avg_similarity  avg_retrieval_time
0         200              0        0.764071              1.6798

Evaluating for chunk_size=512, overlap=50
   chunk_size  chunk_overlap  avg_similarity  avg_retrieval_time
0         512             50        0.682983               1.522

Evaluating for chunk_size=1024, overlap=100
   chunk_size  chunk_overlap  avg_similarity  avg_retrieval_time
0        1024            100        0.726011              1.3054


In [11]:
comparison_df

Unnamed: 0,chunk_size,chunk_overlap,avg_similarity,avg_retrieval_time
0,200,0,0.764071,1.6798
1,512,50,0.682983,1.522
2,1024,100,0.726011,1.3054


Based on the table above, we observed that:
- chunk size: 200, chunk overlap: 0 having the highest similarity but longest retrieval time
- chunk size: 50, chunk oevrlap: 50 having lowest similarity and average retrieval time
- chunk size: 1024, chunk overlap: 100 having average similarity and fatest retrieval speed

Next we can looks into the responses generated by each configs

In [12]:
# Access first config (200, 0)
df_200 = all_details[0]

# Access second config (512, 50)
df_512 = all_details[1]

# Access third config (1024, 100)
df_1024 = all_details[2]

print("\nResponse for chunk size 200, overlap 0:")
for q, ans in zip(df_200["question"], df_200["response"]):
    print(f"Q: {q}\nR: {ans}\n")

print("\nResponse for chunk size 512, overlap 50:")
for q, ans in zip(df_512["question"], df_512["response"]):
    print(f"Q: {q}\nR: {ans}\n")

print("\nResponse for chunk size 1024, overlap 100:")
for q, ans in zip(df_1024["question"], df_1024["response"]):
    print(f"Q: {q}\nR: {ans}\n")


Response for chunk size 200, overlap 0:
Q: What does BERT stand for?
R: Bidirectional Encoder Representations from

Q: What kind of attention mechanism is introduced in 'Attention Is All You Need'?
R: The paper "Attention Is All You Need" introduces the self-attention mechanism (sometimes called intra-attention).

Q: How is positional encoding implemented in the Transformer model?
R: The context states that positional encodings are added to the input embeddings to inject information about the relative or absolute position of the tokens in the sequence.

Q: What are the pre-training objectives used in BERT?
R: The provided context mentions "No NSP" as a training objective used with the same pre-training data, fine-tuning scheme, and hyperparameters as BERTBASE. It also mentions that the BERT model is initialized with pre-trained parameters and fine-tuned using labeled data. However, the context doesn't explicitly list *all* the pre-training objectives used in BERT.

Q: Compare the arch