## Building a RAG pipeline

This notebook implements a full RAG pipeline, starting with PDF parsing, chunking, embedding, information retrieval, and reranking to optimize the quality of search results from the Lender Fees PDF Document

## Installing Python packages into my local environment

In [None]:
!pip install -q llama-index llama-index-llms-gemini pymupdf
!pip install -q llama-index-embeddings-huggingface
!pip install llama-index-retrievers-bm25

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.1/24.1 MB[0m [31m85.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m11.9/11.9 MB[0m [31m106.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m303.3/303.3 kB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m51.8/51.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m4.5/4.5 MB[0m [31m99.9 MB/s[0m eta [36m0:00:00[0m
[2K 

## Installing libraries

In [None]:
from google.colab import files
import fitz
import os
from llama_index.core import Document
from typing import List
from llama_index.llms.gemini import Gemini
from llama_index.core import Settings
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore

## Set up Google API Key

In [None]:
GOOGLE_API_KEY = ""
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

## Load PDF and convert to an Llama-Index Compatible / Parseable Format

1. Open the PDF: Uses PyMuPDF (fitz.open) to open the PDF file you give it.

2. Go through each page: Loops through the pages one by one.

3. Grab the text: For each page, it pulls out the text. If a page has no text (blank or images only), it skips it.

4. Make a document object: For pages that do have text, it creates a Document object. This object stores:
    - The page‚Äôs text
    - Some extra info (metadata): the PDF‚Äôs file name, the page number, and the total number of pages in the PDF.

5. Keep all pages together: Adds each of these Document objects into a list.

6. Close the PDF: After reading everything, it closes the PDF file.

7. Tell you what it did: Prints out the file name, how many pages it looked at, and how many of those pages had text.

8. Return the result: Finally, it gives back the list of all Document objects with text and metadata.

üëâ In short: This function takes a PDF file, pulls the text out of each page, attaches some basic info (like page number), and returns it as a list of neat, ready-to-use objects.

In [None]:
def load_pdf(pdf_path: str) -> List[Document]:
    """Load a PDF and convert it to LlamaIndex Document format using PyMuPDF."""
    doc = fitz.open(pdf_path)
    documents = []

    for i, page in enumerate(doc):
        text = page.get_text()
        if not text.strip():
            continue
        documents.append(
            Document(
                text=text,
                metadata={
                    "file_name": os.path.basename(pdf_path),
                    "page_number": i + 1,
                    "total_pages": len(doc)
                }
            )
        )
    doc.close()
    print(f"Processed {pdf_path}:")
    print(f"Extracted {len(documents)} pages with content")
    return documents

## Initialize Gemini and Embedding Model

1. Initialize Gemini LLM:
  - Creates a Gemini LLM using the "gemini-2.0-flash" model.
  - Tells the Settings object: ‚ÄúUse this LLM for all future tasks.

2. Set up the embedding model:
  - Loads a HuggingFace embedding model called "BAAI/bge-small-en".

**Note:** `bge-small-en` is an efficient embedding model developed by BAAI that transforms English text into 384-dimensional vectors (embeddings) for semantic search and other tasks. It is a small, fast, and high-performing model based on transformer architecture, optimized for both similarity matching and retrieval using contrastive learning. Key features include its small size, strong performance on the MTEB benchmark, and special instruction-based handling for retrieval tasks.

  - An embedding model turns text into number vectors so the system can compare meaning between pieces of text.
  - Then it tells Settings: ‚ÄúUse this embedding model going forward."

3. Create a semantic text splitter:
  - This tool takes a big chunk of text (like a PDF or long document) and splits it into smaller pieces.
  - Instead of cutting at random places (like fixed length), it uses the embedding model to check for changes in meaning and splits where topics shift.
  
    - buffer_size = 1 ‚Üí keeps a small overlap of sentences around the split so context isn‚Äôt lost.
      - With 1, the splitter copies 1 sentence from before and after each cut into neighboring chunks so context isn‚Äôt lost.
    - breakpoint_percentile_threshold = 95 ‚Üí controls how ‚Äúsensitive‚Äù it is to meaning changes.
      - The splitter scores how much the meaning changes between neighboring sentences (using embeddings). It then takes the 95th-percentile of those change scores and cuts only at the biggest jumps (roughly the top 5% of changes). If you lower it (e.g., 80), you get more, smaller chunks; raise it (e.g., 98) for fewer, bigger chunks.
        - Higher = only splits when the meaning shift is very strong.
        - Lower = splits more often, even on smaller changes.

In [None]:
# Initialize Gemini LLM
llm = Gemini(model="models/gemini-2.0-flash")
Settings.llm = llm

# Initialize embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")
Settings.embed_model = embed_model
splitter = SemanticSplitterNodeParser( # Creates semantic splitter with embedding model
    buffer_size = 1,
    breakpoint_percentile_threshold = 95, # How sensitive to change in meaning
    embed_model = embed_model

)

  llm = Gemini(model="models/gemini-2.0-flash")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Pre-process PDF, and create vector and keyword indices

This function takes a PDF, splits it into smart chunks, converts those chunks into embeddings, and builds a semantic search index you can query later.

In [None]:
def process_and_index_pdf(pdf_path):
    documents = load_pdf(pdf_path)
    nodes = splitter.get_nodes_from_documents(documents)
    vector_index = VectorStoreIndex(nodes)
    print(f"Indexed {len(documents)} document chunks")
    return vector_index

## Build RAG Pipeline

1. Collect the text chunks: Pulls every chunk of text that was created when you indexed the PDF and counts them.

2. Pick how many results to fetch (‚Äútop-k‚Äù):
- If there‚Äôs only 1 chunk, fetch 1.
- If there are 2 or more, fetch 2.

*(That‚Äôs what safe_top_k ensures.)*

3. Set up two ways to find relevant chunks:
- Vector search: finds chunks that mean something similar to the question.
- Keyword search (BM25): finds chunks that share exact words with the question.

4. Combine them (HybridRetriever):Runs both searches, merges the results, removes duplicates, sorts by score (best first), and keeps only the top k.

5. (Optional) Re-rank with a smarter model:
If there‚Äôs more than one chunk total, it uses a cross-encoder reranker to double-check which of the top candidates best match the question, and reorders them.

6. Broaden the search with query fusion:
- Asks the LLM to rewrite the user‚Äôs question in a few different ways (3 variants)
- Searches with each
- Blends the results so you don‚Äôt miss answers due to phrasing variations.

7. Build the final Q&A engine: Wires the fusion retriever (plus the optional reranker) into a query engine that you can call with a question to get the best matching chunk(s) and an answer.

8. Return the result: You get back a ready-to-use query_engine.

In [None]:
def build_rag_pipeline(index):
    nodes = list(index.docstore.docs.values()) # Gets all chunks of text that were created when PDF was indexed
    num_nodes = len(nodes) # Stores how many chunks there are
    safe_top_k = min(2, max(1, num_nodes)) # Retrieves the minimum value for top k

    vector_retriever = index.as_retriever(similarity_top_k=safe_top_k) # Uses embeddings to find chunks that are semantically similar
    bm25_retriever = BM25Retriever.from_defaults( # Uses keyword search to find exact terms in chunks found in the query
        nodes=nodes,
        similarity_top_k=safe_top_k
    )

    class HybridRetriever(BaseRetriever): # Custom class to combine both vector and keyword search
        def __init__(self, vector_retriever, keyword_retriever, top_k=2):
            self.vector_retriever = vector_retriever
            self.keyword_retriever = keyword_retriever
            self.top_k = top_k
            super().__init__()

        def _retrieve(self, query_bundle, **kwargs):
            vector_nodes = self.vector_retriever.retrieve(query_bundle)
            keyword_nodes = self.keyword_retriever.retrieve(query_bundle)
            all_nodes = list(vector_nodes) + list(keyword_nodes)
            unique_nodes = {node.node_id: node for node in all_nodes}
            sorted_nodes = sorted(
                unique_nodes.values(),
                key=lambda x: x.score if hasattr(x, 'score') else 0.0,
                reverse=True
            )
            return sorted_nodes[:self.top_k]

    hybrid_retriever = HybridRetriever( # Creates instance of class defined above
        vector_retriever=vector_retriever,
        keyword_retriever=bm25_retriever,
        top_k=safe_top_k
    )

    if num_nodes > 1:
        reranker = SentenceTransformerRerank( # Checks which chunk is most relevant to original query
            model="cross-encoder/ms-marco-MiniLM-L-12-v2", # More powerful than L-6 version
            top_n=min(2, num_nodes)
        )
        node_postprocessors = [reranker]
    else:
        node_postprocessors = []

    fusion_retriever = QueryFusionRetriever( # Creates multiple versions of the user's query
        retrievers=[hybrid_retriever],
        llm=llm,
        similarity_top_k=2,
        num_queries=3,  # Generate 3 queries per original query
        mode="reciprocal_rerank"
    )

    query_engine = RetrieverQueryEngine.from_args( # Takes fusion retriever and reranker and combines them
        retriever=fusion_retriever,
        llm=llm,
        node_postprocessors=node_postprocessors
    )
    return query_engine # Returns output

## Upload PDF Document

In [None]:
print("Please select a PDF file to upload.")
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]

Please select a PDF file to upload.


Saving Sriram Srinivasan Data Analyst Resume.pdf to Sriram Srinivasan Data Analyst Resume.pdf


## Run the query using the RAG pipeline

In [None]:
index = process_and_index_pdf(pdf_path)
rag_engine = build_rag_pipeline(index)
print("Type 'Exit' to stop.")
while True:
  user_input = input()
  if user_input == 'Exit':
    break
  response = rag_engine.query(user_input)
  print('\nFinal Response:\n ---------------------- \n')
  print(response)

Processed Sriram Srinivasan Data Analyst Resume.pdf:
Extracted 1 pages with content


DEBUG:bm25s:Building index from IDs objects


Indexed 1 document chunks


config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Type 'Exit' to stop.

Final Response:
 ---------------------- 

Sriram Srinivasan


Final Response:
 ---------------------- 

I graduated with a Master of Science in Computer Science in May 2024 and a Bachelor of Science in Computer Science in May 2022.


Final Response:
 ---------------------- 

The latest employer is TruBridge, where the individual works as a Healthcare Data Analyst.

