# Methodology

This notebook implements a modular Retrieval-Augmented Generation (RAG) pipeline for question answering over PDF documents. The system extracts text from PDFs, embeds and retrieves relevant chunks, and generates fluent answers using transformer-based models. Each step is documented below.



In [1]:
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import fitz  # PyMuPDF
import pdfplumber
import torch
import numpy as np


  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Extract Text from PDF

We use PyMuPDF to extract layout-aware text from the uploaded PDF. This preserves reading order and structure, which is important for downstream retrieval. The extracted text is stored as a single string.


In [None]:
# pdf extraction
def extracttextpymupdf(path):
    doc = fitz.open(path)
    return "\n".join([page.get_text() for page in doc])

def extracttablespdfplumber(path):
    tables = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            tables.extend(page.extract_tables())
    return tables

pdf_path = "../data/pdfchat_sample.pdf"  
raw_text = extracttextpymupdf(pdf_path)
print(raw_text[:1000])  


The History of Paper 
 
Paper has been one of the most transformative inventions in human history. Originating in 
China around 105 AD, it replaced earlier writing surfaces like papyrus and parchment. Cai 
Lun, a Chinese court official, is credited with refining the papermaking process using 
mulberry bark, hemp, and rags. 
 
Over the centuries, papermaking spread across Asia, the Middle East, and Europe. By the 
13th century, paper mills were operating in Spain and Italy, revolutionizing communication 
and record-keeping. 
 
In the modern era, paper is made primarily from wood pulp, and its uses range from books 
and newspapers to packaging and hygiene products. Despite the rise of digital media, 
paper remains a vital part of global infrastructure. 
 
Fun Fact: 
The word “paper” comes from “papyrus,” a plant-based writing material used in ancient 
Egypt. 
 
Page 2 
 
Environmental Impact 
 
While paper is recyclable and biodegradable, its production can be resource-intensive. 
Sustai

## Step 2: Chunk the Extracted Text

To enable efficient retrieval, we split the raw text into fixed-size chunks. Each chunk becomes a unit of semantic representation and retrieval. We use a simple word-based chunking strategy.


In [3]:
#chunking
def chunk_text(text, chunk_size=500):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

chunks = chunk_text(raw_text)
print(f"Total chunks: {len(chunks)}")


Total chunks: 1


## Step 3: Embed Chunks Using BERT

Each chunk is embedded into a dense vector using BERT (bert-base-uncased). These embeddings capture semantic meaning and are used for similarity-based retrieval.


In [12]:
#Embedding with BERT
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = AutoModel.from_pretrained("bert-base-uncased")

def embed(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

chunk_embeddings = [embed(c, bert_tokenizer, bert_model) for c in chunks]


## Step 4: Embed Question and Retrieve Relevant Chunks

The user’s question is embedded using DistilBERT. Cosine similarity is computed between the question and each chunk embedding. The top-k most relevant chunks are selected as context for answer generation.


In [13]:
#Question Embedding + Retrieval
retrieval_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
retrieval_model = AutoModel.from_pretrained("distilbert-base-uncased")

def embedquestion(question):
    inputs = retrieval_tokenizer(question, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = retrieval_model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

def retrievetopchunks(question, chunk_embeddings, chunks, k=3):
    q_embed = embedquestion(question)
    scores = cosine_similarity([q_embed], chunk_embeddings)[0]
    top_indices = np.argsort(scores)[-k:][::-1]
    return [chunks[i] for i in top_indices]

question = "Where was paper first invented?"
top_chunks = retrievetopchunks(question, chunk_embeddings, chunks)
print("\n\n".join(top_chunks))


The History of Paper Paper has been one of the most transformative inventions in human history. Originating in China around 105 AD, it replaced earlier writing surfaces like papyrus and parchment. Cai Lun, a Chinese court official, is credited with refining the papermaking process using mulberry bark, hemp, and rags. Over the centuries, papermaking spread across Asia, the Middle East, and Europe. By the 13th century, paper mills were operating in Spain and Italy, revolutionizing communication and record-keeping. In the modern era, paper is made primarily from wood pulp, and its uses range from books and newspapers to packaging and hygiene products. Despite the rise of digital media, paper remains a vital part of global infrastructure. Fun Fact: The word “paper” comes from “papyrus,” a plant-based writing material used in ancient Egypt. Page 2 Environmental Impact While paper is recyclable and biodegradable, its production can be resource-intensive. Sustainable forestry practices and re

## Step 5: Generate Answer Using FLAN-T5

The selected chunks are concatenated and passed to FLAN-T5 (flan-t5-base), a fine-tuned instruction-following model. The model generates a fluent answer conditioned on the retrieved context and the user’s question.


In [6]:
#Answer Generation with FLAN-T5
flan_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
flan_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

def generate_answer(question, context_chunks):
    context = "\n".join(context_chunks)
    prompt = f"Answer the question based on this context:\n{context}\n\nQuestion: {question}"
    inputs = flan_tokenizer(prompt, return_tensors="pt", truncation=True)
    outputs = flan_model.generate(**inputs, max_length=200)
    return flan_tokenizer.decode(outputs[0], skip_special_tokens=True)

answer = generate_answer(question, top_chunks)
print(f"\n Answer: {answer}")



 Answer: China


In [7]:
# PDF extraction using PyMuPDF
import fitz  # PyMuPDF

def extract_text_pymupdf(path):
    doc = fitz.open(path)
    return "\n".join([page.get_text() for page in doc])

# Adjust path based on notebook location
pdf_path = "../data/pdfchat_sample.pdf"  
raw_text = extract_text_pymupdf(pdf_path)

print(raw_text[:1000])  # Preview first 1000 characters


The History of Paper 
 
Paper has been one of the most transformative inventions in human history. Originating in 
China around 105 AD, it replaced earlier writing surfaces like papyrus and parchment. Cai 
Lun, a Chinese court official, is credited with refining the papermaking process using 
mulberry bark, hemp, and rags. 
 
Over the centuries, papermaking spread across Asia, the Middle East, and Europe. By the 
13th century, paper mills were operating in Spain and Italy, revolutionizing communication 
and record-keeping. 
 
In the modern era, paper is made primarily from wood pulp, and its uses range from books 
and newspapers to packaging and hygiene products. Despite the rise of digital media, 
paper remains a vital part of global infrastructure. 
 
Fun Fact: 
The word “paper” comes from “papyrus,” a plant-based writing material used in ancient 
Egypt. 
 
Page 2 
 
Environmental Impact 
 
While paper is recyclable and biodegradable, its production can be resource-intensive. 
Sustai

## Step 6 : Evaluate Generated Answer

To assess the quality of generated answers, we use ROUGE and BLEU metrics. These compare the generated output against a reference answer, measuring lexical overlap and n-gram precision.


In [None]:
import evaluate

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

reference = "Paper was first invented in China."
prediction = answer  # From FLAN-T5

# ROUGE: accepts strings
rouge_scores = rouge.compute(predictions=[prediction], references=[reference])

# BLEU: accepts strings or list of strings
bleu_scores = bleu.compute(predictions=[prediction], references=[[reference]])

print("ROUGE:", rouge_scores)
print("BLEU:", bleu_scores)



ROUGE: {'rouge1': np.float64(0.2857142857142857), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.2857142857142857), 'rougeLsum': np.float64(0.2857142857142857)}
BLEU: {'bleu': 0.0, 'precisions': [1.0, 0.0, 0.0, 0.0], 'brevity_penalty': 0.0024787521766663585, 'length_ratio': 0.14285714285714285, 'translation_length': 1, 'reference_length': 7}
