# PDF Q&A Chatbot Using RAG

This project implements a Retrieval-Augmented Generation  system for intelligent PDF document question answering. 

**Key Features:**
- **PDF Processing**: Extract and clean text from PDF documents using ``LlamaParse``

- **Semantic Search**: Embed document chunks with ``Cohere`` and store in ``Pinecone`` vector database
- **Context-Aware Answers**: Retrieve relevant information and generate accurate responses with LLMs


This notebook demonstrates the complete RAG pipeline from document ingestion to answer generation, allowing for natural language queries against your PDF knowledge base.


### Project Architecture

__Here is a high-level overview of the application__ : 



<div style="text-align:left">
    <img src="https://github.com/saidibnerradi603/pdf-rag-qa/blob/master/assets/images/Project_Pipeline.png?raw=true" width="100%"/>
</div>



**Key Components:**

1. **Input Processing**
   - Users submit natural language questions (e.g., "What is the difference between supervised and unsupervised learning?")
   - The system converts these questions into vector embeddings using the same embedding model used for document processing

2. **Indexing Pipeline**
   - PDF documents are loaded and parsed into plain text or markdown format
   - Text is divided into semantically meaningful chunks for more precise retrieval
   - Each chunk is converted into vector embeddings that capture semantic meaning
   - These embeddings are stored in a vector database for efficient similarity search

3. **Retrieval System**
   - The system performs semantic search using the query embedding
   - Top-K most relevant chunks are identified and ranked by similarity score
   - Retrieved chunks contain contextually relevant information (e.g., definitions of supervised vs. unsupervised learning)

4. **Context Augmentation**
   - The original user question is combined with retrieved text chunks
   - A structured prompt is created that instructs the LLM how to use the retrieved information
   - The system organizes the context to facilitate accurate answer generation

5. **Response Generation**
   - The LLM processes the augmented context and generates a comprehensive answer
   - The response is grounded in the retrieved document content rather than potentially outdated training data
   - The final answer is presented to the user in a clear, readable format


Now we'll build our PDF Q&A chatbot step by step, implementing each component of the RAG pipeline:


In [1]:
from dotenv import load_dotenv
import os
import warnings
warnings.filterwarnings('ignore') 

load_dotenv()

True

# 1. Indexing

## 1.1 Document loading and parsing




**Overview of ``LlamaParse``**


`LlamaParse`  is a document parsing service developed by  **LlamaIndex**  , specifically designed for large language models (LLMs).

Key Features:

-   Support for various document formats, such as PDF, Word, PowerPoint, and Excel
-   Customized output formats through natural language instructions
-   Advanced table and image extraction capabilities
-   Multilingual support
-   Multiple output format support

`LlamaParse`  is available as a standalone API and is also integrated into the LlamaCloud platform. This service aims to enhance the performance of LLM-based applications, such as RAG(Retrieval-Augmented Generation), by parsing and refining documents.

Users can process up to 10000 pages per month for free, with additional capacity available through paid plans.  `LlamaParse`  is currently offered in public beta and is continuously expanding its features.

In [3]:
from llama_cloud_services import LlamaParse

parser = LlamaParse( 
    num_workers=8,                
    split_by_page=False,
    verbose=True,
    result_type="text",
    disable_ocr=True,
    disable_image_extraction=True,
)


# async
result = await parser.aparse("../assets/files/raw_pdfs/1706.03762v7.pdf")

Started parsing the file under job_id aa4bb716-582b-4756-9fa5-e3c4649b8eb2
.

The result object is a fully typed ``JobResult`` object. You can interact with it to parse and transform various parts of the result:



In [4]:
data=result.get_text_documents()

In [5]:
print("File Name:", data[0].metadata["file_name"])
print("Text Preview:", data[0].text[:200], "...")
print("Document ID:", data[0].id_)


File Name: ../assets/files/raw_pdfs/1706.03762v7.pdf
Text Preview:     arXiv:1706.03762v7 [cs.CL] 2 Aug 2023

  Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic o ...
Document ID: abe37a4b-0d50-49f5-b056-babd7e1c4250


In [6]:
# write and save the parsed PDF content as a .txt file
with open(f"../assets/files/parsed_pdfs/{data[0].id_}.txt","x") as f:
    f.write(data[0].text)

## 1.2 Text Cleaning Normalization

In [7]:
import re

def clean_text_file(input_path, output_path):
    with open(input_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    cleaned_lines = []
    for line in lines:
        # Strip leading and trailing whitespace
        stripped = line.strip()
        if stripped:
            cleaned_lines.append(stripped)
    
    # Join lines with single newlines
    cleaned_text = '\n'.join(cleaned_lines)
    
    cleaned_text = re.sub(r'\n{2,}', '\n', cleaned_text)
    
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(cleaned_text)
    
    print(f"Cleaned text saved to {output_path}")


In [9]:
input_file = f"../assets/files/parsed_pdfs/{data[0].id_}.txt"
output_file = f"../assets/files/cleaned_text/{data[0].id_}.txt"
    
os.makedirs(os.path.dirname(output_file), exist_ok=True)
    
clean_text_file(input_file, output_file)

Cleaned text saved to ../assets/files/cleaned_text/abe37a4b-0d50-49f5-b056-babd7e1c4250.txt


## 1.3 Chunking

Split the processed document into smaller, manageable ``chunks`` . This makes it easier to find precise information.
We use LangChain `RecursiveCharacterTextSplitter` to preserve semantic structure and maintain context across chunks.


In [10]:
with open(output_file, "r", encoding="utf-8") as f:
    file_content = f.read()

In [11]:
from langchain_core.documents import Document

doc = Document(
    page_content=file_content,
    metadata={"source": data[0].id_}
)

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,  
    chunk_overlap=200,
)

split_docs = splitter.split_documents([doc])


print("First split chunk:\n", split_docs[0])

print("Total number of chunks:\n", len(split_docs))


First split chunk:
 page_content='arXiv:1706.03762v7 [cs.CL] 2 Aug 2023
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗
Google Brain         Google Brain     Google Research    Google Research
avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com
Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗
Google Research    University of Toronto          Google Brain
llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple

## 1.4 Data Embedding & Vector Storage

In this step, we convert the processed document chunks into numerical vectors that the model can understand. Each chunk of text is transformed into an embedding using the **Cohere Embed v3** model. This model generates 1,024-dimensional vectors that capture the semantic meaning of the text. These embeddings are then **stored in a Pinecone vector database**, enabling efficient similarity searches during the retrieval phase of the RAG workflow.

In [17]:
from pinecone import Pinecone

pc = Pinecone()

In [18]:
from pinecone import ServerlessSpec


index_name = "langchain-pdf-index"  

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(index_name)

In [19]:
from langchain_cohere import CohereEmbeddings
from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore

embeddings = CohereEmbeddings(model="embed-english-v3.0")
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

In [20]:
vector_store.add_documents(documents=split_docs)

['805e1583-614f-49db-a905-535b567f5903',
 '27ab21ad-27f9-4fd2-8241-8618fc63d5db',
 '823291ca-eabb-43b6-9ae5-a21d8cb24291',
 '9639e4e5-bc7d-4f20-a593-279457a0de6b',
 '8a24c4f8-c492-4872-b7d2-4f67b2c866f1',
 '185e2256-8c00-49c3-ac86-b3fb63b4fc05',
 '11b4df35-370c-4818-b0b5-e5498642c2a6',
 '1e35606e-e1c8-4633-8bf3-89a4188c3df5',
 'd79783a9-4833-4602-bb31-58438ca882ab',
 'fd2c891d-1f4f-4361-ab8e-df5d0ccf8a5a',
 '6eab9bc8-05a2-4652-b95c-a14c9b73d030',
 '6a7e4fce-7b20-43d4-962b-f53fee881d52',
 'be6b6204-4cf2-46aa-9622-c0b5ac96054d',
 'abf3e444-8e1b-453e-8ec1-3f56582e5fcb',
 '839667d6-6e3a-49db-a9b0-13dabe819d88',
 'aecf9a20-d42e-4d3f-bff0-986e87bdefa3',
 'eb2796ca-72d1-4405-8f78-fa481cc9f2ae',
 '0eaa7ee0-9ce4-44c4-9914-da35cc433fe7',
 '9e0ac292-90c5-4b9e-91d2-a3a38563be45',
 'e131c4df-2453-41d0-8b23-d614c47829eb',
 'd1de589f-7cfc-4e8b-afd3-5a37642ca087',
 '2a7c2265-2981-41e0-b5bc-a9c233ed28b2',
 '417dee18-c216-4d84-9ac9-60905ba53b45',
 '4c05c92e-f8ad-4f2f-8ef7-9d5785b85d6c']

# 2. Retrieval


After generating and storing embeddings in a vector database, the retrieval step allows the system to find the most relevant document chunks for a user query. When a query is submitted, it is converted into an embedding using the same **Cohere Embed v3** model. The vector database (e.g., **Pinecone**) is then searched to find chunks with the highest similarity to the query embedding. These retrieved chunks are passed to the LLM to generate accurate, context-aware answers.


In [21]:
user_query = "What is the main idea behind the Transformer model?"


results = vector_store.similarity_search(
    user_query,
    k=5,
    filter={"source": data[0].id_},
)
for res in results:
    print(f"*{res.page_content}")
    print("===============================================================\n\n")

*sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization tricks [21] and conditional
computation [32], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to draw global dependencies between input and output.
The

In [22]:
# We can also transform the vector store into a retriever for easier usage in our chain.
retriever = vector_store.as_retriever(
    search_type="similarity", # or mmr
    search_kwargs={"k": 5},
)

res=retriever.invoke(user_query, filter={"source":data[0].id_})


In [23]:
for i, doc in enumerate(res, 1):
    source = doc.metadata.get("source")
    content_preview = doc.page_content[:500]  # first 500 chars for preview
    
    print(f"---\nDocument {i}")
    print(f"Source: {source}")
    print("Content Preview:")
    print(content_preview)  
    print("---\n")

---
Document 1
Source: abe37a4b-0d50-49f5-b056-babd7e1c4250
Content Preview:
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factorization tricks [21] and conditional
computation [32], while also improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains.
Attention mechanisms have become an integral 
---

---
Document 2
Source: abe37a4b-0d50-49f5-b056-babd7e1c4250
Content Preview:
it more difficult to learn dependencies between distant positions [12]. In the Transformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
described in section 3.2.
Self-attention, some

# 3.Augmentation


In this step, the retrieved document chunks are combined with the user query to create a prompt for the language model. The context includes relevant information from the documents, which helps the model generate accurate and context-aware answers. Proper formatting, separators, and instructions can be added to ensure the LLM understands how to use the retrieved information effectively.


In [24]:
import textwrap
from IPython.display import Markdown

# Convert plain text to Markdown format
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))


- We use Google's ``gemini-2.5-flash`` model to answer a question and format the response as Markdown


### 3.1 Simple Example: Using Gemini with LangChain


In [25]:
from langchain_google_genai import ChatGoogleGenerativeAI

model = ChatGoogleGenerativeAI(model="gemini-2.5-flash")


# to_markdown(model.invoke("What is Retrieval-Augmented Generation ?").content)

E0000 00:00:1759938571.120008    6586 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


### 3.2 Format the retrieved documents

In [26]:
def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content.strip() for doc in docs)

context_text = format_docs(split_docs)
print(context_text)

arXiv:1706.03762v7 [cs.CL] 2 Aug 2023
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗
Google Brain         Google Brain     Google Research    Google Research
avaswani@google.com    noam@google.com    nikip@google.com    usz@google.com
Llion Jones∗     Aidan N. Gomez∗ †         Łukasz Kaiser∗
Google Research    University of Toronto          Google Brain
llion@google.com    aidan@cs.toronto.edu    lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transfo

### 3.3 Prompt template

In [32]:
from langchain_core.prompts import PromptTemplate

# Q&A assistant prompt template for PDF documents

template = """
You are an expert Q&A assistant specialized in answering questions about PDF documents.

Your responses must be based **entirely** on the context provided below, as well as your general knowledge.

### Guidelines:
1. Provide a **clear**, **detailed**, and **well-structured** answer formatted in Markdown.
2. If the context lacks sufficient details, reply with:
   > The provided context does not contain enough information to answer this question.
3. Keep your tone professional and concise.

---

### Context:
{context}

---

### User Question:
{question}



"""

prompt = PromptTemplate.from_template(template)


print(prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} template='\nYou are an expert Q&A assistant specialized in answering questions about PDF documents.\n\nYour responses must be based **entirely** on the context provided below, as well as your general knowledge.\n\n### Guidelines:\n1. Provide a **clear**, **detailed**, and **well-structured** answer formatted in Markdown.\n2. If the context lacks sufficient details, reply with:\n   > The provided context does not contain enough information to answer this question.\n3. Keep your tone professional and concise.\n\n---\n\n### Context:\n{context}\n\n---\n\n### User Question:\n{question}\n\n\n\n'


### 3.4 Invoke the prompt with the context and question

In [28]:
final_prompt = prompt.invoke({"context": context_text, "question": user_query})

print("--- Final Augmented Prompt ---\n")
print(final_prompt.to_string())

--- Final Augmented Prompt ---


You are an expert Q&A assistant specialized in answering questions about PDF documents.

Your responses must be based **entirely** on the context provided below, as well as your general knowledge.

### Guidelines:
1. Provide a **clear**, **detailed**, and **well-structured** answer formatted in Markdown.
2. If the context lacks sufficient details, reply with:
   > The provided context does not contain enough information to answer this question.
3. Keep your tone professional and concise.

---

### Context:
arXiv:1706.03762v7 [cs.CL] 2 Aug 2023
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗  Noam Shazeer∗        Niki Parmar∗  Jakob Uszkoreit∗
Google Brain         Google Brain     Google Research    Google Research
avaswani@google.com    noam@google.com    nikip@google.com    usz@google.co

## 5. Generation


In this step, the language model is used to generate an answer to the user query based on the constructed prompt and retrieved context. The model takes the formatted context and question, applies its reasoning capabilities, and produces a coherent, detailed, and context-aware response. The output is typically in Markdown format, making it ready for display or further processing.


In [29]:
response_message = model.invoke(final_prompt)

print("--- Final Generated Answer ---\n")
to_markdown(response_message.content)

--- Final Generated Answer ---



> The main idea behind the Transformer model is to propose a new network architecture for sequence transduction models that relies **entirely on attention mechanisms**, specifically multi-head self-attention, thereby **dispensing with recurrence and convolutions entirely**.
> 
> Key aspects of this idea include:
> *   **Sole Reliance on Attention**: Unlike previous dominant models that used complex recurrent (RNNs) or convolutional neural networks, often augmented with attention, the Transformer uses attention as its sole building block to draw global dependencies between input and output sequences.
> *   **Encoder-Decoder Structure**: It maintains the overall encoder-decoder architecture common in competitive neural sequence transduction models, but implements both the encoder and decoder using stacked self-attention and point-wise, fully connected layers.
> *   **Enhanced Parallelization**: By eschewing recurrent layers, the Transformer allows for significantly more parallelization during training, leading to faster training times.
> *   **Improved Long-Range Dependency Learning**: It connects all positions in a sequence with a constant number of sequential operations, making it easier to learn long-range dependencies compared to recurrent layers (O(n) sequential operations) or convolutional layers (O(logk(n)) or O(n/k) path length).

## RAG Chain with LCEL

We combine all previous components into a single **RAG chain** using  LangChain Expression Language (LCEL). The chain takes a user question and retrieved document context, formats the context, fills the prompt template, and passes it to the LLM for generation. This modular setup allows the entire retrieval-augmented generation process to run seamlessly as one pipeline.


In [30]:
from langchain_core.runnables import RunnablePassthrough,RunnableLambda

rag_chain = (
    {
        # The context is generated by retrieving documents and formatting them.
        "context": retriever | RunnableLambda(format_docs), 
        # The question is passed through directly from the input.
        "question": RunnablePassthrough()
    }
    # The dictionary is piped into the prompt template.
    | prompt
    # The prompt is piped into the LLM.
    | model
)

print("RAG chain created successfully!")

RAG chain created successfully!


In [31]:
question = "Can you explain how the Transformer model uses self-attention to capture dependencies in sequences?"
final_response = rag_chain.invoke(question)

print(f"Question: {question}\n")
print(f"Answer:\n")
to_markdown(final_response.content)

Question: Can you explain how the Transformer model uses self-attention to capture dependencies in sequences?

Answer:



> The Transformer model leverages self-attention to effectively capture dependencies in sequences by entirely eschewing recurrence and convolutions, relying solely on attention mechanisms.
> 
> Here's how it uses self-attention for this purpose:
> 
> 1.  **Core Mechanism:** Self-attention, also known as intra-attention, is an attention mechanism designed to relate different positions of a single sequence to compute a representation of that sequence. Unlike traditional recurrent networks, which process symbols sequentially, self-attention allows the model to draw global dependencies between input and output.
> 
> 2.  **Dependency Capture Regardless of Distance:** A key advantage of self-attention is its ability to model dependencies between positions "without regard to their distance in the input or output sequences." In other models like ConvS2S or ByteNet, the number of operations required to relate distant positions grows with their distance, making it harder to learn long-range dependencies. In the Transformer, this is reduced to a constant number of operations.
> 
> 3.  **Multi-Head Attention:** To counteract the reduced effective resolution that can occur from averaging attention-weighted positions, the Transformer employs Multi-Head Attention. This mechanism allows the model to jointly attend to information from different representation subspaces at different positions, enhancing its ability to capture complex dependencies.
> 
> 4.  **Architectural Integration:**
>     *   The Transformer follows an encoder-decoder architecture, where both components extensively use self-attention.
>     *   **Encoder:** Each layer of the encoder stack contains a multi-head self-attention mechanism.
>     *   **Decoder:** Each layer of the decoder stack also includes a multi-head self-attention mechanism (modified to prevent attending to subsequent positions) and an additional multi-head attention sub-layer that attends over the output of the encoder stack.
> 
> 5.  **Parallelization and Path Length:** By relying entirely on self-attention, the Transformer allows for significantly more parallelization during training, as it removes the fundamental constraint of sequential computation present in RNNs. Furthermore, self-attention reduces the path length between any combination of positions in the input and output, which is critical for learning long-range dependencies efficiently.
> 
> In essence, the Transformer uses stacked multi-head self-attention layers to directly relate all positions within a sequence, allowing it to weigh the importance of different parts of the input when processing each part, thereby capturing dependencies across arbitrary distances in a highly parallelizable manner.

End ! 