# **PDF RAG**

Authored by [Kalyan KS](https://www.linkedin.com/in/kalyanksnlp/). To stay updated with LLM, RAG and Agent updates, you can follow me on [Twitter](https://x.com/kalyan_kpl).

- Step-1 : Extract the PDF text
- Step-2 : Chunk the extracted PDF text
- Step-3 : Create a vector store with the PDF chunks
- Step-4 : Create a retriever which returns the relevant chunks
- Step-5 : Build context from the relevant chunk texts
- Step-6 : Build the RAG chain using rag prompt, LLM and string output parser.
- Step-7 : Run the RAG chain to get the answer.

## **Install and import libraries**

- PyPDFLoader uses `pypdf` python library to extract text from PDF document.

In [1]:
!pip install -qU langchain langchain-community langchain-text-splitters
!pip install -qU langchain-openai langchain-chroma pypdf

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/opt_einsum-3.4.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/dill-0.3.9-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/nvfuser-0.2.13a0+0d33366-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packag

In [20]:
import torch
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda

In [3]:
from typing import List
from langchain.schema import Document

def pdf_extract(pdf_path: str) -> List[Document]:
    """
    Extracts text from a PDF file using PyPDFLoader.

    Parameters:
    pdf_path (str): The file path of the PDF to be extracted.

    Returns:
    List[Document]: A list of Document objects containing the extracted text from the PDF.
    """

    print("PDF file text is extracted...")
    loader = PyPDFLoader(pdf_path)
    pdf_text = loader.load()

    return pdf_text

In [4]:
# Download the PDF file
import requests

pdf_url = 'https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf'
response = requests.get(pdf_url)

pdf_path = 'attention_is_all_you_need.pdf'
with open(pdf_path, 'wb') as file:
    file.write(response.content)

In [5]:
pdf_text = pdf_extract(pdf_path)

PDF file text is extracted...


## **Chunk PDF text**

In [7]:
def pdf_chunk(pdf_text: List[Document]) -> List[Document]:
    """
    Splits extracted PDF text into smaller chunks using RecursiveCharacterTextSplitter.

    Parameters:
    pdf_text (List[Document]): A list of Document objects containing extracted text from a PDF.

    Returns:
    List[Document]: A list of chunked Document objects.
    """

    print("PDF file text is chunked....")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    chunks = text_splitter.split_documents(pdf_text)

    return chunks

In [None]:
chunks = pdf_chunk(pdf_text)

In [10]:
print(f"Number of chunks = {len(chunks)}")

Number of chunks = 40


In [12]:
print(chunks[0])

page_content='Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser ∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly' metadata={'prod

## **Create Vector Store**

In [25]:
def create_vector_store(chunks: List[Document], db_path: str, model_name: str) -> Chroma:
    """
    Creates a Chroma vector store from chunked documents.

    Parameters:
    chunks (List[Document]): A list of chunked Document objects.
    db_path (str): The directory path to persist the vector store.

    Returns:
    Chroma: A Chroma vector store containing the embedded documents.
    """
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model_kwargs = {
        "device": device,
        "trust_remote_code": True
    }
    embedding_model = SentenceTransformerEmbeddings(
                        model_name=model_name, 
                        model_kwargs=model_kwargs
                    )

    print("Chrome vector store is created...\n")
    db = Chroma.from_documents(documents=chunks, embedding=embedding_model, persist_directory=db_path)

    return db

In [26]:
db = create_vector_store(
        chunks=chunks, 
        db_path="./chroma_alibaba_gte.db", 
        model_name="Alibaba-NLP/gte-multilingual-base"
    )

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Chrome vector store is created...



## **Retrieve relevant chunks**

In [27]:
def retrieve_context(db: Chroma, query: str) -> List[Document]:
    """
    Retrieves relevant document chunks from the Chroma vector store based on a query.

    Parameters:
    db (Chroma): The Chroma vector store containing embedded documents.
    query (str): The query string to search for relevant document chunks.

    Returns:
    List[Document]: A list of retrieved relevant document chunks.
    """

    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})
    print("Relevant chunks are retrieved...\n")
    relevant_chunks = retriever.invoke(query)

    return relevant_chunks

In [28]:
query = "What is the attention mechanism?"

relevant_chunks = retrieve_context(db, query)

Relevant chunks are retrieved...



In [29]:
print(f"Number of relevant chunks = {len(relevant_chunks)}")

Number of relevant chunks = 2


In [30]:
for i, doc in enumerate(relevant_chunks):
    print(f"Reference {i}\n'{doc}'\n")

Reference 0
'page_content='Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser ∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly' me

## **Build context**

In [31]:
def build_context(relevant_chunks: List[Document]) -> str:
    """
    Builds a context string from retrieved relevant document chunks.

    Parameters:
    relevant_chunks (List[Document]): A list of retrieved relevant document chunks.

    Returns:
    str: A concatenated string containing the content of the relevant chunks.
    """

    print("Context is built from relevant chunks")
    context = "\n\n".join([chunk.page_content for chunk in relevant_chunks])

    return context

In [32]:
context = build_context(relevant_chunks)

Context is built from relevant chunks


In [33]:
print(context)

Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser ∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly

the number of operations requ

## **Combine all the steps into one function**

In [39]:
import os
from typing import Dict

def get_context(inputs: Dict[str, str]) -> Dict[str, str]:
    """
    Creates or loads a vector store for a given PDF file and extracts relevant chunks based on a query.

    Args:
        inputs (Dict[str, str]): A dictionary containing the following keys:
            - 'pdf_path' (str): Path to the PDF file.
            - 'query' (str): The user query.
            - 'db_path' (str): Path to the vector database.

    Returns:
        Dict[str, str]: A dictionary containing:
            - 'context' (str): Extracted relevant context.
            - 'query' (str): The user query.
    """
    pdf_path, query, db_path, model_name  = inputs['pdf_path'], inputs['query'], inputs['db_path'], inputs['model_name']

    # Create new vector store if it does not exist
    if not os.path.exists(db_path):
        print("Creating a new vector store...\n")
        pdf_text = pdf_extract(pdf_path)
        chunks = pdf_chunk(pdf_text)
        db = create_vector_store(chunks, db_path)

    # Load the existing vector store
    else:
        print("Loading the existing vector store\n")
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model_kwargs = {
            "device": device,
            "trust_remote_code": True
        }
        embedding_model = SentenceTransformerEmbeddings(
                            model_name=model_name, 
                            model_kwargs=model_kwargs
                        )
        db = Chroma(persist_directory=db_path, embedding_function=embedding_model)

    relevant_chunks = retrieve_context(db, query)
    context = build_context(relevant_chunks)

    return {'context': context, 'query': query}

## **Build RAG chain**

In [40]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto",)
model = model.to(device)

# Build pipeline
hf_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0,  # ensure CUDA
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
)


Device set to use cuda:0


In [49]:
from langchain.schema import BaseOutputParser

class AssistantOnlyOutputParser(BaseOutputParser):
    def parse(self, text: str) -> str:
        # Extract everything after "Assistant:"
        if "Assistant:" in text:
            return text.split("Assistant:")[-1].strip()
        return text.strip()  # fallback


In [50]:
template = """ You are an AI model trained for question answering. You should answer the
  given question based on the given context only.
  Question : {query}
  \n
  Context : {context}
  \n
  If the answer is not present in the given context, respond as: The answer to this question is not available
  in the provided content.
  """

rag_prompt = ChatPromptTemplate.from_template(template)

llm = HuggingFacePipeline(pipeline=hf_pipe)

# Replace StrOutputParser with your new parser
str_parser = AssistantOnlyOutputParser()

rag_chain = (
    RunnableLambda(get_context)
    | rag_prompt
    | llm
    | str_parser
)

## **Run RAG chain**

In [51]:
# Set Embedding Model
model_name="Alibaba-NLP/gte-multilingual-base"

# Set the chroma DB path
db_path="./chroma_alibaba_gte.db"

# RAG query
query = "What is self-attention?"



In [53]:
answer = rag_chain.invoke(
        {
            'pdf_path':pdf_path, 
            'query':query, 
            'db_path':db_path,
            'model_name':model_name
        }
    )

Loading the existing vector store



Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Relevant chunks are retrieved...

Context is built from relevant chunks


In [54]:
print(f"Query:{query}\n")
print(f"Generated answer:{answer}")

Query:What is self-attention?

Generated answer:Self-attention is an attention mechanism that relates different positions within a single sequence to compute a representation of the sequence. It has been successfully used in various tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. Unlike traditional methods like Convolutional Neural Networks (ConvNets) and Recurrent Neural Networks (RNNs), which rely on position-specific weights, self-attention computes weighted sums across all positions in the sequence, providing a more flexible and powerful way to capture relationships between elements in the sequence.

In the context of transformer models, self-attention helps reduce the dependency between distant positions in the sequence, making it possible to learn longer-range dependencies efficiently. However, this comes at the cost of lower effective resolution due to averaging over multiple positions. T

In [None]:
print(answer)

'Self-attention is an attention mechanism that relates different positions within a single sequence to compute a representation of the sequence. It has been successfully used in various tasks such as reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations. Unlike traditional methods like Convolutional Neural Networks (ConvNets) and Recurrent Neural Networks (RNNs), which rely on position-specific weights, self-attention computes weighted sums across all positions in the sequence, providing a more flexible and powerful way to capture relationships between elements in the sequence.\n\nIn the context of transformer models, self-attention helps reduce the dependency between distant positions in the sequence, making it possible to learn longer-range dependencies efficiently. However, this comes at the cost of lower effective resolution due to averaging over multiple positions. To address this issue, multi-head attention me