# Online PDF RAG System - LangGraph

Today we'll build a RAG (Retrieval Augmented Generation) system that downloads a PDF from an online URL and processes it for question-answering!

Instead of using hardcoded local PDFs, this system will:
1. Download a PDF from arXiv URL: https://arxiv.org/pdf/2509.22613
2. Save it to our data folder
3. Process it through the same RAG pipeline as the original code

This approach allows us to work with any online PDF dynamically!

## Dependencies

Since we'll be relying on OpenAI's suite of models to power our agents today, we'll want to provide our OpenAI API Key.

We're also adding requests library to download PDFs from online URLs.

In [1]:
import os
import getpass
import requests
from pathlib import Path

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [2]:
import nest_asyncio
nest_asyncio.apply()

## Task 1: Online PDF RAG System

Now let's create a RAG system that dynamically downloads and processes PDFs from online URLs!

> NOTE: This approach allows us to work with any PDF available online, making our system much more flexible than hardcoded local files.

## PDF Download and Setup

First, let's download the PDF from arXiv and save it to our data folder.

In [3]:
# Download PDF from arXiv URL
pdf_url = "https://arxiv.org/pdf/2509.22613"
data_folder = Path("data")
pdf_filename = "arxiv_paper.pdf"
pdf_path = data_folder / pdf_filename

# Create data folder if it doesn't exist
data_folder.mkdir(exist_ok=True)

# Download and save the PDF
print(f"Downloading PDF from {pdf_url}...")
response = requests.get(pdf_url)
response.raise_for_status()  # Raises an HTTPError if the HTTP request returned an unsuccessful status code

with open(pdf_path, 'wb') as f:
    f.write(response.content)

print(f"‚úÖ PDF downloaded and saved to: {pdf_path}")
print(f"üìÑ File size: {pdf_path.stat().st_size / 1024:.1f} KB")

Downloading PDF from https://arxiv.org/pdf/2509.22613...
‚úÖ PDF downloaded and saved to: data/arxiv_paper.pdf
üìÑ File size: 4741.6 KB


## Retrieval

The 'R' in 'RAG' - now let's process our downloaded PDF!

#### Data Collection and Processing

Now let's load our downloaded PDF document!

In [4]:
from langchain_community.document_loaders import PyMuPDFLoader

# Load the downloaded PDF
pdf_loader = PyMuPDFLoader(str(pdf_path))
documents = pdf_loader.load()

print(f"‚úÖ Loaded {len(documents)} pages from the PDF")
print(f"üìÑ First page preview: {documents[0].page_content[:200]}...")

‚úÖ Loaded 23 pages from the PDF
üìÑ First page preview: Preprint as an Arxiv Paper
BENEFITS
AND
PITFALLS
OF
REINFORCEMENT
LEARNING FOR LANGUAGE MODEL PLANNING:
A THEORETICAL PERSPECTIVE
Siwei Wang1‚Ä†, Yifei Shen1‚Ä†, Haoran Sun2‚Ä†, Shi Feng3‚Ä†, Shang-Hua Teng4,...


Now we can chunk it down to size!

In [5]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

document_chunks = text_splitter.split_documents(documents)

print(f"‚úÖ Created {len(document_chunks)} text chunks from the PDF")
print(f"üìÑ First chunk preview: {document_chunks[0].page_content[:200]}...")

‚úÖ Created 41 text chunks from the PDF
üìÑ First chunk preview: Preprint as an Arxiv Paper
BENEFITS
AND
PITFALLS
OF
REINFORCEMENT
LEARNING FOR LANGUAGE MODEL PLANNING:
A THEORETICAL PERSPECTIVE
Siwei Wang1‚Ä†, Yifei Shen1‚Ä†, Haoran Sun2‚Ä†, Shi Feng3‚Ä†, Shang-Hua Teng4,...


### üìä Creating Embeddings & Vector Store

Now we'll create embeddings for our document chunks and store them in a vector database for efficient retrieval.

In [6]:
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
import uuid
import time

start_time = time.time()

# Initialize embeddings
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

# Initialize Qdrant client (in-memory)
qdrant_client = QdrantClient(':memory:')

# Create collection
collection_name = f"pdf_documents_{uuid.uuid4().hex[:8]}"
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Initialize vector store
vector_store = QdrantVectorStore(
    client=qdrant_client, 
    collection_name=collection_name, 
    embedding=embeddings,
)

# Add documents to vector store
vector_store.add_documents(documents=document_chunks)

end_time = time.time()
print(f"‚úÖ Vector store created successfully!")
print(f"üìÅ Collection: {collection_name}")
print(f"‚è±Ô∏è Processing time: {end_time - start_time:.2f} seconds")
print(f"üî¢ Total documents indexed: {len(document_chunks)}")

‚úÖ Vector store created successfully!
üìÅ Collection: pdf_documents_419c8480
‚è±Ô∏è Processing time: 2.61 seconds
üî¢ Total documents indexed: 41


### ü§ñ Setting up the RAG System with LangGraph

Now we'll create our retrieval and generation functions using LangGraph for a multi-agent approach.

In [7]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from typing import TypedDict
from langgraph.graph import StateGraph, START, END

# Initialize the ChatOpenAI model
llm = ChatOpenAI(model="gpt-4o-mini")

# Create a retriever from the vector store
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

class State(TypedDict):
    question: str
    documents: list
    answer: str

def retrieve(state: State):
    """Retrieve documents relevant to the question"""
    question = state["question"]
    documents = retriever.get_relevant_documents(question)
    return {"documents": documents}

def generate(state: State):
    """Generate an answer based on the retrieved documents"""
    question = state["question"]
    documents = state["documents"]
    
    # Create context from documents
    context = "\n\n".join([doc.page_content for doc in documents])
    
    # Create prompt
    prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful AI assistant. Answer the question based on the provided context.
        If you can't find the answer in the context, say so. Be concise and accurate.
        
        Context:
        {context}"""),
        ("human", "{question}")
    ])
    
    # Generate response
    chain = prompt | llm
    response = chain.invoke({"context": context, "question": question})
    
    return {"answer": response.content}

# Create the graph
workflow = StateGraph(State)
workflow.add_node("retrieve", retrieve)
workflow.add_node("generate", generate)

workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)

# Compile the graph
rag_app = workflow.compile()

print("‚úÖ RAG system initialized successfully!")
print("üîç Retriever configured to return top 5 relevant documents")
print("üí¨ Language model: gpt-4o-mini")
print("üåü Ready to answer questions about the PDF content!")

‚úÖ RAG system initialized successfully!
üîç Retriever configured to return top 5 relevant documents
üí¨ Language model: gpt-4o-mini
üåü Ready to answer questions about the PDF content!


### üß™ Testing the RAG System

Let's test our RAG system with some sample questions!

In [None]:
def ask_question(question: str):
    """Helper function to ask questions to our RAG system"""
    print(f"üîç Question: {question}")
    print("="*50)
    
    result = rag_app.invoke({"question": question})
    
    print(f"üìö Retrieved {len(result['documents'])} relevant documents")
    print(f"üí° Answer: {result['answer']}")
    print("\n" + "="*70 + "\n")
    
    return result

# Test with sample questions
sample_questions = [
    "What is this document about?",
    "What are the main findings or conclusions?",
    "Who are the authors of this paper?"
]

print("üöÄ Testing RAG System with Sample Questions\n")

for question in sample_questions:
    try:
        ask_question(question)
    except Exception as e:
        print(f"‚ùå Error with question '{question}': {str(e)}")
        print()

### üéØ Interactive Q&A

Now you can ask your own questions about the PDF content!

In [None]:
# Ask your own questions here!
# Example usage:
# ask_question("Your question here")

# Uncomment and modify the line below to ask your own question:
# ask_question("What specific methodology was used in this research?")