# Lab 26: Complete PDF RAG System - Document Question Answering

## Learning Objectives
In this lab, you will learn how to:
- Build a complete end-to-end RAG (Retrieval Augmented Generation) system for PDF documents
- Load and process PDF documents using PyPDFLoader for text extraction
- Implement intelligent text chunking strategies for optimal retrieval performance
- Create vector embeddings and store them in Chroma vector database
- Build a question-answering chain that retrieves relevant context and generates accurate answers
- Understand the complete RAG pipeline: Load → Split → Embed → Store → Retrieve → Generate

## Overview
This lab demonstrates a production-ready RAG system that can answer questions about PDF documents by combining document retrieval with LLM generation. You'll build a complete pipeline that processes PDF content, creates a searchable knowledge base, and provides accurate answers based on document context. This represents the culmination of all previous RAG concepts in a practical, real-world application.

In [None]:
# Complete RAG System Implementation - Essential Imports
# This lab demonstrates a full end-to-end RAG system for PDF document question answering
# combining document loading, processing, embedding, storage, retrieval, and generation

# Document Loading and Processing
from langchain_community.document_loaders import PyPDFLoader  # PDF document loader for text extraction
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Intelligent text chunking

# Vector Storage and Embeddings  
from langchain_openai import OpenAIEmbeddings  # High-quality semantic embeddings
from langchain_chroma import Chroma  # Vector database for similarity search

# LLM and Chain Components
from langchain_openai import ChatOpenAI  # OpenAI's chat model for answer generation
from langchain.prompts import PromptTemplate  # Structured prompt templates
from langchain_core.runnables import RunnablePassthrough  # Data flow management
from langchain_core.output_parsers import StrOutputParser  # Clean string output

In [None]:
# OpenAI API Configuration
# Configure authentication for OpenAI services (embeddings and chat model)
import os

# Set OpenAI API key for embedding generation and LLM inference
# Required for both text-embedding-3-large model and ChatOpenAI model
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Initialize PDF Document Loader
# PyPDFLoader extracts text content from PDF files while preserving structure
# It processes each page separately, maintaining page-level metadata for better document organization

# Load the employee handbook PDF document
# This could be any PDF: manuals, reports, research papers, documentation, etc.
loader = PyPDFLoader("data/handbook.pdf")

print("📄 PDF Loader initialized for: data/handbook.pdf")
print("🔧 PyPDFLoader will extract text from each page individually")
print("📊 Page metadata will be preserved for better document organization")

In [None]:
# Load and Extract PDF Content
# The load_and_split() method performs two operations:
# 1. Extracts text content from all PDF pages
# 2. Splits content into individual page documents with metadata

pages = loader.load_and_split()

print(f"📚 Successfully loaded PDF with {len(pages)} pages")
print("🔍 Each page is now a separate document with preserved metadata")
print("📋 Page documents include source file path and page numbers")

# Display information about the loaded content
if pages:
    print(f"📄 First page preview: {pages[0].page_content[:200]}...")
    print(f"🏷️ Page metadata: {pages[0].metadata}")

In [None]:
# Advanced Text Chunking for Optimal RAG Performance
# RecursiveCharacterTextSplitter creates semantically meaningful chunks for better retrieval

# Configure text splitter with optimized parameters:
# - chunk_size=200: Small chunks for precise retrieval and focused context
# - chunk_overlap=50: 25% overlap ensures context continuity between chunks
# This prevents information loss at chunk boundaries and improves retrieval quality
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)

# Split all pages into optimized chunks
chunks = text_splitter.split_documents(pages)

print(f"📦 Split {len(pages)} pages into {len(chunks)} chunks")
print(f"📏 Chunk size: 200 characters with 50-character overlap")
print(f"🎯 Optimal chunking improves retrieval precision and context relevance")

# Display chunk statistics
if chunks:
    avg_length = sum(len(chunk.page_content) for chunk in chunks) / len(chunks)
    print(f"📊 Average chunk length: {avg_length:.1f} characters")
    print(f"🔍 Sample chunk: {chunks[0].page_content[:150]}...")

In [None]:
# Initialize High-Quality Embedding Model
# OpenAI's text-embedding-3-large provides state-of-the-art semantic understanding
# Essential for accurate document retrieval in RAG systems

# Configure the most advanced OpenAI embedding model:
# - 3072 dimensions for rich semantic representation
# - Superior performance on similarity and retrieval tasks
# - Excellent understanding of domain-specific content
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

print("🚀 OpenAI Embeddings initialized with text-embedding-3-large")
print("📐 Generates 3072-dimensional vectors for semantic search")
print("🎯 Optimized for high-quality document retrieval in RAG systems")

In [None]:
# Create Vector Database from Document Chunks
# Chroma.from_documents() automates the embedding and storage process
# This creates a searchable knowledge base from the PDF content

# The from_documents() method performs multiple operations:
# 1. Generates embeddings for each chunk using text-embedding-3-large
# 2. Creates Chroma vector database instance
# 3. Stores embedded chunks with original text and metadata
# 4. Builds similarity search indexes for fast retrieval
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings)

print("🗄️ Vector store created successfully!")
print(f"📚 Embedded and stored {len(chunks)} document chunks")
print("🔍 Vector database ready for semantic similarity search")
print("⚡ Optimized indexes enable fast retrieval for question answering")

In [None]:
# Initialize Document Retriever
# Convert vector store into a retriever interface for seamless integration with LangChain
# The retriever will find the most relevant document chunks for answering questions

# as_retriever() creates a standardized interface that:
# - Accepts text queries and converts them to embeddings
# - Performs similarity search against stored document vectors
# - Returns most relevant chunks ranked by semantic similarity
# - Integrates seamlessly with LangChain chains and pipelines
retriever = vectorstore.as_retriever()

print("🔍 Document retriever initialized")
print("📊 Retriever will find most relevant chunks for each question")
print("🎯 Uses semantic similarity to match questions with document content")
print("⚡ Ready for integration with question-answering chain")

In [None]:
# Document Formatting Utility Function
# Converts retrieved document chunks into a clean, readable format for the LLM
# Essential for preparing context that the language model can effectively process

def format_docs(docs):
    """
    Format retrieved documents for LLM consumption.
    
    Args:
        docs: List of retrieved document chunks from vector search
    
    Returns:
        str: Formatted string with document content separated by double newlines
    """
    # Join document content with clear separators for better LLM comprehension
    # Double newlines create clear boundaries between different document chunks
    return "\n\n".join(doc.page_content for doc in docs)

print("📝 Document formatter function defined")
print("🔧 Converts retrieved chunks into clean context for LLM processing")
print("📋 Maintains clear separation between different document chunks")

In [None]:
# Initialize Language Model for Answer Generation
# ChatOpenAI provides the generative component of the RAG system
# Will process retrieved context and generate accurate, contextual answers

# ChatOpenAI uses GPT-3.5-turbo by default (cost-effective and fast)
# The model will:
# - Receive structured prompts with context and questions
# - Generate answers based only on provided document context
# - Maintain factual accuracy by staying within retrieved information
llm = ChatOpenAI()

print("🤖 ChatOpenAI language model initialized")
print("💬 Using GPT-3.5-turbo for answer generation")
print("📚 Model will generate answers based on retrieved document context")
print("✅ Ready to process questions with factual, context-based responses")

In [None]:
# Create Structured Prompt Template for RAG Question Answering
# The prompt template ensures consistent, accurate responses based on document context
# Critical for maintaining factual accuracy and preventing hallucination

# Design a prompt template with clear instructions:
# 1. Define the AI's role as a factual question-answer bot
# 2. Emphasize responding only from provided context
# 3. Include fallback behavior for unknown information
# 4. Use clear variable placeholders for dynamic content
template = """SYSTEM: You are a question answer bot. 
                 Be factual in your response.
                 Respond to the following question: {question} only from 
                 the below context :{context}. 
                 If you don't know the answer, just say that you don't know.
               """

# Convert template string into a LangChain PromptTemplate object
prompt = PromptTemplate.from_template(template)

print("📝 RAG prompt template created")
print("🎯 Template ensures factual responses based only on document context") 
print("🚫 Includes safeguards to prevent hallucination and speculation")
print("🔧 Variables: {question} for user query, {context} for retrieved documents")

In [None]:
# Build Complete RAG Question-Answering Chain
# This chain orchestrates the entire RAG pipeline: Retrieve → Format → Generate → Parse
# Demonstrates advanced LangChain composition using LCEL (LangChain Expression Language)

# RAG Chain Architecture:
# 1. Input: User question
# 2. Parallel Processing:
#    - retriever | format_docs: Finds relevant chunks and formats them as context
#    - RunnablePassthrough(): Passes the original question unchanged
# 3. prompt: Combines formatted context and question into structured prompt
# 4. llm: Generates answer based on prompt
# 5. StrOutputParser(): Extracts clean string response

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("⛓️ Complete RAG chain constructed!")
print("🔄 Pipeline: Question → Retrieve → Format → Generate → Parse")
print("📊 Parallel processing: Context retrieval + Question passthrough")
print("🎯 End-to-end system ready for document-based question answering")

In [None]:
# Execute RAG System with Sample Question
# Test the complete pipeline with a question about the employee handbook
# This demonstrates real-world usage of document-based question answering

print("❓ Testing RAG system with employee handbook question...")
print("🔍 Question: 'What's the sick leave policy?'")
print("📚 System will search through PDF content for relevant information")
print()

# Invoke the complete RAG chain:
# 1. Retrieves relevant chunks about sick leave policy
# 2. Formats context for the language model
# 3. Generates accurate answer based on document content
# 4. Returns clean, factual response
response = chain.invoke("What's the sick leave policy?")

print("💬 RAG System Response:")
print("=" * 50)
print(response)
print("=" * 50)
print()
print("✅ RAG system successfully answered question based on PDF content")
print("🎯 Response is grounded in actual document information, not general knowledge")

## Key Takeaways and Production Insights

### What You've Accomplished
1. **Complete RAG System**: Built an end-to-end pipeline for PDF document question answering
2. **Document Processing**: Implemented intelligent PDF loading and text chunking strategies
3. **Vector Storage**: Created searchable knowledge base with high-quality embeddings
4. **Intelligent Retrieval**: Built semantic search that finds contextually relevant information
5. **Answer Generation**: Combined retrieval with LLM generation for accurate, grounded responses

### Technical Architecture
- **Document Loading**: PyPDFLoader for robust PDF text extraction with metadata preservation
- **Text Processing**: RecursiveCharacterTextSplitter with optimized chunk size (200) and overlap (50)
- **Embeddings**: OpenAI text-embedding-3-large for superior semantic understanding
- **Vector Database**: Chroma for efficient storage and similarity search
- **LLM Integration**: ChatOpenAI with carefully crafted prompts for factual responses

### RAG Pipeline Components
1. **Load**: Extract content from PDF documents while preserving structure
2. **Split**: Create semantically meaningful chunks with optimal size and overlap
3. **Embed**: Convert text chunks into high-dimensional vector representations
4. **Store**: Build searchable vector database with efficient similarity indexes
5. **Retrieve**: Find most relevant chunks based on semantic similarity to questions
6. **Generate**: Produce accurate answers grounded in retrieved document context

### Production Considerations
- **Scalability**: System can handle large document collections with proper infrastructure
- **Accuracy**: Factual responses limited to document content prevent hallucination
- **Performance**: Optimized chunking and embeddings ensure fast retrieval
- **Flexibility**: Can be adapted for various document types and domains

### Real-World Applications
- **Enterprise Knowledge Bases**: Employee handbooks, policy documents, procedures
- **Customer Support**: Product manuals, FAQ systems, troubleshooting guides
- **Research Tools**: Academic papers, technical documentation, regulatory documents
- **Legal Applications**: Contract analysis, compliance documentation, case law research