# Lab 23: Advanced PDF Processing with Text Splitting

This lab demonstrates advanced PDF document processing by combining document loading with intelligent text splitting for RAG applications. You'll learn:
- How to load PDF documents and split them into optimized chunks
- Using `RecursiveCharacterTextSplitter` for intelligent text segmentation
- Understanding chunk size and overlap parameters for better retrieval
- Preparing documents for vector databases and similarity search
- Building effective knowledge bases with properly chunked content

In [None]:
# Import PyPDFLoader for advanced PDF document processing and chunking
# This lab extends basic PDF loading with intelligent text splitting capabilities
from langchain_community.document_loaders import PyPDFLoader

In [None]:
# Configure OpenAI API key for potential downstream RAG applications
# While not directly used in document processing, this enables LLM integration
import os
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Initialize PyPDFLoader with the handbook PDF file path
# This prepares the document for loading and subsequent text splitting
loader = PyPDFLoader("data/handbook.pdf")

In [None]:
# Load PDF and split into individual pages as Document objects
# Each page becomes a separate document with metadata (page number, source file)
pages = loader.load_and_split()

In [None]:
# Display the structure of all loaded pages
# Shows Document objects with page_content and metadata for each page
pages

In [None]:
# Check total number of pages extracted from the PDF
# This helps understand document size before text splitting
len(pages)

In [None]:
# Display content of the first page (index 0)
# This shows the raw text content before intelligent chunking
pages[0].page_content

In [None]:
# Import RecursiveCharacterTextSplitter for intelligent text chunking
# This splitter respects document structure and creates optimized chunks for RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Configure RecursiveCharacterTextSplitter with optimal parameters
# chunk_size=200: Maximum characters per chunk (balance between context and granularity)
# chunk_overlap=50: Overlap between chunks to preserve context across boundaries
# This configuration ensures good retrieval performance for RAG applications
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)

In [None]:
# Apply intelligent text splitting to all pages
# Converts large page content into smaller, semantically meaningful chunks
# Each chunk maintains metadata from original pages plus new chunk information
chunks = text_splitter.split_documents(pages)

In [None]:
# Analyze the results of text splitting
# Display all chunks, total count, and examine a specific chunk (index 2)
# This shows how pages were split into smaller, manageable pieces for RAG
chunks
len(chunks)
chunks[2]