# Lab 21: Document Loading from PDF Files

This lab demonstrates how to load and process PDF documents using LangChain's document loaders. You'll learn:
- How to use `PyPDFLoader` for extracting text from PDF files
- Loading and splitting PDF documents into manageable chunks
- Understanding document structure and metadata
- Accessing specific pages and content from PDF files
- Preparing PDF data for further processing in LangChain workflows

In [None]:
# Import LangChain PDF document loader for processing PDF files
# PyPDFLoader is specifically designed to handle PDF document extraction
from langchain_community.document_loaders import PyPDFLoader

In [None]:
# Configure OpenAI API key for potential downstream processing
# While not used directly in document loading, this enables LLM integration
import os
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Initialize PDF loader with the path to the handbook PDF file
# The loader will extract text content from the specified PDF document
loader = PyPDFLoader("data/handbook.pdf")

In [None]:
# Load the PDF and automatically split it into individual pages
# load_and_split() extracts text from each page and creates separate Document objects
# Each page becomes a distinct document with its own metadata (page number, source, etc.)
pages = loader.load_and_split()

In [None]:
# Display all pages (commented out to avoid overwhelming output)
# Uncomment to see the full structure of all extracted pages
#pages

In [None]:
# Check the total number of pages extracted from the PDF
# This helps understand the document size and structure
len(pages)

3

In [None]:
# Access the text content of the second page (index 1)
# page_content attribute contains the extracted text from that specific page
# This demonstrates how to work with individual pages from the PDF
pages[1].page_content

