# LangChain PDF Data Ingestion Demo

This notebook demonstrates how to ingest PDF documents using LangChain's document loaders.

In [None]:
# If running in a fresh environment, install required packages
# !pip install langchain pypdf python-dotenv

In [None]:
from dotenv import load_dotenv
load_dotenv()

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

In [None]:
# Load a sample PDF file (replace with your own PDF path if needed)
pdf_path = "./temp/sample.pdf"

# If you don't have a sample PDF, you can create one using reportlab or manually place a PDF in the project directory.
# For demonstration, this cell assumes sample.pdf exists in temp folder.

loader = PyPDFLoader(pdf_path)
documents = loader.load()

print(f"Loaded {len(documents)} pages from PDF.")

In [None]:
# Extract and print text from each page
for i, doc in enumerate(documents):
    print(f"--- Page {i+1} ---")
    print(doc.page_content[:500])  # Print first 500 characters for brevity
    print()

In [None]:
# Chunk PDF text for processing
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = []
for doc in documents:
    chunks.extend(text_splitter.split_text(doc.page_content))

print(f"Total chunks created: {len(chunks)}")
print("First chunk:")
print(chunks[0] if chunks else "No chunks found.")

## Next Steps

You can now use these PDF text chunks for embedding, querying, or other downstream tasks with LangChain tools.