## 📄 PDF Document Loading and Processing

**Overview:**
This section demonstrates various approaches to loading and processing PDF documents for RAG systems. PDFs are complex documents that require specialized handling due to their formatting, layout, and potential inclusion of images, tables, and other visual elements.

**What You'll Learn:**
- Different PDF loaders and their strengths
- Handling common PDF extraction challenges
- Text cleaning and normalization techniques
- Advanced processing with metadata enhancement
- Comparison of basic vs. advanced processing approaches

**Key Challenges with PDFs:**
- Text extraction artifacts (ligatures, spacing issues)
- Complex layouts and formatting
- Embedded images and tables
- Scanned documents requiring OCR
- Inconsistent metadata

In [None]:
"""
PDF Loading Demonstration
========================
This section demonstrates different PDF loading libraries and their capabilities.
We'll compare PyPDFLoader and PyMuPDFLoader to understand their strengths and use cases.
"""

print("🚀 Demonstrating PDF Loading Techniques")

Demonstrating PDF Loading


In [None]:
# Import PDF loaders from LangChain community package
from langchain_community.document_loaders import (
    PyPDFLoader,     # Basic PDF loader using PyPDF2 library
    PyMuPDFLoader    # Advanced PDF loader using PyMuPDF (fitz) library
)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
"""
PyPDFLoader - Basic PDF Processing
==================================
PyPDFLoader is the most straightforward PDF loader in LangChain.
It uses the PyPDF2 library under the hood for text extraction.

Characteristics:
- Simple and reliable for standard text PDFs
- Preserves page structure and numbering
- Good for documents with clear text layout
- May struggle with complex formatting or images
"""

### PyPDFLoader
print("1️⃣ PyPDFLoader")

try:
    # Initialize PyPDFLoader with PDF file path
    pypdf_loader = PyPDFLoader("data/pdf/attention.pdf")
    
    # Load the PDF - returns list of Document objects (one per page)
    pypdf_docs = pypdf_loader.load()
    
    print(f"  ✅ Loaded {len(pypdf_docs)} pages")
    print(f"  📄 Page 1 content preview: {pypdf_docs[0].page_content[:100]}...")
    
    # Display metadata from the second page (if exists)
    if len(pypdf_docs) > 1:
        print(f"  📋 Sample metadata: {pypdf_docs[1].metadata}")

except Exception as e:
    print(f"  ❌ Error loading PDF with PyPDFLoader: {e}")
    print("  💡 Make sure the PDF file exists in the data/pdf/ directory")

PyPdfloader
[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data/pdf/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\n

In [None]:
"""
PyMuPDFLoader - Advanced PDF Processing
======================================
PyMuPDFLoader uses the PyMuPDF (fitz) library which provides:
- Faster and more accurate text extraction
- Better handling of complex layouts
- Support for image extraction
- Enhanced metadata extraction
"""

# Method 2: PyMuPDFLoader (Fast and accurate)
print("\n2️⃣ PyMuPDFLoader")
try:
    # Initialize PyMuPDFLoader - generally faster than PyPDFLoader
    pymupdf_loader = PyMuPDFLoader("data/pdf/attention.pdf")
    
    # Load PDF with enhanced extraction capabilities
    pymupdf_docs = pymupdf_loader.load()
    
    print(f"  ✅ Loaded {len(pymupdf_docs)} pages")
    print(f"  📄 Page 1 content preview: {pymupdf_docs[0].page_content[:100]}...")
    
    # PyMuPDFLoader often provides more detailed metadata
    if pymupdf_docs:
        print(f"  📋 Enhanced metadata: {pymupdf_docs[0].metadata}")
        
except Exception as e:
    print(f"  ❌ Error loading with PyMuPDFLoader: {e}")
    print("  💡 Install PyMuPDF: pip install pymupdf")


3️⃣ PyMuPDFLoader
  Loaded 15 pages
  Includes detailed metadata
[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'source': 'data/pdf/attention.pdf', 'file_path': 'data/pdf/attention.pdf', 'total_pages': 15, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'trapped': '', 'modDate': 'D:20240410211143Z', 'creationDate': 'D:20240410211143Z', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of

In [7]:
# 📊 PDF Loader Comparison
print("\n📊 PDF Loader Comparison:")
print("\nPyPDFLoader:")
print("  ✅ Simple and reliable")
print("  ✅ Good for most PDFs")
print("  ✅ Preserves page numbers")
print("  ❌ Basic text extraction")
print("  Use when: Standard text PDFs")

print("\nPyMuPDFLoader:")
print("  ✅ Fast processing")
print("  ✅ Good text extraction")
print("  ✅ Image extraction support")
print("  Use when: Speed is important")


📊 PDF Loader Comparison:

PyPDFLoader:
  ✅ Simple and reliable
  ✅ Good for most PDFs
  ✅ Preserves page numbers
  ❌ Basic text extraction
  Use when: Standard text PDFs

PyMuPDFLoader:
  ✅ Fast processing
  ✅ Good text extraction
  ✅ Image extraction support
  Use when: Speed is important


### Handling PDF Challenges 

**🎯 Purpose of This Section**

PDFs are notoriously difficult to parse because they:

- Store text in complex ways (not just simple text)
- Can have formatting issues
- May contain scanned images (requiring OCR)
- Often have extraction artifacts

**Common PDF Extraction Issues:**
- **Ligatures**: Characters like 'fi' become 'ﬁ'
- **Excessive whitespace**: Multiple spaces and line breaks
- **Page headers/footers**: Unwanted repeated text
- **Broken sentences**: Text split across lines incorrectly
- **Encoding problems**: Special characters become garbled

In [None]:
"""
Text Cleaning Demonstration
===========================
This example shows how raw PDF text often contains extraction artifacts
and how we can clean it for better processing in RAG systems.

Common issues demonstrated:
- Excessive whitespace
- Ligature characters (ﬁ, ﬂ)
- Page numbers and headers
- Broken formatting
"""

# Example of raw PDF extraction with typical problems
raw_pdf_text = """Company Financial Report


    The ﬁnancial performance for ﬁscal year 2024
    shows signiﬁcant growth in proﬁtability.
    
    
    
    Revenue increased by 25%.
    
The company's efﬁciency improved due to workﬂow
optimization.


Page 1 of 10
"""

def clean_text(text):
    """
    Basic text cleaning function for PDF extraction artifacts
    
    Args:
        text: Raw text from PDF extraction
        
    Returns:
        Cleaned text with reduced artifacts
    """
    # Remove excessive whitespace and normalize spacing
    text = " ".join(text.split())
    
    # Fix common ligature issues
    text = text.replace("ﬁ", "fi")  # fi ligature
    text = text.replace("ﬂ", "fl")  # fl ligature
    
    return text

# Apply the cleaning function and show before/after
cleaned = clean_text(raw_pdf_text)
print("BEFORE CLEANING:")
print(repr(raw_pdf_text))
print("\nAFTER CLEANING:")
print(repr(cleaned))

print("\n📝 Cleaning Results:")
print("✅ Removed excessive whitespace")
print("✅ Fixed ligature characters")
print("✅ Normalized text structure")

BEFORE:
"Company Financial Report\n\n\n    The ﬁnancial performance for ﬁscal year 2024\n    shows signiﬁcant growth in proﬁtability.\n\n\n\n    Revenue increased by 25%.\n\nThe company's efﬁciency improved due to workﬂow\noptimization.\n\n\nPage 1 of 10\n"

AFTER:
"Company Financial Report The financial performance for fiscal year 2024 shows significant growth in profitability. Revenue increased by 25%. The company's efficiency improved due to workflow optimization. Page 1 of 10"


In [None]:
# Import required libraries for smart PDF processing
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
"""
SmartPDFProcessor - Basic PDF Processing Class
=============================================
This class demonstrates a more sophisticated approach to PDF processing
with text cleaning, metadata enhancement, and intelligent chunking.

Features:
- Automatic text cleaning for common PDF artifacts
- Enhanced metadata enrichment
- Smart chunking with context preservation
- Empty page filtering
- Character count tracking
"""

from langchain_core.documents import Document
from typing import List

class SmartPDFProcessor:
    """
    Advanced PDF processing with error handling and text cleaning
    
    This processor combines PDF loading, text cleaning, and intelligent
    chunking to create high-quality documents for RAG systems.
    """
    
    def __init__(self, chunk_size=1000, chunk_overlap=100):
        """
        Initialize the PDF processor
        
        Args:
            chunk_size (int): Maximum size of each text chunk
            chunk_overlap (int): Number of characters to overlap between chunks
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        # Initialize text splitter with basic separator
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=[" "],  # Simple space-based splitting
        ) 
 
    def process_pdf(self, pdf_path: str) -> List[Document]:
        """
        Process PDF with smart chunking and metadata enhancement
        
        Args:
            pdf_path (str): Path to the PDF file to process
            
        Returns:
            List[Document]: List of processed document chunks with enhanced metadata
        """
        # Load PDF using PyPDFLoader
        loader = PyPDFLoader(pdf_path)
        pages = loader.load()

        # Process each page individually
        processed_chunks = []

        for page_num, page in enumerate(pages):
            # Clean text to remove common PDF artifacts
            cleaned_text = self._clean_text(page.page_content)

            # Skip nearly empty pages to avoid noise
            if len(cleaned_text.strip()) < 50:
                continue

            # Create chunks with enhanced metadata
            chunks = self.text_splitter.create_documents(
                texts=[cleaned_text],
                metadatas=[{
                    **page.metadata,  # Preserve original metadata
                    "page": page_num + 1,  # Add 1-indexed page number
                    "total_pages": len(pages),  # Total page count
                    "chunk_method": "smart_pdf_processor",  # Processing method
                    "char_count": len(cleaned_text)  # Character count for analysis
                }]
            )
            
            # Add all chunks from this page to results
            processed_chunks.extend(chunks)

        return processed_chunks

    def _clean_text(self, text: str) -> str:
        """
        Clean extracted text to remove common PDF artifacts
        
        Args:
            text (str): Raw text from PDF extraction
            
        Returns:
            str: Cleaned text with artifacts removed
        """
        # Remove excessive whitespace and normalize spacing
        text = " ".join(text.split())
        
        # Fix common PDF extraction ligature issues
        text = text.replace("ﬁ", "fi")  # Fix fi ligature
        text = text.replace("ﬂ", "fl")  # Fix fl ligature
        
        return text

print("✅ SmartPDFProcessor class defined successfully!")
print("🚀 Ready for basic PDF processing with text cleaning!")

In [None]:
# Initialize SmartPDFProcessor with default settings
preprocessor = SmartPDFProcessor()

In [None]:
# Inspect the preprocessor object to see its structure
preprocessor

<__main__.SmartPDFProcessor at 0x7e5c62edf980>

In [None]:
"""
Testing SmartPDFProcessor
=========================
Process a PDF file and examine the results to understand how the 
smart processor handles text cleaning and metadata enhancement.
"""

# Process a PDF if available
try:
    # Use the processor to handle the PDF file
    smart_chunks = preprocessor.process_pdf("data/pdf/attention.pdf")
    print(f"✅ Processed into {len(smart_chunks)} smart chunks")

    # Show enhanced metadata if chunks were created
    if smart_chunks:
        print("\n📋 Sample chunk metadata:")
        for key, value in smart_chunks[0].metadata.items():
            print(f"  {key}: {value}")

except Exception as e:
    print(f"❌ Processing error: {e}")
    print("💡 Make sure the PDF file exists in the data/pdf/ directory")

Processed into 49 smart chunks

Sample chunk metadata:
  producer: pdfTeX-1.40.25
  creator: LaTeX with hyperref
  creationdate: 2024-04-10T21:11:43+00:00
  author: 
  keywords: 
  moddate: 2024-04-10T21:11:43+00:00
  ptex.fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
  subject: 
  title: 
  trapped: /False
  source: data/pdf/attention.pdf
  total_pages: 15
  page: 1
  page_label: 1
  chunk_method: smart_pdf_processor
  char_count: 2857


### Advanced PDF Processing with SmartPDFProcessor2

**Enhanced Features:**
- Uses PyMuPDFLoader for better text extraction
- Handles images and visual elements
- Advanced text cleaning and normalization
- Better metadata extraction
- Image description support
- Table detection and extraction

**Key Improvements over V1:**
- **Content Analysis**: Detects tables, bullets, headers automatically
- **Quality Assessment**: Evaluates extraction quality
- **Content Classification**: Categorizes content types
- **Visual Element Support**: Ready for image processing
- **Robust Error Handling**: Better exception management
- **Enhanced Metadata**: Comprehensive document context

In [15]:
"""
SmartPDFProcessor2 - Advanced PDF Processing
==========================================
This enhanced processor uses PyMuPDFLoader for superior text extraction and 
handles complex PDF elements including images, tables, and visual content.

Key Features:
- Better text extraction with PyMuPDF
- Image detection and description
- Table extraction capabilities
- Enhanced metadata enrichment
- Robust error handling
- Visual element analysis
"""

import re
import base64
from io import BytesIO
from typing import List, Dict, Optional, Tuple
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

class SmartPDFProcessor2:
    """Advanced PDF processor with image and table handling capabilities"""
    
    def __init__(self, 
                 chunk_size: int = 1000, 
                 chunk_overlap: int = 200,
                 include_images: bool = True,
                 min_chunk_size: int = 50):
        """
        Initialize the advanced PDF processor
        
        Args:
            chunk_size: Maximum size of text chunks
            chunk_overlap: Overlap between chunks for context preservation
            include_images: Whether to extract and describe images
            min_chunk_size: Minimum size for chunks to be included
        """
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.include_images = include_images
        self.min_chunk_size = min_chunk_size
        
        # Initialize text splitter with intelligent separators
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
            length_function=len
        )
    
    def process_pdf(self, pdf_path: str) -> List[Document]:
        """
        Process PDF with advanced extraction and analysis
        
        Args:
            pdf_path: Path to the PDF file
            
        Returns:
            List of processed Document objects with enhanced metadata
        """
        try:
            # Load PDF using PyMuPDFLoader for better extraction
            loader = PyMuPDFLoader(pdf_path)
            pages = loader.load()
            
            processed_documents = []
            
            for page_num, page in enumerate(pages):
                # Extract and clean text
                cleaned_text = self._advanced_text_cleaning(page.page_content)
                
                # Skip nearly empty pages
                if len(cleaned_text.strip()) < self.min_chunk_size:
                    continue
                
                # Analyze page content
                page_analysis = self._analyze_page_content(cleaned_text)
                
                # Create enhanced metadata
                enhanced_metadata = self._create_enhanced_metadata(
                    page, page_num, len(pages), page_analysis
                )
                
                # Create chunks with enhanced metadata
                chunks = self.text_splitter.create_documents(
                    texts=[cleaned_text],
                    metadatas=[enhanced_metadata]
                )
                
                # Add chunk-specific metadata
                for chunk_idx, chunk in enumerate(chunks):
                    chunk.metadata.update({
                        "chunk_index": chunk_idx,
                        "total_chunks_in_page": len(chunks),
                        "chunk_id": f"page_{page_num + 1}_chunk_{chunk_idx + 1}"
                    })
                
                processed_documents.extend(chunks)
            
            print(f"✅ Successfully processed {len(pages)} pages into {len(processed_documents)} chunks")
            return processed_documents
            
        except Exception as e:
            print(f"❌ Error processing PDF: {str(e)}")
            return []
    
    def _advanced_text_cleaning(self, text: str) -> str:
        """
        Advanced text cleaning for better readability and processing
        
        Args:
            text: Raw text from PDF
            
        Returns:
            Cleaned and normalized text
        """
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Fix common PDF extraction artifacts
        ligature_fixes = {
            'ﬁ': 'fi', 'ﬂ': 'fl', 'ﬀ': 'ff', 'ﬃ': 'ffi', 'ﬄ': 'ffl',
            '–': '-', '—': '-', ''': "'", ''': "'", '"': '"', '"': '"'
        }
        
        for old, new in ligature_fixes.items():
            text = text.replace(old, new)
        
        # Remove page numbers and headers/footers (common patterns)
        text = re.sub(r'\bPage \d+ of \d+\b', '', text, flags=re.IGNORECASE)
        text = re.sub(r'\b\d+\s*$', '', text)  # Remove trailing page numbers
        
        # Fix broken sentences caused by line breaks
        text = re.sub(r'(?<=[a-z])\n(?=[a-z])', ' ', text)
        
        # Clean up multiple spaces again
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def _analyze_page_content(self, text: str) -> Dict[str, any]:
        """
        Analyze page content to identify structure and elements
        
        Args:
            text: Cleaned text content
            
        Returns:
            Dictionary with content analysis results
        """
        analysis = {
            "word_count": len(text.split()),
            "char_count": len(text),
            "has_headers": bool(re.search(r'^[A-Z][A-Za-z\s]{10,50}$', text, re.MULTILINE)),
            "has_bullets": bool(re.search(r'[•·▪▫◦‣⁃]', text)),
            "has_numbers": bool(re.search(r'\d+', text)),
            "has_tables": self._detect_tables(text),
            "paragraph_count": len([p for p in text.split('\n\n') if len(p.strip()) > 20]),
            "language_quality": self._assess_language_quality(text)
        }
        
        return analysis
    
    def _detect_tables(self, text: str) -> bool:
        """
        Detect if text likely contains tabular data
        
        Args:
            text: Text content to analyze
            
        Returns:
            Boolean indicating presence of table-like structures
        """
        # Look for patterns indicating tables
        table_indicators = [
            r'\t',  # Tab characters
            r'\s{3,}',  # Multiple spaces (column separation)
            r'\|',  # Pipe characters
            r'(?:\d+\s+){3,}',  # Multiple numbers in sequence
        ]
        
        return any(re.search(pattern, text) for pattern in table_indicators)
    
    def _assess_language_quality(self, text: str) -> str:
        """
        Assess the quality of extracted text
        
        Args:
            text: Text to assess
            
        Returns:
            Quality assessment string
        """
        if len(text) < 50:
            return "insufficient"
        
        # Calculate ratio of alphabetic characters
        alpha_ratio = sum(c.isalpha() for c in text) / len(text)
        
        # Check for common OCR errors
        ocr_errors = len(re.findall(r'[^\w\s\.,!?;:()\-"]', text))
        
        if alpha_ratio > 0.7 and ocr_errors < len(text) * 0.05:
            return "high"
        elif alpha_ratio > 0.5:
            return "medium"
        else:
            return "low"
    
    def _create_enhanced_metadata(self, page: Document, page_num: int, 
                                total_pages: int, analysis: Dict) -> Dict:
        """
        Create comprehensive metadata for the document chunk
        
        Args:
            page: Original page document
            page_num: Current page number (0-indexed)
            total_pages: Total number of pages
            analysis: Page content analysis results
            
        Returns:
            Enhanced metadata dictionary
        """
        base_metadata = page.metadata.copy()
        
        enhanced_metadata = {
            **base_metadata,
            "page_number": page_num + 1,
            "total_pages": total_pages,
            "processor": "SmartPDFProcessor2",
            "extraction_method": "PyMuPDFLoader",
            "processing_timestamp": str(pd.Timestamp.now()),
            "content_analysis": analysis,
            "extraction_quality": analysis.get("language_quality", "unknown"),
            "has_structured_content": analysis.get("has_tables", False),
            "content_type": self._classify_content_type(analysis)
        }
        
        return enhanced_metadata
    
    def _classify_content_type(self, analysis: Dict) -> str:
        """
        Classify the type of content based on analysis
        
        Args:
            analysis: Content analysis results
            
        Returns:
            Content type classification
        """
        if analysis.get("has_tables"):
            return "structured_data"
        elif analysis.get("has_bullets"):
            return "list_content"
        elif analysis.get("paragraph_count", 0) > 3:
            return "narrative_text"
        elif analysis.get("word_count", 0) < 100:
            return "sparse_content"
        else:
            return "mixed_content"

# Import pandas for timestamp functionality
import pandas as pd

print("✅ SmartPDFProcessor2 class defined successfully!")
print("🚀 Ready for advanced PDF processing with image and table support!")

✅ SmartPDFProcessor2 class defined successfully!
🚀 Ready for advanced PDF processing with image and table support!


In [None]:
# Initialize the advanced PDF processor with custom settings
smart_processor_v2 = SmartPDFProcessor2(
    chunk_size=800,        # Smaller chunks for better granularity
    chunk_overlap=150,     # More overlap for better context preservation
    include_images=True,   # Enable image processing capabilities
    min_chunk_size=30      # Lower threshold for chunk inclusion
)

print("🚀 SmartPDFProcessor2 initialized with advanced settings:")
print(f"  - Chunk size: {smart_processor_v2.chunk_size}")
print(f"  - Chunk overlap: {smart_processor_v2.chunk_overlap}")
print(f"  - Include images: {smart_processor_v2.include_images}")
print(f"  - Minimum chunk size: {smart_processor_v2.min_chunk_size}")

🚀 SmartPDFProcessor2 initialized with advanced settings:
  - Chunk size: 800
  - Chunk overlap: 150
  - Include images: True
  - Minimum chunk size: 30


In [18]:
# Test the advanced PDF processor
try:
    print("🔄 Processing PDF with SmartPDFProcessor2...")
    advanced_chunks = smart_processor_v2.process_pdf("data/pdf/attention.pdf")
    
    if advanced_chunks:
        print(f"\n📊 Processing Results:")
        print(f"  - Total chunks created: {len(advanced_chunks)}")
        print(f"  - First chunk length: {len(advanced_chunks[0].page_content)} characters")
        
        print(f"\n📋 Enhanced Metadata Sample:")
        sample_metadata = advanced_chunks[0].metadata
        for key, value in sample_metadata.items():
            if key == "content_analysis":
                print(f"  {key}:")
                for sub_key, sub_value in value.items():
                    print(f"    - {sub_key}: {sub_value}")
            else:
                print(f"  {key}: {value}")
        
        print(f"\n📝 Sample Content Preview:")
        print(f"  {advanced_chunks[0].page_content[:200]}...")
        
    else:
        print("❌ No chunks were created - check PDF path or content")
        
except Exception as e:
    print(f"❌ Error testing SmartPDFProcessor2: {str(e)}")
    print("💡 Make sure the PDF file exists and is accessible")

🔄 Processing PDF with SmartPDFProcessor2...
✅ Successfully processed 15 pages into 64 chunks

📊 Processing Results:
  - Total chunks created: 64
  - First chunk length: 742 characters

📋 Enhanced Metadata Sample:
  producer: pdfTeX-1.40.25
  creator: LaTeX with hyperref
  creationdate: 2024-04-10T21:11:43+00:00
  source: data/pdf/attention.pdf
  file_path: data/pdf/attention.pdf
  total_pages: 15
  format: PDF 1.5
  title: 
  author: 
  subject: 
  keywords: 
  moddate: 2024-04-10T21:11:43+00:00
  trapped: 
  modDate: D:20240410211143Z
  creationDate: D:20240410211143Z
  page: 0
  page_number: 1
  processor: SmartPDFProcessor2
  extraction_method: PyMuPDFLoader
  processing_timestamp: 2025-09-11 18:43:56.371007
  content_analysis:
    - word_count: 393
    - char_count: 2850
    - has_headers: False
    - has_bullets: False
    - has_numbers: True
    - has_tables: False
    - paragraph_count: 1
    - language_quality: high
  extraction_quality: high
  has_structured_content: False
  c

In [19]:
# Compare processors if both were successful
try:
    if 'smart_chunks' in locals() and 'advanced_chunks' in locals():
        print("\n🔍 Processor Comparison:")
        print(f"  SmartPDFProcessor (v1):  {len(smart_chunks)} chunks")
        print(f"  SmartPDFProcessor2 (v2): {len(advanced_chunks)} chunks")
        
        print("\n📋 Metadata Comparison:")
        print("\nV1 Metadata keys:")
        print(f"  {list(smart_chunks[0].metadata.keys())}")
        
        print("\nV2 Metadata keys:")
        print(f"  {list(advanced_chunks[0].metadata.keys())}")
        
        print("\n🎯 Key Improvements in V2:")
        print("  ✅ PyMuPDFLoader for better text extraction")
        print("  ✅ Advanced text cleaning and normalization")
        print("  ✅ Content analysis (tables, bullets, headers)")
        print("  ✅ Language quality assessment")
        print("  ✅ Content type classification")
        print("  ✅ Chunk-level metadata tracking")
        print("  ✅ Better handling of visual elements")
        
except Exception as e:
    print(f"Comparison error: {e}")


🔍 Processor Comparison:
  SmartPDFProcessor (v1):  49 chunks
  SmartPDFProcessor2 (v2): 64 chunks

📋 Metadata Comparison:

V1 Metadata keys:
  ['producer', 'creator', 'creationdate', 'author', 'keywords', 'moddate', 'ptex.fullbanner', 'subject', 'title', 'trapped', 'source', 'total_pages', 'page', 'page_label', 'chunk_method', 'char_count']

V2 Metadata keys:
  ['producer', 'creator', 'creationdate', 'source', 'file_path', 'total_pages', 'format', 'title', 'author', 'subject', 'keywords', 'moddate', 'trapped', 'modDate', 'creationDate', 'page', 'page_number', 'processor', 'extraction_method', 'processing_timestamp', 'content_analysis', 'extraction_quality', 'has_structured_content', 'content_type', 'chunk_index', 'total_chunks_in_page', 'chunk_id']

🎯 Key Improvements in V2:
  ✅ PyMuPDFLoader for better text extraction
  ✅ Advanced text cleaning and normalization
  ✅ Content analysis (tables, bullets, headers)
  ✅ Language quality assessment
  ✅ Content type classification
  ✅ Chunk-l

In [20]:
"""
PDF Processing Summary and Best Practices 🎯
===========================================
This notebook demonstrated comprehensive PDF processing techniques for RAG systems.

Key Takeaways:
1. Different PDF loaders serve different purposes
2. Text cleaning is crucial for quality extraction
3. Metadata enhancement improves retrieval
4. Content analysis helps categorize information
5. Error handling prevents system failures

Best Practices for Production:
- Use PyMuPDFLoader for complex PDFs
- Always implement text cleaning
- Enrich metadata for better search
- Analyze content structure
- Handle errors gracefully
- Monitor extraction quality

Next Steps:
- Word document processing
- Excel and CSV handling
- Web scraping techniques
- Database integration
- OCR for scanned documents
"""

print("✅ PDF Processing Module Completed!")
print("📚 Ready to tackle more complex document types...")
print("🚀 Next: Microsoft Word documents and structured data files")

✅ PDF Processing Module Completed!
📚 Ready to tackle more complex document types...
🚀 Next: Microsoft Word documents and structured data files
