# 🧠 Hybrid Document Chunking Workshop

Welcome to the **Hybrid Document Chunking Workshop**! This comprehensive notebook demonstrates how to use Docling's advanced chunking capabilities for RAG (Retrieval-Augmented Generation) applications with full structured output support.

## 🎯 Learning Objectives

By the end of this workshop, you will:
- Understand hybrid chunking and its advantages over simple text splitting
- Configure advanced document processing pipelines
- Process various document formats with OCR, table extraction, and figure exports
- Generate structured output with organized folder hierarchies
- Implement tokenization-aware chunking strategies
- Use LLM-powered image descriptions for multimodal RAG
- Analyze and visualize chunks for optimal RAG performance

## 📋 Workshop Sections

1. **🔧 Setup & Dependencies** - Install packages with UV
2. **⚙️  Advanced Pipeline Configuration** - All processing options explained
3. **📊 Structured Output System** - Organized folder hierarchies  
4. **🖼️  Figure & Table Exports** - Visual content extraction
5. **🤖 LLM Image Descriptions** - AI-powered multimodal processing
6. **🧩 Hybrid Chunking Engine** - Smart, context-aware chunking
7. **📈 Analysis & Visualization** - Comprehensive chunk quality analysis
8. **🎛️  Interactive Configuration Testing** - Compare different settings
9. **💡 Best Practices** - Production-ready recommendations

Let's dive deep! 🌊


## 1. 🔧 Setup & Dependencies with UV

UV is a fast Python package installer and dependency manager. Let's install all required packages for this comprehensive workshop.


In [None]:
! echo "::group::Install Dependencies"
%pip install uv
! uv pip install "git+https://github.com/ibm-granite-community/utils.git" \
    transformers \
    pillow \
    langchain_community \
    'langchain_huggingface[full]' \
    docling \
    replicate \
    matplotlib
! echo "::endgroup::"

In [None]:
# Core imports for the workshop
import json
import sys
import warnings
from pathlib import Path
from typing import Dict, List, Any, Optional

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Markdown

# Suppress PyTorch MPS warnings on Mac
warnings.filterwarnings("ignore", message=".*pin_memory.*not supported on MPS.*")
warnings.filterwarnings("ignore", category=UserWarning, module="torch.*")

# Docling imports
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.chunking import HybridChunker
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling_core.types.doc import PictureItem, TableItem
from transformers import AutoTokenizer
from langchain_community.llms import Replicate


# Optional replicate imports for image descriptions
try:
    import base64
    from dotenv import load_dotenv
    import os
    REPLICATE_AVAILABLE = True
    # Load environment variables
    load_dotenv()
except ImportError:
    REPLICATE_AVAILABLE = False
    print("⚠️  Replicate not available. Image descriptions will be disabled.")

print("✅ All imports successful!")
print("🚀 Ready for advanced document processing!")

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

## 2. ⚙️ Advanced Pipeline Configuration

The `AdvancedPipelineConfig` class provides comprehensive control over document processing. This replicates all the features from the command-line processor, including structured output, exports, and LLM integration.


In [None]:
class AdvancedPipelineConfig:
    """Comprehensive configuration class replicating document_processor.py functionality."""
    
    def __init__(
        self,
        # Core processing options
        do_ocr: bool = True,
        do_table_structure: bool = True,
        generate_page_images: bool = False,  # Will be enabled if export_figures=True
        generate_picture_images: bool = False,  # Will be enabled if export_figures=True
        
        # Chunking options
        chunk_max_tokens: int = 512,
        chunk_merge_peers: bool = True,
        embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
        
        # Export options (the key new features!)
        export_figures: bool = False,
        export_tables: bool = False,
        images_scale: float = 2.0,  # Image resolution scale (1.0 = 72 DPI)
        
        # Structured output options
        organize_output: bool = True,  # Create organized folder structure
        save_metadata: bool = True,   # Save comprehensive document metadata
        export_markdown: bool = True, # Export as markdown
        output_dir: str = "workshop_output",
        
        # LLM Image description options
        describe_images: bool = False,
        llm_model: str = "yorickvp/llava-13b",
        
        # Supported formats
        allowed_formats: List[InputFormat] = None,
    ):
        """Initialize comprehensive pipeline configuration.
        
        Key New Features Explained:
        
        🏗️ STRUCTURED OUTPUT (organize_output=True):
            Creates organized folder hierarchy:
            output_dir/files/document_name/
            ├── json/               # Main JSON output + markdown
            ├── metadata/           # Separate metadata file
            └── exports/            # All visual exports
                ├── figures/        # Extracted figures/pictures
                ├── tables/         # Table images
                └── pages/          # Page screenshots
        
        🖼️ FIGURE EXPORTS (export_figures=True):
            • Extracts all figures and pictures from documents
            • Saves as high-resolution PNG files
            • Automatically generates page screenshots
            • Preserves visual content for multimodal RAG
        
        📊 TABLE EXPORTS (export_tables=True):
            • Extracts tables to multiple formats: CSV, HTML, Markdown
            • Saves table images as PNG files
            • Preserves table structure and content
            • Enables both text and visual table retrieval
        
        🤖 LLM IMAGE DESCRIPTIONS (describe_images=True):
            • Supports Claude, GPT-4V, Gemini Vision, etc.
            • Creates searchable text descriptions
            • Tracks costs and token usage
            • Enables semantic search over visual content
        
        📋 COMPREHENSIVE METADATA (save_metadata=True):
            • Processing configuration details
            • Document statistics and structure info
            • File metadata and format detection
            • Chunk quality metrics
        """
        # Core processing
        self.do_ocr = do_ocr
        self.do_table_structure = do_table_structure
        self.generate_page_images = generate_page_images
        self.generate_picture_images = generate_picture_images
        
        # Chunking
        self.chunk_max_tokens = chunk_max_tokens
        self.chunk_merge_peers = chunk_merge_peers
        self.embedding_model = embedding_model
        
        # Export options
        self.export_figures = export_figures
        self.export_tables = export_tables
        self.images_scale = images_scale
        
        # Structured output
        self.organize_output = organize_output
        self.save_metadata = save_metadata
        self.export_markdown = export_markdown
        self.output_dir = Path(output_dir)
        
        # LLM features
        self.describe_images = describe_images
        self.llm_model = llm_model
        
        # Multi-format support
        self.allowed_formats = allowed_formats or [
            InputFormat.PDF,
            InputFormat.DOCX,
            InputFormat.PPTX,
            InputFormat.XLSX,
            InputFormat.HTML,
            InputFormat.MD,
            InputFormat.IMAGE,
        ]
        
        # Auto-enable image generation if we're exporting figures
        if self.export_figures:
            self.generate_page_images = True
            self.generate_picture_images = True
    
    def to_pipeline_options(self) -> Optional[PdfPipelineOptions]:
        """Convert configuration to Docling PdfPipelineOptions."""
        try:
            options = PdfPipelineOptions(
                do_ocr=self.do_ocr,
                do_table_structure=self.do_table_structure,
                generate_page_images=self.generate_page_images,
                generate_picture_images=self.generate_picture_images,
            )
            
            # Set images scale for high-quality exports
            if self.export_figures or self.generate_page_images or self.generate_picture_images:
                options.images_scale = self.images_scale
                
            return options
        except Exception as e:
            print(f"Warning: Could not create pipeline options: {e}")
            return None
    
    def summary(self) -> str:
        """Return a comprehensive summary of the configuration."""
        formats_str = ', '.join([f.name for f in self.allowed_formats])
        
        return f"""🔧 Advanced Pipeline Configuration:
        
📊 CORE PROCESSING:
        - OCR: {'✓' if self.do_ocr else '✗'}
        - Table Structure: {'✓' if self.do_table_structure else '✗'}
        - Page Images: {'✓' if self.generate_page_images else '✗'}
        - Picture Images: {'✓' if self.generate_picture_images else '✗'}
        
🧩 CHUNKING:
        - Method: Hybrid (tokenization-aware)
        - Max Tokens/Chunk: {self.chunk_max_tokens}
        - Merge Peers: {'✓' if self.chunk_merge_peers else '✗'}
        - Embedding Model: {self.embedding_model}
        
📁 STRUCTURED OUTPUT:
        - Organize Output: {'✓' if self.organize_output else '✗'}
        - Save Metadata: {'✓' if self.save_metadata else '✗'}
        - Export Markdown: {'✓' if self.export_markdown else '✗'}
        - Output Directory: {self.output_dir}
        
🖼️ EXPORTS:
        - Export Figures: {'✓' if self.export_figures else '✗'}
        - Export Tables: {'✓' if self.export_tables else '✗'}
        - Images Scale: {self.images_scale}x ({int(self.images_scale * 72)} DPI)
        
🤖 LLM FEATURES:
        - Describe Images: {'✓' if self.describe_images else '✗'}
        - LLM Model: {self.llm_model}
        
⚙️ FORMATS:
        - Supported: {formats_str}"""

# Create a comprehensive configuration (matching the example command)
config = AdvancedPipelineConfig(
    chunk_max_tokens=256,
    organize_output=True,
    export_figures=True,
    export_tables=True,
    save_metadata=True,
    describe_images=True,
    llm_model="yorickvp/llava-13b",
    output_dir="workshop_output"
)

print("✅ AdvancedPipelineConfig class defined!")
print("\n" + "="*60)
print(config.summary())
print("="*60)


## 3. 📊 Structured Output System Explained

The structured output system creates a comprehensive, organized hierarchy that makes it easy to manage and access all processed content. Let's understand each component:


In [None]:
def explain_output_structure():
    """Show the complete output structure with explanations."""
    
    structure_diagram = """
📁 STRUCTURED OUTPUT HIERARCHY:

workshop_output/
└── files/                           # Categorized by type (files vs audio)
    └── document_name/               # One folder per document
        ├── json/                    # Main content
        │   ├── document_name.json   # Complete processed data
        │   └── document_name.md     # Markdown export
        │
        ├── metadata/                # Metadata only
        │   └── document_name_metadata.json
        │
        └── exports/                 
            ├── figures/             
            │   ├── document-picture-1.png
            │   ├── document-picture-2.png
            │   └── ...
            │
            ├── tables/              
            │   ├── document-table-1.png      # Visual
            │   ├── document-table-1.csv      # Data  
            │   ├── document-table-1.html     # Formatted
            │   ├── document-table-1.md       # Markdown
            │   └── ...
            │
            ├── pages/               # Page screenshots
            │   ├── document-page-1.png
            │   ├── document-page-2.png
            │   └── ...
    """
    
    explanations = {
        "📄 json/": "Contains the main processed data as JSON and the original text as Markdown. This is your primary content for RAG.",
        
        "📋 metadata/": "Separate metadata file with processing config, document stats, file info, and quality metrics.",
        
        "🖼️ figures/": "All pictures, diagrams, charts, and visual elements extracted as high-res PNG files for multimodal RAG.",
        
        "📊 tables/": "Tables in multiple formats: PNG images for visual retrieval, CSV for data analysis, HTML for web display, Markdown for text processing.",
        
        "📖 pages/": "Screenshots of each document page, useful for layout-aware applications and visual document search.",
        
        "🤖 image-descriptions.json": "LLM-generated descriptions of all images, with cost tracking and metadata. Makes visual content searchable via text."
    }
    
    print(structure_diagram)
    print("\n🔍 COMPONENT EXPLANATIONS:\n")
    
    for component, explanation in explanations.items():
        print(f"{component}")
        print(f"   {explanation}")
        print()
    

explain_output_structure()

## 4. 🧠 Comprehensive Document Processor



In [None]:
class ComprehensiveDocumentProcessor:
    """Full-featured document processor with all advanced capabilities."""
    
    def __init__(self, config: Optional[AdvancedPipelineConfig] = None):
        """Initialize the comprehensive processor."""
        self.config = config or AdvancedPipelineConfig()
        
        # Initialize DocumentConverter
        self.converter = self._initialize_converter()
        
        # Initialize HybridChunker
        self.hybrid_chunker = None
        self.tokenizer = None
        self._initialize_hybrid_chunker()
        
        # Store conversion result for exports
        self._conversion_result = None
    
    def _initialize_converter(self) -> DocumentConverter:
        """Initialize DocumentConverter with multi-format support."""
        try:
            pipeline_options = self.config.to_pipeline_options()
            
            format_options = {}
            if InputFormat.PDF in self.config.allowed_formats and pipeline_options:
                format_options[InputFormat.PDF] = PdfFormatOption(
                    pipeline_cls=StandardPdfPipeline,
                    backend=PyPdfiumDocumentBackend,
                    pipeline_options=pipeline_options
                )
            
            converter = DocumentConverter(
                allowed_formats=self.config.allowed_formats,
                format_options=format_options if format_options else None
            )
            
            formats_str = ', '.join([f.name for f in self.config.allowed_formats])
            print(f"✅ Initialized converter for formats: {formats_str}")
            return converter
            
        except Exception as e:
            print(f"⚠️ Warning: Could not apply format options: {e}")
            return DocumentConverter()
    
    def _initialize_hybrid_chunker(self):
        """Initialize the hybrid chunker with tokenizer."""
        try:
            print(f"🔧 Loading tokenizer: {self.config.embedding_model}")
            self.tokenizer = AutoTokenizer.from_pretrained(self.config.embedding_model)
            
            self.hybrid_chunker = HybridChunker(
                tokenizer=self.tokenizer,
                max_tokens=self.config.chunk_max_tokens,
                merge_peers=self.config.chunk_merge_peers
            )
            print(f"✅ Initialized HybridChunker (max_tokens={self.config.chunk_max_tokens})")
            
        except Exception as e:
            print(f"❌ Error: Could not initialize hybrid chunker: {e}")
            raise e
    
    def process_document(self, file_path: str) -> Dict[str, Any]:
        """Process a document with full structured output."""
        try:
            print(f"🔄 Processing document: {Path(file_path).name}")
            
            # Convert document
            result = self.converter.convert(file_path)
            doc = result.document
            self._conversion_result = result
            
            # Create output structure
            doc_folder = self._get_output_folder(file_path)
            
            # Extract comprehensive metadata
            metadata = self._extract_comprehensive_metadata(result, file_path, doc_folder)
            
            # Create chunks
            chunks = self._create_hybrid_chunks(doc)
            
            # Extract tables and headers
            tables = self._extract_tables(doc)
            headers = self._extract_headers(doc)
            
            # Handle exports (figures, tables)
            exports_info = {}
            if self.config.export_figures or self.config.export_tables:
                exports_info = self._handle_exports(result, doc, file_path, doc_folder)
            
            # Prepare complete processed data
            processed_data = {
                "metadata": metadata,
                "content": {
                    "full_text": doc.export_to_markdown(),
                    "structured_content": json.loads(doc.to_json()) if hasattr(doc, 'to_json') else {},
                },
                "chunks": chunks,
                "tables": tables,
                "headers": headers,
                "exports": exports_info,
                "document_stats": {
                    "total_characters": len(doc.export_to_markdown()),
                    "total_words": len(doc.export_to_markdown().split()),
                    "total_chunks": len(chunks),
                    "total_tables": len(tables),
                    "total_headers": len(headers),
                }
            }
            
            # Create structured output if enabled
            if self.config.organize_output:
                output_structure = self._create_output_structure(doc_folder, processed_data)
                processed_data["output_structure"] = output_structure
            
            print(f"✅ Document processed successfully! Created {len(chunks)} chunks")
            return processed_data
            
        except Exception as e:
            return {
                "error": f"Failed to process document: {str(e)}",
                "metadata": {"source_file": str(file_path)},
            }
    
    def _get_output_folder(self, file_path: str) -> Path:
        """Determine output folder structure."""
        if self.config.organize_output:
            return self.config.output_dir / "files" / Path(file_path).stem
        else:
            return self.config.output_dir
    
    def _create_hybrid_chunks(self, doc) -> List[Dict[str, Any]]:
        """Create chunks using Docling's HybridChunker."""
        print("🧩 Creating hybrid chunks...")
        chunks = []
        
        try:
            chunk_iter = self.hybrid_chunker.chunk(dl_doc=doc)
            
            for i, chunk in enumerate(chunk_iter):
                contextualized_text = self.hybrid_chunker.contextualize(chunk=chunk)
                
                chunk_data = {
                    "chunk_id": i,
                    "text": chunk.text,
                    "contextualized_text": contextualized_text,
                    "token_count": len(self.tokenizer.encode(chunk.text)) if self.tokenizer else len(chunk.text.split()),
                    "char_count": len(chunk.text),
                    "contextualized_char_count": len(contextualized_text),
                    "metadata": {
                        "headings": getattr(chunk.meta, 'headings', []) if hasattr(chunk, 'meta') else [],
                        "page_info": getattr(chunk.meta, 'page_info', []) if hasattr(chunk, 'meta') else [],
                        "content_type": getattr(chunk.meta, 'content_type', None) if hasattr(chunk, 'meta') else None,
                        "chunk_type": "hybrid"
                    }
                }
                chunks.append(chunk_data)
            
            print(f"✅ Created {len(chunks)} hybrid chunks")
            return chunks
                
        except Exception as e:
            print(f"❌ Error: Hybrid chunking failed: {e}")
            raise e
    
    def _extract_tables(self, doc) -> List[Dict[str, Any]]:
        """Extract table information from the document."""
        tables = []
        try:
            if hasattr(doc, 'to_json'):
                doc_dict = json.loads(doc.to_json())
                if 'tables' in doc_dict:
                    for i, table in enumerate(doc_dict['tables']):
                        tables.append({
                            "table_id": i,
                            "content": table,
                            "extraction_method": "structured_json"
                        })
            
            if hasattr(doc, 'tables'):
                for i, table in enumerate(doc.tables):
                    tables.append({
                        "table_id": len(tables),
                        "content": str(table) if hasattr(table, '__str__') else table,
                        "extraction_method": "direct_attribute"
                    })
                    
            if tables:
                print(f"✅ Extracted {len(tables)} tables")
                
        except Exception as e:
            print(f"⚠️ Warning: Could not extract tables: {e}")
        
        return tables
    
    def _extract_headers(self, doc) -> List[Dict[str, Any]]:
        """Extract header information from the document."""
        headers = []
        try:
            markdown_content = doc.export_to_markdown()
            lines = markdown_content.split('\n')
            
            for i, line in enumerate(lines):
                line = line.strip()
                if line.startswith('#'):
                    level = len(line) - len(line.lstrip('#'))
                    text = line.lstrip('#').strip()
                    if text:
                        headers.append({
                            "level": level,
                            "text": text,
                            "line_number": i,
                        })
                        
            if headers:
                print(f"✅ Extracted {len(headers)} headers")
                
        except Exception as e:
            print(f"⚠️ Warning: Could not extract headers: {e}")
        
        return headers
    
    def _extract_comprehensive_metadata(self, result, file_path: str, doc_folder: Path) -> Dict[str, Any]:
        """Extract comprehensive metadata about the document and processing."""
        file_path_obj = Path(file_path)
        
        metadata = {
            "source_file": str(file_path),
            "file_name": file_path_obj.name,
            "file_stem": file_path_obj.stem,
            "file_type": file_path_obj.suffix.lower(),
            "file_size_bytes": file_path_obj.stat().st_size if file_path_obj.exists() else 0,
            "title": getattr(result.document, 'title', None) or file_path_obj.stem,
            "output_folder": str(doc_folder),
            "processing_config": {
                "chunking_method": "hybrid",
                "max_tokens_per_chunk": self.config.chunk_max_tokens,
                "ocr_enabled": self.config.do_ocr,
                "table_structure_enabled": self.config.do_table_structure,
                "export_figures": self.config.export_figures,
                "export_tables": self.config.export_tables,
                "organize_output": self.config.organize_output,
                "describe_images": self.config.describe_images,
                "llm_model": self.config.llm_model if self.config.describe_images else None,
                "embedding_model": self.config.embedding_model,
            },
        }
        
        # Document structure metadata
        doc = result.document
        if hasattr(doc, 'pages'):
            metadata["page_count"] = len(doc.pages) if doc.pages else 0
        
        if hasattr(doc, 'tables'):
            metadata["table_count"] = len(doc.tables) if doc.tables else 0
        
        # Content statistics
        full_text = doc.export_to_markdown()
        metadata["content_stats"] = {
            "total_characters": len(full_text),
            "total_words": len(full_text.split()),
            "total_lines": len(full_text.split('\n')),
        }
        
        return metadata
    
    def _create_output_structure(self, doc_folder: Path, processed_data: Dict[str, Any]) -> Dict[str, str]:
        """Create organized output folder structure and save files."""
        folders = {
            "json_folder": doc_folder / "json",
            "metadata_folder": doc_folder / "metadata",
            "exports_folder": doc_folder / "exports",
            "figures_folder": doc_folder / "exports" / "figures",
            "tables_folder": doc_folder / "exports" / "tables", 
            "pages_folder": doc_folder / "exports" / "pages",
        }
        
        # Create directories
        for folder in folders.values():
            folder.mkdir(parents=True, exist_ok=True)
        
        # Save JSON output
        json_file = folders["json_folder"] / f"{processed_data['metadata']['file_stem']}.json"
        with json_file.open('w', encoding='utf-8') as f:
            json.dump(processed_data, f, indent=2, ensure_ascii=False)
        
        # Save metadata separately
        metadata_file = folders["metadata_folder"] / f"{processed_data['metadata']['file_stem']}_metadata.json"
        with metadata_file.open('w', encoding='utf-8') as f:
            json.dump(processed_data['metadata'], f, indent=2, ensure_ascii=False)
        
        # Save markdown if enabled
        if self.config.export_markdown and hasattr(self._conversion_result, 'document'):
            markdown_file = folders["json_folder"] / f"{processed_data['metadata']['file_stem']}.md"
            with markdown_file.open('w', encoding='utf-8') as f:
                f.write(self._conversion_result.document.export_to_markdown())
        
        print(f"📁 Created structured output in: {doc_folder}")
        return {k: str(v) for k, v in folders.items()}
    
    # We'll add the export methods in the next cell due to length...

print("✅ ComprehensiveDocumentProcessor class defined!")

print("🔧 Class capabilities:")
print("  • Document formats: PDF, DOCX")
print("  • Chunking: Hybrid chunking with configurable token limits")
print("  • Exports: Figures, tables, pages, markdown")
print("  • Output organization: Structured folders with metadata")
print("  • LLM integration: Image descriptions and content analysis")


In [None]:
def _handle_exports(self, result, doc, file_path: str, doc_folder: Path) -> Dict[str, Any]:
    """Handle figure and table exports based on configuration."""
    exports_info = {
        "figures_exported": 0,
        "tables_exported": 0,
        "pages_exported": 0,
        "export_directory": str(doc_folder / "exports"),
        "exported_files": [],
        "image_descriptions": []
    }
    
    if not (self.config.export_figures or self.config.export_tables):
        return exports_info
        
    # Create output directories
    exports_folder = doc_folder / "exports"
    figures_folder = exports_folder / "figures"
    tables_folder = exports_folder / "tables"
    pages_folder = exports_folder / "pages"
    
    for folder in [exports_folder, figures_folder, tables_folder, pages_folder]:
        folder.mkdir(parents=True, exist_ok=True)
    
    doc_filename = Path(file_path).stem
    
    # Export figures if enabled
    if self.config.export_figures:
        figure_info = self._export_figures(result, doc, doc_filename, figures_folder, tables_folder, pages_folder, exports_folder)
        exports_info.update(figure_info)
        
    # Export tables if enabled
    if self.config.export_tables:
        table_info = self._export_tables(result, doc, doc_filename, tables_folder)
        exports_info.update(table_info)
        
    return exports_info

def _export_figures(self, result, doc, doc_filename: str, figures_folder: Path, tables_folder: Path, pages_folder: Path, exports_folder: Path) -> Dict[str, Any]:
    """Export figures, page images, and pictures with optional LLM descriptions."""
    print("🖼️  Exporting figures...")
    exported_files = []
    image_descriptions = []
    
    try:
        # Save page images to pages subfolder
        page_counter = 0
        if hasattr(result, 'document') and hasattr(result.document, 'pages'):
            for page_no, page in result.document.pages.items():
                if hasattr(page, 'image') and page.image:
                    page_counter += 1
                    page_image_filename = pages_folder / f"{doc_filename}-page-{page.page_no}.png"
                    with page_image_filename.open("wb") as fp:
                        page.image.pil_image.save(fp, format="PNG")
                    exported_files.append(str(page_image_filename))
        
        # Save images of figures and tables
        table_counter = 0
        picture_counter = 0
        
        if hasattr(result, 'document'):
            for element, _level in result.document.iterate_items():
                if isinstance(element, TableItem):
                    table_counter += 1
                    element_image_filename = tables_folder / f"{doc_filename}-table-{table_counter}.png"
                    try:
                        with element_image_filename.open("wb") as fp:
                            element.get_image(result.document).save(fp, "PNG")
                        exported_files.append(str(element_image_filename))
                        
                        # Add LLM description if enabled
                        if self.config.describe_images and REPLICATE_AVAILABLE:
                            desc_result = self._describe_image_with_llm(element_image_filename, self.config.llm_model)
                            desc_result.update({
                                "type": "table",
                                "image_filename": element_image_filename.name,
                                "sequence_number": table_counter
                            })
                            image_descriptions.append(desc_result)
                            
                    except Exception as e:
                        print(f"⚠️ Warning: Could not export table {table_counter} image: {e}")
                
                if isinstance(element, PictureItem):
                    picture_counter += 1
                    element_image_filename = figures_folder / f"{doc_filename}-picture-{picture_counter}.png"
                    try:
                        with element_image_filename.open("wb") as fp:
                            element.get_image(result.document).save(fp, "PNG")
                        exported_files.append(str(element_image_filename))
                        
                        # Add LLM description if enabled
                        if self.config.describe_images and REPLICATE_AVAILABLE:
                            desc_result = self._describe_image_with_llm(element_image_filename, self.config.llm_model)
                            desc_result.update({
                                "type": "picture/figure",
                                "image_filename": element_image_filename.name,
                                "sequence_number": picture_counter
                            })
                            image_descriptions.append(desc_result)
                            
                    except Exception as e:
                        print(f"⚠️ Warning: Could not export picture {picture_counter} image: {e}")
        
        # Save consolidated image descriptions if any were generated
        if self.config.describe_images and image_descriptions:
            consolidated_descriptions = {
                "document_name": doc_filename,
                "timestamp": pd.Timestamp.now().isoformat(),
                "total_images": len(image_descriptions),
                "total_cost": sum(desc.get("cost", 0) for desc in image_descriptions),
                "total_input_tokens": sum(desc.get("input_tokens", 0) for desc in image_descriptions),
                "total_output_tokens": sum(desc.get("output_tokens", 0) for desc in image_descriptions),
                "model_used": self.config.llm_model,
                "descriptions": image_descriptions
            }
            
            consolidated_filename = exports_folder / f"{doc_filename}-image-descriptions.json"
            with consolidated_filename.open("w", encoding="utf-8") as fp:
                json.dump(consolidated_descriptions, fp, indent=2, ensure_ascii=False)
            exported_files.append(str(consolidated_filename))
            print(f"🤖 Generated {len(image_descriptions)} LLM image descriptions")
        
        print(f"✅ Exported {len(exported_files)} figure files")
        return {
            "figures_exported": table_counter + picture_counter,
            "pages_exported": page_counter,
            "figure_files": exported_files,
            "image_descriptions": image_descriptions
        }
        
    except Exception as e:
        print(f"❌ Warning: Figure export failed: {e}")
        return {"figures_exported": 0, "figure_files": []}

def _export_tables(self, result, doc, doc_filename: str, tables_folder: Path) -> Dict[str, Any]:
    """Export tables to various formats (CSV, HTML, Markdown)."""
    print("📊 Exporting tables...")
    exported_files = []
    
    try:
        table_counter = 0
        if hasattr(doc, 'tables'):
            for table_ix, table in enumerate(doc.tables):
                table_counter += 1
                
                try:
                    table_df = table.export_to_dataframe(doc=doc)
                    
                    # Save as CSV
                    csv_filename = tables_folder / f"{doc_filename}-table-{table_ix + 1}.csv"
                    table_df.to_csv(csv_filename, index=False)
                    exported_files.append(str(csv_filename))
                    
                    # Save as HTML
                    html_filename = tables_folder / f"{doc_filename}-table-{table_ix + 1}.html"
                    with html_filename.open("w") as fp:
                        fp.write(table.export_to_html(doc=doc))
                    exported_files.append(str(html_filename))
                    
                    # Save as Markdown
                    md_filename = tables_folder / f"{doc_filename}-table-{table_ix + 1}.md"
                    with md_filename.open("w") as fp:
                        fp.write(f"## Table {table_ix + 1}\n\n")
                        fp.write(table_df.to_markdown(index=False))
                    exported_files.append(str(md_filename))
                    
                except Exception as e:
                    print(f"⚠️ Warning: Could not export table {table_ix + 1}: {e}")
        
        print(f"✅ Exported {table_counter} tables in {len(exported_files)} files")
        return {
            "tables_exported": table_counter,
            "table_files": exported_files
        }
        
    except Exception as e:
        print(f"❌ Warning: Table export failed: {e}")
        return {"tables_exported": 0, "table_files": []}

def _describe_image_with_llm(self, image_path: Path, model: str = "yorickvp/llava-13b:80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb") -> Dict[str, Any]:
        """Describe an image using Replicate LLaVA model."""
        try:
            import replicate
            
            # Convert image to base64 data URI
            with open(image_path, "rb") as image_file:
                image_data = base64.b64encode(image_file.read()).decode('utf-8')
                image_data_uri = f"data:image/png;base64,{image_data}"
            
            prompt_text = "Describe this image in detail. Focus on the main content, text, data, charts, diagrams, or any other relevant information that would be useful for document understanding and search."
            
            # Use Replicate LLaVA model to describe the image
            output = replicate.run(
                "yorickvp/llava-13b:80537f9eead1a5bfa72d5ac6ea6414379be41d4d4f6679fd776e9535d1eb58bb",
                input={
                    "image": image_data_uri,
                    "top_p": 1,
                    "prompt": prompt_text,
                    "max_tokens": 1024,
                    "temperature": 0.2
                }
            )
            
            # LLaVA returns an iterator, collect all output
            description_parts = []
            for item in output:
                description_parts.append(str(item))
            
            description = "".join(description_parts).strip()
            
            return {
                "success": True,
                "description": description,
                "prompt": prompt_text,
                "image_path": str(image_path),
                "model": "yorickvp/llava-13b",
                "input_tokens": 0,  # Replicate doesn't provide token info
                "output_tokens": len(description.split()) if description else 0,
                "cost": 0.0,  # Replicate doesn't provide cost info in response
                "timestamp": pd.Timestamp.now().isoformat(),
                "error": None
            }
            
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "description": None,
                "prompt": None,
                "image_path": str(image_path),
                "model": "yorickvp/llava-13b",
                "input_tokens": 0,
                "output_tokens": 0,
                "cost": 0.0,
                "timestamp": pd.Timestamp.now().isoformat()
            }


# Add these methods to our ComprehensiveDocumentProcessor class
ComprehensiveDocumentProcessor._handle_exports = _handle_exports
ComprehensiveDocumentProcessor._export_figures = _export_figures  
ComprehensiveDocumentProcessor._export_tables = _export_tables
ComprehensiveDocumentProcessor._describe_image_with_llm = _describe_image_with_llm

print("✅ Export functionality added to ComprehensiveDocumentProcessor!")


## 5. 🎯 Complete Demo: Processing a Document with Full Features

Now let's demonstrate the complete functionality by processing a document with all features enabled, just like the command-line example!


In [None]:
# lets first download the PDF locally for processing
import requests

url = "https://arxiv.org/pdf/2501.17887"
document_path = "docling_paper.pdf"
response = requests.get(url)
with open(document_path, "wb") as f:
    f.write(response.content)
print(f"✅ Downloaded sample PDF: {document_path}")

In [None]:
# Initialize the comprehensive processor with full features
processor = ComprehensiveDocumentProcessor(config)

# Check for available example files
print(f" Using document path: {document_path}")


print(f"\nConfiguration Summary:")
print("="*60)
print(config.summary())

In [None]:
# Process the document with full features
print("🚀 Starting comprehensive document processing...")
print("This may take a few minutes depending on document size and features enabled.")

result = processor.process_document(document_path)

# Analyze and display results
if "error" in result:
    print(f"❌ Error processing document: {result['error']}")
    print("💡 Make sure the document path is correct and you have the required API keys if using image descriptions.")
else:
    print(f"\n🎉 Document processed successfully!")
    
    # Display comprehensive summary
    stats = result["document_stats"]
    metadata = result["metadata"]
    exports = result.get("exports", {})
    
    print(f"\n📊 PROCESSING RESULTS:")
    print("="*60)
    print(f"📄 Document: {metadata['file_name']}")
    print(f"📁 Output folder: {metadata['output_folder']}")
    print(f"📏 File size: {metadata['file_size_bytes']:,} bytes")
    if "page_count" in metadata:
        print(f"📖 Pages: {metadata['page_count']}")
    
    print(f"\n🧩 CONTENT ANALYSIS:")
    print(f"   • Total characters: {stats['total_characters']:,}")
    print(f"   • Total words: {stats['total_words']:,}")
    print(f"   • Total chunks: {stats['total_chunks']}")
    print(f"   • Total tables: {stats['total_tables']}")
    print(f"   • Total headers: {stats['total_headers']}")
    
    if stats['total_chunks'] > 0:
        avg_chars_per_chunk = stats['total_characters'] / stats['total_chunks']
        avg_words_per_chunk = stats['total_words'] / stats['total_chunks']
        print(f"   • Avg chars/chunk: {avg_chars_per_chunk:.0f}")
        print(f"   • Avg words/chunk: {avg_words_per_chunk:.0f}")
    
    print(f"\n🎨 EXPORTS SUMMARY:")
    if exports:
        if exports.get('figures_exported', 0) > 0:
            print(f"   • Figures exported: {exports['figures_exported']}")
        if exports.get('tables_exported', 0) > 0:
            print(f"   • Tables exported: {exports['tables_exported']}")
        if exports.get('pages_exported', 0) > 0:
            print(f"   • Pages exported: {exports['pages_exported']}")
        if exports.get('figure_files'):
            print(f"   • Export files created: {len(exports['figure_files'])}")
        if exports.get('image_descriptions'):
            total_cost = sum(desc.get("cost", 0) for desc in exports['image_descriptions'])
            total_tokens = sum(desc.get("input_tokens", 0) + desc.get("output_tokens", 0) for desc in exports['image_descriptions'])
            print(f"   • LLM descriptions: {len(exports['image_descriptions'])}")
            print(f"   • Total LLM cost: ${total_cost:.4f}")
            print(f"   • Total LLM tokens: {total_tokens:,}")
    else:
        print("   • No exports configured")
    
    if "output_structure" in result:
        print(f"\n📁 STRUCTURED OUTPUT CREATED:")
        structure = result["output_structure"]
        for folder_type, folder_path in structure.items():
            folder_name = folder_type.replace("_folder", "")
            print(f"   • {folder_name}: {folder_path}")

### ✅ What We've Done

- **🔧 UV Package Management**: Modern Python dependency management
- **⚙️ Advanced Pipeline Configuration**: All document processing options
- **📊 Structured Output System**: Organized, production-ready folder hierarchies  
- **🖼️ Figure & Table Exports**: Multi-format visual content extraction
- **🤖 LLM Image Descriptions**: AI-powered multimodal processing capabilities
- **🧩 Hybrid Chunking**: Smart, context-aware chunking for optimal RAG
- **📈 Comprehensive Analysis**: Quality metrics and visualization tools


- **Docling Documentation**: [docling-project.github.io](https://docling-project.github.io/docling/)
- **UV Package Manager**: [docs.astral.sh/uv](https://docs.astral.sh/uv/)