# MultiModal Extraction using Docling

This notebook demonstrates how to extract multimodal page data (images, text, cells, segments) from documents and export to Parquet format.

## Overview

**Multimodal extraction** captures rich document information including:
- **Page Images**: Rendered page images at configurable resolution
- **Text Content**: Plain text and markdown representations
- **Cells**: Structured layout cells with bounding boxes
- **Segments**: Document segments with hierarchy
- **Metadata**: Page dimensions, DPI, hashes, etc.

## Use Cases

- **Vision-Language Models**: Training data for multimodal LLMs
- **Document Understanding**: Combined visual + textual analysis
- **Layout Analysis**: Preserve spatial information with content
- **Archival**: High-fidelity document preservation
- **Dataset Creation**: Build datasets for ML training

## Workflow

1. Configure PDF pipeline with image generation
2. Convert document and extract pages
3. Generate multimodal records per page
4. Export to Parquet format
5. Optional: Load and visualize with HuggingFace Datasets


## 1. Installation and Setup

Install required packages for multimodal extraction.


In [None]:
# Install required packages

# Uncomment and run if packages are not installed

# !pip install docling
# !pip install docling-core
# !pip install pandas
# !pip install pyarrow  # Required for Parquet export
# !pip install pillow   # For image handling
# !pip install datasets # Optional: for loading Parquet with HuggingFace


In [None]:
# Import required libraries
import datetime
import logging
import time
from pathlib import Path
from typing import List, Dict, Any

import pandas as pd

# Docling imports
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.utils.export import generate_multimodal_pages
from docling.utils.utils import create_hash

# Configure logging
logging.basicConfig(level=logging.INFO)
_log = logging.getLogger(__name__)

print("âœ“ All imports successful!")


  from .autonotebook import tqdm as notebook_tqdm


âœ“ All imports successful!


## 2. Configuration

Configure document processing parameters and paths.


In [None]:
# Configuration parameters
IMAGE_RESOLUTION_SCALE = 2.0  # Image scale: 1.0 = 72 DPI, 2.0 = 144 DPI, etc.

# Paths
input_doc_path = "/Users/yashpatil/Developer/AI/SunnySavita/sample/projectOverview.pdf"  # Change to your document
output_dir = Path("/Users/yashpatil/Developer/AI/SunnySavita/multimodal_output")

# Create output directory
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Configuration:")
print(f"  - Input document: {input_doc_path}")
print(f"  - Output directory: {output_dir}")
print(f"  - Image resolution scale: {IMAGE_RESOLUTION_SCALE}x (={IMAGE_RESOLUTION_SCALE * 72:.0f} DPI)")


Configuration:
  - Input document: /Users/yashpatil/Developer/AI/SunnySavita/sample/projectOverview.pdf
  - Output directory: /Users/yashpatil/Developer/AI/SunnySavita/multimodal_output
  - Image resolution scale: 2.0x (=144 DPI)


## 3. Initialize Document Converter with Image Generation

Configure the pipeline to generate and preserve page images during conversion.


In [None]:
# Configure pipeline options
# Key: generate_page_images must be True to export images
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True  # Critical for multimodal export
pipeline_options.do_table_structure = True

# Initialize converter
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

print("âœ“ Document converter initialized")
print(f"  - Page images: ENABLED")
print(f"  - Image scale: {pipeline_options.images_scale}x")
print(f"  - Table extraction: {pipeline_options.do_table_structure}")


âœ“ Document converter initialized
  - Page images: ENABLED
  - Image scale: 2.0x
  - Table extraction: True


## 4. Convert Document

Convert the document with image generation enabled.


In [None]:
# Convert document
print(f"Converting document: {input_doc_path}")
print("This may take a moment...")

start_time = time.time()
conv_res = doc_converter.convert(input_doc_path)
conversion_time = time.time() - start_time

print(f"\nâœ“ Document converted in {conversion_time:.2f} seconds")
print(f"  - Document: {conv_res.input.file.name}")
print(f"  - Pages: {len(conv_res.document.pages)}")
print(f"  - Hash: {conv_res.input.document_hash}")


2025-11-10 01:29:24,042 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-10 01:29:24,072 - INFO - Going to convert document batch...
2025-11-10 01:29:24,073 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 8930b91cc2f7c6e23c0e5fd2d07fa5f5
2025-11-10 01:29:24,087 - INFO - Loading plugin 'docling_defaults'
2025-11-10 01:29:24,089 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-11-10 01:29:24,091 - INFO - Loading plugin 'docling_defaults'
2025-11-10 01:29:24,094 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']


Converting document: /Users/yashpatil/Developer/AI/SunnySavita/sample/projectOverview.pdf
This may take a moment...


2025-11-10 01:29:39,166 - INFO - Auto OCR model selected ocrmac.
2025-11-10 01:29:39,170 - INFO - Accelerator device: 'mps'
2025-11-10 01:29:41,103 - INFO - Accelerator device: 'mps'
2025-11-10 01:29:41,554 - INFO - Processing document projectOverview.pdf
2025-11-10 01:29:44,272 - INFO - Finished converting document projectOverview.pdf in 20.23 sec.



âœ“ Document converted in 20.23 seconds
  - Document: projectOverview.pdf
  - Pages: 5
  - Hash: c6850c84582b4c487f5577356a0d9fa49847974fb387db086978d18357694fef


## 5. Generate Multimodal Records

Extract multimodal data (images, text, cells, segments) for each page.


In [None]:
# Generate multimodal records for each page
rows = []

print("Generating multimodal records...")

for (
    content_text,      # Plain text content
    content_md,        # Markdown content
    content_dt,        # DoclingDocument content
    page_cells,        # Layout cells
    page_segments,     # Document segments
    page,              # Page object with image
) in generate_multimodal_pages(conv_res):
    
    # Calculate DPI from scale
    dpi = page._default_image_scale * 72
    
    # Create page record
    page_record = {
        "document": conv_res.input.file.name,
        "hash": conv_res.input.document_hash,
        "page_hash": create_hash(
            conv_res.input.document_hash + ":" + str(page.page_no - 1)
        ),
        "image": {
            "width": page.image.width,
            "height": page.image.height,
            "bytes": page.image.tobytes(),
        },
        "cells": page_cells,
        "contents": content_text,
        "contents_md": content_md,
        "contents_dt": content_dt,
        "segments": page_segments,
        "extra": {
            "page_num": page.page_no,
            "width_in_points": page.size.width,
            "height_in_points": page.size.height,
            "dpi": dpi,
        },
    }
    
    rows.append(page_record)
    print(f"  âœ“ Page {page.page_no}: {page.image.width}x{page.image.height}px @ {dpi:.0f} DPI")

print(f"\nâœ“ Generated {len(rows)} multimodal page records")


Generating multimodal records...
  âœ“ Page 0: 1190x1684px @ 144 DPI
  âœ“ Page 1: 1190x1684px @ 144 DPI
  âœ“ Page 2: 1190x1684px @ 144 DPI
  âœ“ Page 3: 1190x1684px @ 144 DPI
  âœ“ Page 4: 1190x1684px @ 144 DPI

âœ“ Generated 5 multimodal page records


## 6. Create DataFrame and Export to Parquet

Convert records to a pandas DataFrame and save as Parquet file.


In [None]:
# Convert to DataFrame with flattened structure
df_result = pd.json_normalize(rows)

# Generate timestamped filename
now = datetime.datetime.now()
output_filename = output_dir / f"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet"

# Export to Parquet
df_result.to_parquet(output_filename, engine='pyarrow')

print(f"âœ“ Multimodal data exported to Parquet")
print(f"  - File: {output_filename}")
print(f"  - Size: {output_filename.stat().st_size / 1024:.2f} KB")
print(f"  - Records: {len(df_result)}")
print(f"\nDataFrame shape: {df_result.shape}")
print(f"Columns: {list(df_result.columns)}")


âœ“ Multimodal data exported to Parquet
  - File: /Users/yashpatil/Developer/AI/SunnySavita/multimodal_output/multimodal_2025-11-10_013050.parquet
  - Size: 2580.47 KB
  - Records: 5

DataFrame shape: (5, 15)
Columns: ['document', 'hash', 'page_hash', 'cells', 'contents', 'contents_md', 'contents_dt', 'segments', 'image.width', 'image.height', 'image.bytes', 'extra.page_num', 'extra.width_in_points', 'extra.height_in_points', 'extra.dpi']


## 7. Inspect the Data

Preview the extracted multimodal data.


In [None]:
# Display DataFrame info
print("DataFrame Information:")
print("=" * 80)
print(df_result.info())

print("\n" + "=" * 80)
print("Sample Data (first row):")
print("=" * 80)

# Show first record (excluding image bytes for readability)
sample = df_result.iloc[0].to_dict()
for key, value in sample.items():
    if key == 'image.bytes':
        print(f"{key}: <binary data, {len(value)} bytes>")
    elif isinstance(value, str) and len(value) > 200:
        print(f"{key}: {value[:200]}...")
    else:
        print(f"{key}: {value}")


DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   document                5 non-null      object 
 1   hash                    5 non-null      object 
 2   page_hash               5 non-null      object 
 3   cells                   5 non-null      object 
 4   contents                5 non-null      object 
 5   contents_md             5 non-null      object 
 6   contents_dt             5 non-null      object 
 7   segments                5 non-null      object 
 8   image.width             5 non-null      int64  
 9   image.height            5 non-null      int64  
 10  image.bytes             5 non-null      object 
 11  extra.page_num          5 non-null      int64  
 12  extra.width_in_points   5 non-null      float64
 13  extra.height_in_points  5 non-null      float64
 14  extra.dpi              

## 8. Text Content Preview

Display text content from the first few pages.


:

In [None]:
# Preview text content from first 2 pages
num_pages_to_show = min(2, len(df_result))

for i in range(num_pages_to_show):
    row = df_result.iloc[i]
    print(f"\n{'=' * 80}")
    print(f"Page {row['extra.page_num']}")
    print(f"{'=' * 80}")
    print(f"Dimensions: {row['extra.width_in_points']:.1f} x {row['extra.height_in_points']:.1f} points")
    print(f"Image: {row['image.width']}x{row['image.height']}px @ {row['extra.dpi']:.0f} DPI")
    print(f"\nText Content (first 500 chars):")
    print("-" * 80)
    content = row['contents']
    print(content[:500] if len(content) > 500 else content)
    if len(content) > 500:
        print("...")



Page 0
Dimensions: 595.0 x 841.9 points
Image: 1190x1684px @ 144 DPI

Text Content (first 500 chars):
--------------------------------------------------------------------------------
DeepWiki sunnysavita10/document_portal Menu Document Portal Overview Relevant source files Purpose and Scope This document provides a high-level introduction to the Document Portal system, a comprehensive document processing platform that combines AI-powered analysis, comparison, and conversational capabilities. The system enables users to upload documents and perform various operations including metadata extraction, document comparison, and interactive question-answering through retrieval-augme
...

Page 1
Dimensions: 595.0 x 841.9 points
Image: 1190x1684px @ 144 DPI

Text Content (first 500 chars):
--------------------------------------------------------------------------------
High-Level Architecture System Components Overview Sources: requirements.txt 1-22 API Endpoints and Processing Flow Ask Devin a

otebook 

In [None]:
# Optional: Load with HuggingFace Datasets and reconstruct images
# Uncomment to run (requires: pip install datasets pillow)

# from datasets import Dataset
# from PIL import Image
# import io
# 
# # Load the Parquet file
# multimodal_df = pd.read_parquet(output_filename)
# 
# # Convert to HuggingFace Dataset
# dataset = Dataset.from_pandas(multimodal_df)
# 
# # Function to reconstruct images from bytes
# def reconstruct_image(example):
#     """Reconstruct PIL Image from raw bytes."""
#     img = Image.frombytes(
#         'RGB',
#         (example["image.width"], example["image.height"]),
#         example["image.bytes"],
#         'raw'
#     )
#     example["reconstructed_image"] = img
#     return example
# 
# # Apply transformation
# dataset = dataset.map(reconstruct_image)
# 
# print("âœ“ Dataset loaded with reconstructed images")
# print(f"  - Records: {len(dataset)}")
# print(f"  - Features: {dataset.features}")
# 
# # Display first image
# first_record = dataset[0]
# print(f"\nFirst page image: {first_record['reconstructed_image'].size}")
# first_record['reconstructed_image'].show()  # Opens image viewer


## 11. Complete Pipeline Function

A reusable function to process any document and export multimodal data.


## 10. Visualize Page Images (Alternative)

Display page images directly from the extracted data using matplotlib.


### Example: Use the Pipeline Function
"

In [None]:
# Visualize page images using matplotlib
# Uncomment to display images

# import matplotlib.pyplot as plt
# from PIL import Image
# import io
# 
# # Number of pages to visualize
# num_pages = min(3, len(df_result))
# 
# fig, axes = plt.subplots(1, num_pages, figsize=(15, 5))
# if num_pages == 1:
#     axes = [axes]
# 
# for i in range(num_pages):
#     row = df_result.iloc[i]
#     
#     # Reconstruct image from bytes
#     img = Image.frombytes(
#         'RGB',
#         (row['image.width'], row['image.height']),
#         row['image.bytes'],
#         'raw'
#     )
#     
#     # Display
#     axes[i].imshow(img)
#     axes[i].set_title(f"Page {row['extra.page_num']}\n{row['image.width']}x{row['image.height']}px")
#     axes[i].axis('off')
# 
# plt.tight_layout()
# plt.show()
# 
# print(f"âœ“ Displayed {num_pages} page images")


## 12. Use Cases and Applications

### Training Vision-Language Models (VLMs)

The exported Parquet files are ideal for training multimodal models:

```python
# Load as HuggingFace dataset for training
from datasets import load_dataset

dataset = load_dataset('parquet', data_files='multimodal_*.parquet')

# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)
```

### Document Understanding Pipeline

Combine visual and textual features:

```python
# Extract features for each page
for page in dataset:
    image = page['reconstructed_image']
    text = page['contents']
    cells = page['cells']
    
    # Process with vision model
    visual_features = vision_model(image)
    
    # Process with language model
    text_features = language_model(text)
    
    # Combine for downstream tasks
    combined_features = combine(visual_features, text_features)
```

### Document Archival System

Preserve documents with high fidelity:
- Full-resolution page images
- Structured text content
- Layout information (cells, segments)
- Searchable metadata

### Advantages

- **Efficient Storage**: Parquet format with compression
- **Fast Access**: Columnar format for quick queries
- **Rich Metadata**: Complete document structure preserved
- **ML-Ready**: Direct integration with ML frameworks
- **Scalable**: Process batches of documents efficiently


s

In [None]:
def export_multimodal_document(
    input_path: str,
    output_dir: Path,
    image_scale: float = 2.0,
    include_tables: bool = True
) -> Dict[str, Any]:
    """
    Convert a document and export multimodal page data to Parquet.
    
    Args:
        input_path: Path to input document (PDF, DOCX, etc.)
        output_dir: Directory to save Parquet file
        image_scale: Image resolution scale (1.0 = 72 DPI)
        include_tables: Enable table structure extraction
    
    Returns:
        Dictionary with export info and file path
    """
    # Configure pipeline
    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = image_scale
    pipeline_options.generate_page_images = True
    pipeline_options.do_table_structure = include_tables
    
    # Initialize converter
    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )
    
    # Convert document
    print(f"Converting: {input_path}")
    start_time = time.time()
    conv_res = doc_converter.convert(input_path)
    
    # Generate multimodal records
    rows = []
    for (content_text, content_md, content_dt, page_cells, 
         page_segments, page) in generate_multimodal_pages(conv_res):
        
        dpi = page._default_image_scale * 72
        rows.append({
            "document": conv_res.input.file.name,
            "hash": conv_res.input.document_hash,
            "page_hash": create_hash(
                conv_res.input.document_hash + ":" + str(page.page_no - 1)
            ),
            "image": {
                "width": page.image.width,
                "height": page.image.height,
                "bytes": page.image.tobytes(),
            },
            "cells": page_cells,
            "contents": content_text,
            "contents_md": content_md,
            "contents_dt": content_dt,
            "segments": page_segments,
            "extra": {
                "page_num": page.page_no,
                "width_in_points": page.size.width,
                "height_in_points": page.size.height,
                "dpi": dpi,
            },
        })
    
    # Create DataFrame and export
    df_result = pd.json_normalize(rows)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    now = datetime.datetime.now()
    output_filename = output_dir / f"multimodal_{now:%Y-%m-%d_%H%M%S}.parquet"
    df_result.to_parquet(output_filename, engine='pyarrow')
    
    elapsed = time.time() - start_time
    
    return {
        "input_path": input_path,
        "output_file": str(output_filename),
        "num_pages": len(rows),
        "file_size_kb": output_filename.stat().st_size / 1024,
        "processing_time": elapsed,
        "image_dpi": image_scale * 72,
    }

print("âœ“ Multimodal export function defined")


In [None]:
# Example usage of the pipeline function
# Uncomment to process a document

# result = export_multimodal_document(
#     input_path="/path/to/your/document.pdf",
#     output_dir=Path("multimodal_output"),
#     image_scale=2.0,  # 144 DPI
#     include_tables=True
# )
# 
# print("\nExport Complete!")
# print(f"  - Output file: {result['output_file']}")
# print(f"  - Pages processed: {result['num_pages']}")
# print(f"  - File size: {result['file_size_kb']:.2f} KB")
# print(f"  - Processing time: {result['processing_time']:.2f} seconds")
# print(f"  - Image DPI: {result['image_dpi']:.0f}")


de 

## Summary

This notebook demonstrated:

1. **Setup**: Configured Docling with image generation enabled
2. **Conversion**: Processed document with multimodal extraction
3. **Data Generation**: Created per-page records with images, text, cells, and segments
4. **Export**: Saved to Parquet format for efficient storage
5. **Loading**: Showed how to load and reconstruct data with HuggingFace Datasets
6. **Visualization**: Displayed page images and content
7. **Pipeline**: Created reusable function for batch processing

### Key Parameters

- **`images_scale`**: Controls image resolution (1.0 = 72 DPI, 2.0 = 144 DPI)
- **`generate_page_images`**: Must be `True` for multimodal export
- **Output format**: Parquet with flattened JSON structure

### Next Steps

- Process multiple documents in batch
- Integrate with ML training pipelines
- Build custom visualization tools
- Create document search and retrieval systems
- Train vision-language models on extracted data

### Resources

- **Docling Documentation**: https://github.com/DS4SD/docling
- **Parquet Format**: https://parquet.apache.org/
- **HuggingFace Datasets**: https://huggingface.co/docs/datasets/

---

**Ready for multimodal document processing! ðŸš€**
