# Document Processing with Vertector

This notebook demonstrates document processing capabilities including:
- PDF, DOCX, PPTX, XLSX processing
- Classic vs VLM pipeline selection
- Export to multiple formats
- Table extraction
- OCR configuration

## Setup and Imports

In [1]:
from pathlib import Path
from vertector_data_ingestion import (
    UniversalConverter,
    LocalMpsConfig,
    CloudGpuConfig,
    CloudCpuConfig,
    ExportFormat,
    PipelineType,
    HardwareDetector,
    setup_logging,
)

# Setup logging
setup_logging(log_level="INFO")

[32m2025-12-31 15:57:29[0m | [1mINFO    [0m | [36mvertector_data_ingestion.monitoring.logger[0m:[36msetup_logging[0m:[36m52[0m - [1mLogging initialized at INFO level[0m


## Hardware Detection

First, let's detect available hardware to optimize our configuration.

In [2]:
# Detect hardware
hw_info = HardwareDetector.get_device_info()

print("Hardware Information:")
print("=" * 50)
for key, value in hw_info.items():
    print(f"{key}: {value}")

# Get optimized config
hw_config = HardwareDetector.detect()
print(f"\nRecommended device: {hw_config.device_type}")

[32m2025-12-31 15:57:49[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 15:57:49[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m


Hardware Information:
device_type: mps
batch_size: 8
use_fp16: False
num_workers: 4
use_mlx: True
platform: darwin
chip: M1

Recommended device: HardwareType.MPS


## Basic Document Conversion

Convert a document using default settings.

In [3]:
# Initialize converter with hardware-optimized config
config = LocalMpsConfig()  # or CloudGpuConfig() or CloudCpuConfig()
converter = UniversalConverter(config)

# Convert a document (replace with your file)
# Note: For audio files (.wav, .mp3), install the 'asr' extra: uv sync --extra asr
doc_path = Path("../test_documents/arxiv_sample.pdf")

[32m2025-12-31 15:57:58[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 15:57:58[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2025-12-31 15:57:58[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36m__init__[0m:[36m57[0m - [1mHardware detected: mps[0m
[32m2025-12-31 15:57:58[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m__init__[0m:[36m45[0m - [1mInitialized UniversalConverter on mps[0m
[32m2025-12-31 15:57:58[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_ensure_models_available[0m:[36m68[0m - [1mChecking model availability...[0m


In [4]:
if doc_path.exists():
    # Use unified convert() method
    doc = converter.convert(doc_path)
    
    print(f"Document: {doc.metadata.source_path.name}")
    print(f"Pages: {doc.metadata.num_pages}")
    print(f"Pipeline: {doc.metadata.pipeline_type}")
    print(f"Processing time: {doc.metadata.processing_time:.2f}s")
else:
    print(f"File not found: {doc_path}")
    print("\nReplace with path to your PDF, DOCX, PPTX, or other document.")

Consider using the pymupdf_layout package for a greatly improved page layout analysis.


[32m2025-12-31 15:59:02[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m100[0m - [1mUsing Classic pipeline for PDF with tables: arxiv_sample.pdf[0m
[32m2025-12-31 15:59:02[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m180[0m - [1mConverting arxiv_sample.pdf with classic pipeline[0m
2025-12-31 15:59:02,610 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-31 15:59:02,697 - INFO - Going to convert document batch...
2025-12-31 15:59:02,698 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 870e160bad93d15722a8ae8d62725e09
2025-12-31 15:59:02,737 - INFO - Loading plugin 'docling_defaults'
2025-12-31 15:59:02,740 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-31 15:59:02,748 - INFO - Loading plugin 'docling_defaults'
2025-12-31 15:59:02,761 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocr

Document: arxiv_sample.pdf
Pages: 9
Pipeline: classic
Processing time: 121.70s


## Export to Different Formats

In [5]:
if doc_path.exists():
    # Export to Markdown
    markdown = converter.export(doc, ExportFormat.MARKDOWN)
    print("Markdown Output (first 500 chars):")
    print("=" * 50)
    print(markdown[:500])
    print("...\n")
    
    # Save to file - output_dir is automatically managed
    output_path = converter.convert_and_export(
        source=doc_path,
        output_name="document.md",
        format=ExportFormat.MARKDOWN
    )
    print(f"Saved to: {output_path}")

Markdown Output (first 500 chars):
## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com

Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com

Ahmed S. Nassar IBM Research

Rueschlikon, Switzerland ahn@zurich.ibm.com

Michele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com

Peter Staar IBM Research Rueschlikon, Switzerland taa@zurich.ibm.com

Figure 1: Four examples of complex page layouts across diff
...



[32m2025-12-31 16:00:41[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m100[0m - [1mUsing Classic pipeline for PDF with tables: arxiv_sample.pdf[0m
[32m2025-12-31 16:00:41[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m180[0m - [1mConverting arxiv_sample.pdf with classic pipeline[0m
2025-12-31 16:00:41,831 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-31 16:00:41,836 - INFO - Going to convert document batch...
2025-12-31 16:00:41,836 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 870e160bad93d15722a8ae8d62725e09
2025-12-31 16:00:41,836 - INFO - Accelerator device: 'mps'
2025-12-31 16:00:42,408 - INFO - Accelerator device: 'mps'
2025-12-31 16:00:43,006 - INFO - Processing document arxiv_sample.pdf
2025-12-31 16:00:54,825 - INFO - Finished converting document arxiv_sample.pdf in 13.00 sec.
[32m2025-12-31

Saved to: output/document.md


## Pipeline Selection

Compare Classic vs VLM pipelines.

In [7]:
import time

if doc_path.exists():
    print("Classic Pipeline:")
    start = time.time()
    classic_doc = converter.convert(doc_path, use_vlm=False)
    classic_time = time.time() - start
    print(f"  Time: {classic_time:.2f}s")
    
    print("\nVLM Pipeline:")
    start = time.time()
    vlm_doc = converter.convert(doc_path, use_vlm=True)
    vlm_time = time.time() - start
    print(f"  Time: {vlm_time:.2f}s")
    print(f"\nVLM is {vlm_time/classic_time:.2f}x slower (but more accurate)")

[32m2025-12-31 16:05:14[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m180[0m - [1mConverting arxiv_sample.pdf with classic pipeline[0m
2025-12-31 16:05:14,310 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-31 16:05:14,316 - INFO - Going to convert document batch...
2025-12-31 16:05:14,317 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 870e160bad93d15722a8ae8d62725e09
2025-12-31 16:05:14,318 - INFO - Accelerator device: 'mps'


Classic Pipeline:


2025-12-31 16:05:15,260 - INFO - Accelerator device: 'mps'
2025-12-31 16:05:15,740 - INFO - Processing document arxiv_sample.pdf
2025-12-31 16:05:28,163 - INFO - Finished converting document arxiv_sample.pdf in 13.85 sec.
[32m2025-12-31 16:05:28[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m199[0m - [1mConverted arxiv_sample.pdf in 13.86s (9 pages, 0.6 pages/sec)[0m
[32m2025-12-31 16:05:28[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36m_build_vlm_options[0m:[36m233[0m - [1mUsing Granite-Docling-MLX (258M params, default)[0m
[32m2025-12-31 16:05:28[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m180[0m - [1mConverting arxiv_sample.pdf with vlm pipeline[0m
2025-12-31 16:05:28,173 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-31 16:05:28,177 - INFO - Going to convert document batch.

  Time: 13.87s

VLM Pipeline:


2025-12-31 16:05:28,858 - INFO - Processing document arxiv_sample.pdf
2025-12-31 16:08:18,440 - INFO - Finished converting document arxiv_sample.pdf in 170.27 sec.
[32m2025-12-31 16:08:18[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m199[0m - [1mConverted arxiv_sample.pdf in 170.27s (9 pages, 0.1 pages/sec)[0m


  Time: 170.27s

VLM is 12.28x slower (but more accurate)


## Batch Processing

Process multiple documents with the same unified API.

In [9]:
documents_dir = Path("../test_documents/")

if documents_dir.exists():
    pdf_files = list(documents_dir.glob("*.pdf"))[:5]
    
    if pdf_files:
        print(f"Processing {len(pdf_files)} documents...")
        
        # Same convert() method - just pass a list!
        results = converter.convert(pdf_files, parallel=True)
        
        print("\nResults:")
        for doc in results:
            print(f"  {doc.metadata.source_path.name}: {doc.metadata.num_pages} pages")
    else:
        print("No PDF files found")
else:
    print("Create a 'documents/' directory with PDFs to test batch processing")

[32m2025-12-31 16:09:33[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36mconvert_batch[0m:[36m331[0m - [1mConverting 3 documents in parallel[0m
[32m2025-12-31 16:09:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m97[0m - [1mUsing VLM pipeline for scanned PDF: ocr_test.pdf[0m


Processing 3 documents...


[32m2025-12-31 16:09:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36mdetermine_pipeline[0m:[36m106[0m - [1mUsing Classic pipeline (default) for 2112.13734v2.pdf[0m
[32m2025-12-31 16:09:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.pipeline_router[0m:[36m_build_vlm_options[0m:[36m233[0m - [1mUsing Granite-Docling-MLX (258M params, default)[0m
[32m2025-12-31 16:09:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m180[0m - [1mConverting 2112.13734v2.pdf with classic pipeline[0m
[32m2025-12-31 16:09:36[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.universal_converter[0m:[36m_convert_with_retry[0m:[36m180[0m - [1mConverting ocr_test.pdf with vlm pipeline[0m
2025-12-31 16:09:37,055 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-31 16:09:37,098 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-31 16:0


Results:
  ocr_test.pdf: 1 pages
  2112.13734v2.pdf: 4 pages
  arxiv_sample.pdf: 9 pages


## Summary

This notebook demonstrated:
- Unified `convert()` API for single and batch processing
- Hardware detection
- Pipeline selection
- Export capabilities

Next: Check out `02_audio_transcription.ipynb`