# Document Processing with Vertector

This notebook demonstrates document processing capabilities including:
- PDF, DOCX, PPTX, XLSX processing
- Classic vs VLM pipeline selection
- Export to multiple formats
- Table extraction
- OCR configuration

## Setup and Imports

In [1]:
from pathlib import Path
from vertector_data_ingestion import (
    UniversalConverter,
    LocalMpsConfig,
    CloudGpuConfig,
    CloudCpuConfig,
    ExportFormat,
    PipelineType,
    HardwareDetector,
    setup_logging,
)

# Setup logging
setup_logging(log_level="INFO")

[32m2026-01-15 02:40:20[0m | [1mINFO    [0m | [36mvertector_data_ingestion.monitoring.logger[0m:[36msetup_logging[0m:[36m51[0m - [1mLogging initialized at INFO level[0m


## Hardware Detection

First, let's detect available hardware to optimize our configuration.

In [2]:
# Detect hardware
hw_info = HardwareDetector.get_device_info()

print("Hardware Information:")
print("=" * 50)
for key, value in hw_info.items():
    print(f"{key}: {value}")

# Get optimized config
hw_config = HardwareDetector.detect()
print(f"\nRecommended device: {hw_config.device_type}")

[32m2026-01-15 02:40:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m
[32m2026-01-15 02:40:46[0m | [1mINFO    [0m | [36mvertector_data_ingestion.core.hardware_detector[0m:[36mdetect[0m:[36m50[0m - [1mDetected Apple Silicon with MPS support[0m


Hardware Information:
device_type: mps
batch_size: 8
use_fp16: False
num_workers: 4
use_mlx: True
platform: darwin
chip: M1

Recommended device: HardwareType.MPS


## Basic Document Conversion

Convert a document using default settings.

In [None]:
# Initialize converter with hardware-optimized config
config = LocalMpsConfig()  # or CloudGpuConfig() or CloudCpuConfig()
converter = UniversalConverter(config)

# Convert a document (replace with your file)
# Note: For audio files (.wav, .mp3), install the 'asr' extra: uv sync --extra asr
doc_path = Path("../test_documents/sample1.pdf")

In [None]:
if doc_path.exists():
    # Use unified convert() method
    doc = converter.convert(doc_path)
    
    print(f"Document: {doc.metadata.source_path.name}")
    print(f"Pages: {doc.metadata.num_pages}")
    print(f"Pipeline: {doc.metadata.pipeline_type}")
    print(f"Processing time: {doc.metadata.processing_time:.2f}s")
else:
    print(f"File not found: {doc_path}")
    print("\nReplace with path to your PDF, DOCX, PPTX, or other document.")

## Export to Different Formats

In [None]:
if doc_path.exists():
    # Export to Markdown
    markdown = converter.export(doc, ExportFormat.MARKDOWN)
    print("Markdown Output (first 500 chars):")
    print("=" * 50)
    print(markdown[:500])
    print("...\n")
    
    # Save to file - output_dir is automatically managed
    output_path = converter.convert_and_export(
        source=doc_path,
        output_name="prompt.md",
        format=ExportFormat.MARKDOWN
    )
    print(f"Saved to: {output_path}")

## Pipeline Selection

Compare Classic vs VLM pipelines.

In [None]:
import time

if doc_path.exists():
    print("Classic Pipeline:")
    start = time.time()
    classic_doc = converter.convert(doc_path, use_vlm=False)
    classic_time = time.time() - start
    print(f"  Time: {classic_time:.2f}s")
    
    print("\nVLM Pipeline:")
    start = time.time()
    vlm_doc = converter.convert(doc_path, use_vlm=True)
    vlm_time = time.time() - start
    print(f"  Time: {vlm_time:.2f}s")
    print(f"\nVLM is {vlm_time/classic_time:.2f}x slower (but more accurate)")

## Batch Processing

Process multiple documents with the same unified API.

In [None]:
documents_dir = Path("../test_documents/")

if documents_dir.exists():
    pdf_files = list(documents_dir.glob("*.pdf"))[:5]
    
    if pdf_files:
        print(f"Processing {len(pdf_files)} documents...")
        
        # Same convert() method - just pass a list!
        results = converter.convert(pdf_files, parallel=True)
        
        print("\nResults:")
        for doc in results:
            print(f"  {doc.metadata.source_path.name}: {doc.metadata.num_pages} pages")
    else:
        print("No PDF files found")
else:
    print("Create a 'documents/' directory with PDFs to test batch processing")

## Summary

This notebook demonstrated:
- Unified `convert()` API for single and batch processing
- Hardware detection
- Pipeline selection
- Export capabilities

Next: Check out `02_audio_transcription.ipynb`