Release v4.3.0 · xberg-io/xberg

Added

Blank Page Detection

is_blank field on PageInfo and PageContent: Pages with fewer than 3 non-whitespace characters and no tables or images are flagged as blank. Detection uses a two-phase approach: text-only analysis during extraction, then refinement after table/image assignment. Available across all 9 language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir, WASM). Closes #378.

PaddleOCR Backend

PaddleOCR backend via ONNX Runtime: New OCR backend (kreuzberg-paddle-ocr) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
PaddleOCR support in all bindings: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the paddle-ocr feature flag.
PaddleOCR CLI support: The kreuzberg-cli binary supports --ocr-backend paddle-ocr for PaddleOCR extraction.

Unified OCR Element Output

Structured OCR element data: Extraction results now include OcrElement data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.

Shared ONNX Runtime Discovery

ort_discovery module: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.

Document Structure Output

DocumentStructure support across all bindings: Added structured document output with include_document_structure configuration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.

Native DOC/PPT Extraction

OLE/CFB-based extraction: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.

musl Linux Support

Re-enabled musl targets: Added x86_64-unknown-linux-musl and aarch64-unknown-linux-musl targets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).

Fixed

MSG Extraction Hang on Large Attachments (#372)

Fixed .msg (Outlook) extraction hanging indefinitely on files with large attachments. Replaced the msg_parser crate with direct OLE/CFB parsing using the cfb crate — attachment binary data is now read directly without hex-encoding overhead.
Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.

Rotated PDF Text Extraction

Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips /Rotate entries from page dictionaries before loading, restoring correct text extraction for all rotation angles.

CSV and Excel Extraction Quality

Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
Fixed Excel extraction producing low quality scores (0.22) by outputting clean tab/newline-delimited cell text.

XML Extraction Quality

Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.

WASM Table Extraction

Fixed WASM adapter not recognizing page_number field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.

DOCX Formatting Output (#376)

Fixed DOCX extraction producing plain text instead of formatted markdown. Bold, italic, underline, strikethrough, and hyperlinks are now rendered with proper markdown markers.
Fixed heading hierarchy: Title style maps to #, Heading1 to ##, through Heading5+ clamped at ######.
Fixed bullet lists, numbered lists, and nested list indentation.
Fixed tables missing from markdown output. Tables are now interleaved with paragraphs in document order and rendered as markdown pipe tables.
Fixed table cell formatting being stripped — bold/italic inside table cells is now preserved.
Added 16 integration tests covering formatting, headings, lists, tables, and document structure.

Typst Table Content Extraction

Fixed Typst extract_table_content double-counting opening parenthesis, which caused the table parser to consume all remaining document content after a #table() call.

PaddleOCR Recognition Model

Fixed PaddleOCR recognition model failing to load with ShapeInferenceError on ONNX Runtime 1.23.x.
Fixed incorrect detection model filename in Docker and CI action.

Python Bindings

Fixed OcrConfig constructor silently ignoring paddle_ocr_config and element_config keyword arguments.
Fixed keyword extraction results being silently dropped in Python bindings. Closes #379.

TypeScript/Node.js Bindings

Fixed PaddleOCR config and element config being silently dropped by the NAPI-RS binding layer.
Fixed ocr_elements missing from extraction result conversion.

Ruby Bindings

Fixed kreuzberg-pdfium-render vendored crate not included in gemspec.
Fixed PaddleOCR config and element config not being parsed in Ruby binding config layer.
Fixed ocr_elements missing from Ruby extraction result conversion.

Go Bindings

Fixed PdfMetadata deserialization failing when keyword extraction produces object arrays instead of simple strings.

C# Bindings

Fixed keyword extraction data inaccessible — ExtractedKeywords was marked [JsonIgnore] and excluded from metadata serialization.

PHP Bindings

Fixed document, elements, and ocrElements properties inaccessible on ExtractionResult.
Fixed ExtractionConfig::toArray() not serializing include_document_structure.
Fixed wrapper function names for document extractor management.
Added missing OCR backend management functions.
Fixed page_count metadata key mismatch.

Elixir Bindings

Fixed NIF config parser not forwarding include_document_structure, result_format, output_format, html_options, max_concurrent_extractions, and security_limits options.
Added missing document extractor management NIFs.

CI

Fixed PHP E2E tests not actually running in CI.

Changed

Build System

Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
Bumped vendored Tesseract from 5.5.1 to 5.5.2.
Bumped vendored Leptonica from 1.86.0 to 1.87.0.

Removed

LibreOffice Dependency

LibreOffice is no longer required: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB.

`msg_parser` Dependency

Replaced msg_parser crate with direct CFB parsing for MSG extraction.

Guten OCR Backend

Removed all references to the unused Guten OCR backend from Node.js and PHP bindings.

Uh oh!

v4.3.0

Added

Blank Page Detection

PaddleOCR Backend

Unified OCR Element Output

Shared ONNX Runtime Discovery

Document Structure Output

Native DOC/PPT Extraction

musl Linux Support

Fixed

MSG Extraction Hang on Large Attachments (#372)

Rotated PDF Text Extraction

CSV and Excel Extraction Quality

XML Extraction Quality

WASM Table Extraction

DOCX Formatting Output (#376)

Typst Table Content Extraction

PaddleOCR Recognition Model

Python Bindings

TypeScript/Node.js Bindings

Ruby Bindings

Go Bindings

C# Bindings

PHP Bindings

Elixir Bindings

CI

Changed

Build System

Removed

LibreOffice Dependency

msg_parser Dependency

Guten OCR Backend

Uh oh!

`msg_parser` Dependency