You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
is_blank field on PageInfo and PageContent: Pages with fewer than 3 non-whitespace characters and no tables or images are flagged as blank. Detection uses a two-phase approach: text-only analysis during extraction, then refinement after table/image assignment. Available across all 9 language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir, WASM). Closes #378.
PaddleOCR Backend
PaddleOCR backend via ONNX Runtime: New OCR backend (kreuzberg-paddle-ocr) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
PaddleOCR support in all bindings: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the paddle-ocr feature flag.
PaddleOCR CLI support: The kreuzberg-cli binary supports --ocr-backend paddle-ocr for PaddleOCR extraction.
Unified OCR Element Output
Structured OCR element data: Extraction results now include OcrElement data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.
Shared ONNX Runtime Discovery
ort_discovery module: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.
Document Structure Output
DocumentStructure support across all bindings: Added structured document output with include_document_structure configuration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.
Native DOC/PPT Extraction
OLE/CFB-based extraction: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.
musl Linux Support
Re-enabled musl targets: Added x86_64-unknown-linux-musl and aarch64-unknown-linux-musl targets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).
Fixed .msg (Outlook) extraction hanging indefinitely on files with large attachments. Replaced the msg_parser crate with direct OLE/CFB parsing using the cfb crate — attachment binary data is now read directly without hex-encoding overhead.
Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.
Rotated PDF Text Extraction
Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips /Rotate entries from page dictionaries before loading, restoring correct text extraction for all rotation angles.
CSV and Excel Extraction Quality
Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.
WASM Table Extraction
Fixed WASM adapter not recognizing page_number field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.
Fixed DOCX extraction producing plain text instead of formatted markdown. Bold, italic, underline, strikethrough, and hyperlinks are now rendered with proper markdown markers.
Fixed heading hierarchy: Title style maps to #, Heading1 to ##, through Heading5+ clamped at ######.
Fixed bullet lists, numbered lists, and nested list indentation.
Fixed tables missing from markdown output. Tables are now interleaved with paragraphs in document order and rendered as markdown pipe tables.
Fixed table cell formatting being stripped — bold/italic inside table cells is now preserved.
Fixed Typst extract_table_content double-counting opening parenthesis, which caused the table parser to consume all remaining document content after a #table() call.
PaddleOCR Recognition Model
Fixed PaddleOCR recognition model failing to load with ShapeInferenceError on ONNX Runtime 1.23.x.
Fixed incorrect detection model filename in Docker and CI action.
Python Bindings
Fixed OcrConfig constructor silently ignoring paddle_ocr_config and element_config keyword arguments.
Fixed keyword extraction results being silently dropped in Python bindings. Closes #379.
TypeScript/Node.js Bindings
Fixed PaddleOCR config and element config being silently dropped by the NAPI-RS binding layer.
Fixed ocr_elements missing from extraction result conversion.
Ruby Bindings
Fixed kreuzberg-pdfium-render vendored crate not included in gemspec.
Fixed PaddleOCR config and element config not being parsed in Ruby binding config layer.
Fixed ocr_elements missing from Ruby extraction result conversion.
Go Bindings
Fixed PdfMetadata deserialization failing when keyword extraction produces object arrays instead of simple strings.
C# Bindings
Fixed keyword extraction data inaccessible — ExtractedKeywords was marked [JsonIgnore] and excluded from metadata serialization.
PHP Bindings
Fixed document, elements, and ocrElements properties inaccessible on ExtractionResult.
Fixed ExtractionConfig::toArray() not serializing include_document_structure.
Fixed wrapper function names for document extractor management.
Added missing OCR backend management functions.
Fixed page_count metadata key mismatch.
Elixir Bindings
Fixed NIF config parser not forwarding include_document_structure, result_format, output_format, html_options, max_concurrent_extractions, and security_limits options.
Added missing document extractor management NIFs.
CI
Fixed PHP E2E tests not actually running in CI.
Changed
Build System
Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
Bumped vendored Tesseract from 5.5.1 to 5.5.2.
Bumped vendored Leptonica from 1.86.0 to 1.87.0.
Removed
LibreOffice Dependency
LibreOffice is no longer required: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB.
msg_parser Dependency
Replaced msg_parser crate with direct CFB parsing for MSG extraction.
Guten OCR Backend
Removed all references to the unused Guten OCR backend from Node.js and PHP bindings.