v4.4.0

Goldziher released this 28 Feb 08:23

· 6486 commits to main since this release

4b12757

Added

R language bindings -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets.
PHP async extraction: Non-blocking extraction via DeferredResult pattern with Tokio thread pool. Includes extractFileAsync(), extractBytesAsync(), batchExtractFilesAsync(), batchExtractBytesAsync() across OOP, procedural, and static APIs. Framework bridges for Amp v3+ (AmpBridge) and ReactPHP (ReactBridge).
C FFI distribution: Official C shared library (libkreuzberg) with cbindgen-generated header, cmake packaging (find_package(kreuzberg)), pkg-config support, and prebuilt binaries for Linux x86_64/aarch64, macOS arm64, and Windows x86_64. Includes full API reference documentation and test coverage.
Go FFI bindings: Go package (packages/go/v4) consuming the C FFI shared library with prebuilt binaries published as GitHub release assets for all four platforms.
C as 13th e2e test language: The e2e-generator now produces C test files exercising the FFI API, with 15 passing test cases.
R distribution via r-universe: Switched R package distribution from CRAN to r-universe for faster release cycles and easier native compilation.
WASM native OCR (ocr-wasm feature): Tesseract OCR compiled directly into the WASM binary via kreuzberg-tesseract, enabling OCR in all environments (Browser, Node.js, Deno, Bun) without browser-specific APIs. Supports 43 languages with tessdata downloaded from CDN into memory.
WASM Node.js/Deno PDFium support: PDFium initialization now works in Node.js and Deno by loading the WASM module from the filesystem. Configurable via KREUZBERG_PDFIUM_PATH environment variable.
WASM full-feature build: OCR, Excel, and archive extraction are now enabled by default in the WASM package. All wasm-pack build targets include the ocr-wasm feature.
WASM Excel extraction (excel-wasm feature): Calamine-based Excel/spreadsheet extraction available in WASM without requiring Tokio runtime.
WASM archive extraction: ZIP, TAR, 7z, and GZIP archive extraction now available in WASM via synchronous extractor implementations.
WASM PDF annotations: PDF annotations (text notes, highlights, links, stamps) are now exposed in the WASM TypeScript API via the annotations field on ExtractionResult.

Fixed

DOCX equations not extracted: OMML math content was completely ignored by the DOCX parser, causing all equation text to be silently dropped. Math runs are now extracted as regular text.
DOCX line breaks ignored: <w:br/> elements were not handled, causing adjacent text segments to merge. Line breaks now insert whitespace.
PPTX/PPSX table content lost: Tables were rendered as HTML without whitespace between tags, causing the entire table to tokenize as a single unreadable blob. Tables now render as markdown pipe tables with proper cell separation.
PPTX/PPSX/PPTM image markers pollute text: Image references injected spurious numeric tokens into extracted content. Image markers now use a clean ![image]() format.
DOCX image markers pollute text: Drawing references injected spurious numeric tokens. Changed to ![alt](image).
EPUB double-lossy conversion: XHTML content was converted through an XHTML-to-markdown-to-plain-text pipeline, losing content at each stage. Replaced with direct roxmltree traversal that extracts text content from XHTML elements without intermediate markdown.
Excel float formatting drops numeric precision: format_cell_to_string() formatted whole-number floats as "1.0" instead of "1", causing numeric token mismatches in quality scoring.
HTML metadata extraction pollutes content: The extract_metadata option was left enabled, causing YAML frontmatter to be prepended to the content string. Set extract_metadata = false in the metadata extraction path.
Markdown extractor loses tokens through AST reconstruction: Now returns raw text content directly (after frontmatter extraction) while still parsing the AST for table and image extraction.
SVG text extraction includes element prefixes: SVG extraction now targets only text-bearing elements without prefixes.
XML ground truth uses raw source: Regenerated all 20 ground truth files.
Elixir benchmark UTF-8 locale: Erlang VM running with latin1 native encoding corrupted UTF-8 strings from Rust NIFs. Added ERL_LIBS path configuration.
WASM OCR not working (enableOcr() regression): The function now bridges both JS-side and Rust-side registries so OCR works end-to-end.
WASM tessdata CDN URL returns 404: Updated to use the official tesseract-ocr/tessdata_fast GitHub repository.
XML UTF-16 parsing fails on files with odd byte count: The decoder now truncates to the nearest even byte boundary.
R bindings crash on strings with embedded NUL bytes: NUL bytes are now stripped before passing strings to R.
R bindings %||% operator incompatible with R < 4.4: Added a package-local polyfill for backwards compatibility.
API returns HTTP 500 for unsupported file formats (#414): UnsupportedFormat errors are now mapped to HTTP 400 with a clear UnsupportedFormatError response.
PDF markdown extraction missing headings/bold for flat structure trees (#391): Pages with font size variation but no heading tags are now enriched via K-means font-size clustering.
PaddleOCR backend not found when using backend="paddleocr" (#403): The OCR backend registry now resolves the "paddleocr" alias to the canonical "paddle-ocr" name.
WASM metadata serialization: Switched from serde_wasm_bindgen to serde_json + JSON.parse() for output serialization.
WASM config deserialization: Config keys are now converted to snake_case before passing to the WASM boundary.
WASM PDFium module loading: The build script now locates and copies the actual PDFium ESM module from the Cargo build output.
Email header extraction loses display names: From, To, CC, and BCC fields now use "Display Name" <email@example.com> format when a display name is available.
Email date header normalized to RFC 3339: Now preserves the raw Date header value and only falls back to RFC 3339 normalization when unavailable.
Docker builds fail due to missing snippet-runner exclusion: Added snippet-runner to the sed exclusion patterns in all three Dockerfiles.
WASM Deno e2e tests skip OCR fixtures: The e2e generator now calls enableOcr() after initWasm() in every generated test file.
WASM Deno e2e tests ignore pages config: Added mapPageConfig() to the test helper template.
C FFI NULL callback crash: Reject NULL callback function pointers in plugin registration to prevent segfaults.

Removed

polars dependency: Removed unused polars crate and table_from_arrow_to_markdown dead code from the excel feature.

Assets 33