You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
R language bindings -- Added kreuzberg R package via extendr with full extraction API (sync/async, batch, bytes), typed error conditions, S3 result class with accessors, config discovery, OCR/chunking configuration, plugin system, and 32 documentation snippets.
PHP async extraction: Non-blocking extraction via DeferredResult pattern with Tokio thread pool. Includes extractFileAsync(), extractBytesAsync(), batchExtractFilesAsync(), batchExtractBytesAsync() across OOP, procedural, and static APIs. Framework bridges for Amp v3+ (AmpBridge) and ReactPHP (ReactBridge).
C FFI distribution: Official C shared library (libkreuzberg) with cbindgen-generated header, cmake packaging (find_package(kreuzberg)), pkg-config support, and prebuilt binaries for Linux x86_64/aarch64, macOS arm64, and Windows x86_64. Includes full API reference documentation and test coverage.
Go FFI bindings: Go package (packages/go/v4) consuming the C FFI shared library with prebuilt binaries published as GitHub release assets for all four platforms.
C as 13th e2e test language: The e2e-generator now produces C test files exercising the FFI API, with 15 passing test cases.
R distribution via r-universe: Switched R package distribution from CRAN to r-universe for faster release cycles and easier native compilation.
WASM native OCR (ocr-wasm feature): Tesseract OCR compiled directly into the WASM binary via kreuzberg-tesseract, enabling OCR in all environments (Browser, Node.js, Deno, Bun) without browser-specific APIs. Supports 43 languages with tessdata downloaded from CDN into memory.
WASM Node.js/Deno PDFium support: PDFium initialization now works in Node.js and Deno by loading the WASM module from the filesystem. Configurable via KREUZBERG_PDFIUM_PATH environment variable.
WASM full-feature build: OCR, Excel, and archive extraction are now enabled by default in the WASM package. All wasm-pack build targets include the ocr-wasm feature.
WASM Excel extraction (excel-wasm feature): Calamine-based Excel/spreadsheet extraction available in WASM without requiring Tokio runtime.
WASM archive extraction: ZIP, TAR, 7z, and GZIP archive extraction now available in WASM via synchronous extractor implementations.
WASM PDF annotations: PDF annotations (text notes, highlights, links, stamps) are now exposed in the WASM TypeScript API via the annotations field on ExtractionResult.
Fixed
DOCX equations not extracted: OMML math content was completely ignored by the DOCX parser, causing all equation text to be silently dropped. Math runs are now extracted as regular text.
DOCX line breaks ignored: <w:br/> elements were not handled, causing adjacent text segments to merge. Line breaks now insert whitespace.
PPTX/PPSX table content lost: Tables were rendered as HTML without whitespace between tags, causing the entire table to tokenize as a single unreadable blob. Tables now render as markdown pipe tables with proper cell separation.
PPTX/PPSX/PPTM image markers pollute text: Image references injected spurious numeric tokens into extracted content. Image markers now use a clean ![image]() format.
EPUB double-lossy conversion: XHTML content was converted through an XHTML-to-markdown-to-plain-text pipeline, losing content at each stage. Replaced with direct roxmltree traversal that extracts text content from XHTML elements without intermediate markdown.
Excel float formatting drops numeric precision: format_cell_to_string() formatted whole-number floats as "1.0" instead of "1", causing numeric token mismatches in quality scoring.
HTML metadata extraction pollutes content: The extract_metadata option was left enabled, causing YAML frontmatter to be prepended to the content string. Set extract_metadata = false in the metadata extraction path.
Markdown extractor loses tokens through AST reconstruction: Now returns raw text content directly (after frontmatter extraction) while still parsing the AST for table and image extraction.
SVG text extraction includes element prefixes: SVG extraction now targets only text-bearing elements without prefixes.
XML ground truth uses raw source: Regenerated all 20 ground truth files.
Elixir benchmark UTF-8 locale: Erlang VM running with latin1 native encoding corrupted UTF-8 strings from Rust NIFs. Added ERL_LIBS path configuration.
WASM OCR not working (enableOcr() regression): The function now bridges both JS-side and Rust-side registries so OCR works end-to-end.
WASM tessdata CDN URL returns 404: Updated to use the official tesseract-ocr/tessdata_fast GitHub repository.
XML UTF-16 parsing fails on files with odd byte count: The decoder now truncates to the nearest even byte boundary.
R bindings crash on strings with embedded NUL bytes: NUL bytes are now stripped before passing strings to R.
R bindings %||% operator incompatible with R < 4.4: Added a package-local polyfill for backwards compatibility.
API returns HTTP 500 for unsupported file formats (#414): UnsupportedFormat errors are now mapped to HTTP 400 with a clear UnsupportedFormatError response.
PDF markdown extraction missing headings/bold for flat structure trees (#391): Pages with font size variation but no heading tags are now enriched via K-means font-size clustering.
PaddleOCR backend not found when using backend="paddleocr" (#403): The OCR backend registry now resolves the "paddleocr" alias to the canonical "paddle-ocr" name.
WASM metadata serialization: Switched from serde_wasm_bindgen to serde_json + JSON.parse() for output serialization.
WASM config deserialization: Config keys are now converted to snake_case before passing to the WASM boundary.
WASM PDFium module loading: The build script now locates and copies the actual PDFium ESM module from the Cargo build output.
Email header extraction loses display names: From, To, CC, and BCC fields now use "Display Name" <email@example.com> format when a display name is available.
Email date header normalized to RFC 3339: Now preserves the raw Date header value and only falls back to RFC 3339 normalization when unavailable.
Docker builds fail due to missing snippet-runner exclusion: Added snippet-runner to the sed exclusion patterns in all three Dockerfiles.
WASM Deno e2e tests skip OCR fixtures: The e2e generator now calls enableOcr() after initWasm() in every generated test file.
WASM Deno e2e tests ignore pages config: Added mapPageConfig() to the test helper template.
C FFI NULL callback crash: Reject NULL callback function pointers in plugin registration to prevent segfaults.
Removed
polars dependency: Removed unused polars crate and table_from_arrow_to_markdown dead code from the excel feature.