You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenAPI schema for /extract endpoint: Implemented utoipa::ToSchema on all extraction result types (ExtractionResult, Metadata, Chunk, ExtractedImage, Element, DjotContent, PageContent, Table, and all nested types), enabling full OpenAPI documentation for the extraction endpoint
Unified ChunkingConfig: Merged internal chunking config into a single ChunkingConfig struct with canonical field names (max_characters, overlap) and serde aliases (max_chars, max_overlap) for backwards compatibility. Added trim and chunker_type fields. ChunkerType enum is no longer feature-gated behind chunking
OCR
KREUZBERG_OCR_LANGUAGE="all" support: Setting the language to "all" or "*" automatically detects and uses all installed Tesseract languages from the tessdata directory, eliminating manual enumeration (#344)
Fixed
Ruby Bindings
Cow<'static, str> type conversions: Fixed Magnus bindings to properly convert Cow<'static, str> fields (mime_type, format, colorspace) using .as_ref() instead of passing directly to FFI methods
Vendor workspace bytes dependency: Added bytes to the Ruby vendor workspace Cargo.toml via the vendoring script, fixing workspace dependency resolution failures
Tempfile GC in batch test: Kept Tempfile references alive in batch_operations_spec.rb to prevent garbage collection before batch_extract_files_sync reads them
Python Bindings
Runtime ExtractedImage import: Defined ExtractedImage, Metadata, OutputFormat, and ResultFormat as Python-level runtime types instead of importing from compiled Rust bindings (these are stub-only types, not #[pyclass] exports)
Overhauled _internal_bindings.pyi type stubs: Exhaustive audit against Rust source to ensure all types, fields, and optionality match exactly
Removed duplicate types.py: Deleted kreuzberg/types.py which contained 43 duplicate type definitions conflicting with _internal_bindings.pyi
Consolidated duplicate test files: Merged unique tests from test_embeddings_advanced.py, test_images_extraction.py, test_tables_extraction.py into their canonical counterparts and deleted the duplicates
C# Bindings
Attributes deserialization on ARM64: Added AttributesDictionaryConverter to handle both array-of-arrays and object JSON formats for LinkMetadata.Attributes and HtmlImageMetadata.Attributes
Overhauled type definitions from audit against Rust source
Fixed keyword deserialization: Properly discriminate between simple string keywords and extracted keyword objects
Java Bindings
Test timeout prevention: Added @Timeout(60) to all concurrency and async test methods
Surefire timeout reduction: Reduced forkedProcessTimeoutInSeconds from 3600s to 600s
Overhauled type definitions from audit against Rust source
TypeScript Bindings
Overhauled type definitions from audit against NAPI-RS Rust source
PHP Bindings
Overhauled type definitions from audit against ext-php-rs Rust source
Go Bindings
Overhauled type definitions from audit against Rust source
Consolidated config tests
Elixir Bindings
Overhauled all struct types from audit against Rust source: Exhaustive audit of every Elixir struct against the Rust core types to ensure field-level correctness