v0.3.46 | extraction quality, raw RGBA output, JBIG2 decode, editor fixes, and FIPS CI hardening.
Added
-
Raw RGBA pixel buffer, SIMD downscaling, and thread-safe rendering
(#446,
#481) —
page.render_pixmap()(Python),renderToPixmap()(Node.js / Go),
andPage.RenderToRgba()(C#) expose the premultiplied RGBA8888
buffer directly fromtiny_skia::Pixmap::data(), eliminating the
encode→decode roundtrip for callers that need raw pixels (PIL,
sharp,System.Drawing.Bitmap,image.RGBA). Downscaling is now SIMD-accelerated via
fast_image_resize(ARM NEON, x86 AVX2), replacing the previous
bilinear path. Concurrentrender_*calls on the same
PdfDocumentare now safe: all rendering functions take&PdfDocument
(shared reference) and all interior-mutable state is already guarded by
per-fieldMutex, so the FFI layer no longer produces aliased&mut
references and concurrent renders run without a global serialisation
bottleneck.
Requested by @mara004 and @potatochipcoconut. -
ConversionOptions::exclude_regions/include_region
(#484) — New
spatial filtering fields allow callers to exclude rectangular regions
from extraction output or restrict extraction to a single bounding
rectangle. Backed bySpatialCollectionFilteringtrait methods
filter_by_rect/exclude_rects. -
PageFontStats
(#484) — New
layout::PageFontStatsstruct computed in O(n) over spans; exposes
dominant_em,dominant_line_height,dominant_char_width, and
body_font_name. All layout heuristics now derive absolute thresholds
from these measurements instead of hardcoded constants, improving
correctness across a wider range of font sizes.
Fixed
-
JBIG2-compressed scanner PDFs render as blank pages
(#332) —
The pass-throughJbig2Decoderreturned compressed bytes unchanged,
causing a dimension mismatch and a silent image drop. Integrates
hayro-jbig2v0.3 (pure-Rust, Apache-2.0 OR MIT); embedded JBIG2
bitstreams are decoded viahayro_jbig2::Image::new_embedded, with
JBIG2Globals loaded from/DecodeParmswhen present.
BitsPerComponentis overridden to 8 post-decode so
to_dynamic_image()does not attempt CCITT bilevel decompression of
already-decoded pixels. Reported by @frederikhors, who also confirmed
the original vertical-flip / glyph-substitution symptom is resolved
in v0.3.45. -
add_texton existing PDF produces blank or discarded content
(#483) —
DocumentEditor::add_texton a page of an existing PDF either blanked
the page or (when combined withselect_pages) silently returned the
unmodified original. Root causes: the storage-side page-index mapping
afterselect_pageswas off by one, andadd_textfailed to preserve
the existing content stream when writing the new text layer. Both are
fixed; an end-to-end regression suite is added. Reported by
@stephenjudkins. -
Text extraction corpus quality improvements across 166 PDFs
(#484) —
Systematic audit driven by @Goldziher's calibrated 166-document corpus
(the kreuzberg test suite),
which provides per-document ground-truth.txtfiles and a word-F1
harness. Multiple extraction failures identified and fixed:- Newline/CR-only spans treated as line breaks — Spans consisting
entirely of\nor\rbytes are now emitted as a single newline
rather than verbatim byte sequences, eliminating spurious blank lines
from some PDF generators. - Annotation text double-emitted —
append_non_widget_annotation_text
was called after the main span assembly pass even though
annotation_content_spans()already inlined annotation/Contents
into the span list. The redundant call is removed. - Markup annotation
/Contentscorrectly filtered — Per ISO
32000-1 §12.5.6.2,/Contentson Highlight, Underline, StrikeOut,
Squiggly, Caret, Ink, FileAttachment, and Redact annotations is
popup/tooltip text, not page content. These subtypes are now excluded
fromannotation_content_spansandappend_non_widget_annotation_text. - No space inserted between adjacent CJK characters —
should_insert_spacenow returnsfalsewhen both the trailing and
leading characters are CJK (Hiragana, Katakana, CJK Unified
Ideographs, Hangul, CJK Extension B). - Unicode ligatures preserved; adjacent CJK spans merged — Latin
ligatures (U+FB00–U+FB06) are now preserved in the span stream
rather than dropped. Adjacent CJK spans from the same run are merged
into a single span, eliminating inter-character noise. - Lower→upper CID range boundary split restored — The CID range
boundary split now consistently applies the lower→upper ordering
correction that was accidentally dropped; the fix propagates to
Markdown and HTML output paths. - Non-adjacent subscript/superscript spans merged —
merge_sub_superscript_spanshandles spans separated by intervening
content, using em-relative thresholds[-0.1×em, +0.25×em]instead
of hardcoded absolute values so detection scales with body font size. - Column-spanning decimals split at table cell boundaries —
Decimal numbers that span two adjacent table cells are split at the
cell boundary rather than merged into a single token. - Position-aware space insertion between adjacent MCID spans —
Spaces between MCID-tagged spans are inserted based on actual
rendered x-positions rather than always or never. - Boundary split on letter→digit transition only —
char_widths_boundary_splitnow splits only at a letter-to-digit
boundary (e.g.Theorem1), removing false splits on UpperCamelCase
terms that previously broke word-shape heuristics. - Same-line threshold formula fixed —
same_line_thresholdnow
uses(min_fs × 1.2).max(max_fs × 0.3), handling mixed-size lines
(heading + caption on the same line) without cliff effects. - Bare-word identifiers and corrupt
StructTreeRoothandled —
Parser now tolerates bare-word tokens as dictionary values; a
corrupt or absentStructTreeRootno longer aborts extraction. - Standard-14 font matching strips
SUBSET+prefix; accepts
canonical PostScript aliases — Per ISO 32000-1 §9.6.2.2 Annex D,
standard font names are matched after stripping anyABCDEF+prefix.
HelveticaOblique(no hyphen) is now accepted alongside
Helvetica-Oblique. - Explicit
/DWtracked inFontInfo—has_explicit_dw: bool
added;has_explicit_widths()returnstruewhen/DWis
explicitly present, enabling correct width lookup for CIDFonts that
declare only/DW(no/Warray). - CIDFont width fallback corrected — When
/DWis absent and a
CID is not in the/Warray,get_glyph_widthnow falls through to
default_widthrather thancid_default_width, matching real-world
PDF behaviour. - Word extractor honours
split_boundary_before— Words that
straddle a table-cell or column boundary are no longer merged. - Ligature expansion option —
ConversionOptionsgains
expand_ligatures: bool(defaultfalse). When enabled, Latin
ligatures (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st) are
expanded to component letters. - Extraction warnings API —
PdfDocument::warnings()(clones) and
take_warnings()(drains) expose non-fatal extraction warnings
(missing MCIDs, encrypted-PDF fallback) accumulated during a run.
- Newline/CR-only spans treated as line breaks — Spans consisting
-
Same-line span reorder: x-gap validation guard
(#413) —
After the row-aware sort, mixed-baseline glyphs (superscripts,
subscripts) could appear before their base glyphs. The
reorder_same_line_runshelper now validates that a candidate run is
horizontally contiguous before X-sorting it; runs with a large X gap
are left in row-aware order, preventing disjoint footer/header content
from being collapsed into a fake same-line sequence. Fixes"8th"
ordering (was"th8"). Contributed by
@RolandWArnold in
PR #413. -
Layout word-merge O(n²) → O(n) — The word-merge pass previously
re-scanned the entire accumulator for every candidate span; it is now
O(n) via an index map. -
Wide spatial false-positive tables rejected via dense-row-ratio —
Table detection now computes the fraction of rows with dense (≥50%)
column coverage and rejects candidates below the threshold, eliminating
false positives on wide but sparsely populated layouts. -
Bare-identifier lexer leniency confined to dict-value position —
The lexer's tolerance for bare (unquoted) name-like tokens is now
restricted to dictionary value positions, preventing mis-tokenisation
of content streams where the same byte sequences are valid operators. -
Typographic Unicode spaces normalised in extracted spans —
Non-breaking, thin, en, em, and other Unicode space variants in span
text are normalised to ASCII space before the word-spacing heuristics
run, eliminating invisible gaps in the extracted output.
Performance
- Rendering: per-segment font re-parsing eliminated — The text
rasterizer no longer re-parses font data on every span segment;Arc
clones across the hot render loop and redundant CJK subsetter
invocations are also eliminated, reducing CPU time for text-heavy
pages by 30–60%.
Dependencies
fast_image_resizeadded
(#454) —
New dependency enabling SIMD-accelerated (ARM NEON, x86 AVX2) image
downscaling for the raw-RGBA render path.
CI
- FIPS release workflow now validates on pull requests —
release-fips.ymlnow triggers on PRs tomainthat touch source,
language-binding, or workflow files. The full build across all five
platforms and all four language bindings runs without publishing,
so the tag push is a pure deployment step after a confirmed-green PR. - macOS x86_64 FIPS builds moved to free runners — All four
macos-13-xlarge(paid Intel Larger Runner, causing indefinite queue
waits on plans without access) replaced withmacos-latest
(free ARM runner cross-compiling tox86_64-apple-darwin). - Cargo registry caching added to all 20 FIPS build jobs —
Per-target cache keys ($runner_os-$target-fips-cargo-$lock_hash)
are restored before each build, substantially reducing re-run time
on warm caches.
Community contributors
- @RolandWArnold — contributed
the same-line x-gap validation fix in
PR #413. Roland
diagnosed thatreorder_same_line_runswas collapsing disjoint
footer/header spans into a fake same-line sequence and designed the
horizontal-contiguity guard that prevents it. The fix also correctly
handles superscript/subscript ordering ("8th"instead of"th8"). - @Goldziher (Na'aman Hirschfeld) —
filed #484 with a
calibrated 166-document corpus, per-document ground-truth.txtfiles,
and a word-F1 harness, providing the systematic test bed that drove the
bulk of the extraction improvements in this release. - @stephenjudkins (Stephen
Judkins) — filed #483
with a minimal, precisely-scoped reproduction of theadd_text
regression that made the root-cause analysis straightforward. - @mara004 and
@potatochipcoconut —
requested the raw RGBA pixel buffer API in comments on
#325 with clear
use cases across PIL, sharp,System.Drawing.Bitmap, and
Go'simage.RGBA, and engaged on the pixel-format details (premultiplied
vs straight alpha, tiny-skia format constraints) that shaped the final
API design. - @frederikhors — reported the
JBIG2 blank-page symptom in a comment on
#332 and
confirmed that both the JBIG2 fix and the earlier vertical-flip
regression are resolved.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.