Skip to content

v0.3.46 | extraction quality, raw RGBA output, JBIG2 decode, editor fixes, and FIPS CI hardening.

Choose a tag to compare

@github-actions github-actions released this 11 May 05:07
· 219 commits to main since this release
a6c5120

Added

  • Raw RGBA pixel buffer, SIMD downscaling, and thread-safe rendering
    (#446,
    #481)

    page.render_pixmap() (Python), renderToPixmap() (Node.js / Go),
    and Page.RenderToRgba() (C#) expose the premultiplied RGBA8888
    buffer directly from tiny_skia::Pixmap::data(), eliminating the
    encode→decode roundtrip for callers that need raw pixels (PIL,
    sharp, System.Drawing.Bitmap, image.RGBA). Downscaling is now SIMD-accelerated via
    fast_image_resize (ARM NEON, x86 AVX2), replacing the previous
    bilinear path. Concurrent render_* calls on the same
    PdfDocument are now safe: all rendering functions take &PdfDocument
    (shared reference) and all interior-mutable state is already guarded by
    per-field Mutex, so the FFI layer no longer produces aliased &mut
    references and concurrent renders run without a global serialisation
    bottleneck.
    Requested by @mara004 and @potatochipcoconut.

  • ConversionOptions::exclude_regions / include_region
    (#484)
    — New
    spatial filtering fields allow callers to exclude rectangular regions
    from extraction output or restrict extraction to a single bounding
    rectangle. Backed by SpatialCollectionFiltering trait methods
    filter_by_rect / exclude_rects.

  • PageFontStats
    (#484)
    — New
    layout::PageFontStats struct computed in O(n) over spans; exposes
    dominant_em, dominant_line_height, dominant_char_width, and
    body_font_name. All layout heuristics now derive absolute thresholds
    from these measurements instead of hardcoded constants, improving
    correctness across a wider range of font sizes.

Fixed

  • JBIG2-compressed scanner PDFs render as blank pages
    (#332)

    The pass-through Jbig2Decoder returned compressed bytes unchanged,
    causing a dimension mismatch and a silent image drop. Integrates
    hayro-jbig2 v0.3 (pure-Rust, Apache-2.0 OR MIT); embedded JBIG2
    bitstreams are decoded via hayro_jbig2::Image::new_embedded, with
    JBIG2Globals loaded from /DecodeParms when present.
    BitsPerComponent is overridden to 8 post-decode so
    to_dynamic_image() does not attempt CCITT bilevel decompression of
    already-decoded pixels. Reported by @frederikhors, who also confirmed
    the original vertical-flip / glyph-substitution symptom is resolved
    in v0.3.45.

  • add_text on existing PDF produces blank or discarded content
    (#483)

    DocumentEditor::add_text on a page of an existing PDF either blanked
    the page or (when combined with select_pages) silently returned the
    unmodified original. Root causes: the storage-side page-index mapping
    after select_pages was off by one, and add_text failed to preserve
    the existing content stream when writing the new text layer. Both are
    fixed; an end-to-end regression suite is added. Reported by
    @stephenjudkins.

  • Text extraction corpus quality improvements across 166 PDFs
    (#484)

    Systematic audit driven by @Goldziher's calibrated 166-document corpus
    (the kreuzberg test suite),
    which provides per-document ground-truth .txt files and a word-F1
    harness. Multiple extraction failures identified and fixed:

    • Newline/CR-only spans treated as line breaks — Spans consisting
      entirely of \n or \r bytes are now emitted as a single newline
      rather than verbatim byte sequences, eliminating spurious blank lines
      from some PDF generators.
    • Annotation text double-emittedappend_non_widget_annotation_text
      was called after the main span assembly pass even though
      annotation_content_spans() already inlined annotation /Contents
      into the span list. The redundant call is removed.
    • Markup annotation /Contents correctly filtered — Per ISO
      32000-1 §12.5.6.2, /Contents on Highlight, Underline, StrikeOut,
      Squiggly, Caret, Ink, FileAttachment, and Redact annotations is
      popup/tooltip text, not page content. These subtypes are now excluded
      from annotation_content_spans and append_non_widget_annotation_text.
    • No space inserted between adjacent CJK characters
      should_insert_space now returns false when both the trailing and
      leading characters are CJK (Hiragana, Katakana, CJK Unified
      Ideographs, Hangul, CJK Extension B).
    • Unicode ligatures preserved; adjacent CJK spans merged — Latin
      ligatures (U+FB00–U+FB06) are now preserved in the span stream
      rather than dropped. Adjacent CJK spans from the same run are merged
      into a single span, eliminating inter-character noise.
    • Lower→upper CID range boundary split restored — The CID range
      boundary split now consistently applies the lower→upper ordering
      correction that was accidentally dropped; the fix propagates to
      Markdown and HTML output paths.
    • Non-adjacent subscript/superscript spans merged
      merge_sub_superscript_spans handles spans separated by intervening
      content, using em-relative thresholds [-0.1×em, +0.25×em] instead
      of hardcoded absolute values so detection scales with body font size.
    • Column-spanning decimals split at table cell boundaries
      Decimal numbers that span two adjacent table cells are split at the
      cell boundary rather than merged into a single token.
    • Position-aware space insertion between adjacent MCID spans
      Spaces between MCID-tagged spans are inserted based on actual
      rendered x-positions rather than always or never.
    • Boundary split on letter→digit transition only
      char_widths_boundary_split now splits only at a letter-to-digit
      boundary (e.g. Theorem1), removing false splits on UpperCamelCase
      terms that previously broke word-shape heuristics.
    • Same-line threshold formula fixedsame_line_threshold now
      uses (min_fs × 1.2).max(max_fs × 0.3), handling mixed-size lines
      (heading + caption on the same line) without cliff effects.
    • Bare-word identifiers and corrupt StructTreeRoot handled
      Parser now tolerates bare-word tokens as dictionary values; a
      corrupt or absent StructTreeRoot no longer aborts extraction.
    • Standard-14 font matching strips SUBSET+ prefix; accepts
      canonical PostScript aliases
      — Per ISO 32000-1 §9.6.2.2 Annex D,
      standard font names are matched after stripping any ABCDEF+ prefix.
      HelveticaOblique (no hyphen) is now accepted alongside
      Helvetica-Oblique.
    • Explicit /DW tracked in FontInfohas_explicit_dw: bool
      added; has_explicit_widths() returns true when /DW is
      explicitly present, enabling correct width lookup for CIDFonts that
      declare only /DW (no /W array).
    • CIDFont width fallback corrected — When /DW is absent and a
      CID is not in the /W array, get_glyph_width now falls through to
      default_width rather than cid_default_width, matching real-world
      PDF behaviour.
    • Word extractor honours split_boundary_before — Words that
      straddle a table-cell or column boundary are no longer merged.
    • Ligature expansion optionConversionOptions gains
      expand_ligatures: bool (default false). When enabled, Latin
      ligatures (U+FB00–U+FB06: ff, fi, fl, ffi, ffl, ſt, st) are
      expanded to component letters.
    • Extraction warnings APIPdfDocument::warnings() (clones) and
      take_warnings() (drains) expose non-fatal extraction warnings
      (missing MCIDs, encrypted-PDF fallback) accumulated during a run.
  • Same-line span reorder: x-gap validation guard
    (#413)

    After the row-aware sort, mixed-baseline glyphs (superscripts,
    subscripts) could appear before their base glyphs. The
    reorder_same_line_runs helper now validates that a candidate run is
    horizontally contiguous before X-sorting it; runs with a large X gap
    are left in row-aware order, preventing disjoint footer/header content
    from being collapsed into a fake same-line sequence. Fixes "8th"
    ordering (was "th8"). Contributed by
    @RolandWArnold in
    PR #413.

  • Layout word-merge O(n²) → O(n) — The word-merge pass previously
    re-scanned the entire accumulator for every candidate span; it is now
    O(n) via an index map.

  • Wide spatial false-positive tables rejected via dense-row-ratio
    Table detection now computes the fraction of rows with dense (≥50%)
    column coverage and rejects candidates below the threshold, eliminating
    false positives on wide but sparsely populated layouts.

  • Bare-identifier lexer leniency confined to dict-value position
    The lexer's tolerance for bare (unquoted) name-like tokens is now
    restricted to dictionary value positions, preventing mis-tokenisation
    of content streams where the same byte sequences are valid operators.

  • Typographic Unicode spaces normalised in extracted spans
    Non-breaking, thin, en, em, and other Unicode space variants in span
    text are normalised to ASCII space before the word-spacing heuristics
    run, eliminating invisible gaps in the extracted output.

Performance

  • Rendering: per-segment font re-parsing eliminated — The text
    rasterizer no longer re-parses font data on every span segment; Arc
    clones across the hot render loop and redundant CJK subsetter
    invocations are also eliminated, reducing CPU time for text-heavy
    pages by 30–60%.

Dependencies

  • fast_image_resize added
    (#454)

    New dependency enabling SIMD-accelerated (ARM NEON, x86 AVX2) image
    downscaling for the raw-RGBA render path.

CI

  • FIPS release workflow now validates on pull requests
    release-fips.yml now triggers on PRs to main that touch source,
    language-binding, or workflow files. The full build across all five
    platforms and all four language bindings runs without publishing,
    so the tag push is a pure deployment step after a confirmed-green PR.
  • macOS x86_64 FIPS builds moved to free runners — All four
    macos-13-xlarge (paid Intel Larger Runner, causing indefinite queue
    waits on plans without access) replaced with macos-latest
    (free ARM runner cross-compiling to x86_64-apple-darwin).
  • Cargo registry caching added to all 20 FIPS build jobs
    Per-target cache keys ($runner_os-$target-fips-cargo-$lock_hash)
    are restored before each build, substantially reducing re-run time
    on warm caches.

Community contributors

  • @RolandWArnold — contributed
    the same-line x-gap validation fix in
    PR #413. Roland
    diagnosed that reorder_same_line_runs was collapsing disjoint
    footer/header spans into a fake same-line sequence and designed the
    horizontal-contiguity guard that prevents it. The fix also correctly
    handles superscript/subscript ordering ("8th" instead of "th8").
  • @Goldziher (Na'aman Hirschfeld) —
    filed #484 with a
    calibrated 166-document corpus, per-document ground-truth .txt files,
    and a word-F1 harness, providing the systematic test bed that drove the
    bulk of the extraction improvements in this release.
  • @stephenjudkins (Stephen
    Judkins) — filed #483
    with a minimal, precisely-scoped reproduction of the add_text
    regression that made the root-cause analysis straightforward.
  • @mara004 and
    @potatochipcoconut
    requested the raw RGBA pixel buffer API in comments on
    #325 with clear
    use cases across PIL, sharp, System.Drawing.Bitmap, and
    Go's image.RGBA, and engaged on the pixel-format details (premultiplied
    vs straight alpha, tiny-skia format constraints) that shaped the final
    API design.
  • @frederikhors — reported the
    JBIG2 blank-page symptom in a comment on
    #332 and
    confirmed that both the JBIG2 fix and the earlier vertical-flip
    regression are resolved.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.