Skip to content

v0.3.47 | text-extraction quality, CJK + RTL fixes, table-detection hardening, and a WASM SystemTime fix.

Choose a tag to compare

@github-actions github-actions released this 13 May 04:59
· 176 commits to main since this release
7c9e338

This release closes the remaining bugs surfaced by the kreuzberg
integration (issue #484)
and ships the related text-extraction quality fixes. Word-F1 against the
pdftotext-derived ground truth corpus now meets the kreuzberg quality
floor for every PDF in the issue 484 set.

Fixed

  • kreuzberg regression suite — all 24 PDFs now meet the F1 floor
    (#484)

    extract_text previously failed three documents reported by
    @Goldziher on the kreuzberg corpus: pdfa_039.pdf (swimming-results
    table) returned F1 0.810, pr-136-example.pdf (CJK financial document)
    returned F1 0.709, and annotations.pdf returned F1 0.545. Three
    separate root-cause fixes restore them to F1 ≥ 0.85:

    • eliminate duplicate emission of multi-row table labels — the
      text-only spatial fallback in detect_tables_with_lines now
      requires config.text_fallback=true (which extract_text does
      not pass) so report-style PDFs with decorative ruling lines no
      longer get their cell content emitted twice; span_in_table adds
      a text-match fallback to catch label spans whose font ascent
      extends slightly above the cell's ink box (issue-53-example.pdf
      F1 0.867 → 0.992).
    • tighten cross-font glue and decimal merge for CJK + Latin layoutscross_font_word_glue no longer fires on a CJK ↔
      non-CJK boundary (CJK ideographs satisfy is_alphabetic() per
      Unicode and were being concatenated with adjacent Latin); the
      decimal_merge heuristic requires a column-boundary-sized gap
      (gap > 0.4 em) so per-glyph Tj operators in CJK documents stop
      mangling "2013" into "201.3" (pr-136 F1 0.709 → 0.884).
    • narrow CJK boundary forced-space to script glyphs only
      should_insert_space now actively inserts a space at the
      CJK ↔ non-CJK boundary to match pdftotext tokenisation, but
      restricted to actual script glyphs (ideographs, kana, hangul);
      fullwidth ASCII operators like < > = μ stay inline with
      adjacent digits/Latin so compound tokens like "60000≤Q<80000"
      are preserved (issue-336 text quality gate stays at PASS).
      Reported by @Goldziher.
  • extract_spans now exposes a merge_tm_tj_runs opt-out
    (#488)

    Same-line Tm+Tj runs were unconditionally batched into a single
    TextSpan, throwing away the per-Tm positioning that downstream
    layout-analysis code (e.g. column-aware table detection) needs.
    SpanMergingConfig::merge_tm_tj_runs (default true for backward
    compatibility) now flushes the span buffer at every Tm operator so
    callers can opt in to one span per Tm+Tj group, matching the
    granularity of pdftotext -bbox-layout. Reported by @haberman.

  • saveEncryptedToBytes no longer panics in browser WASM
    (#492)

    generate_file_id (per ISO 32000-1 §14.4) called
    std::time::SystemTime::now(), which is unimplemented on
    wasm32-unknown-unknown. Cfg-gated so the WASM build derives the
    file identifier from uuid::Uuid::new_v4() only — still a unique
    opaque 16-byte ID per the spec. Reported by @eersis-byte.

  • CJK fullwidth operator spacing in to_markdown / to_html
    (#485)

    Four coordinated changes restore issue-336-example.pdf to PASS
    on all three quality gates (text, markdown, html):

    • pipeline/converters/has_horizontal_gap suppresses space
      insertion when one side is CJK and the other is CJK or a
      fullwidth/math operator (≤, <, >, =, μ, etc.), mirroring the
      text-extraction CJK-pair suppression.
    • extract_cell_text no longer inserts an unconditional space
      between adjacent spans on the same row of a table cell — uses
      the same gap-aware separator rules as the inline-flow path so
      multi-span cells like 60000≤Q<80000 (rendered as 5 separate
      Tj operators) keep their compound tokens intact.
    • consolidate_adjacent_table_fragments (new helper in
      spatial_table_detector) merges vertically-adjacent tables that
      share an identical column structure. The line-based detector
      emits one fragment per ruling-rule strip on PDFs that draw a
      horizontal rule between every pair of rows; each fragment was
      failing is_real_grid and falling through to paragraph flow
      with column-based reading order, producing orphan
      <p>40000≤Q</p> / <p><55000</p> pairs. Consolidating before
      the filter lets the merged multi-row table survive.
    • is_real_grid accepts wide consolidated tables that have
      dense data rows alongside sparse header / multi-row-label rows
      — the strict 70 % dense-ratio gate was rejecting real tables
      whose column headers split across multiple visual rows.
      Score improvements on issue-336-example.pdf:
      text 0.612 → 0.820, markdown 0.577 → 0.863, html 0.632 → 0.646
      (all PASS their thresholds).
  • Text-only spatial table fallback for line-less tables in
    to_markdown
    (#486) —
    partial fix.
    extract_page_tables now opts in to a relaxed
    text-only detection when the caller is a converter (text_fallback=
    true), with the column ceiling raised from 15 to 25 so that
    sailing-score grids with 16-18 score columns are no longer
    rejected outright. The fragmented-table consolidation from #485
    also kicks in here, recovering most of the row labels and
    identifier columns. nougat_018.pdf markdown still trails its
    threshold (0.656 vs 0.90) because the score columns themselves —
    variable-width sparse cells with parenthesised drop-scores —
    evade column detection; that is the remaining piece tracked
    separately.

  • HTML table cell rendering aligned with markdown
    (#487) —
    partial fix.
    to_html now uses the same span-walking and
    bold/italic preservation as to_markdown's
    render_table_markdown. Three of four affected docs improved
    by 1-4 % Jaccard but two (nougat_018, nougat_026) still trail the
    threshold pending the table-fragmentation work above.

  • RTL inline emphasis stripping in markdown extraction
    (#459)

    RTL detection now strips <strong> / <em> markers from
    visually-reversed runs in to_markdown consistently with the
    plain-text path; spec basis ISO 32000-1 §14.8.2.3.3 (Reverse-
    Order Show Strings). 46 unit tests in
    tests/test_rtl_script_support.rs cover the detector, BiDi
    algorithm, and inline-flow integration.

  • Multi-byte CMap parsing and array-form beginbfrange
    (§9.7.5)
    beginbfrange ... endbfrange array notation
    <src> <src> [<dst1> <dst2> ...] was not fully covered; the
    CMap parser now matches the spec's allowed grammar so multi-byte
    CIDs map correctly through ToUnicode CMaps.

  • /StructTreeRoot-only tagged PDFs (§14.7.4) — Documents
    that declare /StructTreeRoot in the catalog without a
    /MarkInfo dictionary (PDF 1.4 documents, valid per the spec)
    now correctly use the structure tree for table-cell content
    extraction. Resolves /OBJR content-item references during
    tree traversal so OBJR-referenced annotations and XObjects are
    no longer lost.

  • Indirect references in MediaBox/CropBox accessors (§7.7.3.4)
    — Page attribute accessors now resolve /MediaBox and /CropBox
    through indirect references and the /Pages inheritance chain.
    This is what made the Bucket A errors in the issue 484 retest
    comment (annotations*.pdf, pdfa_039.pdf) parse successfully.

  • CTM-aware cache key for Form XObject span extraction — Form
    XObject spans were cached by XObject reference alone, returning
    stale coordinates for the same XObject reused on multiple pages
    with different CTM transforms. Cache key now includes the CTM
    so repeated XObjects produce correctly-positioned spans on each
    invocation.

  • notdefrange U+FFFD no longer blocks the CID-as-Unicode
    fallback (§9.10.2)
    — Per the spec, U+FFFD (REPLACEMENT
    CHARACTER) signals "no proper Unicode mapping", so a notdefrange
    hit must not stop the priority list. The Identity CID-as-Unicode
    fallback (Priority 3) now fires correctly for composite fonts
    whose ToUnicode CMap returns U+FFFD.

  • ToUnicode Priority-3 fallback guarded for composite fonts
    (§9.10.2)
    — The CID-as-Unicode fallback is now only applied
    to fonts whose CMap is one of the predefined composite-font
    CMaps or whose CIDFont uses one of the Adobe character
    collections, matching the spec's enumeration; misapplication on
    other fonts could produce mojibake on previously-working files.

  • Reject prose / TOC / underline-annotation false-positive
    tables in to_html and to_markdown
    — Wide pages of
    ordinary paragraph text were sometimes detected as multi-column
    tables: word x-positions cluster into "columns" by accident, and
    decorative horizontal rules (newsletter mastheads, annotation
    underlines, page borders) tricked the line-based detector into
    treating two adjacent lines as a header + data row. The
    detection pipeline now applies several post-is_real_grid
    guards that look at the shape of the candidate's cell
    content rather than just its grid geometry:

    • looks_like_prose_table rejects a candidate when more than
      12 % of cells end with a mid-sentence , or ;, more than
      25 % of cells start with a lowercase ASCII letter
      (continuation fragments like "and", "the", "to"), or more
      than 10 % of cells are pure leader dots (the . . . . . .
      runs in tables of contents).
    • The text-only spatial fallback and the horizontal-rule-
      bounded path both now require ≥ 3 rows of evidence. A
      title plus a wrapped body line is the signature of prose,
      not a table; only the line-based intersection / cluster
      paths (which have authoritative visual evidence) still
      accept 2-row tables.
    • should_insert_space no longer forces a space at the
      CJK ↔ ASCII-punctuation boundary. The boundary forced-
      space added in v0.3.47 was correctly inserting a space at
      "神鹰集团" + "2015" but was wrongly producing "する ."
      instead of "する." in Japanese technical text; ASCII
      clause punctuation hugs the preceding token in every
      script, so the rule is now suppressed when the
      transitioning glyph IS the punctuation.
    • text_fallback defaults back to true on
      TableDetectionConfig. The new prose-shape filter
      replaces the gate-based protection added earlier in the
      cycle, so the public extract_tables API again detects
      line-less data tables out of the box.

Notes

  • tests/test_corpus_extraction_quality.rs now strips markdown
    formatting markers (**bold**, *italic*, | separators,
    ---|---|--- rule, # heading, ``` fences) before
    computing Jaccard against the plain-text GT — mirrors the HTML
    test's existing strip_html step so the score reflects text
    content rather than formatting markup.
  • All 19 quality-gate Jaccard tests in
    tests/test_corpus_extraction_quality.rs now pass (up from
    13 at the start of this branch). The kreuzberg issue 484
    corpus passes its F1 floor on every PDF.

Thanks

This release was driven entirely by community bug reports and the
kreuzberg integration test feedback loop:

  • @Goldziher (kreuzberg-dev) — opened #484 with a calibrated 166-PDF
    regression suite and follow-up retest comments that turned every
    remaining gap into a focused root-cause fix
  • @haberman — opened #488 with a minimal Rust reproducer for the
    Tm+Tj merging issue
  • @eersis-byte — opened #492 with the WASM SystemTime panic backtrace

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.