Release v0.3.47 | text-extraction quality, CJK + RTL fixes, table-detection hardening, and a WASM SystemTime fix. · yfedoseev/pdf_oxide

This release closes the remaining bugs surfaced by the kreuzberg
integration (issue #484)
and ships the related text-extraction quality fixes. Word-F1 against the
pdftotext-derived ground truth corpus now meets the kreuzberg quality
floor for every PDF in the issue 484 set.

Fixed

kreuzberg regression suite — all 24 PDFs now meet the F1 floor
(#484) —
extract_text previously failed three documents reported by
@Goldziher on the kreuzberg corpus: pdfa_039.pdf (swimming-results
table) returned F1 0.810, pr-136-example.pdf (CJK financial document)
returned F1 0.709, and annotations.pdf returned F1 0.545. Three
separate root-cause fixes restore them to F1 ≥ 0.85:
- eliminate duplicate emission of multi-row table labels — the
  text-only spatial fallback in detect_tables_with_lines now
  requires config.text_fallback=true (which extract_text does
  not pass) so report-style PDFs with decorative ruling lines no
  longer get their cell content emitted twice; span_in_table adds
  a text-match fallback to catch label spans whose font ascent
  extends slightly above the cell's ink box (issue-53-example.pdf
  F1 0.867 → 0.992).
- tighten cross-font glue and decimal merge for CJK + Latin layouts — cross_font_word_glue no longer fires on a CJK ↔
  non-CJK boundary (CJK ideographs satisfy is_alphabetic() per
  Unicode and were being concatenated with adjacent Latin); the
  decimal_merge heuristic requires a column-boundary-sized gap
  (gap > 0.4 em) so per-glyph Tj operators in CJK documents stop
  mangling "2013" into "201.3" (pr-136 F1 0.709 → 0.884).
- narrow CJK boundary forced-space to script glyphs only —
  should_insert_space now actively inserts a space at the
  CJK ↔ non-CJK boundary to match pdftotext tokenisation, but
  restricted to actual script glyphs (ideographs, kana, hangul);
  fullwidth ASCII operators like ＜＞＝ μ stay inline with
  adjacent digits/Latin so compound tokens like "60000≤Q＜80000"
  are preserved (issue-336 text quality gate stays at PASS).
  Reported by @Goldziher.
extract_spans now exposes a merge_tm_tj_runs opt-out
(#488) —
Same-line Tm+Tj runs were unconditionally batched into a single
TextSpan, throwing away the per-Tm positioning that downstream
layout-analysis code (e.g. column-aware table detection) needs.
SpanMergingConfig::merge_tm_tj_runs (default true for backward
compatibility) now flushes the span buffer at every Tm operator so
callers can opt in to one span per Tm+Tj group, matching the
granularity of pdftotext -bbox-layout. Reported by @haberman.
saveEncryptedToBytes no longer panics in browser WASM
(#492) —
generate_file_id (per ISO 32000-1 §14.4) called
std::time::SystemTime::now(), which is unimplemented on
wasm32-unknown-unknown. Cfg-gated so the WASM build derives the
file identifier from uuid::Uuid::new_v4() only — still a unique
opaque 16-byte ID per the spec. Reported by @eersis-byte.
CJK fullwidth operator spacing in to_markdown / to_html
(#485) —
Four coordinated changes restore issue-336-example.pdf to PASS
on all three quality gates (text, markdown, html):
- pipeline/converters/has_horizontal_gap suppresses space
 insertion when one side is CJK and the other is CJK or a
 fullwidth/math operator (≤, ＜, ＞, ＝, μ, etc.), mirroring the
 text-extraction CJK-pair suppression.
- extract_cell_text no longer inserts an unconditional space
 between adjacent spans on the same row of a table cell — uses
 the same gap-aware separator rules as the inline-flow path so
 multi-span cells like 60000≤Q＜80000 (rendered as 5 separate
 Tj operators) keep their compound tokens intact.
- consolidate_adjacent_table_fragments (new helper in
 spatial_table_detector) merges vertically-adjacent tables that
 share an identical column structure. The line-based detector
 emits one fragment per ruling-rule strip on PDFs that draw a
 horizontal rule between every pair of rows; each fragment was
 failing is_real_grid and falling through to paragraph flow
 with column-based reading order, producing orphan
 40000≤Q / ＜55000 pairs. Consolidating before
 the filter lets the merged multi-row table survive.
- is_real_grid accepts wide consolidated tables that have
 dense data rows alongside sparse header / multi-row-label rows
 — the strict 70 % dense-ratio gate was rejecting real tables
 whose column headers split across multiple visual rows.
 Score improvements on issue-336-example.pdf:
 text 0.612 → 0.820, markdown 0.577 → 0.863, html 0.632 → 0.646
 (all PASS their thresholds).
Text-only spatial table fallback for line-less tables in
to_markdown
(#486) —
partial fix. extract_page_tables now opts in to a relaxed
text-only detection when the caller is a converter (text_fallback=
true), with the column ceiling raised from 15 to 25 so that
sailing-score grids with 16-18 score columns are no longer
rejected outright. The fragmented-table consolidation from #485
also kicks in here, recovering most of the row labels and
identifier columns. nougat_018.pdf markdown still trails its
threshold (0.656 vs 0.90) because the score columns themselves —
variable-width sparse cells with parenthesised drop-scores —
evade column detection; that is the remaining piece tracked
separately.
HTML table cell rendering aligned with markdown
(#487) —
partial fix. to_html now uses the same span-walking and
bold/italic preservation as to_markdown's
render_table_markdown. Three of four affected docs improved
by 1-4 % Jaccard but two (nougat_018, nougat_026) still trail the
threshold pending the table-fragmentation work above.
RTL inline emphasis stripping in markdown extraction
(#459) —
RTL detection now strips  /  markers from
visually-reversed runs in to_markdown consistently with the
plain-text path; spec basis ISO 32000-1 §14.8.2.3.3 (Reverse-
Order Show Strings). 46 unit tests in
tests/test_rtl_script_support.rs cover the detector, BiDi
algorithm, and inline-flow integration.
Multi-byte CMap parsing and array-form beginbfrange
(§9.7.5) — beginbfrange ... endbfrange array notation
<src> <src> [<dst1> <dst2> ...] was not fully covered; the
CMap parser now matches the spec's allowed grammar so multi-byte
CIDs map correctly through ToUnicode CMaps.
/StructTreeRoot-only tagged PDFs (§14.7.4) — Documents
that declare /StructTreeRoot in the catalog without a
/MarkInfo dictionary (PDF 1.4 documents, valid per the spec)
now correctly use the structure tree for table-cell content
extraction. Resolves /OBJR content-item references during
tree traversal so OBJR-referenced annotations and XObjects are
no longer lost.
Indirect references in MediaBox/CropBox accessors (§7.7.3.4)
— Page attribute accessors now resolve /MediaBox and /CropBox
through indirect references and the /Pages inheritance chain.
This is what made the Bucket A errors in the issue 484 retest
comment (annotations*.pdf, pdfa_039.pdf) parse successfully.
CTM-aware cache key for Form XObject span extraction — Form
XObject spans were cached by XObject reference alone, returning
stale coordinates for the same XObject reused on multiple pages
with different CTM transforms. Cache key now includes the CTM
so repeated XObjects produce correctly-positioned spans on each
invocation.
notdefrange U+FFFD no longer blocks the CID-as-Unicode
fallback (§9.10.2) — Per the spec, U+FFFD (REPLACEMENT
CHARACTER) signals "no proper Unicode mapping", so a notdefrange
hit must not stop the priority list. The Identity CID-as-Unicode
fallback (Priority 3) now fires correctly for composite fonts
whose ToUnicode CMap returns U+FFFD.
ToUnicode Priority-3 fallback guarded for composite fonts
(§9.10.2) — The CID-as-Unicode fallback is now only applied
to fonts whose CMap is one of the predefined composite-font
CMaps or whose CIDFont uses one of the Adobe character
collections, matching the spec's enumeration; misapplication on
other fonts could produce mojibake on previously-working files.
Reject prose / TOC / underline-annotation false-positive
tables in to_html and to_markdown — Wide pages of
ordinary paragraph text were sometimes detected as multi-column
tables: word x-positions cluster into "columns" by accident, and
decorative horizontal rules (newsletter mastheads, annotation
underlines, page borders) tricked the line-based detector into
treating two adjacent lines as a header + data row. The
detection pipeline now applies several post-is_real_grid
guards that look at the shape of the candidate's cell
content rather than just its grid geometry:
- looks_like_prose_table rejects a candidate when more than
  12 % of cells end with a mid-sentence , or ;, more than
  25 % of cells start with a lowercase ASCII letter
  (continuation fragments like "and", "the", "to"), or more
  than 10 % of cells are pure leader dots (the . . . . . .
  runs in tables of contents).
- The text-only spatial fallback and the horizontal-rule-
  bounded path both now require ≥ 3 rows of evidence. A
  title plus a wrapped body line is the signature of prose,
  not a table; only the line-based intersection / cluster
  paths (which have authoritative visual evidence) still
  accept 2-row tables.
- should_insert_space no longer forces a space at the
  CJK ↔ ASCII-punctuation boundary. The boundary forced-
  space added in v0.3.47 was correctly inserting a space at
  "神鹰集团" + "2015" but was wrongly producing "する ."
  instead of "する." in Japanese technical text; ASCII
  clause punctuation hugs the preceding token in every
  script, so the rule is now suppressed when the
  transitioning glyph IS the punctuation.
- text_fallback defaults back to true on
  TableDetectionConfig. The new prose-shape filter
  replaces the gate-based protection added earlier in the
  cycle, so the public extract_tables API again detects
  line-less data tables out of the box.

Notes

tests/test_corpus_extraction_quality.rs now strips markdown
formatting markers (**bold**, *italic*, | separators,
---|---|--- rule, # heading, ``` fences) before
computing Jaccard against the plain-text GT — mirrors the HTML
test's existing strip_html step so the score reflects text
content rather than formatting markup.
All 19 quality-gate Jaccard tests in
tests/test_corpus_extraction_quality.rs now pass (up from
13 at the start of this branch). The kreuzberg issue 484
corpus passes its F1 floor on every PDF.

Thanks

This release was driven entirely by community bug reports and the
kreuzberg integration test feedback loop:

@Goldziher (kreuzberg-dev) — opened #484 with a calibrated 166-PDF
regression suite and follow-up retest comments that turned every
remaining gap into a focused root-cause fix
@haberman — opened #488 with a minimal Rust reproducer for the
Tm+Tj merging issue
@eersis-byte — opened #492 with the WASM SystemTime panic backtrace

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.3.47 | text-extraction quality, CJK + RTL fixes, table-detection hardening, and a WASM SystemTime fix.

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Fixed

Notes

Thanks

Installation

Platform Support

Changelog

Contributors

Uh oh!