v0.3.47 | text-extraction quality, CJK + RTL fixes, table-detection hardening, and a WASM SystemTime fix.
This release closes the remaining bugs surfaced by the kreuzberg
integration (issue #484)
and ships the related text-extraction quality fixes. Word-F1 against the
pdftotext-derived ground truth corpus now meets the kreuzberg quality
floor for every PDF in the issue 484 set.
Fixed
-
kreuzberg regression suite — all 24 PDFs now meet the F1 floor
(#484) —
extract_textpreviously failed three documents reported by
@Goldziher on the kreuzberg corpus:pdfa_039.pdf(swimming-results
table) returned F1 0.810,pr-136-example.pdf(CJK financial document)
returned F1 0.709, andannotations.pdfreturned F1 0.545. Three
separate root-cause fixes restore them to F1 ≥ 0.85:eliminate duplicate emission of multi-row table labels— the
text-only spatial fallback indetect_tables_with_linesnow
requiresconfig.text_fallback=true(whichextract_textdoes
not pass) so report-style PDFs with decorative ruling lines no
longer get their cell content emitted twice;span_in_tableadds
a text-match fallback to catch label spans whose font ascent
extends slightly above the cell's ink box (issue-53-example.pdf
F1 0.867 → 0.992).tighten cross-font glue and decimal merge for CJK + Latin layouts—cross_font_word_glueno longer fires on a CJK ↔
non-CJK boundary (CJK ideographs satisfyis_alphabetic()per
Unicode and were being concatenated with adjacent Latin); the
decimal_mergeheuristic requires a column-boundary-sized gap
(gap > 0.4 em) so per-glyph Tj operators in CJK documents stop
mangling "2013" into "201.3" (pr-136 F1 0.709 → 0.884).narrow CJK boundary forced-space to script glyphs only—
should_insert_spacenow actively inserts a space at the
CJK ↔ non-CJK boundary to match pdftotext tokenisation, but
restricted to actual script glyphs (ideographs, kana, hangul);
fullwidth ASCII operators like < > = μ stay inline with
adjacent digits/Latin so compound tokens like "60000≤Q<80000"
are preserved (issue-336 text quality gate stays at PASS).
Reported by @Goldziher.
-
extract_spansnow exposes amerge_tm_tj_runsopt-out
(#488) —
Same-line Tm+Tj runs were unconditionally batched into a single
TextSpan, throwing away the per-Tm positioning that downstream
layout-analysis code (e.g. column-aware table detection) needs.
SpanMergingConfig::merge_tm_tj_runs(defaulttruefor backward
compatibility) now flushes the span buffer at every Tm operator so
callers can opt in to one span per Tm+Tj group, matching the
granularity ofpdftotext -bbox-layout. Reported by @haberman. -
saveEncryptedToBytesno longer panics in browser WASM
(#492) —
generate_file_id(per ISO 32000-1 §14.4) called
std::time::SystemTime::now(), which is unimplemented on
wasm32-unknown-unknown. Cfg-gated so the WASM build derives the
file identifier fromuuid::Uuid::new_v4()only — still a unique
opaque 16-byte ID per the spec. Reported by @eersis-byte. -
CJK fullwidth operator spacing in
to_markdown/to_html
(#485) —
Four coordinated changes restoreissue-336-example.pdfto PASS
on all three quality gates (text, markdown, html):pipeline/converters/has_horizontal_gapsuppresses space
insertion when one side is CJK and the other is CJK or a
fullwidth/math operator (≤, <, >, =, μ, etc.), mirroring the
text-extraction CJK-pair suppression.extract_cell_textno longer inserts an unconditional space
between adjacent spans on the same row of a table cell — uses
the same gap-aware separator rules as the inline-flow path so
multi-span cells like60000≤Q<80000(rendered as 5 separate
Tj operators) keep their compound tokens intact.consolidate_adjacent_table_fragments(new helper in
spatial_table_detector) merges vertically-adjacent tables that
share an identical column structure. The line-based detector
emits one fragment per ruling-rule strip on PDFs that draw a
horizontal rule between every pair of rows; each fragment was
failingis_real_gridand falling through to paragraph flow
with column-based reading order, producing orphan
<p>40000≤Q</p>/<p><55000</p>pairs. Consolidating before
the filter lets the merged multi-row table survive.is_real_gridaccepts wide consolidated tables that have
dense data rows alongside sparse header / multi-row-label rows
— the strict 70 % dense-ratio gate was rejecting real tables
whose column headers split across multiple visual rows.
Score improvements onissue-336-example.pdf:
text 0.612 → 0.820, markdown 0.577 → 0.863, html 0.632 → 0.646
(all PASS their thresholds).
-
Text-only spatial table fallback for line-less tables in
to_markdown
(#486) —
partial fix.extract_page_tablesnow opts in to a relaxed
text-only detection when the caller is a converter (text_fallback=
true), with the column ceiling raised from 15 to 25 so that
sailing-score grids with 16-18 score columns are no longer
rejected outright. The fragmented-table consolidation from #485
also kicks in here, recovering most of the row labels and
identifier columns.nougat_018.pdfmarkdown still trails its
threshold (0.656 vs 0.90) because the score columns themselves —
variable-width sparse cells with parenthesised drop-scores —
evade column detection; that is the remaining piece tracked
separately. -
HTML table cell rendering aligned with markdown
(#487) —
partial fix.to_htmlnow uses the same span-walking and
bold/italic preservation asto_markdown's
render_table_markdown. Three of four affected docs improved
by 1-4 % Jaccard but two (nougat_018, nougat_026) still trail the
threshold pending the table-fragmentation work above. -
RTL inline emphasis stripping in markdown extraction
(#459) —
RTL detection now strips<strong>/<em>markers from
visually-reversed runs into_markdownconsistently with the
plain-text path; spec basis ISO 32000-1 §14.8.2.3.3 (Reverse-
Order Show Strings). 46 unit tests in
tests/test_rtl_script_support.rscover the detector, BiDi
algorithm, and inline-flow integration. -
Multi-byte CMap parsing and array-form
beginbfrange
(§9.7.5) —beginbfrange ... endbfrangearray notation
<src> <src> [<dst1> <dst2> ...]was not fully covered; the
CMap parser now matches the spec's allowed grammar so multi-byte
CIDs map correctly through ToUnicode CMaps. -
/StructTreeRoot-only tagged PDFs (§14.7.4) — Documents
that declare/StructTreeRootin the catalog without a
/MarkInfodictionary (PDF 1.4 documents, valid per the spec)
now correctly use the structure tree for table-cell content
extraction. Resolves/OBJRcontent-item references during
tree traversal so OBJR-referenced annotations and XObjects are
no longer lost. -
Indirect references in MediaBox/CropBox accessors (§7.7.3.4)
— Page attribute accessors now resolve/MediaBoxand/CropBox
through indirect references and the/Pagesinheritance chain.
This is what made the Bucket A errors in the issue 484 retest
comment (annotations*.pdf,pdfa_039.pdf) parse successfully. -
CTM-aware cache key for Form XObject span extraction — Form
XObject spans were cached by XObject reference alone, returning
stale coordinates for the same XObject reused on multiple pages
with different CTM transforms. Cache key now includes the CTM
so repeated XObjects produce correctly-positioned spans on each
invocation. -
notdefrangeU+FFFD no longer blocks the CID-as-Unicode
fallback (§9.10.2) — Per the spec, U+FFFD (REPLACEMENT
CHARACTER) signals "no proper Unicode mapping", so a notdefrange
hit must not stop the priority list. The Identity CID-as-Unicode
fallback (Priority 3) now fires correctly for composite fonts
whose ToUnicode CMap returns U+FFFD. -
ToUnicode Priority-3 fallback guarded for composite fonts
(§9.10.2) — The CID-as-Unicode fallback is now only applied
to fonts whose CMap is one of the predefined composite-font
CMaps or whose CIDFont uses one of the Adobe character
collections, matching the spec's enumeration; misapplication on
other fonts could produce mojibake on previously-working files. -
Reject prose / TOC / underline-annotation false-positive
tables into_htmlandto_markdown— Wide pages of
ordinary paragraph text were sometimes detected as multi-column
tables: word x-positions cluster into "columns" by accident, and
decorative horizontal rules (newsletter mastheads, annotation
underlines, page borders) tricked the line-based detector into
treating two adjacent lines as a header + data row. The
detection pipeline now applies several post-is_real_grid
guards that look at the shape of the candidate's cell
content rather than just its grid geometry:looks_like_prose_tablerejects a candidate when more than
12 % of cells end with a mid-sentence,or;, more than
25 % of cells start with a lowercase ASCII letter
(continuation fragments like "and", "the", "to"), or more
than 10 % of cells are pure leader dots (the. . . . . .
runs in tables of contents).- The text-only spatial fallback and the horizontal-rule-
bounded path both now require ≥ 3 rows of evidence. A
title plus a wrapped body line is the signature of prose,
not a table; only the line-based intersection / cluster
paths (which have authoritative visual evidence) still
accept 2-row tables. should_insert_spaceno longer forces a space at the
CJK ↔ ASCII-punctuation boundary. The boundary forced-
space added in v0.3.47 was correctly inserting a space at
"神鹰集团" + "2015" but was wrongly producing "する ."
instead of "する." in Japanese technical text; ASCII
clause punctuation hugs the preceding token in every
script, so the rule is now suppressed when the
transitioning glyph IS the punctuation.text_fallbackdefaults back totrueon
TableDetectionConfig. The new prose-shape filter
replaces the gate-based protection added earlier in the
cycle, so the publicextract_tablesAPI again detects
line-less data tables out of the box.
Notes
tests/test_corpus_extraction_quality.rsnow strips markdown
formatting markers (**bold**,*italic*,|separators,
---|---|---rule,# heading,```fences) before
computing Jaccard against the plain-text GT — mirrors the HTML
test's existingstrip_htmlstep so the score reflects text
content rather than formatting markup.- All 19 quality-gate Jaccard tests in
tests/test_corpus_extraction_quality.rsnow pass (up from
13 at the start of this branch). The kreuzberg issue 484
corpus passes its F1 floor on every PDF.
Thanks
This release was driven entirely by community bug reports and the
kreuzberg integration test feedback loop:
- @Goldziher (kreuzberg-dev) — opened #484 with a calibrated 166-PDF
regression suite and follow-up retest comments that turned every
remaining gap into a focused root-cause fix - @haberman — opened #488 with a minimal Rust reproducer for the
Tm+Tj merging issue - @eersis-byte — opened #492 with the WASM
SystemTimepanic backtrace
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.