v0.3.57 | Community contributions + extraction-quality sweep — separation plates, OCG ink filtering, two-phase images, rendered-advance metrics, plus multi-column reading order, page-rotation, CJK/UTF-8 CMap decoding, RTL logical order, indirect-ref page boxes, and font-cache correctness
Added
TextChar::rendered_advance— per-glyph cursor advance to the next character's origin, including character spacing (Tc) and word spacing (Tw) per the PDF Tx formula, distinct from the shape-onlyadvance_width. Enables accurate word-boundary detection and cursor reconstruction. Thanks @haberman. (#602)- Separation plate rendering —
render_separations(page, dpi)/render_separation(page, ink_name, dpi)(Rust + Python) emit one grayscale image per ink, pixel value = ink coverage (0 = none, 255 = full tint). Routes DeviceCMYK / Separation / DeviceN content per ISO 32000-1 §8.6 and honours the reserved colorant names/Alland/Noneper §8.6.6.4 so registration / crop marks land on every plate. NewSeparationPlatenamedtuple in Python. Thanks @RayVR. (#605) - OCG (Optional Content Group) ink filtering for text extraction —
extract_text_filtered(page, excluded_layers, excluded_inks)and the Python equivalent route through the full text-assembly pipeline (structure-tree ordering, table detection) while filtering by PDF layer and DeviceN/Separation ink. Handles OCMD membership dictionaries and DeviceN all-or-nothing ink semantics. Thanks @RayVR. (#600) page_image_handles()two-phase image API — enumerate image handles on a page first, then materialize pixels on demand, including images nested inside Form XObjects via recursion. Avoids decoding every image up front. Thanks @kh3rld. (#588)- Optional Content Group (PDF "layer") name on extracted paths —
PathContentgains alayer: Option<String>carrying the human-readable OCG name from the surroundingBDC /OC … EMCmarkers (e.g.A-GRID,S-COLSfrom Revit/AutoCAD exports), surfaced in the Python path dict too. Resolves OCMD membership dictionaries via/OCGs(§8.11.3.2, depth-bounded), decodes names through PDFDocEncoding/UTF-16, and honours Form-XObject-scoped/Resources/Propertieswith leak-isolation across XObject boundaries (§14.6.2, §8.10.1). Thanks @willywg. (#587) - olmOCR-bench regression harness —
tools/benchmark-harness/olmocr/runs the publicallenai/olmOCR-benchcorpus (999 single-page PDFs, checkable substring/order/absent assertions) for CI regression tracking. Corpus fetched on demand (gitignored, not vendored). (#567) - Configurable non-text drop heuristics —
NonTextDetectorthresholds (non_ascii_drop_threshold,drop_suspicious_unicode) are now configurable so callers can tune the markdown garbage-glyph filter rather than relying on hard-coded constants. (PDX-7)
Changed
TextChargained a requiredrendered_advancefield — external callers constructingTextChar { .. }literals must addrendered_advance(set it equal toadvance_widthto preserve prior behaviour). (#602)- Documented the three plain-text APIs —
extract_text,to_plain_text, and markdown-strip now carry guidance on when each is the right choice and why their output differs, so callers stop picking the wrong mode per-PDF. (#554)
Fixed
- Hebrew and Arabic text now extracts in correct reading order (#557) — right-to-left runs were emitted in visual (reversed) order; they now read in logical order in plain-text, Markdown/HTML, and tagged (structure-tree) extraction alike. Previously a tagged Hebrew document such as
אבג דהוcame out reversed. Latin text is never reordered. - Two-column references and bibliographies are read column-by-column (#549, #536, #607) — pages whose left and right columns share the same line baselines were read straight across, interleaving the two columns line by line (
…genetic exchange Kashtan, N., … divergence in prokaryotes reveals…). They now read down the left column, then the right. Validated across the corpus: 15 academic pages jumped to ~0.98–0.99 similarity vs pdftotext + PyMuPDF, with no regression to tables. - Chinese / Japanese / Korean text in UTF-8 CMap fonts is now extracted (#610) — Type0 fonts encoded with a UTF-8 CMap (
Uni-Utf8-Hand the AdobeUniGB-/UniCNS-/UniJIS-/UniKS-UTF8-Hfamily) previously returned no text at all; their 1–4-byte codes are now decoded correctly, recovering Latin and CJK including rare 4-byte ideographs. - Non-embedded Japanese (JIS) fonts no longer produce garbled Latin — text using the bare predefined
H/VCMaps with an Adobe-Japan1 collection (e.g.あいうえお) was emitted as nonsense ASCII; it now decodes to the correct kana/kanji. - Pages with indirect-reference page boxes no longer come back empty — when a page's
/MediaBoxor/CropBoxstored its coordinates as indirect references (/MediaBox [4 0 R 5 0 R 6 0 R 7 0 R], ISO 32000-1 §7.3.10) the page collapsed to zero area and dropped all text; the references are now resolved per element. - 180°-rotated pages read in the right order — a page with
/Rotate 180was extracted in unrotated coordinates, so its lines and words came out fully reversed (a rotated English agreement read bottom-up, words backwards). The page geometry is now corrected before reading-order assembly. (90°/270° remain a follow-up.) - Signature and form-field text stored only in the widget appearance is recovered — signed-signature fields and form widgets whose value lives in the
/APappearance stream (not a/Ventry) were dropped from extraction; their visible text is now included. - Unchecked checkboxes no longer inject
[ ]noise — an unchecked checkbox widget previously emitted a stray[ ]marker into the surrounding text; it now contributes nothing. - Page numbers and running headers no longer leak into body text (#553) — a standalone page number or running-header line isolated on its own baseline is no longer spliced into the adjacent paragraph.
- Glyph corruption between documents that reuse a font name (#597, #598) — Type 3 fonts (whose glyphs are document-scoped content streams) are no longer shared via the cross-document font cache, and the cache key now includes glyph-width metrics, so two fonts that share a BaseFont name but differ in
/Widthsno longer alias to one another. - Type 3 font spacing now honours the font's
FontMatrix(#606) — glyph advances for Type 3 fonts were scaled by a hard-coded 1/1000 em; they now apply the font's ownFontMatrix[0], so Type 3 fonts with a non-standard (e.g. identity[1 0 0 1 0 0]) matrix get correct character and word spacing. Thanks @haberman. - Faster, no double rescan on damaged PDFs (#572) — a reconstructed cross-reference table now seeds the object-scan cache, removing a redundant second full-file sweep on corrupt/polyglot PDFs.
- Form XObject image cache poisoning when fonts/XObjects collide on basename — the OCG ink-filtering work also fixed three latent bugs in OCG/ink handling: a parser edge case, a Form XObject cache keyed too coarsely, and ink-state restore on graphics-state pop. Thanks @RayVR. (#600)
Performance
extract_textno longer hangs on heavily OCR-layered scans (#575) — superscript-baseline snapping was quadratic in the number of text spans; it is now windowed, so pages with tens of thousands of OCR spans extract promptly instead of stalling.- Regression guards added for previously-fixed word-spacing, character-clustering scaling,
to_htmltable handling, and multi-column detection, so they cannot silently regress.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.