Release v0.3.54 | Text-extraction fidelity pass. Hebrew / RTL reads in correct reading order thanks to a new geometric visual-vs-logical detector that finally fires on Hebrew (Pass 0's Arabic-Presentation-Forms gate never matched Hebrew because Hebrew has no presentation forms in Unicode). Bullets and `fi` / `fl` ligatures decode to their canonical codepoints via a new ISO 32000-1 §9.10.2 fallback level — Type0 Identity-H fonts without a `CIDToGIDMap` now consult the embedded TrueType `cmap` + `post` tables before the CID-as-Unicode last resort. Tight two-column prose bodies read column-by-column instead of row-by-row interleave (the known v0.3.53 issue), gated by a new table-vs-prose classifier so the v0.3.53 corpus-revert lesson holds — tables stay byte-identical. Two-column reference PDFs (Bibles / dictionaries / encyclopedias) inherit the same reading-order fix. Four fixes; zero new public surface. · yfedoseev/pdf_oxide

Fixed

Hebrew / RTL visual-vs-logical detection
(#537)
— Hebrew PDFs that store text in visual order (the PDF
content stream draws glyphs left-to-right even though the
script reads right-to-left) now extract in correct logical
order. New per-RTL-run X-coordinate-monotonicity detector
gates the existing UAX #9 bidi::reorder_visual_to_logical
pass; logical-order PDFs (the pdfium hebrew_mirrored.pdf
test fixture and Arabic CID-TrueType samples) stay
byte-identical. Pass 0's Arabic-Presentation-Forms trigger
was the only Arabic-specific gate; Hebrew has no
presentation forms in Unicode, so the geometric gate is
what unblocks it. Reported by alexagr with the Magic
Palace Eilat hotel PDF as the canonical case; reproduced
empirically before the fix landed.
ToUnicode CMap miss: extended ISO 32000-1 §9.10.2
fallback chain
(#535)
— Type0 Identity-H fonts without a CIDToGIDMap (the
PowerPoint / Acrobat-exported subset-font shape that
produced ~25,350 "ToUnicode CMap MISS" warnings on the
PowerPoint corpus) now consult the embedded TrueType
font's cmap + post tables to recover glyph names,
then look up the Adobe Glyph List (including the AGL §6
uniXXXX / uXXXXX synthetic-name patterns). Resolves
the bullet → ❍ substitution and the fi / fl
ligature corruption. The post-extraction band-aid
normalize_bullet_glyphs stays retired (it was wrong for
documents that legitimately use ❍); the fix is in the
decoder, where it belongs. Same shape pdf.js / MuPDF /
PDFBox 3.x use.
Multi-column prose reading order
(#534)
— tight two-column PROSE bodies (~12pt gutters) now read
column-by-column instead of row-by-row interleave. The
v0.3.53 known issue, explicitly documented in that
release's notes. New table-vs-prose classifier (region
is "prose" when lines are tall stacks of wide content,
"table" when lines are short cells; the v0.3.53 revert
comment in xycut.rs:73-101 is the spec) gates a
tight-gutter column cut so the same XY-cut recursion no
longer corrupts table cells — the v0.3.53 lesson where two
prior attempts were reverted by the 70-PDF corpus sweep
after the google_doc_document.pdf population table
("273.879.7501" → "1273.879.750") regressed. Adds
recursive band-separation of full-width header / footer
rows BEFORE column detection so the multi-column body
gets analyzed as its own region. Same 70-PDF sweep is the
acceptance gate.
Two-column reference PDFs: column-aware reading order
(#536)
— reference-style two-column PDFs (Bibles, dictionaries,
encyclopedias, academic editions with marginal section
numbers) now extract column-by-column thanks to the #534
classifier. The cascade where the spatial-table-detector
latched onto the interleaved structure and rendered the
whole multi-column body as a Markdown table-with-
one-word-per-cell is resolved: clean column reading order
produces clean prose, not a degenerate table. Concerns
about marginal-label classification, structural heading
isolation, cross-page section continuity, and
header / footer separation that the reporter usefully
called out are captured for the v0.4.0
structured-extraction track and acknowledged on the
issue. Reported by lggcs (Luke) with a 1256-page
French Louis Segond Bible as the canonical case;
reproduced empirically before the fix landed.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform	Architecture	Archive
Linux	x86_64 (glibc)	`pdf_oxide-linux-x86_64-*.tar.gz`
Linux	x86_64 (musl)	`pdf_oxide-linux-x86_64-musl-*.tar.gz`
Linux	ARM64	`pdf_oxide-linux-aarch64-*.tar.gz`
macOS	x86_64 (Intel)	`pdf_oxide-macos-x86_64-*.tar.gz`
macOS	ARM64 (Apple Silicon)	`pdf_oxide-macos-aarch64-*.tar.gz`
Windows	x86_64	`pdf_oxide-windows-x86_64-*.zip`

Changelog

See CHANGELOG.md for full details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Fixed

Installation

Platform Support

Changelog

Uh oh!