v0.3.58 | Structure-tree reading order with /Suspects handling, structured page extraction, two-column reference/verse routing, math-font punctuation + emoji spacing, mixed bidirectional text, /Rotate 90°/270°, image colour-space resolution, and a dependency refresh
Added
PdfDocument::extract_structured(page) -> StructuredPage— additive typed page surface (#536). Returns the page's text grouped into reading-orderStructuredRegions with aRegionRole(BodyBlock,StructuralHeading { level },MarginalLabel,Header,Footer,PageNumber,Artifact) and a best-effortcolumn_indexfor two-column bodies. Roles reuse signals already on each span —/Artifactmarked content (ISO 32000-1:2008 §14.8.2.2), structure-tree heading levels (§14.7.2), and geometry (§14.8.2.3.1) — so a trustworthy tagged PDF yields tree-driven roles for free. New public typesStructuredPage/StructuredRegion/RegionRoleinpdf_oxide::structured(serde-serializable). Available across all bindings with idiomatic names — Pythonextract_structured, JS/WASMextractStructured, GoExtractStructured, Rubyextract_structured, PHPextractStructured, JavaextractStructured, C#ExtractStructured, and the C ABIpdf_document_extract_structured_to_json(all returning the serializedStructuredPage). Thanks @lggcs.PdfDocument::prefers_structure_reading_order() -> bool— read-only introspection accessor reporting whether text extraction will use the Tagged-PDF logical structure order (a depth-first traversal of/StructTreeRoot, ISO 32000-1:2008 §14.8.2.3.1 / §14.7.1) rather than geometric page-content order for this document. (#608)PdfImageHandle::indexed_base() -> Option<ColorSpace>— for an Indexed image ([/Indexed base hival lookup], ISO 32000-1:2008 §8.6.6.3) the de-indexed base colour space;Nonefor non-Indexed images. (#588)
Fixed
-
AcroForm fill: non-ASCII (CJK/Japanese) field values no longer mojibake (#616) —
DocumentEditor::set_form_field_valuewrote field values (and field names/T, tooltips/TU) as raw UTF-8 bytes, which a conformant reader (Acrobat, Stirling-PDF, Preview, pdf.js) interprets as PDFDocEncoding, so山田太郎rendered aså±±ç"°…. Values are now encoded as proper PDF text strings per ISO 32000-1:2008 §7.9.2.2 — a PDFDocEncoding-compatible literal for ASCII/Latin-1, and UTF-16BE with aU+FEFFBOM for anything above U+00FF (/V <FEFF5C71753059 2A90CE>). ASCII values stay plain literals. -
AcroForm fill: a filled field no longer disappears on save (#617) — after
set_form_field_value+save_to_bytes, re-reading the document returned 0 form fields, because a standalone terminal field lost its/FT(field type) on rewrite, leaving an untyped widget annotationFormExtractorcould not classify (/FTis an inheritable, required key — §12.7.4.1). Parentless terminal fields now re-emit/FT(and/Ff), so fill → save → re-read round-trips correctly. (Both #616 and #617 surfaced building the issue #611 Japanese form-fill round-trip; they fix the published crate, so every binding's form-fill benefits.) -
Two-column reference lists and short-verse bodies read column-by-column (#536) — pages whose left/right columns share line baselines but have short lines (bibliographies, Bible/lexicon verse editions) were read straight across, because the table-safety guard rejected anything with a low per-line character count. A length-independent admission path now routes them down the left column then the right when a single persistent central gutter is present, gated by concentration, coverage, column char-balance, and a grid-row signal so multi-cell numeric tables stay off the column path. Display-equation rows are excluded from the gutter-coverage measurement so dense-math pages still route. Validated table-safe:
google_doc_document.pdfmarkdown is byte-identical. -
Punctuation glyphs parked at non-standard codes in symbolic fonts decode correctly (#536) — when a font's
/Differencesnames a codeperiod/comma/hyphen/minus(ISO 32000-1:2008 §9.6.6.1) but its program / base encoding resolves that code to a non-sensible symbol, the Adobe Glyph List value (§9.10.2) is now preferred, recovering the intended punctuation. Tightly gated: correctly-mapped fonts and genuinelogicalnot(¬) / math-symbol glyphs are untouched. (A figure axis that draws a symbol-font glyph whose encoding genuinely is¬where a decimal point belongs — the font itself is wrong — remains out of scope, since recovering it would require unsafe context guessing that corrupts legitimate5×3/2±1.) -
Mixed Arabic/Latin/numeral lines settle in logical order — a right-to-left line embedding European/Arabic-Indic numerals or Latin words (e.g. a date
14 april 1434 ٤٣٤١) now gives each embedded left-to-right sub-run its own LTR level per the Unicode Bidirectional Algorithm (UAX #9 §3.3.4), instead of only reversing pure-RTL runs. The pass is gated to confidently-RTL mixed lines; pure-RTL, pure-LTR, and all ASCII/Latin extraction are byte-for-byte unchanged. -
Image
decode()resolves resource-name colour spaces (#588) —page_image_handles()decode()previously failed on an image whose/ColorSpaceis a resource name (e.g./CS0resolved via/Resources/ColorSpace, ISO 32000-1:2008 §8.6.6 / §8.9.7), returningUnsupported color space. The active resource colour-space map is now threaded into the handle (page, Form-XObject, and inline scopes) so such images decode. Indexed image handles now reportColorSpace::Indexed(1-component sample layout, §8.6.6.3) with the de-indexed base available via the newindexed_base()accessor — an API refinement of the v0.3.57-only handle surface; decoded pixels are unchanged. -
Suspect tagged PDFs now extract in geometric reading order (#608) — a document advertising
/MarkInfo /Suspects true(the/TagSuspect /Orderingsignal that the producer could not guarantee page content order matches logical structure order, §14.8.2.3.1) was previously read through its structure tree byextract_text, which could emit content out of visual order. Such documents now fall back to geometric order, while trustworthy tagged PDFs —/Marked, or a catalog/StructTreeRooton PDF-1.4 files that predate/MarkInfo— read in logical structure order across all four text accessors (extract_text,to_plain_text,to_markdown,to_html). A single shared trustworthiness predicate gates every reading-order path so they cannot drift apart, and a marked-content element whose spans cross multiple visual lines is now emitted in reading order. Non-suspect and untagged documents are byte-for-byte unchanged. -
/Rotate 90and/Rotate 270pages now read in display orientation — pages with a 90°- or 270°-clockwise/Rotate(ISO 32000-1:2008 §7.7.3.3) were previously extracted in raw user-space coordinates and came out sideways, with lines and words in the wrong order. Every span is now mapped into the displayed coordinate frame — including the page width/height swap for 90°/270° (§8.3.3) — before reading-order assembly, so all four rotation angles read upright. Annotation text on rotated pages is mapped into the same displayed frame. Unrotated and 180° pages are byte-for-byte unchanged (verified on the issue14415 180° contract, which now reads in correct order). -
Space after an emoji is no longer dropped — a pictographic glyph (e.g. 📄) immediately followed by a word (
📄 README.md) previously merged into📄README.md, because the residual gap after the wide glyph fell below the proportional-font space threshold. An emoji→letter boundary with a positive gap now keeps the inter-token space. Word segmentation is reader latitude (ISO 32000-1:2008 §9.10); the rule is gated on pictographic codepoints (arrows and math-operator blocks excluded), so technical text is unaffected.
Changed
PdfImageHandle::decode()/raw_compressed_bytes()now borrow&selfandPdfImageHandleisClone(refining the v0.3.57 two-phase image API, #588). A single handle now supports an inspect → raw-bytes → decode flow without re-enumerating the page. Borrowing is a strict superset of the previous by-value signatures, so existing callers still compile.- Indexed image handles now report
ColorSpace::Indexedinstead of the de-indexed base (refining the v0.3.57-only handle metadata, #588). The base remains available via the newindexed_base()accessor; this makes the handle'scolor_spacereflect the 1-component sample layout per ISO 32000-1:2008 §8.6.6.3 and is consistent across the direct-array, indirect-reference, and inline-image forms. Decoded pixels are unchanged (decode()re-resolves the palette independently).
Dependencies
- Rust:
serde_json1.0.149 → 1.0.150 (#578),log0.4.29 → 0.4.30 (#592),quick-xml0.40.0 → 0.40.1 (#579),cbc0.2.0 → 0.2.1 (pullscipher0.5.2 /crypto-common0.2.2, #577). Go:ebitengine/purego0.10.0 → 0.10.1 (#593). Ruby:rubocop-rspec2.20 → 3.9 (#581). CI actions:actions/setup-javav4 → v5.2.0 (#585),github/codeql-actionv3 4.35.5 → 4.36.0 (#582),taiki-e/install-action2.79.2 → 2.79.12 (#580),EmbarkStudios/cargo-deny-action2.0.18 → 2.0.20 (#584),golangci/golangci-lint-action9.2.0 → 9.2.1 (#583).
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.