Skip to content

v0.3.58 | Structure-tree reading order with /Suspects handling, structured page extraction, two-column reference/verse routing, math-font punctuation + emoji spacing, mixed bidirectional text, /Rotate 90°/270°, image colour-space resolution, and a dependency refresh

Choose a tag to compare

@github-actions github-actions released this 31 May 17:21
· 97 commits to main since this release
e7fe552

Added

  • PdfDocument::extract_structured(page) -> StructuredPage — additive typed page surface (#536). Returns the page's text grouped into reading-order StructuredRegions with a RegionRole (BodyBlock, StructuralHeading { level }, MarginalLabel, Header, Footer, PageNumber, Artifact) and a best-effort column_index for two-column bodies. Roles reuse signals already on each span — /Artifact marked content (ISO 32000-1:2008 §14.8.2.2), structure-tree heading levels (§14.7.2), and geometry (§14.8.2.3.1) — so a trustworthy tagged PDF yields tree-driven roles for free. New public types StructuredPage / StructuredRegion / RegionRole in pdf_oxide::structured (serde-serializable). Available across all bindings with idiomatic names — Python extract_structured, JS/WASM extractStructured, Go ExtractStructured, Ruby extract_structured, PHP extractStructured, Java extractStructured, C# ExtractStructured, and the C ABI pdf_document_extract_structured_to_json (all returning the serialized StructuredPage). Thanks @lggcs.
  • PdfDocument::prefers_structure_reading_order() -> bool — read-only introspection accessor reporting whether text extraction will use the Tagged-PDF logical structure order (a depth-first traversal of /StructTreeRoot, ISO 32000-1:2008 §14.8.2.3.1 / §14.7.1) rather than geometric page-content order for this document. (#608)
  • PdfImageHandle::indexed_base() -> Option<ColorSpace> — for an Indexed image ([/Indexed base hival lookup], ISO 32000-1:2008 §8.6.6.3) the de-indexed base colour space; None for non-Indexed images. (#588)

Fixed

  • AcroForm fill: non-ASCII (CJK/Japanese) field values no longer mojibake (#616)DocumentEditor::set_form_field_value wrote field values (and field names /T, tooltips /TU) as raw UTF-8 bytes, which a conformant reader (Acrobat, Stirling-PDF, Preview, pdf.js) interprets as PDFDocEncoding, so 山田太郎 rendered as å±±ç"°…. Values are now encoded as proper PDF text strings per ISO 32000-1:2008 §7.9.2.2 — a PDFDocEncoding-compatible literal for ASCII/Latin-1, and UTF-16BE with a U+FEFF BOM for anything above U+00FF (/V <FEFF5C71753059 2A90CE>). ASCII values stay plain literals.

  • AcroForm fill: a filled field no longer disappears on save (#617) — after set_form_field_value + save_to_bytes, re-reading the document returned 0 form fields, because a standalone terminal field lost its /FT (field type) on rewrite, leaving an untyped widget annotation FormExtractor could not classify (/FT is an inheritable, required key — §12.7.4.1). Parentless terminal fields now re-emit /FT (and /Ff), so fill → save → re-read round-trips correctly. (Both #616 and #617 surfaced building the issue #611 Japanese form-fill round-trip; they fix the published crate, so every binding's form-fill benefits.)

  • Two-column reference lists and short-verse bodies read column-by-column (#536) — pages whose left/right columns share line baselines but have short lines (bibliographies, Bible/lexicon verse editions) were read straight across, because the table-safety guard rejected anything with a low per-line character count. A length-independent admission path now routes them down the left column then the right when a single persistent central gutter is present, gated by concentration, coverage, column char-balance, and a grid-row signal so multi-cell numeric tables stay off the column path. Display-equation rows are excluded from the gutter-coverage measurement so dense-math pages still route. Validated table-safe: google_doc_document.pdf markdown is byte-identical.

  • Punctuation glyphs parked at non-standard codes in symbolic fonts decode correctly (#536) — when a font's /Differences names a code period / comma / hyphen / minus (ISO 32000-1:2008 §9.6.6.1) but its program / base encoding resolves that code to a non-sensible symbol, the Adobe Glyph List value (§9.10.2) is now preferred, recovering the intended punctuation. Tightly gated: correctly-mapped fonts and genuine logicalnot (¬) / math-symbol glyphs are untouched. (A figure axis that draws a symbol-font glyph whose encoding genuinely is ¬ where a decimal point belongs — the font itself is wrong — remains out of scope, since recovering it would require unsafe context guessing that corrupts legitimate 5×3 / 2±1.)

  • Mixed Arabic/Latin/numeral lines settle in logical order — a right-to-left line embedding European/Arabic-Indic numerals or Latin words (e.g. a date 14 april 1434 ٤٣٤١) now gives each embedded left-to-right sub-run its own LTR level per the Unicode Bidirectional Algorithm (UAX #9 §3.3.4), instead of only reversing pure-RTL runs. The pass is gated to confidently-RTL mixed lines; pure-RTL, pure-LTR, and all ASCII/Latin extraction are byte-for-byte unchanged.

  • Image decode() resolves resource-name colour spaces (#588)page_image_handles() decode() previously failed on an image whose /ColorSpace is a resource name (e.g. /CS0 resolved via /Resources/ColorSpace, ISO 32000-1:2008 §8.6.6 / §8.9.7), returning Unsupported color space. The active resource colour-space map is now threaded into the handle (page, Form-XObject, and inline scopes) so such images decode. Indexed image handles now report ColorSpace::Indexed (1-component sample layout, §8.6.6.3) with the de-indexed base available via the new indexed_base() accessor — an API refinement of the v0.3.57-only handle surface; decoded pixels are unchanged.

  • Suspect tagged PDFs now extract in geometric reading order (#608) — a document advertising /MarkInfo /Suspects true (the /TagSuspect /Ordering signal that the producer could not guarantee page content order matches logical structure order, §14.8.2.3.1) was previously read through its structure tree by extract_text, which could emit content out of visual order. Such documents now fall back to geometric order, while trustworthy tagged PDFs — /Marked, or a catalog /StructTreeRoot on PDF-1.4 files that predate /MarkInfo — read in logical structure order across all four text accessors (extract_text, to_plain_text, to_markdown, to_html). A single shared trustworthiness predicate gates every reading-order path so they cannot drift apart, and a marked-content element whose spans cross multiple visual lines is now emitted in reading order. Non-suspect and untagged documents are byte-for-byte unchanged.

  • /Rotate 90 and /Rotate 270 pages now read in display orientation — pages with a 90°- or 270°-clockwise /Rotate (ISO 32000-1:2008 §7.7.3.3) were previously extracted in raw user-space coordinates and came out sideways, with lines and words in the wrong order. Every span is now mapped into the displayed coordinate frame — including the page width/height swap for 90°/270° (§8.3.3) — before reading-order assembly, so all four rotation angles read upright. Annotation text on rotated pages is mapped into the same displayed frame. Unrotated and 180° pages are byte-for-byte unchanged (verified on the issue14415 180° contract, which now reads in correct order).

  • Space after an emoji is no longer dropped — a pictographic glyph (e.g. 📄) immediately followed by a word (📄 README.md) previously merged into 📄README.md, because the residual gap after the wide glyph fell below the proportional-font space threshold. An emoji→letter boundary with a positive gap now keeps the inter-token space. Word segmentation is reader latitude (ISO 32000-1:2008 §9.10); the rule is gated on pictographic codepoints (arrows and math-operator blocks excluded), so technical text is unaffected.

Changed

  • PdfImageHandle::decode() / raw_compressed_bytes() now borrow &self and PdfImageHandle is Clone (refining the v0.3.57 two-phase image API, #588). A single handle now supports an inspect → raw-bytes → decode flow without re-enumerating the page. Borrowing is a strict superset of the previous by-value signatures, so existing callers still compile.
  • Indexed image handles now report ColorSpace::Indexed instead of the de-indexed base (refining the v0.3.57-only handle metadata, #588). The base remains available via the new indexed_base() accessor; this makes the handle's color_space reflect the 1-component sample layout per ISO 32000-1:2008 §8.6.6.3 and is consistent across the direct-array, indirect-reference, and inline-image forms. Decoded pixels are unchanged (decode() re-resolves the palette independently).

Dependencies

  • Rust: serde_json 1.0.149 → 1.0.150 (#578), log 0.4.29 → 0.4.30 (#592), quick-xml 0.40.0 → 0.40.1 (#579), cbc 0.2.0 → 0.2.1 (pulls cipher 0.5.2 / crypto-common 0.2.2, #577). Go: ebitengine/purego 0.10.0 → 0.10.1 (#593). Ruby: rubocop-rspec 2.20 → 3.9 (#581). CI actions: actions/setup-java v4 → v5.2.0 (#585), github/codeql-action v3 4.35.5 → 4.36.0 (#582), taiki-e/install-action 2.79.2 → 2.79.12 (#580), EmbarkStudios/cargo-deny-action 2.0.18 → 2.0.20 (#584), golangci/golangci-lint-action 9.2.0 → 9.2.1 (#583).

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.