v0.3.60 | Converter performance sweep (no double per-page extraction, cached structure-tree traversal) + Arabic/Persian CIDFont extraction, ZapfDingbats coverage, graceful encrypted-PDF text extraction, and an `extract_tables` opt-out for speed-first text extraction
Added
-
TextChar::ascentandTextChar::descent— glyph ascent and descent in device space (pre-multiplied by effective font size, matching the units ofadvance_widthandrendered_advance). Sourced from the font'sFontDescriptor(/Ascent//Descent), with fallbacks to built-in metrics for the 14 standard PDF fonts and then Poppler-compatible defaults (0.95em / −0.35em). For Type0/CID fonts the values are now read from the CIDFont descendant's descriptor (§9.7.4) rather than silently falling back to 0.95em / −0.35em. Useorigin_y + ascent/origin_y + descentdirectly to get glyph bounding-box edges. Thanks @haberman. -
Arabic/Persian CIDFont text extraction (Adobe-Arabic-1 / Adobe-Persian-1) — Type0 CIDFonts that declare
/CIDSystemInfo /Ordering (Arabic)or(Persian)and ship without an embedded/ToUnicode(Nazanin, Yagut, Mitra, Lotus and similar) now decode through the existing Arabic-block CID→Unicode mapping instead of falling through to Latin-Extended-B garbage. Both predefined-CMap dispatch sites (character_mapper.rs,font_dict.rs) gained"Arabic"/"Persian"ordering arms (ISO 32000-1:2008 §9.7.3 / §9.7.5 / §9.10.3 step-3 identity fallback). -
ZapfDingbats circled-digit and arrow glyphs (①–➓, ➔ ➾, → ↔ ↕) — the standard-14 ZapfDingbats built-in encoding now maps the circled-digit ranges (
①–⑩,❶–❿,➀–➉,➊–➓) and the arrow ranges (ISO 32000-1:2008 Annex D.6, octal codes 254–376) that were previously dropped, recovering this content from ZapfDingbats showcase documents. -
Symbol-font math operators ≤ ≥ ∞ — the Adobe Symbol built-in encoding now maps
lessequal/greaterequal/infinity(Annex D.5, octal 243/263/245), previously unmapped. -
CLI
text --format structuredand MCPextractformat: "structured"(#626) — both surfaces now expose the library'sextract_structuredAPI, emittingStructuredPageJSON: typed regions (kind= RegionRole — body, heading, marginal label, header/footer, page number, artifact) with per-regioncolumn_index, so two-column PDFs (Bibles, dictionaries, papers with side notes) come out as separate column blocks instead of line-interleaved. Previously this was reachable only from the Rust library and Python binding. Thanks @lggcs. -
extract_tableskeyword on the Pythonextract_text—doc.extract_text(page, extract_tables=False)skips the table-detection sweep for speed-first raw-text extraction (the dense-academic-page hot spot). DefaultTruereproduces previous behaviour byte-for-byte.
Changed
TextCharandFontInfogain two newpubfields (ascent: f32,descent: f32) — source-breaking for downstream code that constructs these structs with struct-literal syntax; add the two new fields to fix. Both structs are not#[non_exhaustive].- Encrypted PDFs that cannot be decrypted with the empty password now extract empty text instead of erroring —
extract_text/extract_spans/to_markdown/to_htmland their whole-document variants warn and return empty output, matchingpdftotext/PyMuPDF (ISO 32000-1:2008 §7.6).page_countstill returnsErr(EncryptedPdf)so callers that query document structure are not silently handed zero pages; image extraction still fails closed.
Fixed
- Decimal points in Computer-Modern math subsets no longer decode as
¬— when a CM/Symbol subset draws its decimal from thelogicalnotslot, a¬directly between two digits (e.g.1¬00) is recovered as.(1.00); spaced logic/set¬is left untouched. - Oversized drop-cap / table-title initials re-attach to their word — a lone uppercase initial set in a larger font (so it became its own span) is merged with the body run to its right before reading-order sorting, fixing
TABLE→T … ABLEstranding on regulatory tables. Gated to genuine initials (oversized vs the page's median body text, touching its continuation, on the same baseline) so inline math (A_st), word-spaced capitals (A Perspective), and tall initials reaching the line above are left alone. - Rotated text runs are read as their own blocks instead of scrambling the page — a run drawn with a rotated text matrix (
atan2(b,a)ofT_m × CTM, ISO 32000-1 §9.4.4) — the verticalarXiv:…margin stamp, rotated plot axis labels, rotated table column headers — was interleaved into the horizontal row-band / XY-cut sort, whose axis-aligned assumptions it violates. Such runs now carry arotation_degreesand are stably lifted out of the horizontal flow (which keeps its exact prior order — pages with no rotated text are byte-identical) and re-emitted as their own blocks ordered in an upright frame. Recovers e.g. the rotatedArray/Boolean/… headers of a PDF-syntax matrix and stops a chart's axis labels from fusing into its data (0CardSort→0+CardSort). - Borderless numeric results tables keep one value per column — a dense ML/benchmark grid laid out on a tight, regular numeric pitch (no ruling lines) had adjacent columns fused by greedy clustering (
0.69 0.76sharing one cell). When the spans are predominantly numeric and a finer text-edge column set recurs across ≥3 rows on a regular pitch — and still forms a valid grid — each value now lands in its own column. The validity probe guarantees the refinement never demotes an otherwise-valid table to prose. to_markdown/to_htmlnow honourexclude_regionsandinclude_region(#609) — theseConversionOptionsregion filters were applied only by the plain-text path (extract_text/to_plain_text), so markdown and HTML emitted the whole page regardless of the requested exclusions. The filter now runs up front for all three surfaces (sharedapply_region_filters), before table/heading/reading-order processing, so excluded content is gone everywhere; tables in excluded regions drop too. No-op when neither field is set (default output unchanged). Thanks @alexanderameye.- Dense numeric tables are no longer flattened into bold-label + run-on numbers — the spatial-table quality gate rejected any ≥5-column table whose cells are >70% single-word, to suppress prose accidentally split into one-word columns. A genuine numeric data table (financial/metrics slides, benchmark grids) is legitimately almost all single tokens — every cell is a number — so it was wrongly rejected and emitted as flattened text. The gate now bypasses that rule when a table is numeric-dominated (≥50% of cells are data values like
5,012/+2%/240); number-heavy prose stays below the threshold and is still rejected. Recovers e.g. the tracemonkey SunSpider benchmark table and arXiv result/count matrices that were previously flattened. - A stray superscript ordinal is no longer promoted to a heading — when a superscript ordinal (
st/nd/rd/th) is split from its number ("May 5th" → "May 5" + superscript "th"), the lone suffix was emitted as its own#### thheading underdetect_headings, fragmenting the document outline. Heading detection now rejects a bare ordinal suffix. - Scanned/image pages are marked instead of rendering silently blank — a page with no extractable text that classifies as scanned/image now emits a
> [OCR REQUIRED — page N]block-quote into_markdown/to_markdown_all, so a reader of a scanned document sees where content was lost and OCR is needed rather than half the document silently missing. Gated to genuinely scanned/image pages (legitimately-blank pages are untouched) and suppressible via the newConversionOptions::annotate_skipped_pages(defaulttrue). - OCR reading-order sort no longer panics the host process — the detection-box reading-order comparator (
ocr::engine) used a non-transitive rule ("compare X when the Y gap is < 10 px, else compare Y"), which Rust's sort detects on image-text pages with a few near-aligned labels and turns into acomparison function does not correctly implement a total orderpanic — aborting the host across every binding (C#/Go/Node/Rust). It is now a genuine total order: group by a fixed 10 px Y band, then X, then Y. - Text extraction no longer panics on spans with out-of-range coordinates — a span whose centre maps to the
i32limits could overflow the dedup-grid neighbour scan in an overflow-checked build; the cell-index arithmetic now saturates. - Line-end hyphen rejoin is more conservative — a hard
-wrap only rejoins when both fragments are lowercase-alpha (keepsCOVID-19/well-Known); soft hyphens (U+00AD) still rejoin per §14.8.2.2.3. - Bogus
U+FFFF/U+FFFEToUnicode placeholders fall back instead of being emitted — some producers stuff the BMP noncharacters into a/ToUnicodeCMap as a "no glyph" marker (e.g. an Identity-H subset mapping every CID to<ffff>). Noncharacters are never valid interchange text, so for Identity-encoded fonts these are now treated as a CMap miss and routed through the CID→GID→embedded-cmap/ CID-as-Unicode fallback (recovering real text when the embedded font carries a usablecmap), consistent with the existing notdefrange-U+FFFDhandling. - Filling a merged field+widget form field no longer blanks the widget — when an AcroForm text field's dictionary is also its widget annotation (the common single-widget case, ISO 32000-1:2008 §12.7.4.1),
DocumentEditor::set_form_field_valuefollowed bysave_to_byteswrote the value onto a freshly-allocated bare field object (no/T,/Subtype,/Rect,/AP) appended to the page/Annots, and left the real widget the reader displays with an empty/V. Readers (PyMuPDF, Acrobat) then showed an empty field and could not build an appearance stream. The value (and/ASfor buttons) is now written in place on the existing field/widget object, preserving its full dictionary and its/Fields//Annotsmembership, with/NeedAppearancesset so viewers regenerate; CJK values round-trip (山田太郎→/V <FEFF…>). Thanks @mitslabo. - Subset CFF (Type 1C) fonts resolve glyphs through the PDF
/Encoding(#629) — for a simple CFF font the byte→glyph-name mapping is now taken from the font dictionary's/Encoding(its/BaseEncoding— WinAnsi / MacRoman / StandardEncoding per Annex D — plus/Differences) and resolved to a GID through the CFF Charset (parsed over the fullnGlyphs), per ISO 32000-1:2008 §9.6.6 — instead of the font program's own, frequently sparse, built-in Encoding table. Aggressively-subset CFF fonts (common in prepress/packaging artwork) whose internal Encoding mapped only a handful of bytes previously dropped every other byte to.notdef, painting a single fallback glyph; they now render the full subset. Thanks @RayVR.
Performance
All performance changes preserve output (byte-identical for inputs that previously decoded correctly), except where noted; a small number carry a floating-point/tie-break corpus note in the release notes.
- Converters no longer extract every page twice —
to_markdown/to_htmlextracted a page's spans once for their own use and again inside the table-detection path (extract_page_tables→extract_words→page_reading_order). Postprocessed spans are now memoized per page (invalidated by redaction), removing the second full glyph-decode + postprocess pass. to_htmlstructure-tree traversal is now cached (O(pages²) → O(pages)) —to_html's reading-order context used the un-cached per-pagetraverse_structure_treewalk; on a tagged N-page document that is O(N × tree) ≈ O(N²). It now uses the document-level all-pages traversal cache (the patternto_markdownalready used, #608).find_table_elementsis computed once per document — the converter table path walked the whole structure tree per page (another O(pages²) on tagged docs); a single all-pages walk now bucketsTableelements by page for O(1) per-page lookup.- XY-cut depth bound + clustering memoization —
partition_indexedgained a recursion depth cap (bounds the O(n)-deep singleton-peel pathology on header/footer-heavy pages); andclassify_region_kindis now computed once per node instead of being re-run by each prose detector. - Type0/CID
char_to_unicodeis memoized per font — composite-font glyphs re-ran the full decode cascade (and re-allocated) on every occurrence; a per-font cache makes repeat decodes O(1), also removing the redundant CJK-fallback rescan and TJ/RTL double-decodes. - Superscript/subscript band detection is O(n) —
apply_super_sub_script_substitutionsreplaced its per-span Y-window walk (O(n²) on wide rows) with a sliding-window maximum. - TJ-distribution analysis is O(1) per query —
analyze_tj_distributionmaintains running mean/variance accumulators instead of re-scanning the offset history (was O(n²)/page on justified text). - Table detection fail-fast gate — a page with thousands of line/rectangle paths (an engineering drawing/chart, not a ruled table) skips the O(E²) cell-reconstruction sweep.
- Large contiguous ToUnicode
bfranges are stored compressed — a<0000><FFFF>-style range no longer persists ~65 536 individualStrings in the cached CMap; long contiguous runs are collapsed to range entries resolved by binary search. Document-order overrides (§9.10.3, #619) are preserved — the compression runs over the final mapping state. - Global font cache uses a concurrent read lock — the cross-document font cache moved from
Mutex<LRU>toRwLock<FIFO>, so concurrent readers don't serialize (single-threaded behaviour unchanged; cache is correctness-neutral, #408/#595 isolation preserved). - Superscript/subscript neighbour scan is spatially indexed —
span_is_token_internal's same-line scan (O(n) per candidate) now queries a Y-band index, byte-identically. - Misc — span-sort caches its row-band key once per span (
sort_by_row_band); stroke/fill dedup uses a grid index (exact IoU preserved); the Layer-4 font-set cache guard memoizes each font's identity hash per object (no per-page re-load + re-hash); whole-document converters pre-reserve output buffers and write page wrappers without per-pageformat!temporaries; detector binning/cloning micro-optimised.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.