Skip to content

v0.3.60 | Converter performance sweep (no double per-page extraction, cached structure-tree traversal) + Arabic/Persian CIDFont extraction, ZapfDingbats coverage, graceful encrypted-PDF text extraction, and an `extract_tables` opt-out for speed-first text extraction

Choose a tag to compare

@github-actions github-actions released this 04 Jun 04:04
· 83 commits to main since this release
2e52cec

Added

  • TextChar::ascent and TextChar::descent — glyph ascent and descent in device space (pre-multiplied by effective font size, matching the units of advance_width and rendered_advance). Sourced from the font's FontDescriptor (/Ascent / /Descent), with fallbacks to built-in metrics for the 14 standard PDF fonts and then Poppler-compatible defaults (0.95em / −0.35em). For Type0/CID fonts the values are now read from the CIDFont descendant's descriptor (§9.7.4) rather than silently falling back to 0.95em / −0.35em. Use origin_y + ascent / origin_y + descent directly to get glyph bounding-box edges. Thanks @haberman.

  • Arabic/Persian CIDFont text extraction (Adobe-Arabic-1 / Adobe-Persian-1) — Type0 CIDFonts that declare /CIDSystemInfo /Ordering (Arabic) or (Persian) and ship without an embedded /ToUnicode (Nazanin, Yagut, Mitra, Lotus and similar) now decode through the existing Arabic-block CID→Unicode mapping instead of falling through to Latin-Extended-B garbage. Both predefined-CMap dispatch sites (character_mapper.rs, font_dict.rs) gained "Arabic"/"Persian" ordering arms (ISO 32000-1:2008 §9.7.3 / §9.7.5 / §9.10.3 step-3 identity fallback).

  • ZapfDingbats circled-digit and arrow glyphs (①–➓, ➔ ➾, → ↔ ↕) — the standard-14 ZapfDingbats built-in encoding now maps the circled-digit ranges (①–⑩, ❶–❿, ➀–➉, ➊–➓) and the arrow ranges (ISO 32000-1:2008 Annex D.6, octal codes 254–376) that were previously dropped, recovering this content from ZapfDingbats showcase documents.

  • Symbol-font math operators ≤ ≥ ∞ — the Adobe Symbol built-in encoding now maps lessequal/greaterequal/infinity (Annex D.5, octal 243/263/245), previously unmapped.

  • CLI text --format structured and MCP extract format: "structured" (#626) — both surfaces now expose the library's extract_structured API, emitting StructuredPage JSON: typed regions (kind = RegionRole — body, heading, marginal label, header/footer, page number, artifact) with per-region column_index, so two-column PDFs (Bibles, dictionaries, papers with side notes) come out as separate column blocks instead of line-interleaved. Previously this was reachable only from the Rust library and Python binding. Thanks @lggcs.

  • extract_tables keyword on the Python extract_textdoc.extract_text(page, extract_tables=False) skips the table-detection sweep for speed-first raw-text extraction (the dense-academic-page hot spot). Default True reproduces previous behaviour byte-for-byte.

Changed

  • TextChar and FontInfo gain two new pub fields (ascent: f32, descent: f32) — source-breaking for downstream code that constructs these structs with struct-literal syntax; add the two new fields to fix. Both structs are not #[non_exhaustive].
  • Encrypted PDFs that cannot be decrypted with the empty password now extract empty text instead of erroringextract_text/extract_spans/to_markdown/to_html and their whole-document variants warn and return empty output, matching pdftotext/PyMuPDF (ISO 32000-1:2008 §7.6). page_count still returns Err(EncryptedPdf) so callers that query document structure are not silently handed zero pages; image extraction still fails closed.

Fixed

  • Decimal points in Computer-Modern math subsets no longer decode as ¬ — when a CM/Symbol subset draws its decimal from the logicalnot slot, a ¬ directly between two digits (e.g. 1¬00) is recovered as . (1.00); spaced logic/set ¬ is left untouched.
  • Oversized drop-cap / table-title initials re-attach to their word — a lone uppercase initial set in a larger font (so it became its own span) is merged with the body run to its right before reading-order sorting, fixing TABLET … ABLE stranding on regulatory tables. Gated to genuine initials (oversized vs the page's median body text, touching its continuation, on the same baseline) so inline math (A_st), word-spaced capitals (A Perspective), and tall initials reaching the line above are left alone.
  • Rotated text runs are read as their own blocks instead of scrambling the page — a run drawn with a rotated text matrix (atan2(b,a) of T_m × CTM, ISO 32000-1 §9.4.4) — the vertical arXiv:… margin stamp, rotated plot axis labels, rotated table column headers — was interleaved into the horizontal row-band / XY-cut sort, whose axis-aligned assumptions it violates. Such runs now carry a rotation_degrees and are stably lifted out of the horizontal flow (which keeps its exact prior order — pages with no rotated text are byte-identical) and re-emitted as their own blocks ordered in an upright frame. Recovers e.g. the rotated Array/Boolean/… headers of a PDF-syntax matrix and stops a chart's axis labels from fusing into its data (0CardSort0 + CardSort).
  • Borderless numeric results tables keep one value per column — a dense ML/benchmark grid laid out on a tight, regular numeric pitch (no ruling lines) had adjacent columns fused by greedy clustering (0.69 0.76 sharing one cell). When the spans are predominantly numeric and a finer text-edge column set recurs across ≥3 rows on a regular pitch — and still forms a valid grid — each value now lands in its own column. The validity probe guarantees the refinement never demotes an otherwise-valid table to prose.
  • to_markdown / to_html now honour exclude_regions and include_region (#609) — these ConversionOptions region filters were applied only by the plain-text path (extract_text / to_plain_text), so markdown and HTML emitted the whole page regardless of the requested exclusions. The filter now runs up front for all three surfaces (shared apply_region_filters), before table/heading/reading-order processing, so excluded content is gone everywhere; tables in excluded regions drop too. No-op when neither field is set (default output unchanged). Thanks @alexanderameye.
  • Dense numeric tables are no longer flattened into bold-label + run-on numbers — the spatial-table quality gate rejected any ≥5-column table whose cells are >70% single-word, to suppress prose accidentally split into one-word columns. A genuine numeric data table (financial/metrics slides, benchmark grids) is legitimately almost all single tokens — every cell is a number — so it was wrongly rejected and emitted as flattened text. The gate now bypasses that rule when a table is numeric-dominated (≥50% of cells are data values like 5,012 / +2% / 240); number-heavy prose stays below the threshold and is still rejected. Recovers e.g. the tracemonkey SunSpider benchmark table and arXiv result/count matrices that were previously flattened.
  • A stray superscript ordinal is no longer promoted to a heading — when a superscript ordinal (st/nd/rd/th) is split from its number ("May 5th" → "May 5" + superscript "th"), the lone suffix was emitted as its own #### th heading under detect_headings, fragmenting the document outline. Heading detection now rejects a bare ordinal suffix.
  • Scanned/image pages are marked instead of rendering silently blank — a page with no extractable text that classifies as scanned/image now emits a > [OCR REQUIRED — page N] block-quote in to_markdown/to_markdown_all, so a reader of a scanned document sees where content was lost and OCR is needed rather than half the document silently missing. Gated to genuinely scanned/image pages (legitimately-blank pages are untouched) and suppressible via the new ConversionOptions::annotate_skipped_pages (default true).
  • OCR reading-order sort no longer panics the host process — the detection-box reading-order comparator (ocr::engine) used a non-transitive rule ("compare X when the Y gap is < 10 px, else compare Y"), which Rust's sort detects on image-text pages with a few near-aligned labels and turns into a comparison function does not correctly implement a total order panic — aborting the host across every binding (C#/Go/Node/Rust). It is now a genuine total order: group by a fixed 10 px Y band, then X, then Y.
  • Text extraction no longer panics on spans with out-of-range coordinates — a span whose centre maps to the i32 limits could overflow the dedup-grid neighbour scan in an overflow-checked build; the cell-index arithmetic now saturates.
  • Line-end hyphen rejoin is more conservative — a hard - wrap only rejoins when both fragments are lowercase-alpha (keeps COVID-19 / well-Known); soft hyphens (U+00AD) still rejoin per §14.8.2.2.3.
  • Bogus U+FFFF/U+FFFE ToUnicode placeholders fall back instead of being emitted — some producers stuff the BMP noncharacters into a /ToUnicode CMap as a "no glyph" marker (e.g. an Identity-H subset mapping every CID to <ffff>). Noncharacters are never valid interchange text, so for Identity-encoded fonts these are now treated as a CMap miss and routed through the CID→GID→embedded-cmap / CID-as-Unicode fallback (recovering real text when the embedded font carries a usable cmap), consistent with the existing notdefrange-U+FFFD handling.
  • Filling a merged field+widget form field no longer blanks the widget — when an AcroForm text field's dictionary is also its widget annotation (the common single-widget case, ISO 32000-1:2008 §12.7.4.1), DocumentEditor::set_form_field_value followed by save_to_bytes wrote the value onto a freshly-allocated bare field object (no /T, /Subtype, /Rect, /AP) appended to the page /Annots, and left the real widget the reader displays with an empty /V. Readers (PyMuPDF, Acrobat) then showed an empty field and could not build an appearance stream. The value (and /AS for buttons) is now written in place on the existing field/widget object, preserving its full dictionary and its /Fields / /Annots membership, with /NeedAppearances set so viewers regenerate; CJK values round-trip (山田太郎/V <FEFF…>). Thanks @mitslabo.
  • Subset CFF (Type 1C) fonts resolve glyphs through the PDF /Encoding (#629) — for a simple CFF font the byte→glyph-name mapping is now taken from the font dictionary's /Encoding (its /BaseEncoding — WinAnsi / MacRoman / StandardEncoding per Annex D — plus /Differences) and resolved to a GID through the CFF Charset (parsed over the full nGlyphs), per ISO 32000-1:2008 §9.6.6 — instead of the font program's own, frequently sparse, built-in Encoding table. Aggressively-subset CFF fonts (common in prepress/packaging artwork) whose internal Encoding mapped only a handful of bytes previously dropped every other byte to .notdef, painting a single fallback glyph; they now render the full subset. Thanks @RayVR.

Performance

All performance changes preserve output (byte-identical for inputs that previously decoded correctly), except where noted; a small number carry a floating-point/tie-break corpus note in the release notes.

  • Converters no longer extract every page twiceto_markdown/to_html extracted a page's spans once for their own use and again inside the table-detection path (extract_page_tablesextract_wordspage_reading_order). Postprocessed spans are now memoized per page (invalidated by redaction), removing the second full glyph-decode + postprocess pass.
  • to_html structure-tree traversal is now cached (O(pages²) → O(pages))to_html's reading-order context used the un-cached per-page traverse_structure_tree walk; on a tagged N-page document that is O(N × tree) ≈ O(N²). It now uses the document-level all-pages traversal cache (the pattern to_markdown already used, #608).
  • find_table_elements is computed once per document — the converter table path walked the whole structure tree per page (another O(pages²) on tagged docs); a single all-pages walk now buckets Table elements by page for O(1) per-page lookup.
  • XY-cut depth bound + clustering memoizationpartition_indexed gained a recursion depth cap (bounds the O(n)-deep singleton-peel pathology on header/footer-heavy pages); and classify_region_kind is now computed once per node instead of being re-run by each prose detector.
  • Type0/CID char_to_unicode is memoized per font — composite-font glyphs re-ran the full decode cascade (and re-allocated) on every occurrence; a per-font cache makes repeat decodes O(1), also removing the redundant CJK-fallback rescan and TJ/RTL double-decodes.
  • Superscript/subscript band detection is O(n)apply_super_sub_script_substitutions replaced its per-span Y-window walk (O(n²) on wide rows) with a sliding-window maximum.
  • TJ-distribution analysis is O(1) per queryanalyze_tj_distribution maintains running mean/variance accumulators instead of re-scanning the offset history (was O(n²)/page on justified text).
  • Table detection fail-fast gate — a page with thousands of line/rectangle paths (an engineering drawing/chart, not a ruled table) skips the O(E²) cell-reconstruction sweep.
  • Large contiguous ToUnicode bfranges are stored compressed — a <0000><FFFF>-style range no longer persists ~65 536 individual Strings in the cached CMap; long contiguous runs are collapsed to range entries resolved by binary search. Document-order overrides (§9.10.3, #619) are preserved — the compression runs over the final mapping state.
  • Global font cache uses a concurrent read lock — the cross-document font cache moved from Mutex<LRU> to RwLock<FIFO>, so concurrent readers don't serialize (single-threaded behaviour unchanged; cache is correctness-neutral, #408/#595 isolation preserved).
  • Superscript/subscript neighbour scan is spatially indexedspan_is_token_internal's same-line scan (O(n) per candidate) now queries a Y-band index, byte-identically.
  • Misc — span-sort caches its row-band key once per span (sort_by_row_band); stroke/fill dedup uses a grid index (exact IoU preserved); the Layer-4 font-set cache guard memoizes each font's identity hash per object (no per-page re-load + re-hash); whole-document converters pre-reserve output buffers and write page wrappers without per-page format! temporaries; detector binning/cloning micro-optimised.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.