Skip to content

v0.3.56 | Text-extraction fidelity sweep — XY-cut routing, typed extraction status, OCR API repair, Persian font support, encryption authentication enforcement

Choose a tag to compare

@github-actions github-actions released this 29 May 03:49
· 171 commits to main since this release
1bfd69e

Added

  • ExtractionSignal enum + Warning / WarningSink types (src/extractors/status.rs, src/extractors/warnings.rs) — typed signal surface so callers can distinguish "no text" from "extraction failure" (Ok / Truncated{at_op} / NoTextLayer / UnmappedGlyphs{count} / OcrUnavailable{reason} / PasswordRequired / Multiple). Foundation for flatten_warnings() and per-call status accessors.
  • PdfPermissions accessor (src/encryption/permissions.rs) — decodes /P flags per PDF spec §7.6.3.2 Table 22; surfaces print, modify, copy, annotate, form_fill, extract_for_accessibility, assemble, print_high_quality. Closes #562 API gap.
  • PdfDocument::has_text_layer(page) -> Result<bool> — predicate wrapping page_cannot_have_text + may_contain_text so callers know whether to fall back to OCR. Closes #563.
  • PdfDocument::extract_text_ocr_only(page, engine, options) — additive companion that always invokes OCR (no text-layer peek). Closes #574.
  • set_max_ops_per_stream(Option<usize>) (src/content/parser.rs) — global AtomicUsize override for the 1,000,000-operator cap; all 6 runtime check-sites gated through effective_max_operators(). Closes #559.
  • set_preserve_unmapped_glyphs(bool) (src/extractors/text.rs) — global atomic + 8 filter-site gating so U+FFFD glyphs from glyph-name-only fonts (MSAM10, dingbats) can be preserved for downstream repair. Closes #571.
  • assemble_text_via_reading_order(spans) helper + Python wrappers + 4 per-class detector predicates (DenseSingleLine, SubSuperBaselineReattach, NarrowTrackedJustified, DramaticScript) in src/pipeline/reading_order/detectors.rs. The detectors return a ReadingOrderClass callers can route on. Bench-visible improvements on the same issue cluster (#549/#556/#561/#565/#568/#576) come from the parallel XYCutStrategy work documented under Changed — full detector-driven assembly is a follow-up.
  • ExtractionProfile::TJ_HEAVY (src/config/extraction_profiles.rs) — opt-in profile for TJ-kerned word-boundary recovery (threshold -100 vs CONSERVATIVE -120). Closes #564.
  • Adobe-Persian-1 / Adobe-Arabic-1 stub CMaps (src/fonts/cid_mappings/adobe_arabic.rs) + DescendantFonts inline-dict parse path in src/fonts/font_dict.rs. Closes #566.
  • PyPageCount dual-shape — Python doc.page_count works as both attribute and method (__call__ + __index__) so range(doc.page_count) works again post-v0.3.54 regression. Closes #550.
  • PdfDocument::structured_warnings() accessor + WarningSink wired as the per-document field type, with 5 log::warn! migration sites pushing into the global sink. Per-document and global drains merge on each call. Closes #558 (second half).

Changed

  • extract_text / extract_text_auto / extract_page_auto no longer panic when ONNX Runtime fails to dlopen — std::panic::catch_unwind around the full ort::Session::builder() chain converts the dylib-load panic to a typed OcrError::ModelLoadError. Closes #569, #573.
  • extract_text_ocr honours its contract — was previously a passthrough that returned the empty text layer of scanned PDFs; now invokes the supplied engine when text_layer_empty is true. Closes #574.
  • get_form_fields walks parent fields with /T even without /FT — matches pypdf's traversal of IRS AcroForm logical-group dictionaries (~15-30% of checkbox /Off values were silently dropped). Closes #570.
  • Default Python logger no longer captures pdf_oxide.{parser,content,fonts,document} WARN records — at import, _setup_default_log_levels attaches a NullHandler and sets propagate = False on those four high-frequency targets (standard library convention per PEP 282); records stop at the pdf_oxide logger boundary instead of bubbling to root's default stderr handler. Callers re-enable bubbling via logger.propagate = True or read the typed feed via doc.structured_warnings(). Closes #558 (first half).
  • Span-boundary spacing tightened — bucket-1/2/3/4 fixes: span bbox.x matches first char after TJ word boundary; font-transition with small positive gap inserts a space; super/sub digit runs substituted to U+2070–2089/U+00B2/B3/B9; spacing diacritics (U+00B4, U+0060) folded into following base letter via NFC.
  • merge_adjacent_spans joins same-font multi-char small-caps / drop-cap runs so "SUBTITLE A—OFFICE OF THE" is no longer split mid-word; bench small-caps clusters now read cleanly. Closes #555 root-cause path; complements #560.
  • Super/subscript glyphs snap onto base baseline pre-sort so arxiv affiliation markers ("name¹,²") stay inline with the author run instead of being lifted into a stray top-of-page band. Improves bench by lifting 6 papers in the py-pdf 14-PDF set.
  • AcroForm widget text-capacity boundtruncate_to_widget_capacity caps /V to 0.0175 * w * h + 64 chars so scrollable fields with embedded Lorem-ipsum (pdfbox AcroFormsBasicFields) no longer dump 90 KB of phantom text per widget.
  • Forward-scan CTM tracker skips inline image data (BI…EI) so the marker bytes inside JPX/JBIG2 streams stop being parsed as text operators (2201.00151 p2: 758 of 789 garbage spans → 0).
  • XY-cut find_horizontal_split_indexed rejects sliver sub-splits (MIN_RESULT_WIDTH_PT = 60.0 pt) so 8 pt monospaced columns no longer fragment into 48 pt strips.
  • Narrow-gutter prose detector (detect_narrow_gutter_prose) uses gap-position clustering to recover 2-column body text where the legacy detector saw outlier singleton clusters and bailed — fixes arXiv 2201.00151-class column interleave. Bench: py-pdf 14-PDF mean 89.4% → 90.2%.

Fixed

  • #549extract_text and to_plain_text now route through XY-cut on multi-column / dramatic-script layouts (parity with to_markdown_all). Largest single bench improvement of the cycle.
  • #551 — Latin ligatures (fi/fl/ff/ffi) preserved at should_insert_space threshold + post-processing repair pass for the merged-cluster cases not visible at extractor time.
  • #552 — Combining diacritics (´, `, ^, etc.) NFC-composed into the following base letter at apply_combining_mark_composition instead of being emitted as a separate token before the letter.
  • #555 — Missing space at run/font boundary repaired at should_insert_space (root-cause) + repair_run_boundary_space (post-processing for case-change boundaries like theEditor).
  • #556 — Figure-region math glyphs (subscripts/superscripts) no longer interleave captions; same reading-order plumbing as #549.
  • #558SPEC VIOLATION / xref-reconstruction / font-warn log noise gated behind opt-in setup_logging(); structured flatten_warnings() accessor available for programmatic consumers.
  • #560 — Monospace code blocks no longer emit intra-token whitespace at every glyph boundary; is_monospace_font helper bumps the space-insertion threshold from 0.5× to 1.2× space-width.
  • #561 — Subscript/superscript glyphs no longer reordered into wrong inline position; SubSuperBaselineReattach detector + pre-sort baseline snap.
  • #562password_protected.pdf and copy_protected.pdf no longer extract text without authentication; permissions() accessor + existing require_authenticated() guard verified end-to-end.
  • #563 — Image-only / rasterized PDFs surface has_text_layer() == false so callers can dispatch to OCR instead of receiving silent empty strings.
  • #564 — Word boundaries no longer lost on TJ-kerned text; opt-in ExtractionProfile::TJ_HEAVY calibration.
  • #565 — Narrow-tracked / justified column intra-word spaces eliminated via per-line median-gap normalisation in NarrowTrackedJustified detector.
  • #566 — Persian/Farsi Type0 fonts (NazaninNormal, YagutBold) without bundled CMap now resolve via Adobe-Persian-1 / Adobe-Arabic-1 stub lookup; DescendantFonts inline-dict parse path accepted.
  • #568 — Dense 8 pt body lines no longer split into two interleaved character streams; DenseSingleLine detector + min-result-width filter.
  • #576 — Dramatic-script layouts (centered speaker tags + indented dialogue) read in performance order; DramaticScript detector.

Security

  • #562 — Verified no text extraction path bypasses authentication on encrypted PDFs. EncryptionHandler::raw_permissions() accessor for cross-binding /P consumption; spec-aligned audit at docs/releases/issues/password-bypass-audit.md.

Deprecated

  • PdfDocument.page_count() method-call form (Python binding) — supported but deprecated; use doc.page_count attribute form instead. Removal scheduled for v0.4.0 (#414). Closes #550.

Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.