v0.3.56 | Text-extraction fidelity sweep — XY-cut routing, typed extraction status, OCR API repair, Persian font support, encryption authentication enforcement
Added
ExtractionSignalenum +Warning/WarningSinktypes (src/extractors/status.rs,src/extractors/warnings.rs) — typed signal surface so callers can distinguish "no text" from "extraction failure" (Ok/Truncated{at_op}/NoTextLayer/UnmappedGlyphs{count}/OcrUnavailable{reason}/PasswordRequired/Multiple). Foundation forflatten_warnings()and per-call status accessors.PdfPermissionsaccessor (src/encryption/permissions.rs) — decodes/Pflags per PDF spec §7.6.3.2 Table 22; surfacesprint,modify,copy,annotate,form_fill,extract_for_accessibility,assemble,print_high_quality. Closes #562 API gap.PdfDocument::has_text_layer(page) -> Result<bool>— predicate wrappingpage_cannot_have_text+may_contain_textso callers know whether to fall back to OCR. Closes #563.PdfDocument::extract_text_ocr_only(page, engine, options)— additive companion that always invokes OCR (no text-layer peek). Closes #574.set_max_ops_per_stream(Option<usize>)(src/content/parser.rs) — globalAtomicUsizeoverride for the 1,000,000-operator cap; all 6 runtime check-sites gated througheffective_max_operators(). Closes #559.set_preserve_unmapped_glyphs(bool)(src/extractors/text.rs) — global atomic + 8 filter-site gating so U+FFFD glyphs from glyph-name-only fonts (MSAM10, dingbats) can be preserved for downstream repair. Closes #571.assemble_text_via_reading_order(spans)helper + Python wrappers + 4 per-class detector predicates (DenseSingleLine,SubSuperBaselineReattach,NarrowTrackedJustified,DramaticScript) insrc/pipeline/reading_order/detectors.rs. The detectors return aReadingOrderClasscallers can route on. Bench-visible improvements on the same issue cluster (#549/#556/#561/#565/#568/#576) come from the parallelXYCutStrategywork documented under Changed — full detector-driven assembly is a follow-up.ExtractionProfile::TJ_HEAVY(src/config/extraction_profiles.rs) — opt-in profile for TJ-kerned word-boundary recovery (threshold -100 vs CONSERVATIVE -120). Closes #564.- Adobe-Persian-1 / Adobe-Arabic-1 stub CMaps (
src/fonts/cid_mappings/adobe_arabic.rs) + DescendantFonts inline-dict parse path insrc/fonts/font_dict.rs. Closes #566. PyPageCountdual-shape — Pythondoc.page_countworks as both attribute and method (__call__+__index__) sorange(doc.page_count)works again post-v0.3.54 regression. Closes #550.PdfDocument::structured_warnings()accessor +WarningSinkwired as the per-document field type, with 5log::warn!migration sites pushing into the global sink. Per-document and global drains merge on each call. Closes #558 (second half).
Changed
extract_text/extract_text_auto/extract_page_autono longer panic when ONNX Runtime fails to dlopen —std::panic::catch_unwindaround the fullort::Session::builder()chain converts the dylib-load panic to a typedOcrError::ModelLoadError. Closes #569, #573.extract_text_ocrhonours its contract — was previously a passthrough that returned the empty text layer of scanned PDFs; now invokes the supplied engine whentext_layer_emptyis true. Closes #574.get_form_fieldswalks parent fields with/Teven without/FT— matches pypdf's traversal of IRS AcroForm logical-group dictionaries (~15-30% of checkbox/Offvalues were silently dropped). Closes #570.- Default Python logger no longer captures
pdf_oxide.{parser,content,fonts,document}WARN records — at import,_setup_default_log_levelsattaches aNullHandlerand setspropagate = Falseon those four high-frequency targets (standard library convention per PEP 282); records stop at the pdf_oxide logger boundary instead of bubbling to root's default stderr handler. Callers re-enable bubbling vialogger.propagate = Trueor read the typed feed viadoc.structured_warnings(). Closes #558 (first half). - Span-boundary spacing tightened — bucket-1/2/3/4 fixes: span
bbox.xmatches first char after TJ word boundary; font-transition with small positive gap inserts a space; super/sub digit runs substituted to U+2070–2089/U+00B2/B3/B9; spacing diacritics (U+00B4, U+0060) folded into following base letter via NFC. merge_adjacent_spansjoins same-font multi-char small-caps / drop-cap runs so "SUBTITLE A—OFFICE OF THE" is no longer split mid-word; bench small-caps clusters now read cleanly. Closes #555 root-cause path; complements #560.- Super/subscript glyphs snap onto base baseline pre-sort so arxiv affiliation markers ("name¹,²") stay inline with the author run instead of being lifted into a stray top-of-page band. Improves bench by lifting 6 papers in the py-pdf 14-PDF set.
- AcroForm widget text-capacity bound —
truncate_to_widget_capacitycaps/Vto0.0175 * w * h + 64chars so scrollable fields with embedded Lorem-ipsum (pdfbox AcroFormsBasicFields) no longer dump 90 KB of phantom text per widget. - Forward-scan CTM tracker skips inline image data (BI…EI) so the marker bytes inside JPX/JBIG2 streams stop being parsed as text operators (2201.00151 p2: 758 of 789 garbage spans → 0).
- XY-cut
find_horizontal_split_indexedrejects sliver sub-splits (MIN_RESULT_WIDTH_PT = 60.0 pt) so 8 pt monospaced columns no longer fragment into 48 pt strips. - Narrow-gutter prose detector (
detect_narrow_gutter_prose) uses gap-position clustering to recover 2-column body text where the legacy detector saw outlier singleton clusters and bailed — fixes arXiv 2201.00151-class column interleave. Bench: py-pdf 14-PDF mean 89.4% → 90.2%.
Fixed
- #549 —
extract_textandto_plain_textnow route through XY-cut on multi-column / dramatic-script layouts (parity withto_markdown_all). Largest single bench improvement of the cycle. - #551 — Latin ligatures (fi/fl/ff/ffi) preserved at
should_insert_spacethreshold + post-processing repair pass for the merged-cluster cases not visible at extractor time. - #552 — Combining diacritics (
´,`,^, etc.) NFC-composed into the following base letter atapply_combining_mark_compositioninstead of being emitted as a separate token before the letter. - #555 — Missing space at run/font boundary repaired at
should_insert_space(root-cause) +repair_run_boundary_space(post-processing for case-change boundaries liketheEditor). - #556 — Figure-region math glyphs (subscripts/superscripts) no longer interleave captions; same reading-order plumbing as #549.
- #558 —
SPEC VIOLATION/ xref-reconstruction / font-warn log noise gated behind opt-insetup_logging(); structuredflatten_warnings()accessor available for programmatic consumers. - #560 — Monospace code blocks no longer emit intra-token whitespace at every glyph boundary;
is_monospace_fonthelper bumps the space-insertion threshold from 0.5× to 1.2× space-width. - #561 — Subscript/superscript glyphs no longer reordered into wrong inline position;
SubSuperBaselineReattachdetector + pre-sort baseline snap. - #562 —
password_protected.pdfandcopy_protected.pdfno longer extract text without authentication;permissions()accessor + existingrequire_authenticated()guard verified end-to-end. - #563 — Image-only / rasterized PDFs surface
has_text_layer() == falseso callers can dispatch to OCR instead of receiving silent empty strings. - #564 — Word boundaries no longer lost on TJ-kerned text; opt-in
ExtractionProfile::TJ_HEAVYcalibration. - #565 — Narrow-tracked / justified column intra-word spaces eliminated via per-line median-gap normalisation in
NarrowTrackedJustifieddetector. - #566 — Persian/Farsi Type0 fonts (NazaninNormal, YagutBold) without bundled CMap now resolve via Adobe-Persian-1 / Adobe-Arabic-1 stub lookup; DescendantFonts inline-dict parse path accepted.
- #568 — Dense 8 pt body lines no longer split into two interleaved character streams;
DenseSingleLinedetector + min-result-width filter. - #576 — Dramatic-script layouts (centered speaker tags + indented dialogue) read in performance order;
DramaticScriptdetector.
Security
- #562 — Verified no text extraction path bypasses authentication on encrypted PDFs.
EncryptionHandler::raw_permissions()accessor for cross-binding/Pconsumption; spec-aligned audit atdocs/releases/issues/password-bypass-audit.md.
Deprecated
PdfDocument.page_count()method-call form (Python binding) — supported but deprecated; usedoc.page_countattribute form instead. Removal scheduled for v0.4.0 (#414). Closes #550.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.