v0.3.51 | Comprehensive auto extraction — per-page text-vs-OCR with typed reason codes, graceful native fallback, and image-table recovery — across all seven bindings plus the CLI and MCP server; a pre-merge release-pipeline dry-run; and five bundled fixes.
Added
- Comprehensive auto extraction
(#517) — a
new, strictly additive surface that returns recoverable text
decided per page/region with a machine-readable reason for every
degraded result, and a graceful warn-and-fall-back-to-native
policy (never a crash, never a silent empty). The classifier consumes
pdf_oxide internals (Tr render-mode-3, GlyphlessFont/no-embedded
ratio, notdef/U+FFFD, union of CTM-transformed image boxes, image
codec, structure tree, producer/XMP) — strictly more accurate than a
post-hoc heuristic on the flattened text. New: a configured-once
AutoExtractor(new/text_only/with+fast/balanced/
high_fidelitypresets + builder),extract_text/extract_markdown/
extract_html/extract_page/extract_document, the cheap
classify_page/classify_documentpreflight (+pages_needing_ocr),
a one-shotPdfDocument::extract_text_auto, an enriched T0.5
text-quality gate (U+FFFD ratio + critical-fragmentation hard-trigger- a column-scramble/consecutive-repeat detector), an optional
force_ocr_pagesper-page OCR override, and build-time
AutoExtractor::prefetch_models()/model_manifest()(the
pdf-oxide models prefetch/manifestDockerfile contract). Exposed
across all seven bindings (Rust, C-ABI, Python, WASM, Node, C#,
Go cgo+purego — Go via idiomatic functional options) as a frozen
JSON envelope, plus CLI subcommandsclassify/auto/modelsand
MCP toolsclassify/auto. Existingextract_text/CLI/MCP
behaviour is byte-identical.
- a column-scramble/consecutive-repeat detector), an optional
- AutoExtractor semantics are precisely specified:
TextOnlyreturns
native text without classifying (the cheapest path); each
per-page result reports its actual source/reason, so a native
fallback after a failed/empty/absent OCR isFallback+
OcrRequestedButUnavailable— never mislabelledOcr;
classify_page/classify_documentfail closed on
encrypted-unauthenticated PDFs (a security op) while non-security
per-page errors degrade gracefully; the "OCR unavailable" warning
is emitted only when theocrfeature is absent; and
model_cache_dir()resolves cross-platform (Windows
%LOCALAPPDATA%/%USERPROFILE%, else$XDG_CACHE_HOMEor
$HOME/.cache; dependency-free). - The local-CPU tier ships via the existing ONNX OCR engine + spatial
table detector; the SLANet + PP-DocLayout-S ONNX models are a
documented zero-API-change point-release follow-up
(tier-model-strategy.md§5) — the API, prefetch and manifest
contracts are stable now.
Fixed
- CSS
background-colorignored in HTML/CSS→PDF
(#516) — a
v0.3.50 regression where output was byte-identical with/without a
page/bodybackground-color. Implemented CSS 2.1 §14.2 / CSS
Backgrounds 3 §3.11.2 canvas background propagation (root → else
body, painted over the whole page under content); guarded by a
core-level Rust test plus the existing Python/Go oracles. - OCR-only reading-order parity
(#460) —
detect_page_type/needs_ocrnow route through the unified
classifier so OCR detection matchesextract_page_autoexactly;
extract_text_ocris retained as the documented forced-OCR escape
hatch;extract_textis unchanged. - Opaque OCR error on Windows
(#513) — the
bareRuntimeError("OCR feature not enabled.")is replaced with an
actionable message (which wheel/extra, how to supply models, and the
gracefulextract_text_autopath); plus a cross-platform Python
feature-guard test (runs onwindows-latest). - Stale PAdES module rustdoc
(#514) —
src/signatures/pades/mod.rsno longer claims the B-T/B-LT/B-LTA
pieces are "deferred / must not be shipped" (they shipped in
v0.3.50). - Per-glyph
Tm+Tjjitter scrambled reading order
(#518) —
Microsoft Word emits broken-image placeholder text as one
BT Tm Tj ETblock per glyph with ±2.5–5pt sinusoidal Y-jitter;
theTm-run merge tolerated only ±0.5pt, splitting jittered
glyphs into separate Y-banded spans that the reading-order sort
then emitted top-to-bottom (e.g."Hello"→"elH l o"). The
same-line tolerance is now scale-relative (0.5× the text-space
glyph height, ≥0.5pt floor) so typographic jitter merges while
genuine line breaks (leading ≳ 1.0× font size) still split.
Pinned by an end-to-end regression suite (the reported repro plus
a max-amplitude case and an anti-over-merge two-line case). - Go
puregobackend panicked at runtime on the first call — the
CGO_ENABLED=0backend registers every FFI symbol in one
sync.Once;pdf_sign_bytes_padeshas 18 scalar parameters, which
exceedspurego's SysV/AMD64 argument limit, so
purego.RegisterLibFuncpanicked (too many stack arguments) and
the entire pure-Go backend was unusable (any first call aborted).
A pre-existing v0.3.50 defect —cgois unaffected and CI only
built (never ran) thepuregobackend, so it went unnoticed.
Fixed additively: a new C-ABIpdf_sign_bytes_pades_optscollapses
the parameters into one#[repr(C)]options struct (5-argument
call surface; delegates topdf_sign_bytes_pades, byte-identical
behaviour — the 18-argument function is unchanged for existing
C/C++/C#/Node callers). Thepuregobinding now uses it; a Go
regression test exercises the registration path (closing the
build-only CI gap). Surfaced by a cross-binding smoke pass of the
full v0.3.51 + v0.3.50 API. - Auto-extract reported a complete native result as
partial_success/ocr_requested_but_unavailable— when the
classifier routed a page to OCR but OCR was unavailable, the native
fallback was unconditionally labelled degraded, even when that
native text was itself high quality. A downstream consumer trusting
status/reason/pages_needing_ocrwould run needless OCR and
treat a perfect extraction as incomplete. Nowroutere-checks the
T0.5 quality gate on the fallback text: high-quality native text is
reportedComplete/NativeText/NativeTextHighConfidence
(only genuinely poor fallback stayspartial). Also: a short,
clean, image-free text page is classifiedTextLayer(not
Scanned) so it is no longer wrongly listed in
pages_needing_ocr; only garbled glyphs route to OCR. Pinned by
a semantic regression suite that additionally asserts
AutoExtractor::extract_textis byte-identical to the canonical
extract_textper page, plus a default-running fidelity suite
(known prose extracts verbatim, in reading order, ungarbled, and
extract_markdown/extract_htmldelegate faithfully and carry the
content). Surfaced by the cross-binding smoke pass. - AutoExtractor never actually ran OCR —
route()invoked
extract_text_with_ocr(.., None, ..)with aNoneengine and there
was no default engine loader, so the function returned native text
without OCR. The Auto surface silently fell back to native for
every image page even with theocrfeature and models present —
text-from-images was non-functional, not merely untested (#519).
Fixed:route()now builds anOcrEnginefrom the documented
model_cache_dir()($PDF_OXIDE_MODEL_DIR/ theprefetch_models
layout:det.onnx/rec.onnx/en_dict.txt) and passes
Some(&engine); unprovisioned → graceful native fallback (never
fail-loud — only security ops fail-closed). Pinned by a model-gated
#[cfg(feature = "ocr")]end-to-end test (real image-only PDF →
AutoExtractor recovers the text,source = Ocr) plus a new CI
ocrlane that provisions the models + ONNX Runtime and runs it,
so the path is genuinely exercised. Multi-script note: native
CJK/Arabic/Hebrew/Cyrillic extraction via the auto surface is
guaranteed by the byte-identical-to-canonical invariant over the
repo's running script suites (a direct CJK auto test is included);
OCR recognition of non-Latin images is bounded by which
PaddleOCR language models are provisioned (provisioning, not a code
defect). - Multi-language OCR + a real model-provisioning API: the engine
loader honorsAutoExtractOptions.ocr_languagesand, when unset, a
cheap script heuristic (detect_ocr_language) reads the document's
own native text so a scanned Chinese/Arabic/Cyrillic/Devanagari PDF
is not OCR'd with the English model; it selects the per-language
recognition model + dictionary from the model cache dir (shared
script-agnostic detector), falling back English → native (never
fail-loud).AutoExtractor::prefetch_models(&[OcrLanguage])is no
longer a stub — it actually downloads (idempotent, atomic) the
detector + requested language packs intomodel_cache_dir(); new
prefetch_models_default(), instanceAutoExtractor::prefetch()
(uses the configuredocr_languages),prefetch_available(),
OcrLanguageenum +OcrLanguage::ALL, and a real
model_manifest()(det + every language's files/URLs). The
provisioning trio (prefetch_models/model_manifest/
prefetch_available) is exposed across all bindings — C-ABI
(pdf_oxide_prefetch_models/_model_manifest/_prefetch_available),
Python, Node, Go (cgo+purego), C# — so the Docker/CI build-time
predownload story works from any consumer language, not just Rust;
WASM exposesmodelManifest()only (browser has no
filesystem/network-to-disk — host-side provisioning, stated
honestly). CLI:pdf-oxide models prefetch [-l <lang>… | --all],
and (real fix) the CLI now warns instead of silently lying when
built without theocrfeature (the downloader isocr-gated;
pdf_oxide_cligained anocrfeature forwardingpdf_oxide/ocr).
Honest scope (empirically verified end-to-end through the auto
surface, 10/12): english · chinese (Simplified) · cyrillic ·
arabic · korean · latin · devanagari · tamil · telugu · kannada.
japanese & chinese-traditional: the loader/prefetch/detect
pipeline is correct and their packs download fine, but the specific
deepghsjapan_PP-OCRv3_rec/chinese_cht_PP-OCRv3_recmodels do
not produce output through the current recognizer (model/engine
compat —source=Fallback; the same pipeline works for the other
10 incl. Simplified Chinese); their tests are#[ignore]with that
reason — a tracked follow-up, not a code defect, not hidden.
Hebrew: a genuine hard limit — PaddleOCR publishes a Hebrew
dict but no recognition model anywhere, so it cannot be fetched
(the loader is ready the instant a pair is provided — upstream
limit, not our code). Pinned by a network-gatedprefetch_models
download test (proves real fetch-to-disk), the model-gated
per-language auto-OCR matrix, the cross-binding manifest-parity
tests (C-ABI/Python/Node/Go/C#), and the new CIocrlane
(provisions models + ONNX Runtime and runs them).
CI / Release
- Release pipeline unverifiable pre-merge
(#515) —
release.ymlnow runs a no-publish dry-run onrelease/*
pull requests (parity withrelease-fips.yml) plus
workflow_dispatch{publish}; every mutating publish job is
hard-gated so apull_requestcan never publish, while the
full build/validate/package matrix runs on the release PR. Scoped
torelease/*PRs so ordinary feature PRs are unaffected.
Thanks
- @Suleman-Elahi for reporting
#513. - @kh3rld for reporting
#518.
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.