v0.3.63 | CJK extraction-quality fixes — vertical-CJK (tategaki) reading order no longer mis-fires on horizontal Japanese text, CJK glyphs no longer surface as Kangxi-radical codepoints, and Korean number spacing is preserved — plus recovery of dropped inter-word spaces on tightly-typeset PDFs, and routine dependency updates.
Fixed
- Dropped inter-word spaces on tightly-typeset PDFs (#724, #725) — text drawn one glyph per
Tjwith incrementalTdoffsets and no space glyph (inter-word gaps are just slightly largerTdmoves, ~0.3–0.8 × the glyph advance — common in résumé generators) was concatenated:"JOHN DOE"→"JOHNDOE","Master of Science"→"MasterofScience". The characters were correct; only the word boundaries were lost, which breaks search/indexing ("science"can't be matched inside"masterofscience"). PyMuPDF and popplerpdftotextboth infer these spaces, so oxide was the outlier against its own calibration reference. Two root causes, both fixed: (1)FontInfo::get_space_glyph_widthreturned a CID (Type0) font's/DWdefault width (often ≥ 0.5 em) as the "space advance" when code0x20has no explicit/Wentry — the norm for Identity-H subsets, where the space glyph is rarely at0x20— inflating the geometric word-gap threshold above real word gaps; it now falls back to the 0.25 em typographic default unless0x20carries an explicit/Wwidth (preserving the #656 Arabic-subset fallback). (2) the intra-word kerning guard inshould_insert_spacesuppressed lowercase→lowercase boundaries up to 2.4 × the geometric threshold (≈ 0.33 em for Helvetica) — far wider than any real inter-letter kerning — swallowing genuine tight word gaps; the ceiling is lowered to 1.5 × (≈ 0.2 em), which still clears worst-case ~0.19 em intra-word kerning while admitting 0.2-em-and-wider word gaps, the same ~0.18–0.2 em word-break point PyMuPDF / poppler use. - Vertical-CJK reading order mis-fired on horizontal Japanese text — the v0.3.62 tategaki assembler counted vertical-vs-horizontal glyph adjacencies over all pairs, which false-positived on grid-aligned horizontal CJK (genkō-yōshi regulatory/academic layouts align glyphs in both rows and columns), scrambling the reading order and collapsing line breaks. Detection now requires single-glyph CJK spans to dominate the page — genuine tategaki positions each glyph on its own origin, while horizontal CJK is emitted as multi-character runs — and discriminates by each glyph's nearest-neighbour direction with a clear vertical majority. Ambiguous or horizontal pages fall back to the normal horizontal assembler (the pre-vertical-CJK behaviour), so a missed detection can never regress against that baseline.
- CJK glyphs surfaced as Kangxi/CJK radical codepoints — a glyph shared between a radical and its unified ideograph could reverse-resolve to the radical form (e.g. 欠→⽋, 立→⽴, 言→⾔, 金→⾦). The CID decode path now NFKC-normalizes the CJK Radicals Supplement (U+2E80–2EFF) and Kangxi Radicals (U+2F00–2FDF) blocks — which never appear in running text — to their unified ideograph; ordinary text and fullwidth forms are untouched.
- Korean number spacing was wrongly stripped — the CJK number-boundary cleanup that correctly removes stray spaces between Chinese/Japanese ideographs and embedded digits (
公元前 1000 年→公元前1000年) over-fired on Korean. Korean is written with inter-word spaces, so a space between a Hangul syllable and a number is a real word boundary (14 예= "14 cases"); Hangul is now excluded from the cleanup.
Changed
- Dependencies —
p12-keystore0.2.1 → 0.3.0 (from_pkcs12adapted to its newPkcs12ImportPolicyAPI),uuid1.23.2 → 1.23.3,chrono0.4.44 → 0.4.45; CI actions:actions/checkout→ 6.0.3,astral-sh/setup-uv→ 8.2.0,ruby/setup-ruby→ 1.312.0,codecov/codecov-action→ 7.0.0,taiki-e/install-action→ 2.81.8.
Thanks
- @regularkevvv (Kevin Castro) — reported the tightly-typeset inter-word spacing bug and contributed the fix (#724, #725).
Installation
Rust (crates.io)
cargo add pdf_oxidePython (PyPI)
pip install pdf_oxideJavaScript/WASM (npm)
npm install pdf-oxide-wasmCLI (Homebrew)
brew install yfedoseev/tap/pdf-oxideCLI (Scoop — Windows)
scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxideCLI (Shell installer)
curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | shCLI (cargo-binstall)
cargo binstall pdf_oxide_cliMCP Server (for AI assistants)
cargo install pdf_oxide_mcpPre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).
Platform Support
| Platform | Architecture | Archive |
|---|---|---|
| Linux | x86_64 (glibc) | pdf_oxide-linux-x86_64-*.tar.gz |
| Linux | x86_64 (musl) | pdf_oxide-linux-x86_64-musl-*.tar.gz |
| Linux | ARM64 | pdf_oxide-linux-aarch64-*.tar.gz |
| macOS | x86_64 (Intel) | pdf_oxide-macos-x86_64-*.tar.gz |
| macOS | ARM64 (Apple Silicon) | pdf_oxide-macos-aarch64-*.tar.gz |
| Windows | x86_64 | pdf_oxide-windows-x86_64-*.zip |
Changelog
See CHANGELOG.md for full details.