Skip to content

v0.3.19 | Text Extraction Accuracy, Column-Aware Reading Order, and Community Contributions

Choose a tag to compare

@github-actions github-actions released this 03 Apr 07:24

Features

  • extract_page_text() Single-Call DTO (#268) — New PageText struct returns spans, characters, and page dimensions from a single extraction pass, eliminating redundant content stream parsing. Available across Rust, Python, and WASM.
  • Column-Aware Reading Order (#270) — New extract_spans_with_reading_order() method accepts a ReadingOrder parameter. ReadingOrder::ColumnAware uses XY-Cut spatial partitioning to detect columns and read each column top-to-bottom, fixing garbled text for multi-column PDFs.
  • Per-Character Bounding Boxes from Font Metrics (#269) — TextSpan now carries per-glyph advance widths captured during extraction. to_chars() produces accurate per-character bounding boxes using font metrics instead of uniform width division. Available as span.char_widths in Python and span.charWidths in WASM (omitted when empty).
  • is_monospace Flag on TextSpan/TextChar (#271) — Exposes the PDF font descriptor FixedPitch bit, with fallback name heuristic (Courier, Consolas, Mono, Fixed). Eliminates the need for fragile font-name string matching.
  • Pdf::from_bytes() Constructor (#252) — Opens existing PDFs from in-memory bytes without requiring a file path. Available across Rust, Python (Pdf.from_bytes(data)), and WASM (WasmPdf.fromBytes(data)).
  • Path Operations in Python (#261) — extract_paths() now includes an operations list with individual path commands (move_to, line_to, curve_to, rectangle, close_path) and their coordinates. WASM extractPaths() also aligned.

Bug Fixes

  • Fixed panic on multi-byte UTF-8 in debug log slicing (#251) — Replaced raw byte-offset string slices with char-boundary-safe helpers, preventing panics when extracting text from CJK/emoji PDFs with debug logging enabled.
  • Fixed markdown spacing around styled text (#273) — Markdown output no longer merges words across annotation/style span boundaries (e.g., "visitwww.example.comto" → "visit www.example.com to").
  • Fixed Form XObject /Matrix application (#266) — Text extraction now correctly applies Form XObject transformation matrices and wraps in implicit q/Q save/restore per PDF spec Section 8.10.1.
  • Fixed text matrix advance for rotated text (#266) — Replaced incorrect total_width / text_matrix.d.abs() division (divide-by-zero for 90° rotation) with correct Tm_new = T(tx, 0) × Tm per ISO 32000-1 Section 9.4.4.
  • Fixed prescan CTM loss for deeply nested text (#267) — Replaced backward 4KB scan with forward CTM tracking across the full content stream, capturing outer scaling transforms for text in streams >256KB (e.g., chart axis labels).
  • Fixed prescan dropping marked content (BDC/BMC) for tagged PDFs — The forward CTM scan now includes preceding BDC/BMC operators and following EMC operators in region boundaries, preserving MCID, ActualText, and artifact tagging for tagged PDFs in large content streams.
  • Fixed deduplication dropping distinct characters (#253) — deduplicate_overlapping_chars now checks character identity, not just position. Distinct characters close together (e.g., space followed by 'r' at 1.5pt) are no longer incorrectly removed.
  • Fixed text dropped with font-size-as-Tm-scale pattern (#254) — Corrected TD/T* matrix multiplication order per ISO 32000-1 Section 9.4.2. PDFs using /F1 1 Tf + scaled Tm (common in InDesign, LaTeX) no longer silently lose lines. Also tightened containment filter to require text identity match.
  • Fixed markdown merging words in single-word BT/ET blocks (#260) — to_markdown() now detects horizontal gaps between consecutive same-line spans and inserts spaces, matching extract_text() behavior. Fixes PDFs generated by PDFKit.NET/DocuSign.
  • Fixed CLI merge creating blank documents (#262) — merge_from/merge_from_bytes now properly imports page objects with deep recursive copy of all dependent objects (content streams, fonts, images), remapping indirect references.

Dependencies

  • pyo3 0.27.2 → 0.28.2 — Added skip_from_py_object / from_py_object annotations per new FromPyObject opt-in requirement.
  • clap 4.5.60 → 4.6.0
  • codecov/codecov-action 5 → 6

Breaking Changes (WASM only)

  • WASM JSON field names now use camelCaseTextSpan, TextChar, PageText, TextBlock, and TextLine serialized fields changed from snake_case to camelCase (e.g., font_namefontName, font_sizefontSize, is_italicisItalic, page_widthpageWidth) when the wasm feature is enabled. This aligns with JavaScript naming conventions. Rust JSON serialization via serde is only affected when the wasm feature is enabled. Python uses PyO3 getters and is unaffected.

🏆 Community Contributors

🥇 @Goldziher — Thank you for the comprehensive feature requests (#252, #268, #269, #270, #271) that shaped the text extraction improvements in this release. Your detailed issue reports with code examples and spec references made implementation straightforward! 🚀

🥈 @bsickler — Thank you for the Form XObject matrix fix (#266) and prescan CTM rewrite (#267). These are critical correctness fixes for text extraction in rotated documents and large content streams! 🚀

🥉 @hansmrtn — Thank you for the UTF-8 panic fix (#251). This prevents crashes for any user processing non-ASCII PDFs with debug logging! 🚀

🏅 @jorlow — Thank you for the markdown spacing fix (#273). Clean, well-tested fix for a common user-facing issue! 🚀

🏅 @willywg — Thank you for exposing path operations in Python (#261), giving downstream tools access to individual vector path commands! 🚀

🏅 @titusz — Thank you for reporting the character deduplication (#253) and Tm-scale text dropping (#254) bugs with clear root cause analysis! 🚀

🏅 @oscmejia — Thank you for reporting the markdown word merging issue (#260) with a clear reproduction case! 🚀

🏅 @Inklikdevteam — Thank you for reporting the CLI merge blank pages bug (#262)! 🚀


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.