Skip to content

v0.3.53 | Java is the 8th binding, plus a markdown-extraction quality pass and OCR parity across every prebuilt. Native Maven-Central artifact on jni-rs 0.22 (JDK 11+, five-arch fat JAR), full v0.3.52 surface parity across text / markdown / AutoExtractor / forms / render / PAdES B-B+B-T+B-LT / destructive redaction / split-by-bookmarks / compliance / crypto-policy. Free Kotlin interop via the same JAR. Published Python wheels and the Java JAR now ship OCR (parity with Node / Go / C#). Markdown extraction fixes: table-cell bold/italic preserved, CamelCase brand names no longer split, spatial cell words no longer fragment into columns, centered titles read in order. The May-2026 language promise ([README:3](README.md)) lands.

Choose a tag to compare

@github-actions github-actions released this 23 May 03:52
· 170 commits to main since this release
cc77438

Added

  • Java binding (fyi.oxide:pdf-oxide:0.3.53, #NNN)
    — native JNI binding to pdf_oxide via jni-rs 0.22 with the same
    Rust core the existing seven bindings sit on. Maven Central
    publish via central-publishing-maven-plugin 0.9.0 under groupId
    fyi.oxide (matching the pdf.oxide.fyi brand), Java package
    fyi.oxide.pdf.*. JDK 11 LTS floor — broadest enterprise
    reach, Polars/Lance/RocksDB precedent (not kreuzberg-style
    FFM+Java 25 which excludes the JDK 17/21 majority). Five native
    arches embedded in the published fat JAR (linux x86_64, linux
    aarch64, macOS x86_64, macOS aarch64, windows x86_64). 52 JNI
    symbols across 9 wired classes; 82 JUnit tests green.

  • PdfDocumentopen(Path/byte[]/InputStream/String),
    open(Path, String password) + bytes variant, authenticate,
    pageCount, extractText(int), extractTextAuto(int) (v0.3.51
    graceful auto-routing), render(int) + DPI overload (PNG bytes),
    producer/creator Info dict, formFields(),
    search(query, caseInsensitive, regex, maxResults),
    toMarkdown/toHtml convenience, page(int) /
    pages() / pagesStream(). AutoCloseable with idempotent
    close() (shared AtomicLong + Cleaner backstop — multi-class-
    loader safe).

  • PdfPagemediaBox / cropBox, width / height,
    rotation, text(), text(BBox region), words(), lines()
    (nested List<TextWord> per line), chars(), images()
    (ExtractedImage with bytes + format enum + bbox + dimensions),
    tables() (flat List<TableCell> with row/col indices + spans),
    annotations() (13-subtype enum + URI extraction for Link).

  • MarkdownConvertertoMarkdown(doc) /
    toMarkdown(doc, page) / toHtml(doc) / toHtml(doc, page).

  • PdffromMarkdown(String) / fromHtml(String) /
    fromImages(List<byte[]>) (auto-detects JPEG/PNG), save() /
    saveTo(Path), planSplitByBookmarksCount(byte[], int),
    splitByBookmarksFromBytes(byte[], int) -> byte[][] (v0.3.50
    #482 — round-trip proven: outlined PDF → segments → each
    reopenable).

  • DocumentEditoropen(Path/byte[]/String),
    setFormField(name, String/boolean), addRedaction(page, BBox),
    redactionCount(page), applyRedactionsDestructive() (v0.3.50
    #231 — full Phase 3 T11 pipeline; default RedactionOptions
    scrub metadata + strip JS + remove embedded files + hide OCG;
    fail-closed on composite/Type0/unknown fonts), scrubMetadata(),
    save() / saveTo(Path).

  • AutoExtractor (v0.3.51 #517) — of(doc) /
    fast(doc) / balanced(doc) / highFidelity(doc) presets,
    classifyPageKind(int) / classifyDocumentKinds() (returns
    per-page PageClass enum), extractText() /
    extractTextForPage(int) (graceful OCR fallback), extractAutoPage(int)
    / extractAutoDocument() (simplified AutoResult), and the
    rich-shape escape hatch extractPageJson(int) /
    extractDocumentJson()
    returning serde-JSON of the full
    v0.3.51 PageExtraction / DocumentExtraction (typed reasons +
    per-region bboxes + confidence + ocr_used + pages_needing_ocr).

  • PdfSigner (v0.3.50 #235) — fromPkcs12(Path/byte[], String),
    sign(byte[] pdf, SignOptions opts) supporting PAdES B-B
    (no TSA needed), B-T and B-LT (RFC 3161 TSA HTTP via the
    tsa-client Cargo feature; opts.tsaUrl() required for B-T/B-LT),
    verify(byte[]), classifyLevel(byte[]) (static — returns highest
    PAdES level present in a signed PDF without needing key material).

  • PdfValidatorisPdfA(doc, PdfALevel) /
    isPdfUa(doc, PdfUaLevel) (simplified boolean verdict);
    validatePdfA / validatePdfUa return ValidationResult. PDF/A
    levels 1a/1b/2a/2b/2u/3a/3b/3u supported; PDF/A-4 + PDF/UA-2
    surface as PdfUnsupportedException (pdf_oxide core gaps).

  • PdfPolicy (v0.3.50 #230) — current() / set(PolicyMode)

    • compat/strict/fipsStrict presets. Set-once enforced at
      process startup per the v0.3.50 design (second set throws with
      a clear "already set" message).
  • Exception taxonomyPdfException extends RuntimeException
    (unchecked, modern Java consensus per Effective Java Item 71) +
    8 typed subclasses (PdfParseException, PdfEncryptedException,
    PdfPermissionException, PdfIoException,
    PdfOcrUnavailableException, PdfSignatureException,
    PdfInvalidStateException, PdfUnsupportedException) +
    PdfErrorKind enum for switch-on-enum dispatch. Rust Error::*
    variants mapped 1:1 in pdf_oxide_jni/src/error.rs.

  • Value typesgeometry.{BBox, Point, Rect, Color},
    text.{TextStyle, TextWord, TextLine, TextChar, TextSpan},
    table.{Table, TableCell}, image.{ImageFormat, ExtractedImage},
    form.{FormField, FormFieldType},
    auto.{ExtractMode, ExtractReason, PageClass, RegionResult, AutoResult, ClassifyResult, AutoExtractConfig + Builder},
    compliance.{PdfALevel, PdfXLevel, PdfUaLevel, ValidationResult, ValidationViolation},
    signature.{SignatureLevel, SignOptions + Builder},
    policy.{PolicyMode, SecurityPolicy + Builder},
    render.PixelFormat, redaction.RedactResult,
    split.{SplitByBookmarksOptions + Builder, BookmarkSegment},
    metadata.{DocumentInfo, XmpMetadata},
    search.{SearchOptions + Builder, SearchMatch, SearchResult},
    annotation.{Annotation, AnnotationType}. JDK 11 floor → final
    classes with manual equals/hashCode/toString and
    record-shaped accessor names (drop-in record migration when
    floor moves to 17+). JSpecify @Nullable annotations throughout.

  • NativeLoader — multi-classloader-safe UUID-suffixed temp
    extraction (snappy-java pattern, avoids the Tomcat/OSGi
    UnsatisfiedLinkError trap from FLINK-5408). Honors
    -Dfyi.oxide.pdf.lib.path / -Dfyi.oxide.pdf.use.systemlib /
    -Dfyi.oxide.pdf.tempdir overrides for FIPS / locked-down
    /tmp / read-only-rootfs deployments.

Fixed

  • OCR now ships in the published Python wheels and Java JAR — CI
    test builds compiled OCR (--features python,ocr,barcodes) but the
    released wheels used --features python, so PyPI users got a wheel
    without OCR even though CI exercised it. Both glibc and musl Python
    wheels, and the Java JNI fat JAR, now build with OCR for parity with
    the Node / Go / C# prebuilts. FIPS variants deliberately exclude OCR
    (no ONNX in FIPS deployments).

  • Markdown table cells preserve bold/italic — the tagged-PDF table
    extractor built TableCells from joined text only, discarding the
    per-span font weight/style, so **bold** / *italic* inside table
    cells was lost on the way out. Cells now carry their span styles
    end-to-end (table_extractor populates cell.spans).

  • Words no longer split mid-word by phantom spacing — words whose
    glyph runs are positioned edge-to-edge (common in presentation
    exports) could be emitted with a spurious internal space when the
    source font lacked a /Widths array. Per ISO 32000-1 §9.4.4,
    inter-glyph spacing is the displacement between glyph origins; the
    fallback-width correction that compensates for missing width metrics
    now applies only when glyph boxes actually overlap, never to
    cleanly-adjacent glyphs. Legitimate word spacing — including after a
    token that ends in a capital letter — is preserved.

  • Spatially-positioned cell words no longer fragment into columns
    a single table cell whose words are laid out with wide gaps was split
    into one column per word. A row-coverage filter drops phantom columns
    present in too few rows, gated so it only refines an already-detected
    table and never fabricates one from prose.

  • Prose pages no longer mis-detected as tables — a single-column
    page whose wrapped paragraph lines' inter-word gaps coincidentally
    aligned could be emitted as a fragmented table. A prose gate rejects a
    spatially-detected (no-rulings) table when a row crosses a sentence
    boundary, a structure genuine data tables do not exhibit. Ruled and
    tagged tables are unaffected.

  • Centered titles read in document order — a centered multi-word
    title plus subtitle/byline was misread as multiple columns,
    scrambling the heading. A centered-block guard (scattered leftmost
    edges, small block) keeps such blocks as a single column.

  • Fewer fragmented headings — runs of same-level heading fragments
    (PowerPoint word-per-heading exports, wrapped headings) are merged
    when the run is unambiguous; KPI numeric-only heading runs collapse
    to a list.

  • Stray pipe characters escaped — a | outside a markdown table
    block is escaped so downstream renderers do not misread it as a
    malformed table row.

  • Content-preservation policy for markdown post-processing — the
    post-process pass never drops or rewrites legitimate text. Earlier
    band-aids that filtered "Page N" lines, rewrote bullet-glyph
    codepoints, flattened sparse-but-real tables, or deduped repeated
    content were removed after a 70-PDF baseline-vs-HEAD regression sweep
    proved they damaged real documents; the correct upstream fixes are
    tracked as follow-ups.

Known issues

  • Tight two-column prose bodies can still interleave row-by-row in
    reading order
    (#534). A safe
    fix needs a table-vs-prose classifier so it does not regress
    table-cell ordering; two threshold/structural attempts were reverted
    after the regression sweep caught table-data corruption.

  • Bullet and ligature glyphs in fonts with no usable /ToUnicode CMap
    can decode to an incorrect code point or be dropped
    (#535). The fix
    is a §9.10 decode fallback (glyph-name / encoding) in the font layer,
    not a markdown-layer code-point rewrite (which was removed as content
    corruption — see the content-preservation note above).

CI / Release

  • .github/workflows/ci.yml — new build-lib variant
    java-jni builds the JNI cdylib with --features rendering, signatures,tsa-client. New java job (matrix: ubuntu × JDK
    {11, 17, 21}) downloads the native, stages into the Maven
    resource path, runs mvn compile/test/package, validates JAR
    contents + manifest, uploads the JAR artifact. New java-lint
    job runs the Java code-quality gates — Spotless
    (palantir-java-format) formatting check and SpotBugs static
    analysis — bringing the Java binding to parity with the
    format+lint gates the other bindings already enforce (rustfmt +
    clippy / gofmt + golangci-lint / Biome / dotnet-format / ruff).

  • .github/workflows/ci-fips.yml — new fips-java job
    (ubuntu + macOS) builds pdf_oxide_jni with --no-default-features --features fips,signatures and runs the full JUnit suite against
    the FIPS-compiled cdylib. Validates the legacy-crypto exclusion
    holds end-to-end.

  • .github/workflows/release.yml — new build-java-native
    matrix (5 arches: linux x86_64/aarch64, macOS x86_64/aarch64,
    windows x86_64) cross-compiles the JNI cdylib per target with
    ocr,rendering,signatures,barcodes,tsa-client (OCR-enabled parity
    with the Node/Go/C# native cdylib; system-fonts arrives
    transitively via rendering). New
    package-java-jar job assembles the fat JAR (all 5 natives
    embedded). New publish-maven job uploads to Maven Central via
    central-publishing-maven-plugin with autoPublish=false per
    feedback_release_gate — the upload reaches VALIDATED state and
    the maintainer flips Publish from the Central Portal UI. Python
    wheel jobs (glibc + musl) build --features python,ocr,barcodes
    so the published wheels ship OCR. validate job extended to
    enforce java/pom.xml version matches Cargo workspace.

  • pdf_oxide_jni — new workspace member crate (crate-type = ["cdylib", "rlib"]; jni 0.22; feature-mirrored ocr /
    signatures / tsa-client / rendering / barcodes / full
    / fips / legacy-crypto; not published to crates.io — the
    consumable artifact is the Maven Central jar).

Thanks


Installation

Rust (crates.io)

cargo add pdf_oxide

Python (PyPI)

pip install pdf_oxide

JavaScript/WASM (npm)

npm install pdf-oxide-wasm

CLI (Homebrew)

brew install yfedoseev/tap/pdf-oxide

CLI (Scoop — Windows)

scoop bucket add pdf-oxide https://github.com/yfedoseev/scoop-pdf-oxide
scoop install pdf-oxide

CLI (Shell installer)

curl -fsSL https://raw.githubusercontent.com/yfedoseev/pdf_oxide/main/install.sh | sh

CLI (cargo-binstall)

cargo binstall pdf_oxide_cli

MCP Server (for AI assistants)

cargo install pdf_oxide_mcp

Pre-built Binaries
Download archives for Linux, macOS, and Windows from the assets below. Each archive includes both pdf-oxide (CLI) and pdf-oxide-mcp (MCP server).

Platform Support

Platform Architecture Archive
Linux x86_64 (glibc) pdf_oxide-linux-x86_64-*.tar.gz
Linux x86_64 (musl) pdf_oxide-linux-x86_64-musl-*.tar.gz
Linux ARM64 pdf_oxide-linux-aarch64-*.tar.gz
macOS x86_64 (Intel) pdf_oxide-macos-x86_64-*.tar.gz
macOS ARM64 (Apple Silicon) pdf_oxide-macos-aarch64-*.tar.gz
Windows x86_64 pdf_oxide-windows-x86_64-*.zip

Changelog

See CHANGELOG.md for full details.