GitHub - tanmaysheoran/ai.pdf

An AI-native PDF extension format.

One file. Two audiences. Zero compromise.

An .ai.pdf file is a fully valid PDF — it renders, prints, and opens in any PDF reader unchanged. It also embeds a Brotli-compressed semantic XML layer so AI-aware parsers can read structure directly, without OCR or heuristics.

Important

Core rule: PDF is the visual authority. XML is the machine-structure authority. Nothing is duplicated.

Why

PDF is the dominant format for structured documents — papers, contracts, reports, forms. But PDF was designed for print fidelity, not machine parsing. Today, AI systems either:

Run OCR on rendered pixels (slow, lossy, layout-dependent), or
Use brittle text extraction that loses tables, reading order, and structure.

.ai.pdf solves this by making the semantic layer a first-class citizen of the file itself.

Installation

Linux / macOS

curl -fsSL https://raw.githubusercontent.com/tanmaysheoran/ai.pdf/main/install.sh | sh

Checks for Rust/Cargo (installs via rustup if absent), then downloads the pre-built binary or builds from source into ~/.cargo/bin.

Windows (PowerShell)

irm https://raw.githubusercontent.com/tanmaysheoran/ai.pdf/main/install.ps1 | iex

Downloads the pre-built aipdf-x86_64-windows.zip, extracts aipdf.exe to %USERPROFILE%\.cargo\bin, and adds it to your user PATH. Restart your terminal after install.

From source

git clone https://github.com/tanmaysheoran/ai.pdf.git
cd ai.pdf
./install.sh          # detects local source automatically

# or directly with Cargo
cargo install --path crates/aipdf-cli

Verify

aipdf --help

Tip

If aipdf is not found after install, add ~/.cargo/bin to your PATH:

export PATH="$HOME/.cargo/bin:$PATH"

How It Works

Every .ai.pdf file is a standard PDF with an invisible second layer:

Layer	Object	Purpose
Visual	PDF page tree + content streams (embedded Unicode font)	Human rendering
Semantic	`aipdf-semantic.xml.br` (embedded associated file)	Machine parsing
Metadata	XMP packet (`/Metadata`)	Discovery, indexing

The embedded file is a Brotli-compressed XML document validated against the V1 schema. The PDF /AF (Associated Files) entry links the catalog to it. Legacy readers ignore the associated file entirely.

Detection works in two passes: a fast literal byte-scan for /Subtype /application#2Faipdf+xml+br (files written by this tool), falling back to a structural lopdf lookup via the Filespec /EF reference — which is what makes third-party re-saved PDFs detectable.

Format at a Glance

<document version="1.0">
  <metadata>
    <title>Research Summary</title>
    <authors><author>Jane Smith</author></authors>
  </metadata>

  <section id="s-intro" level="1">
    <title>Introduction</title>
    <paragraph id="p1" page="1" bbox="72,680,540,710">
      This paper proposes a new approach to...
    </paragraph>
    <figure id="fig1">
      <image src="architecture.png" alt="System architecture diagram"/>
      <caption>Figure 1. System overview.</caption>
    </figure>
  </section>

  <section id="s-results" level="1">
    <title>Results</title>
    <table id="t1">
      <caption>Table 1. Benchmark results.</caption>
      <thead><row><cell header="true">Method</cell><cell header="true">Score</cell></row></thead>
      <tbody><row><cell>Baseline</cell><cell>71.2</cell></row></tbody>
    </table>
  </section>
</document>

Block types: paragraph, title, codeBlock, equation, table, figure, list, definitionList, note, footnote, annotation, citation, reference

Each block can carry id, page, bbox (page-local coordinates), and role attributes.

Inputs

The CLI and all SDKs accept these source formats and convert them to semantic XML automatically:

Input	Notes
`.xml`	Direct semantic XML — must conform to the V1 schema
`.html`	HTML5 — headings, tables, lists, code, figures extracted
`.md`	Markdown via `pulldown-cmark` — headings, paragraphs, ordered/unordered lists, GFM tables, fenced code with language, blockquotes, images
`.typ`	Typst — headings, lists, fenced code, $…$ equations, `image()` figures
existing `.pdf`	`aipdf ingest` — text extraction with optional `tesseract` OCR fallback for scanned pages

Export Formats

Once a file has a semantic layer, export to any of these:

Format	Flag	Use case
XML	`--format xml`	Raw semantic payload
Markdown	`--format markdown`	Human-readable rendering
Markdown AST	`--format markdown-ast`	MDAST-compatible JSON tree
ONTO	`--format onto`	Token-efficient LLM ingestion

Note

ONTO is a columnar format (cells ^-separated, rows |-separated) encoding the document as Document, Sections, Blocks, Tables, Figures, and References column families. Designed for direct LLM context injection with minimal token overhead, produced byte-identically by all three SDKs (verified by shared golden fixtures).

CLI

# Build from source files
aipdf build samples/minimal.xml
aipdf build page.html --render full
aipdf build paper.md -o paper.ai.pdf
aipdf build paper.md --render full --font /path/to/NotoSansCJK.ttf   # embed a CJK font
aipdf build page.html --render browser -o page.ai.pdf                 # full CSS via headless Chrome

# Ingest an existing PDF (attach a semantic layer)
aipdf ingest scanned.pdf -o scanned.ai.pdf   # OCR scanned pages if tesseract is installed
aipdf ingest report.pdf --ocr never

# Inspect and validate
aipdf inspect paper.ai.pdf
aipdf validate paper.ai.pdf

# Extract and export
aipdf extract paper.ai.pdf
aipdf export paper.ai.pdf --format markdown
aipdf export paper.ai.pdf --format onto
aipdf export paper.ai.pdf --format markdown-ast

Render modes: --render minimal (plain text, fast) · --render full (layout engine with headings, tables, images) · --render browser (HTML only, full CSS via headless Chrome)

Page sizes: --page-size letter (default) · --page-size a4

MCP server

The Python package ships an MCP stdio server for agent integration:

aipdf-mcp   # or: python -m aipdf.mcp_server

Tools exposed: aipdf_inspect, aipdf_extract, aipdf_reading_order, aipdf_validate, aipdf_build, aipdf_convert. See docs/mcp.md for client configuration.

SDKs

All three SDKs produce byte-identical output for read-side transforms (ONTO, Markdown, Markdown AST), enforced by shared golden fixtures in tests/conformance/. Write-side operations (build, ingest) delegate to the installed aipdf binary so the Rust core remains authoritative.

Rust

[dependencies]
aipdf = { path = "crates/aipdf" }

use aipdf::AipdfDocument;

let doc = AipdfDocument::open("paper.ai.pdf")?;

println!("{}", doc.to_markdown()?);
println!("{}", doc.to_onto()?);

for block in doc.get_reading_order()? {
    println!("[{}] page={:?} {}", block.kind, block.page, block.text);
}

use aipdf::{build_aipdf, BuildOptions, RenderMode};

let xml = std::fs::read_to_string("paper.xml")?;
let bytes = build_aipdf(&xml, &BuildOptions {
    title: "My Paper".into(),
    render: RenderMode::Full,
    ..Default::default()
})?;
std::fs::write("paper.ai.pdf", bytes)?;

Python

pip install -e sdk/python

from aipdf import AIPDF

doc = AIPDF.open("paper.ai.pdf")

print(doc.to_xml())
print(doc.to_markdown())
print(doc.to_onto())

for block in doc.get_reading_order():
    print(f"[{block.kind}] page={block.page}  {block.text[:60]}")

Dependency: brotli >= 1.1.0. No other runtime deps. Also ships aipdf-mcp — see docs/mcp.md.

TypeScript / Node

cd sdk/typescript && npm install && npm run build

import { AIPDF } from "./src/index.js";

const doc = AIPDF.open("paper.ai.pdf");

console.log(doc.toMarkdown());
console.log(doc.toOnto());

for (const block of doc.getReadingOrder()) {
  console.log(`[${block.kind}] page=${block.page}  ${block.text.slice(0, 60)}`);
}

No runtime dependencies. Uses Node's built-in zlib for Brotli and a small built-in XML parser.

Security

The semantic layer is data, not behavior.

Every XML path runs through sanitize_xml on both build and extract, with an identical rejected-marker list across Rust, Python, and TypeScript.
Rejected markers (active-content / structural only): <!DOCTYPE, <?xml-stylesheet, <?processing, <script, /JavaScript, /Launch. Body text is XML-escaped, so these only match real markup.
Natural-language phrases are not banned — XML text is data, and the visible PDF already carries the same words.
No external entity resolution. Decompressed payload capped at 16 MiB.
No embeddings, prompts, model output, or executable content in the V1 format.

See docs/security.md for the full threat model.

Status

V1 prototype. The format spec, schema, and all three SDKs are functional and tested. Feedback on the schema, ONTO format, and PDF object layout is welcome via issues.

Working today:

Authoring from XML, HTML, Markdown, and Typst
full render with a real layout engine, embedded Unicode (CID/Type0) font, and raster figure images
Per-block page + bbox coordinates written back into the embedded XML
Ingestion of existing PDFs (text extraction + optional tesseract OCR)
1.x version negotiation (forward-compatible: unknown elements ignored, other majors rejected)
MCP server for agent integration
Byte-identical ONTO/Markdown/AST across Rust, Python, and TypeScript (enforced by golden fixtures)

Not yet:

Rasterizing vector PDF pages for OCR (only embedded-image scans are covered)
Glyph subsetting for embedded fonts
Signed payloads, incremental-update authoring, PDF 2.0 conformance suite

Repository layout

crates/
  aipdf/              Rust core library
    src/pdf.rs          PDF assembly, detection, extract/inspect
    src/render.rs       Full-render layout engine (page/bbox, images)
    src/browser.rs      Browser render path (headless Chrome)
    src/font.rs         Embedded CID/Type0 Unicode font
    src/ingest.rs       Ingest existing PDFs (text + OCR fallback)
    src/{xml,markdown,onto,source,security}.rs
    assets/             Bundled DejaVuSans.ttf (+ FONT-LICENSE.md)
  aipdf-cli/          CLI (clap, thin wrapper over the library)
sdk/
  python/             Pure-Python SDK (+ aipdf.mcp_server)
  typescript/         ESM TypeScript SDK (Node), proper XML parser
schema/
  aipdf-1.0.xsd       Normative XSD schema for the semantic payload
docs/
  spec.md             Full format specification (incl. versioning contract)
  security.md         Threat model and security controls
  compatibility.md    Legacy PDF reader compatibility notes
  mcp.md              MCP server tools and client configuration
samples/
  minimal.xml         Minimal valid semantic XML
  maximal.xml         All element types exercised (v1.0)
  html/ xml/          Comprehensive HTML / XML samples
tests/
  conformance/        Golden ONTO/Markdown fixtures (single source of truth)
  *.py                Python integration, smoke, MCP, and conformance tests
benches/              Compression analysis

License

Apache 2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents

Why

Installation

Linux / macOS

Windows (PowerShell)

From source

Verify

How It Works

Format at a Glance

Inputs

Export Formats

CLI

MCP server

SDKs

Rust

Python

TypeScript / Node

Security

Status

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
assets		assets
benches		benches
crates		crates
docs		docs
examples		examples
samples		samples
schema		schema
sdk		sdk
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
index.html		index.html
install.ps1		install.ps1
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

Contents

Why

Installation

Linux / macOS

Windows (PowerShell)

From source

Verify

How It Works

Format at a Glance

Inputs

Export Formats

CLI

MCP server

SDKs

Rust

Python

TypeScript / Node

Security

Status

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages