cli-pdf-extract is a fast Rust CLI for LLM-friendly PDF inspection. It wraps pdf_oxide and focuses on practical research workflows: quick triage, targeted extraction, and lightweight downstream parsing.
LLMs are often slow when they must open and synthesize full PDFs. This CLI gives a faster path:
- Extract only what you need (
--abstract,--highlight, page/range) - Avoid heavy image payloads by default
- Prefer plain text output for fast agent throughput (default
--mode text)
full text/pages: extract one page, a range, or all pages (default if no page flags)abstract: extract only the abstract block for paper triagehighlight: extract only PDF highlights and their notesmetadata: extracttitle,authors,year,page_countsection: extract a specific section by name (e.g.,--section conclusion)
For non-highlight extraction, choose with --mode:
text(default): plain text, usually fastest for agentsmarkdown: preserves heading/list structure when availableauto: tries markdown first, then falls back to text if quality looks poor
Spacing normalization is enabled by default to reduce merged-word artifacts. Disable with --no-normalize-spacing.
cargo run -- <PDF_PATH> --page 0cargo install --path .Then:
cli-pdf-extract <PDF_PATH> --page 0cli-pdf-extract --helpcli-pdf-extract examples/main.pdf --page 0cli-pdf-extract examples/main.pdf --start-page 0 --end-page 5cli-pdf-extract examples/main.pdf --last-pages 3cli-pdf-extract examples/main.pdfcli-pdf-extract examples/main.pdf --abstractcli-pdf-extract examples/main.pdf --highlightcli-pdf-extract papers/*.pdf --abstract --output-dir ./output/cli-pdf-extract papers/*.pdf --metadata --output-dir ./output/cli-pdf-extract papers/*.pdf --metadata --format json --output-dir ./output/cli-pdf-extract papers/*.pdf --metadata --format json --output papers.jsoncli-pdf-extract examples/main.pdf --mode markdowncli-pdf-extract examples/main.pdf --mode autocli-pdf-extract papers/*.pdf --abstract --output-dir ./output/ --parallel 4cli-pdf-extract examples/main.pdf --abstract --output abstract.txtcli-pdf-extract paper.pdf --abstractcli-pdf-extract paper.pdf --mode markdown --start-page 0 --end-page 2cli-pdf-extract paper.pdf --mode auto --start-page 0 --end-page 5cli-pdf-extract paper.pdf --highlightcli-pdf-extract papers/*.pdf --abstract --output-dir ./output/ --parallel 4cli-pdf-extract papers/*.pdf --metadata --output-dir ./output/ --parallel 4cli-pdf-extract paper.pdf --section conclusion- Page indices are zero-based.
--start-pageand--end-pagemust be used together.--last-pages Nextracts the finalNpages.--abstractcannot be combined with page/range/all or--highlight.--outputis single-input for most modes; exception: with--metadata --format json, multiple inputs are consolidated into one JSON array file.- Pro-tip: add standardized tags to annotation notes (e.g.,
<paper-idea>,<method>,<limitation>) for downstream clustering and trend discovery.
MIT. See LICENSE.
Edgar Torres (edgar.torres@ki.uni-stuttgart.de)