cli-pdf-extract

cli-pdf-extract is a fast Rust CLI for LLM-friendly PDF inspection. It wraps pdf_oxide and focuses on practical research workflows: quick triage, targeted extraction, and lightweight downstream parsing.

Why This Tool

LLMs are often slow when they must open and synthesize full PDFs. This CLI gives a faster path:

Extract only what you need (--abstract, --highlight, page/range)
Avoid heavy image payloads by default
Prefer plain text output for fast agent throughput (default --mode text)

Modalities

full text/pages: extract one page, a range, or all pages (default if no page flags)
abstract: extract only the abstract block for paper triage
highlight: extract only PDF highlights and their notes
metadata: extract title, authors, year, page_count
section: extract a specific section by name (e.g., --section conclusion)

Extraction Modes

For non-highlight extraction, choose with --mode:

text (default): plain text, usually fastest for agents
markdown: preserves heading/list structure when available
auto: tries markdown first, then falls back to text if quality looks poor

Spacing normalization is enabled by default to reduce merged-word artifacts. Disable with --no-normalize-spacing.

Install

Local build/run

cargo run -- <PDF_PATH> --page 0

Install binary locally

cargo install --path .

Then:

cli-pdf-extract <PDF_PATH> --page 0

Usage

Help

cli-pdf-extract --help

Single page

cli-pdf-extract examples/main.pdf --page 0

Page range (inclusive)

cli-pdf-extract examples/main.pdf --start-page 0 --end-page 5

Last N pages

cli-pdf-extract examples/main.pdf --last-pages 3

All pages (default behavior)

cli-pdf-extract examples/main.pdf

Abstract-only

cli-pdf-extract examples/main.pdf --abstract

Highlights + notes only

cli-pdf-extract examples/main.pdf --highlight

Batch abstract extraction to a directory

cli-pdf-extract papers/*.pdf --abstract --output-dir ./output/

Metadata extraction

cli-pdf-extract papers/*.pdf --metadata --output-dir ./output/

Metadata as JSON

cli-pdf-extract papers/*.pdf --metadata --format json --output-dir ./output/

Consolidated metadata JSON (single file)

cli-pdf-extract papers/*.pdf --metadata --format json --output papers.json

Force markdown mode

cli-pdf-extract examples/main.pdf --mode markdown

Auto fallback mode

cli-pdf-extract examples/main.pdf --mode auto

Batch with parallel workers

cli-pdf-extract papers/*.pdf --abstract --output-dir ./output/ --parallel 4

Write to file

cli-pdf-extract examples/main.pdf --abstract --output abstract.txt

Recommended Presets

Paper triage (fast)

cli-pdf-extract paper.pdf --abstract

Quick skim with structure

cli-pdf-extract paper.pdf --mode markdown --start-page 0 --end-page 2

Robust default for noisy PDFs

cli-pdf-extract paper.pdf --mode auto --start-page 0 --end-page 5

Annotation mining workflow

cli-pdf-extract paper.pdf --highlight

Collection triage (batch + parallel)

cli-pdf-extract papers/*.pdf --abstract --output-dir ./output/ --parallel 4

Metadata at scale (batch + parallel)

cli-pdf-extract papers/*.pdf --metadata --output-dir ./output/ --parallel 4

Conclusion extraction

cli-pdf-extract paper.pdf --section conclusion

Notes

Page indices are zero-based.
--start-page and --end-page must be used together.
--last-pages N extracts the final N pages.
--abstract cannot be combined with page/range/all or --highlight.
--output is single-input for most modes; exception: with --metadata --format json, multiple inputs are consolidated into one JSON array file.
Pro-tip: add standardized tags to annotation notes (e.g., <paper-idea>, <method>, <limitation>) for downstream clustering and trend discovery.

License

MIT. See LICENSE.

Author

Edgar Torres (edgar.torres@ki.uni-stuttgart.de)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

cli-pdf-extract

Why This Tool

Modalities

Extraction Modes

Install

Local build/run

Install binary locally

Usage

Help

Single page

Page range (inclusive)

Last N pages

All pages (default behavior)

Abstract-only

Highlights + notes only

Batch abstract extraction to a directory

Metadata extraction

Metadata as JSON

Consolidated metadata JSON (single file)

Force markdown mode

Auto fallback mode

Batch with parallel workers

Write to file

Recommended Presets

Paper triage (fast)

Quick skim with structure

Robust default for noisy PDFs

Annotation mining workflow

Collection triage (batch + parallel)

Metadata at scale (batch + parallel)

Conclusion extraction

Notes

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages