Natural PDF

A friendly library for working with PDFs, built on top of pdfplumber.

Natural PDF lets you find and extract content from PDFs using simple code that makes sense.

Installation

pip install natural-pdf

Need OCR engines, layout models, or other heavy add-ons? Install the core once, then use the helper CLI to pull in exactly what you need:

# add PaddleOCR (+paddlex) after the fact
npdf install paddle

# Surya OCR and the YOLO Doc-Layout detector in one go
npdf install surya yolo

# see what's already on your machine
npdf list

Light-weight extras such as deskew or search can still be added with classic PEP-508 markers if you prefer:

pip install "natural-pdf[deskew]"
pip install "natural-pdf[search]"

More details in the installation guide.

Quick Start

from natural_pdf import PDF

# Open a PDF
pdf = PDF('document.pdf')
page = pdf.pages[0]

# Extract all of the text on the page
page.extract_text()

# Find elements using CSS-like selectors
heading = page.find('text:contains("Summary"):bold')

# Extract content below the heading
content = heading.below().extract_text()

# Examine all the bold text on the page
page.find_all('text:bold').show()

# Exclude parts of the page from selectors/extractors
header = page.find('text:contains("CONFIDENTIAL")').above()
footer = page.find_all('line')[-1].below()
page.add_exclusion(header)
page.add_exclusion(footer)

# Extract clean text from the page ignoring exclusions
clean_text = page.extract_text()

And as a fun bonus, page.viewer() will provide an interactive method to explore the PDF.

Key Features

Natural PDF offers a range of features for working with PDFs:

CSS-like Selectors: Find elements using intuitive query strings (page.find('text:bold')).
Spatial Navigation: Select content relative to other elements (heading.below(), element.select_until(...)).
Text & Table Extraction: Get clean text or structured table data, automatically handling exclusions.
OCR Integration: Extract text from scanned documents using engines like EasyOCR, PaddleOCR, or Surya.
Layout Analysis: Detect document structures (titles, paragraphs, tables) using various engines (e.g., YOLO, Paddle, LLM via API).
Document QA: Ask natural language questions about your document's content.
Semantic Search: Index PDFs and find relevant pages or documents based on semantic meaning using Haystack.
Visual Debugging: Highlight elements and use an interactive viewer or save images to understand your selections.

Learn More

Dive deeper into the features and explore advanced usage in the Complete Documentation.

Best friends

Natural PDF sits on top of a lot of fantastic tools and mdoels, some of which are:

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
docs		docs
natural_pdf		natural_pdf
notebooks		notebooks
pdfs		pdfs
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
01-execute_notebooks.py		01-execute_notebooks.py
02-run_all_tutorials.sh		02-run_all_tutorials.sh
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
audit_packaging.py		audit_packaging.py
check_run_md.sh		check_run_md.sh
mkdocs.yml		mkdocs.yml
noxfile.py		noxfile.py
publish.sh		publish.sh
pyproject.toml		pyproject.toml
sample-screen.png		sample-screen.png
test_install.sh		test_install.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural PDF

Installation

Quick Start

Key Features

Learn More

Best friends

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jsoma/natural-pdf

Folders and files

Latest commit

History

Repository files navigation

Natural PDF

Installation

Quick Start

Key Features

Learn More

Best friends

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages