Open-source labelled-field extraction for PDFs — born-digital, AcroForm, or scanned. Built-in OCR. Citation-backed. Deterministic. MIT licensed.
A focused service that pulls structured fields out of any PDF and returns them as JSON, with a page reference and bounding box per value. Three OSS libraries glued into one opinionated pipeline; one HTTP endpoint, one reviewer UI, ~2k LOC end-to-end. Production-safe license stack (MIT / BSD / Apache-2.0 / MPL-2.0 — zero AGPL/GPL).
| Capability | How |
|---|---|
| Native (born-digital) PDFs | pdfplumber word extraction + labelled-regex matching. ~80 ms p50 per page. |
| AcroForm PDFs | pypdf reads named form fields directly. ~20 ms per doc. |
| Scanned / faxed / photographed PDFs | ocrmypdf (Tesseract LSTM) transparently adds a text layer. ~3 s/page at 200 DPI. One code path serves all three input modes — downstream strategies don't know or care which kind of PDF arrived. |
| Citation per field | Every value carries (page, snippet, bounding_box, method, confidence) — auditable, side-by-side reviewable, no hallucination. |
| Reviewer UI | Single page. Drop a PDF / pick a sample → click Extract → recognised fields are highlighted in situ on the thumbnail and listed inline for review → Copy JSON or Export CSV. |
| REST API | POST /api/extract returns JSON. OpenAPI 3.1 at /openapi.json, Swagger UI at /docs. |
| Deterministic | No LLMs in the pipeline. Same input → same output, every run. Reproducible for audit. |
- Insurance — claim-file intake (FNOL, loss runs, declaration pages, ACORD-style packets).
- Legal — case-file intake forms, retainer agreements, court intake.
- Finance / accounting — labelled invoices, receipts, expense reports.
- HR / compliance — onboarding paperwork, certification forms, KYC packets.
- Anywhere someone is manually typing
Label: valuepairs from PDFs into a spreadsheet.
| This service | Cloud OCR (Textract / Document AI) | LLM extractor (GPT-4o / Claude) | |
|---|---|---|---|
| Hallucination risk | None — deterministic | None | Possible |
| Citation per field | Yes — (page, bbox) |
Partial | Sometimes |
| Native-PDF latency | <100 ms | seconds | seconds |
| Air-gapped / on-prem | Yes (one container) | No | Rarely |
| API keys required | None | Yes | Yes |
| Cost per page | $0 | ~$0.0015 | ~$0.003–0.03 |
| License footprint | MIT, all transitive permissive | proprietary | proprietary |
| Single-binary deploy | Yes (Docker) | n/a | n/a |
git clone https://github.com/rohcode/pdf-field-extractor
cd pdf-field-extractor
cp .env.example .env
docker compose up
# Open http://localhost:3000/ui/Pick a sample from the dropdown (or drop your own PDF). Click Extract. Review fields. Click Export CSV.
OpenAPI spec at /openapi.json; interactive docs at /docs.
Multipart pdf field. Returns JSON.
curl -F 'pdf=@samples/fnol-text.pdf' http://localhost:3000/api/extract | jq{
"doc_id": "a3f1b8c2",
"pages": 1,
"ocr_applied": false,
"elapsed_ms": 84,
"fields": [
{
"name": "claim_number",
"value": "CLM-2026-00481",
"value_normalized": { "type": "policy_number", "raw": "CLM-2026-00481" },
"method": "labeled_regex",
"confidence": "high",
"source": {
"page": 1,
"snippet": "Claim Number: CLM-2026-00481",
"bbox": { "x0": 180.0, "y0": 158.4, "x1": 286.0, "y1": 172.1 }
}
}
]
}GET /api/samples— bundled-sample manifest.GET /api/samples/{name}— stream a bundled sample PDF.GET /api/samples/all— all bundled samples as a single ZIP.GET /healthz—200 OKhealth probe.
PDF in
│
▼
normalize.py → ocrmypdf (Tesseract) if text-coverage < threshold,
else pass-through
│
▼ text-layer PDF
├── pypdf AcroForm strategy
└── pdfplumber labelled-regex strategy (with per-word bbox)
│
▼
merge → ExtractResult { doc_id, pages, ocr_applied, elapsed_ms, fields[…] }
See ARCHITECTURE.md for the longer narrative — coordinate
systems, why no LLM, threat model, scaling characteristics.
Measured on an M1 Pro, single page in/out:
| Document | Strategy | p50 |
|---|---|---|
| Native text PDF | pdfplumber regex |
~80 ms |
| AcroForm PDF | pypdf field lookup |
~20 ms |
| Scanned PDF (200 DPI) | ocrmypdf + pdfplumber |
~3 s |
OCR dominates the scanned-PDF latency. Tesseract is single-threaded per page; multi-page batches can be parallelised at the worker layer.
All knobs live in .env:
| Var | Default | Description |
|---|---|---|
PORT |
3000 |
HTTP port |
OCR_TIMEOUT_S |
60 |
Max wall-time for an OCR run |
MAX_UPLOAD_MB |
10 |
Reject larger uploads with 413 |
RATE_LIMIT_PER_MIN |
30 |
Per-IP token-bucket size & refill rate |
TEXT_COVERAGE_THRESHOLD |
50 |
Avg chars/page below which OCR fires |
The demo ships with an insurance-flavoured field set; pre-trained in
patterns.py:
policy_number, claim_number, date_of_loss, loss_amount,
claimant_name, insured_name, policy_period_start,
policy_period_end, deductible, coverage_limit, loss_type.
Add or modify a field by appending one FieldDef entry to that file — no
other code change needed.
Need a field that isn't built in for a single request? POST fields_extra
alongside the PDF — a JSON array of {name, labels, value_type} objects
(max 10 per request). The UI exposes this via "+ Add custom field". Persistent
additions still go in patterns.py.
Is
- A small, focused extractor — three OSS libs and one opinionated pipeline.
- Production-safe license stack (MIT / BSD / Apache-2.0 / MPL-2.0).
- Type-checked end-to-end (
mypy --strict),ruff-linted, pytest goldens cover all three input modes. - Single
docker compose updeploy.
Isn't
- An LLM extractor. No Anthropic, no OpenAI, no Mastra. Deterministic by construction.
- A multi-format ingester. DOCX / EML / MSG / image / archive support are clearly-scoped phase-2 adapters, not in v1.
- A table extractor (yet). Camelot is the natural add when a real prospect's case files need loss-run / SOV-style tables.
- A multi-tenant SaaS product. No auth, no DB, no audit log. Wrap it in whatever shell your product needs.
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest -v # 44 tests; goldens cover AcroForm, native, OCR + custom fields
ruff check src tests
mypy --strict src
uvicorn pdf_field_extractor.main:app --reload --port 3000Inside Docker:
docker compose run --rm pdf-field-extractor pytest -vSee SECURITY.md. Vulnerabilities go through GitHub's private "Report a vulnerability" workflow on this repo, not public issues.
MIT. Every transitive dependency is permissively licensed
(MIT / BSD / Apache-2.0 / MPL-2.0). CI fails the build if a non-permissive
licence shows up in a transitive — see
.github/workflows/ci.yml.
The reviewer UI loads pdfjs-dist@4.10.38 from jsdelivr's CDN on first paint.
For air-gapped deployments, vendor pdf.min.mjs and pdf.worker.min.mjs into
public/ and adjust the imports in app.js.
