This repository contains a Gemini-backed archive analysis tool for answering one research question over a directory tree of PDFs. The supported system is the pdf_analyzer package in src/pdf_analyzer/.
It is designed for repeatable journalism and archival work:
- recursively scan a PDF archive
- avoid re-uploading unchanged PDFs by SHA-256
- store per-document analyses and project synthesis in SQLite
- rerun idempotently
- track failures and Gemini costs
- generate HTML and XLSX reports
Install dependencies:
uv syncRun the analyzer with a YAML config:
uv run analyze pdf-analyzer.example.yamlRuntime environment:
GEMINI_API_KEYRequired for any run that makes Gemini API calls. Not required when using--no-geminito render reports from cached SQLite state only. Also used by the optional live Gemini test and the name-clustering comparison harness.
Config shape:
name: Example Archive
pdf_directory: ./pdfs
output_directory: ./example-output
question: What does this archive say about bridge safety?
name_clustering: local
ignore_dirs_containing: .pdfdata
report_html_filename: report.html
flatten_pdf: false
flatten_dpi: 300
schema_version: v1Optional overrides such as model, workers, oversize_strategy, name_clustering, ignore_dirs_containing, report_html_filename, flatten_pdf, flatten_dpi, prompt_version, schema_version, and synthesis_prompt_version are also supported. name_clustering defaults to local and also supports gemini.
ignore_dirs_containing controls recursive PDF discovery. It may be one marker filename or a list of marker filenames. Any scanned directory containing one of those files is skipped, including all of its children. The default is .pdfdata; each run writes that hidden JSON marker into output_directory, so an output directory inside the PDF archive is not rescanned as source input.
report_html_filename controls the HTML report filename inside output_directory. It defaults to report.html and must be a filename, not a path.
Report PDF copies are normalized to PDF/A when veraPDF reports non-compliance or policy checks detect JavaScript, encryption, signatures, embedded files, launch actions, multimedia, external file streams, interactive forms, or URI actions. flatten_pdf defaults to false; when set to true, normalization renders pages to bitmap PDF at flatten_dpi before producing PDF/A.
The config file uses schema_version to identify the structured extraction schema for cache identity. The default is v1. When the Pydantic schema in src/pdf_analyzer/models.py changes, the analyzer also adds an automatic schema fingerprint, so old cached Gemini responses are not reused with a structurally different schema.
The current config file does not accept a custom top-level schema: block. To extract a domain-specific activity, put that requirement in the question. For example:
name: Activity Archive
pdf_directory: ./pdfs
output_directory: ./activity-output
question: For each responsive document, identify the name, date, place, and activity. Include only activities that are supported by the PDF text.
schema_version: v1With the default schema, names, dates, and places are stored in typed lists. Activities are captured in the document and evidence summaries unless the application schema and report storage are extended to make activity a first-class field.
Default document-analysis schema:
DocumentAnalysisResult:
responsive: boolean
relevance_score: integer # 0-10
summary: string
people:
- string
places:
- string
dates:
- string
evidence_items:
- page_start: integer | null
page_end: integer | null
summary: string
people:
- string
places:
- string
dates:
- string
reasoning_notes: string | nullEach project run writes durable artifacts into output_directory:
pdf_analyzer.sqlite3- the configured HTML report, defaulting to
report.html report.xlsxpdfs/containing copied responsive PDFs only; copies that are not valid PDF/A or that fail cross-domain policy checks are normalized to PDF/A before being linked.pdfdatacontaining output metadata such as the run timestamp.gitignorecontaining*, so generated output stays untracked when the output directory is inside the repository- a copy of the YAML config used for the run
Intermediate prepared PDFs are written to a temporary directory for the duration of the run and are not kept afterward.
report.html includes these sections:
Run Summary: document count, scanned pages, responsive-PDF count, evidence-row count, error summary, and cached token/cost totals when pricing is available.Project Answer: project-level synthesis answer plus key findings and extracted people, places, and dates.Responsive Index: a concise chronological table of contents linking into the detailed evidence cards.People: authoritative clustered names, sorted by last name, with expandable mention rows that show the matched extracted name variant for each evidence row.Responsive Evidence Timeline: chronological evidence cards with date pill, key person, page or page-range links, canonical people, and links back to responsive documents.Responsive Documents: one row per responsive PDF, with expandable evidence rows showing pages and authoritative names.Errors: failures plus unanalyzed documents when any exist.
src/pdf_analyzer/Supported archive-analysis package.doc/theory-of-operation.mdOverall architecture, data flow, and report pipeline forpdf_analyzer.doc/fact-extraction.mdFocused note on document analysis, synthesis, and report data flow.doc/name-matching.mdFocused note on authoritative-name clustering, current options, and alternative approaches.
uv run analyze ...is the preferred CLI.uv run pdf-analyzer ...remains as a compatibility alias.uv run pdfa-fix file1.pdf file2.pdfchecks listed files, normalizes flagged PDFs to PDF/A in place, and prints only normalized paths and non-PDF input paths.- Ghostscript (
gs) and veraPDF (verapdf) must be installed and onPATH; startup andmake checkfail early when either is missing. - The only runtime environment variable currently used by the application code is
GEMINI_API_KEY. - A successful rerun with no new work reuses cached per-document analyses and cached project synthesis, then re-renders the reports.
- Cache reuse is keyed by the configured versions plus automatic prompt/schema fingerprints, so structural model-output changes invalidate old cached analyses even if you forget to bump YAML version strings.