PatentBench

The First Reproducible Benchmark for Patent Prosecution AI

Overview

PatentBench is the first open, reproducible benchmark for evaluating AI systems on patent prosecution tasks. Inspired by SWE-bench for software engineering, PatentBench measures whether AI can perform the real work of patent attorneys, from parsing USPTO Office Actions to drafting legally sound arguments under 35 U.S.C. sections 101, 102, 103, and 112.

Patent prosecution has remained one of the last untouched domains for AI benchmarking despite being a $15B+ annual market. Existing evaluations are ad hoc, non-reproducible, and disconnected from the economic reality of patent practice. PatentBench changes this.

Why PatentBench?

Real tasks, not toy problems. Every test case derives from actual USPTO proceedings
Economically grounded. Tasks map to billable activities at patent law firms
Anti-hallucination first. Poison-pill MPEP citations and fabricated case law detection built in
Glass Box Standard. Full transparency on data provenance, evaluation methodology, and scoring

Benchmark Structure

5 Domains

Domain	Description	Example Tasks
Administration	Deadline computation, fee calculation, entity status	Calculate response deadline from OA mailing date
Drafting	Claim drafting, specification writing, amendment preparation	Draft independent claim from invention disclosure
Prosecution	Office Action response, rejection traversal, interviews	Argue against 103 obviousness rejection
Analytics	Portfolio analysis, prior art landscape, claim mapping	Identify claim overlap across patent family
Prior Art	Search strategy, reference analysis, relevance ranking	Evaluate novelty of claims against prior art set

7,200 Total Test Cases

PatentBench contains 7,200 expert-curated test cases spanning all five domains. The initial release, PatentBench-Mini, includes 300 representative cases for rapid evaluation.

Subset	Cases	Purpose
PatentBench-Full	7,200	Complete evaluation
PatentBench-Mini	300	Quick iteration and development
PatentBench-OA	1,800	Office Action response focus
PatentBench-Draft	1,200	Drafting focus

5 Difficulty Tiers

Tier	Level	Equivalent	Examples
1	Paralegal	0-1 years	Deadline calculation, fee lookup, form filling
2	Junior Associate	1-3 years	OA parsing, straightforward 112 responses
3	Senior Associate	3-6 years	103 arguments, claim amendments, interview prep
4	Junior Partner	6-10 years	Complex multi-rejection OAs, continuation strategy
5	Senior Partner	10+ years	Portfolio strategy, IPR defense, prosecution history estoppel

4-Layer Evaluation

PatentBench uses a rigorous 4-layer evaluation framework:

Deterministic Evaluation. Binary correctness for objective tasks (deadlines, fees, format compliance)
LLM-as-Judge. Calibrated rubric-based scoring for subjective quality (legal accuracy, argument strength, completeness)
Comparative Evaluation. Blind side-by-side ranking of model outputs by domain experts
Human Calibration. Expert attorney scores on a subset to anchor and validate automated metrics

Quick Start

Installation

pip install patentbench

Or from source:

git clone https://github.com/rhahn28/patentbench.git
cd patentbench
pip install -e ".[dev]"

Run the Benchmark

# Run PatentBench-Mini with a specific model
patentbench run --model openai:gpt-4o --subset mini

# Run a specific domain and tier
patentbench run --model anthropic:claude-sonnet-4 --domain prosecution --tier 3

# Run with ABIGAIL
patentbench run --model abigail --subset mini --api-key YOUR_KEY

Python API

from patentbench import BenchmarkRunner, DataLoader
from patentbench.models import OpenAIAdapter

# Load test cases
loader = DataLoader("data/mini")
cases = loader.load(domain="prosecution", tier=3)

# Configure model
model = OpenAIAdapter(model_name="gpt-4o")

# Run benchmark
runner = BenchmarkRunner(model=model, cases=cases)
results = runner.run()

# Print results
print(results.summary())

Leaderboard

Results on PatentBench-Mini (300 cases). Last updated: 2026-03-19.

System	Classification	Timeline	Fees	Deadlines	Layer 1 Overall
ABIGAIL v3	100.0%	100.0%	100.0%	100.0%	100.0%
ABIGAIL v3 (Variant B)	92.7%	94.2%	100.0%	99.0%	95.9%

Layer 1 (deterministic) results only. Layer 2-4 evaluation in progress. Submit your system for evaluation, see METHODOLOGY.md. Underlying model architectures are not disclosed; results measure system-level output quality.

Glass Box Standard

PatentBench adheres to the Glass Box Standard for AI evaluation transparency:

Data Provenance. Every test case traces back to a specific USPTO application number, Office Action date, and MPEP section
Evaluation Reproducibility. Deterministic scoring with pinned LLM-judge versions and published rubrics
Contamination Prevention. Held-out test cases never published in training data; canary strings embedded
Economic Validity. Tasks map to real billable activities with known market rates
Human Calibration. Expert attorney scores anchor all automated metrics with published inter-rater reliability

Anti-Hallucination Testing

PatentBench includes dedicated anti-hallucination checks:

Poison Pill MPEP Citations. Fabricated MPEP section numbers inserted to detect confabulation
Case Law Verification. All cited cases validated against published USPTO and Federal Circuit decisions
Statute Accuracy, 35 U.S.C. section references verified against current patent law
Examiner Name Verification. Cross-referenced against USPTO PEDS records

Citation

@article{patentbench2026,
  title={PatentBench: A Reproducible Benchmark for Patent Prosecution AI},
  author={Salt Holdings LLC},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026},
  url={https://github.com/rhahn28/patentbench}
}

Links

Paper: arXiv:XXXX.XXXXX
Dataset: HuggingFace
Leaderboard: abigail.app/patentbench
ABIGAIL Patent AI: abigail.app

Integrating Your System

To benchmark a commercial tool (SOLV, PatSnap, etc.), a proprietary system, or a tool without an API, see INTEGRATION.md. It covers:

Writing custom adapters for direct-API systems
CSV round-trip workflow for tools with no API
Browser automation for UI-only systems
Schema translation, how the evaluator reads text output
Submitting results to the leaderboard

Contributing

See CONTRIBUTING.md for guidelines on contributing test cases, rubrics, and model adapters.

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
blog_posts		blog_posts
data		data
frontend		frontend
outreach		outreach
paper		paper
patentbench		patentbench
reports		reports
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION_CHECKLIST.md		IMPLEMENTATION_CHECKLIST.md
INTEGRATION.md		INTEGRATION.md
LICENSE		LICENSE
MANIFESTO.md		MANIFESTO.md
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatentBench

Overview

Why PatentBench?

Benchmark Structure

5 Domains

7,200 Total Test Cases

5 Difficulty Tiers

4-Layer Evaluation

Quick Start

Installation

Run the Benchmark

Python API

Leaderboard

Glass Box Standard

Anti-Hallucination Testing

Citation

Links

Integrating Your System

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

PatentBench

Overview

Why PatentBench?

Benchmark Structure

5 Domains

7,200 Total Test Cases

5 Difficulty Tiers

4-Layer Evaluation

Quick Start

Installation

Run the Benchmark

Python API

Leaderboard

Glass Box Standard

Anti-Hallucination Testing

Citation

Links

Integrating Your System

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages