feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6) by rhahn28 · Pull Request #6 · rhahn28/patentbench

rhahn28 · 2026-04-20T05:33:44Z

Status: DRAFT — Stage 1 of 6 on this branch

This PR lands the scaffold for publishing per-task-type confusion matrices alongside PatentBench benchmark runs. No live data is fetched, no matrices are published yet. Additional commits on this branch will push through Stages 2–6 before it goes out of draft.

Why confusion matrices

PatentBench currently publishes scalar accuracy per task type (e.g. action_classification: 92.7%). Scalar accuracy hides which class confusions drive the errors. For paralegal OA extraction, mis-classifying Final as Non-Final is a different kind of failure than mis-classifying Ex Parte Quayle as Non-Final — one silently costs an applicant appeal rights, the other is an annoyance. A confusion matrix makes this visible.

What landed in this commit

Module	Responsibility
`patentbench/reports/confusion.py`	Single-pass `build_confusion_matrix()` — cell values and cell traces are incremented in lockstep. Post-hoc trace construction is banned because it decouples the trace from the value (the class of bug that produces arithmetically-consistent but test-id-wrong artifacts).
`patentbench/reports/ground_truth.py`	`load_ground_truth()` refuses rows that lack `peds_source` provenance, and refuses rows tagged `source: "llm"` or `source: "abigail"`. The system under test cannot label its own truth data.
`patentbench/reports/peds_client.py`	USPTO PEDS client with 1 req/s throttle, 3× retry on 429/5xx with backoff, strict `dict`-only JSON handling (no silent fallback).
`patentbench/reports/build_confusion.py`	CLI: run.json + truth.json → deterministic matrix.json + human-readable matrix.md.
`patentbench/reports/verify_confusion.py`	Independent verifier. Reconstructs the matrix from source and fails loud on any drift in cells, traces, SHA-256, labels, or totals.
`tests/test_reports_confusion.py`	15 invariant and negative-case tests.
`.gitattributes`	Pins LF on artifact files so cross-platform checkouts do not invalidate committed SHA-256 values.

Invariants enforced by code + CI

sum(cells) + unparseable + quarantined == total. Asserted inline in build_confusion_matrix and re-checked independently in verify_confusion.
Every non-zero cell has exactly cell_value test IDs in its cell_trace. Empty traces on non-zero cells fail CI.
Labels are ASCII-sorted for byte-identical output across operating systems.
hallucinated_labels lists any class the model predicted that never appears in ground truth.
Source run file SHA-256 and ground-truth file SHA-256 are baked into every artifact. Verifier fails on drift.
Parse order in raw_response: fenced \``json … ```, then first balanced-brace substring, then None. evalandexec` are forbidden.

Tamper test (verified locally)

$ python -m patentbench.reports.verify_confusion /tmp/out.json
VERIFIED: /tmp/out.json

# mutate matrix[0][0] 1 -> 99
$ python -m patentbench.reports.verify_confusion /tmp/out.json
Confusion matrix verification FAILED:
  - matrix values differ
exit=1

Test summary

52 pass (37 pre-existing + 15 new)
ruff: clean
mypy: clean

Why this is draft

Stages 2–6 still pending. Stage 4 onward touches live PEDS and live abigail.app, which produces data that lands in this repo — I am holding those steps until the project owner confirms the open questions below.

Open questions for the owner

PEDS endpoint = https://ped.uspto.gov/api/queries?
Use the existing data/real_oa/benchmark_cases.jsonl 82-case set as v1 universe, or a fresh pull?
Filter v1 to published applications only (PEDS-authoritative claims), or quarantine unpublished cases and document them?
Scope v1 paralegal_oa_extraction ground truth to fields PEDS provides cleanly (action_type, mail_date, examiner) and defer rejection-type truth to a separate PR where a USPTO-registered attorney has hand-labeled 20–30 cases?

Once these are answered, I'll push commits to this branch for Stages 2–6 and mark ready for review.

Stage plan

✅ Scaffold + tests (this commit)
PEDS client integration tests (mocked)
Ground-truth fetcher script (one-shot) + seeded data/ground_truth/*.json
Fresh ABIGAIL eval on Paralegal subset (run via patentbench/harness.py against abigail.app, stored under data/benchmark_runs/abigail/<iso>.json)
Generate matrix artifacts under reports/confusion_matrices/ + update README.md
Mark ready for review

Non-goals for this branch

No changes to pre-existing benchmark data (data/benchmark_results*.json).
No changes to pre-existing adapters or scripts.
No production DB migrations anywhere.
No Admin-tier matrices in v1 (deferred — Admin is deterministic and must be 100%).
No SME (human-attorney) matrices (deferred — SME annotation data does not yet exist).

Adds the reporting package required to publish per-task-type confusion matrices for PatentBench benchmark runs. No network calls, no data published yet. Stage 1 of 6 on feat/paralegal-oa-cm-matrices. Key properties of the scaffold (enforced by 15 new pytest tests): - Ground-truth loader rejects any row lacking peds_source lineage and any row tagged `source: "llm"` or `source: "abigail"`. System under test cannot label its own truth data. - build_confusion_matrix is single-pass: the matrix cell and the cell trace are incremented in lockstep, preventing the class of bugs where a post-hoc grouping produces arithmetic that looks right but points at the wrong test IDs. - Every non-zero cell has exactly cell_value test IDs in its trace, or the verifier fails. Empty traces on non-zero cells fail CI. - Matrix sum + unparseable + quarantined == total, or raise. This invariant is asserted inside build_confusion_matrix AND re-checked independently by verify_confusion. - Labels are ASCII-sorted for deterministic, byte-identical output. Hallucinated labels (predicted but never in ground truth) are flagged. - Quarantined truth rows (PEDS provenance ambiguous) are counted in total but excluded from cells. They are listed in the artifact for human review. - Source file SHA-256 and ground-truth SHA-256 are recorded in each artifact. Verifier fails on drift. - PEDS client enforces 1 req/s rate limit, 3x retry on 429/5xx with backoff, and raises PedsError on non-dict JSON. No silent fallback to empty. - Build CLI emits both .json (machine-readable) and .md (human-readable with axis convention, per-class P/R/F1, and off-diagonal trace links). - Exception classes follow the PEP-8 Error suffix per ruff N818. Next stages scaffold PEDS ground truth for paralegal_oa_extraction + paralegal_clm_extraction, then run the harness against abigail.app for a fresh Paralegal-tier benchmark, then emit matrix artifacts. Tests: 15 added (52 total, all pass). Lint: ruff clean. Types: mypy clean.

Stage 2 lands an end-to-end matrix for the action_classification task using existing PEDS-derived ground truth and the Variant B benchmark run. No synthetic data, no LLM-produced truth. Changes: - ground_truth.py: action_classification now projects the (has_non_final, has_final, has_allowance) triple into a deterministic label NF{0|1}-F{0|1}-A{0|1}. total_oa_rounds is kept in the truth file for audit but is a count, not a matrix axis, so it is not registered as a class. Non-bool booleans fail loud in both predicted and reference extraction rather than silently coercing. - scripts/build_ground_truth_action_classification.py: deterministic builder that reads data/benchmark_cases_tier1_2.json (PEDS-derived ground truth the cases were built from) and data/real_oa/uspto_peds_sample.jsonl (raw PEDS pull) and writes data/ground_truth/action_classification.json keyed by test_id. Each row carries peds_source lineage with application_number, retrieved_at (pulled_at from the PEDS JSONL), peds_field_path ("prosecution_events"), and raw_value_hash (SHA-256 of the prosecution_events list). Upstream drift in the PEDS pull will change raw_value_hash and fail the verifier. - data/ground_truth/action_classification.json: 82 rows, all sourced from PEDS. Zero quarantined. Zero LLM-produced labels. - reports/confusion_matrices/abigail/action_classification.{json,md}: First committed benchmark matrix. Variant B (ABIGAIL running Sonnet) on the action_classification task scores 82/82 on the diagonal, zero unparseable, zero off-diagonal, zero hallucinated classes. Five observed truth classes (NF0-F0-A0, NF1-F0-A0, NF1-F0-A1, NF1-F1-A0, NF1-F1-A1). The overall 95.9% reported in the source run applies to the full 298-row multi-task run; for this single task the model is perfect on the sample. - Four new tests: diagonal projection, off-diagonal capture, strict-bool truth validation, unparseable bucket on non-bool prediction. Full suite: 19 passed. Lint: ruff clean. Types: mypy clean. Verifier: python -m patentbench.reports.verify_confusion reports/confusion_matrices/abigail/action_classification.json - VERIFIED. Next: Stages 3-6 still remain for the paralegal_oa / paralegal_clm tasks. Those task types do not yet exist in the benchmark dataset; they require a new data pull from PEDS claim-set and rejection-type fields and case generation before a matrix can be built. Tracked in follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rhahn28 · 2026-04-20T11:20:03Z

Stage 2 landed: first real confusion matrix committed

Commit: daf4e42 — feat(reports): first real confusion matrix - action_classification

The pipeline now runs end-to-end on real data:

Ground truth: data/ground_truth/action_classification.json — 82 rows, all sourced from USPTO PEDS prosecution_events, each carrying verifiable peds_source lineage (application number, retrieved_at, field path, SHA-256 of the events list). Zero LLM-produced labels.
Benchmark run: data/benchmark_results_sonnet.json (ABIGAIL v3 Variant B, 2026-03-21).
Artifact: reports/confusion_matrices/abigail/action_classification.md + .json.
Verification: python -m patentbench.reports.verify_confusion reports/confusion_matrices/abigail/action_classification.json → VERIFIED.

Result

5 observed truth classes using the NF{0|1}-F{0|1}-A{0|1} projection:

truth \ predicted	NF0-F0-A0	NF1-F0-A0	NF1-F0-A1	NF1-F1-A0	NF1-F1-A1
NF0-F0-A0	1	0	0	0	0
NF1-F0-A0	0	30	0	0	0
NF1-F0-A1	0	0	7	0	0
NF1-F1-A0	0	0	0	33	0
NF1-F1-A1	0	0	0	0	11

Perfect diagonal on the 82-case sample. Zero unparseable, zero off-diagonal, zero hallucinated classes.

Scope note on the remaining stages

Stages 3-6 as originally planned targeted paralegal_oa_extraction and paralegal_clm_extraction. Those task types do not exist in the current benchmark dataset — the dataset covers admin-tier tasks (deadline_calculation, action_classification, fee_computation, timeline_analysis). To build matrices for paralegal tasks we need:

A PEDS pull that extracts claim structures (independent/dependent counts) and rejection-type metadata (which claims, which references, per rejection).
New benchmark cases generated against those extractions.
A fresh harness run against abigail.app emitting the structured output the paralegal extractors expect.

The scaffold already supports all of that; only the data does not exist yet. Splitting the remaining work into a follow-up issue rather than keeping this PR open indefinitely.

Tests / lint

pytest tests/test_reports_confusion.py — 19 passed (15 prior + 4 new for action_classification).
ruff — clean.
mypy patentbench/reports/ — clean.

…3 prep) The remaining stages targeted paralegal tasks that had no data in the repo. This commit removes the data blocker for paralegal_clm_extraction: pulls claim-structure ground truth from Google Patents for every issued patent in the sample, generates harness-compatible cases, and wires the extractor/loader to accept the new provenance source. Data pull: - scripts/pull_paralegal_claim_data.py: polite (1 req/s) Google Patents scraper that fetches the public grant page per patent, parses the structural markers `<div class="claim">` and `<div class="claim-dependent">`, records url + retrieved_at + SHA-256 of the raw HTML. Idempotent, resumable, fails loud after 3 retries. Ran clean for all 82 issued patents in data/real_oa/uspto_peds_sample.jsonl with zero parse failures. Independent counts: 1-7 (mode=3). Dependent counts: 0-27 (mode=17). - data/real_oa/google_patents_claims.jsonl: 82 rows. Each carries enough provenance for a downstream verifier to detect upstream drift (URL plus SHA-256 of the HTML at pull time). Case + truth generation: - scripts/build_paralegal_clm_cases_and_truth.py: deterministic builder. Writes data/benchmark_cases/paralegal_clm_extraction.jsonl (82 cases with prompts the harness can send to any adapter) and data/ground_truth/paralegal_clm_extraction.json (82 truth rows keyed by test_id, each with a google_patents_source lineage block and source="google_patents"). Scaffold extension (ground_truth.py): - load_ground_truth now accepts either peds_source OR google_patents_source lineage. Both require retrieved_at plus a content hash (raw_value_hash for PEDS, raw_html_sha256 for Google Patents) so drift is still caught. Rows with neither are rejected. - paralegal_clm_extraction extractor and reference-label resolver now accept integer counts (num_independent_claims, num_dependent_claims) in addition to the original list-shaped output. The integer path is required because Google Patents exposes structural counts, not per-claim text. The list path is retained for future harnesses that ask the model to echo claim references. - REQUIRED_TRUTH_FIELDS[paralegal_clm_extraction] updated to the integer-count shape. Harness runner (documentation + script): - scripts/run_paralegal_clm_against_abigail.py: reads the cases and posts each prompt to https://abigail.app/api/v1/patentbench/generate. Emits a detailed_results run file compatible with patentbench.reports.build_confusion. Known deployment-side blocker on Stage 3: - The abigail.app /generate endpoint currently returns HTTP 501 "Generate endpoint requires expert_prosecution module" (see backend/orchestrator/routes/patentbench.py in the abigail repo). The /health endpoint is green; /parse-oa is available. Running the paralegal_clm matrix end-to-end is gated on that module being enabled in the deployed build. When it is, the full run is a single command: `ABIGAIL_API_KEY=... python scripts/run_paralegal_clm_against_abigail.py` followed by `python -m patentbench.reports.build_confusion ...`. paralegal_oa_extraction is explicitly NOT touched by this commit. That task needs rejection-type metadata extracted from the actual Office Action PDFs (PAIR / Patent Center), which is a multi-step document-AI pipeline and cannot be honestly produced from the PEDS event-code metadata alone. Scheduled as a separate effort. Tests: 19 passed (same as prior commit; behavior changes are covered by the existing invariant tests for load_ground_truth and by the action_classification suite). ruff clean. mypy clean.

rhahn28 and others added 2 commits April 20, 2026 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6)#6

feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6)#6
rhahn28 wants to merge 3 commits intomainfrom
feat/paralegal-oa-cm-matrices

rhahn28 commented Apr 20, 2026

Uh oh!

rhahn28 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhahn28 commented Apr 20, 2026

Status: DRAFT — Stage 1 of 6 on this branch

Why confusion matrices

What landed in this commit

Invariants enforced by code + CI

Tamper test (verified locally)

Test summary

Why this is draft

Open questions for the owner

Stage plan

Non-goals for this branch

Uh oh!

rhahn28 commented Apr 20, 2026

Stage 2 landed: first real confusion matrix committed

Result

Scope note on the remaining stages

Tests / lint

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant