feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6)#6
feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6)#6
Conversation
Adds the reporting package required to publish per-task-type confusion matrices for PatentBench benchmark runs. No network calls, no data published yet. Stage 1 of 6 on feat/paralegal-oa-cm-matrices. Key properties of the scaffold (enforced by 15 new pytest tests): - Ground-truth loader rejects any row lacking peds_source lineage and any row tagged `source: "llm"` or `source: "abigail"`. System under test cannot label its own truth data. - build_confusion_matrix is single-pass: the matrix cell and the cell trace are incremented in lockstep, preventing the class of bugs where a post-hoc grouping produces arithmetic that looks right but points at the wrong test IDs. - Every non-zero cell has exactly cell_value test IDs in its trace, or the verifier fails. Empty traces on non-zero cells fail CI. - Matrix sum + unparseable + quarantined == total, or raise. This invariant is asserted inside build_confusion_matrix AND re-checked independently by verify_confusion. - Labels are ASCII-sorted for deterministic, byte-identical output. Hallucinated labels (predicted but never in ground truth) are flagged. - Quarantined truth rows (PEDS provenance ambiguous) are counted in total but excluded from cells. They are listed in the artifact for human review. - Source file SHA-256 and ground-truth SHA-256 are recorded in each artifact. Verifier fails on drift. - PEDS client enforces 1 req/s rate limit, 3x retry on 429/5xx with backoff, and raises PedsError on non-dict JSON. No silent fallback to empty. - Build CLI emits both .json (machine-readable) and .md (human-readable with axis convention, per-class P/R/F1, and off-diagonal trace links). - Exception classes follow the PEP-8 Error suffix per ruff N818. Next stages scaffold PEDS ground truth for paralegal_oa_extraction + paralegal_clm_extraction, then run the harness against abigail.app for a fresh Paralegal-tier benchmark, then emit matrix artifacts. Tests: 15 added (52 total, all pass). Lint: ruff clean. Types: mypy clean.
Stage 2 lands an end-to-end matrix for the action_classification task
using existing PEDS-derived ground truth and the Variant B benchmark run.
No synthetic data, no LLM-produced truth.
Changes:
- ground_truth.py: action_classification now projects the (has_non_final,
has_final, has_allowance) triple into a deterministic label
NF{0|1}-F{0|1}-A{0|1}. total_oa_rounds is kept in the truth file for
audit but is a count, not a matrix axis, so it is not registered as a
class. Non-bool booleans fail loud in both predicted and reference
extraction rather than silently coercing.
- scripts/build_ground_truth_action_classification.py: deterministic
builder that reads data/benchmark_cases_tier1_2.json (PEDS-derived
ground truth the cases were built from) and data/real_oa/uspto_peds_sample.jsonl
(raw PEDS pull) and writes data/ground_truth/action_classification.json
keyed by test_id. Each row carries peds_source lineage with
application_number, retrieved_at (pulled_at from the PEDS JSONL),
peds_field_path ("prosecution_events"), and raw_value_hash (SHA-256 of
the prosecution_events list). Upstream drift in the PEDS pull will
change raw_value_hash and fail the verifier.
- data/ground_truth/action_classification.json: 82 rows, all sourced from
PEDS. Zero quarantined. Zero LLM-produced labels.
- reports/confusion_matrices/abigail/action_classification.{json,md}:
First committed benchmark matrix. Variant B (ABIGAIL running Sonnet)
on the action_classification task scores 82/82 on the diagonal, zero
unparseable, zero off-diagonal, zero hallucinated classes. Five
observed truth classes (NF0-F0-A0, NF1-F0-A0, NF1-F0-A1, NF1-F1-A0,
NF1-F1-A1). The overall 95.9% reported in the source run applies to
the full 298-row multi-task run; for this single task the model is
perfect on the sample.
- Four new tests: diagonal projection, off-diagonal capture,
strict-bool truth validation, unparseable bucket on non-bool
prediction. Full suite: 19 passed.
Lint: ruff clean.
Types: mypy clean.
Verifier: python -m patentbench.reports.verify_confusion
reports/confusion_matrices/abigail/action_classification.json
- VERIFIED.
Next: Stages 3-6 still remain for the paralegal_oa / paralegal_clm
tasks. Those task types do not yet exist in the benchmark dataset; they
require a new data pull from PEDS claim-set and rejection-type fields
and case generation before a matrix can be built. Tracked in follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 2 landed: first real confusion matrix committedCommit: The pipeline now runs end-to-end on real data:
Result5 observed truth classes using the
Perfect diagonal on the 82-case sample. Zero unparseable, zero off-diagonal, zero hallucinated classes. Scope note on the remaining stagesStages 3-6 as originally planned targeted
The scaffold already supports all of that; only the data does not exist yet. Splitting the remaining work into a follow-up issue rather than keeping this PR open indefinitely. Tests / lint
|
…3 prep) The remaining stages targeted paralegal tasks that had no data in the repo. This commit removes the data blocker for paralegal_clm_extraction: pulls claim-structure ground truth from Google Patents for every issued patent in the sample, generates harness-compatible cases, and wires the extractor/loader to accept the new provenance source. Data pull: - scripts/pull_paralegal_claim_data.py: polite (1 req/s) Google Patents scraper that fetches the public grant page per patent, parses the structural markers `<div class="claim">` and `<div class="claim-dependent">`, records url + retrieved_at + SHA-256 of the raw HTML. Idempotent, resumable, fails loud after 3 retries. Ran clean for all 82 issued patents in data/real_oa/uspto_peds_sample.jsonl with zero parse failures. Independent counts: 1-7 (mode=3). Dependent counts: 0-27 (mode=17). - data/real_oa/google_patents_claims.jsonl: 82 rows. Each carries enough provenance for a downstream verifier to detect upstream drift (URL plus SHA-256 of the HTML at pull time). Case + truth generation: - scripts/build_paralegal_clm_cases_and_truth.py: deterministic builder. Writes data/benchmark_cases/paralegal_clm_extraction.jsonl (82 cases with prompts the harness can send to any adapter) and data/ground_truth/paralegal_clm_extraction.json (82 truth rows keyed by test_id, each with a google_patents_source lineage block and source="google_patents"). Scaffold extension (ground_truth.py): - load_ground_truth now accepts either peds_source OR google_patents_source lineage. Both require retrieved_at plus a content hash (raw_value_hash for PEDS, raw_html_sha256 for Google Patents) so drift is still caught. Rows with neither are rejected. - paralegal_clm_extraction extractor and reference-label resolver now accept integer counts (num_independent_claims, num_dependent_claims) in addition to the original list-shaped output. The integer path is required because Google Patents exposes structural counts, not per-claim text. The list path is retained for future harnesses that ask the model to echo claim references. - REQUIRED_TRUTH_FIELDS[paralegal_clm_extraction] updated to the integer-count shape. Harness runner (documentation + script): - scripts/run_paralegal_clm_against_abigail.py: reads the cases and posts each prompt to https://abigail.app/api/v1/patentbench/generate. Emits a detailed_results run file compatible with patentbench.reports.build_confusion. Known deployment-side blocker on Stage 3: - The abigail.app /generate endpoint currently returns HTTP 501 "Generate endpoint requires expert_prosecution module" (see backend/orchestrator/routes/patentbench.py in the abigail repo). The /health endpoint is green; /parse-oa is available. Running the paralegal_clm matrix end-to-end is gated on that module being enabled in the deployed build. When it is, the full run is a single command: `ABIGAIL_API_KEY=... python scripts/run_paralegal_clm_against_abigail.py` followed by `python -m patentbench.reports.build_confusion ...`. paralegal_oa_extraction is explicitly NOT touched by this commit. That task needs rejection-type metadata extracted from the actual Office Action PDFs (PAIR / Patent Center), which is a multi-step document-AI pipeline and cannot be honestly produced from the PEDS event-code metadata alone. Scheduled as a separate effort. Tests: 19 passed (same as prior commit; behavior changes are covered by the existing invariant tests for load_ground_truth and by the action_classification suite). ruff clean. mypy clean.
Status: DRAFT — Stage 1 of 6 on this branch
This PR lands the scaffold for publishing per-task-type confusion matrices alongside PatentBench benchmark runs. No live data is fetched, no matrices are published yet. Additional commits on this branch will push through Stages 2–6 before it goes out of draft.
Why confusion matrices
PatentBench currently publishes scalar accuracy per task type (e.g.
action_classification: 92.7%). Scalar accuracy hides which class confusions drive the errors. For paralegal OA extraction, mis-classifying Final as Non-Final is a different kind of failure than mis-classifying Ex Parte Quayle as Non-Final — one silently costs an applicant appeal rights, the other is an annoyance. A confusion matrix makes this visible.What landed in this commit
patentbench/reports/confusion.pybuild_confusion_matrix()— cell values and cell traces are incremented in lockstep. Post-hoc trace construction is banned because it decouples the trace from the value (the class of bug that produces arithmetically-consistent but test-id-wrong artifacts).patentbench/reports/ground_truth.pyload_ground_truth()refuses rows that lackpeds_sourceprovenance, and refuses rows taggedsource: "llm"orsource: "abigail". The system under test cannot label its own truth data.patentbench/reports/peds_client.pydict-only JSON handling (no silent fallback).patentbench/reports/build_confusion.pypatentbench/reports/verify_confusion.pytests/test_reports_confusion.py.gitattributesInvariants enforced by code + CI
sum(cells) + unparseable + quarantined == total. Asserted inline inbuild_confusion_matrixand re-checked independently inverify_confusion.cell_valuetest IDs in itscell_trace. Empty traces on non-zero cells fail CI.hallucinated_labelslists any class the model predicted that never appears in ground truth.raw_response: fenced\``json … ```, then first balanced-brace substring, thenNone.evalandexec` are forbidden.Tamper test (verified locally)
Test summary
Why this is draft
Stages 2–6 still pending. Stage 4 onward touches live PEDS and live
abigail.app, which produces data that lands in this repo — I am holding those steps until the project owner confirms the open questions below.Open questions for the owner
https://ped.uspto.gov/api/queries?data/real_oa/benchmark_cases.jsonl82-case set as v1 universe, or a fresh pull?paralegal_oa_extractionground truth to fields PEDS provides cleanly (action_type,mail_date,examiner) and defer rejection-type truth to a separate PR where a USPTO-registered attorney has hand-labeled 20–30 cases?Once these are answered, I'll push commits to this branch for Stages 2–6 and mark ready for review.
Stage plan
data/ground_truth/*.jsonpatentbench/harness.pyagainstabigail.app, stored underdata/benchmark_runs/abigail/<iso>.json)reports/confusion_matrices/+ updateREADME.mdNon-goals for this branch
data/benchmark_results*.json).