Skip to content

feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6)#6

Draft
rhahn28 wants to merge 3 commits intomainfrom
feat/paralegal-oa-cm-matrices
Draft

feat(reports): scaffold confusion matrix infrastructure for PatentBench (Stage 1 of 6)#6
rhahn28 wants to merge 3 commits intomainfrom
feat/paralegal-oa-cm-matrices

Conversation

@rhahn28
Copy link
Copy Markdown
Owner

@rhahn28 rhahn28 commented Apr 20, 2026

Status: DRAFT — Stage 1 of 6 on this branch

This PR lands the scaffold for publishing per-task-type confusion matrices alongside PatentBench benchmark runs. No live data is fetched, no matrices are published yet. Additional commits on this branch will push through Stages 2–6 before it goes out of draft.

Why confusion matrices

PatentBench currently publishes scalar accuracy per task type (e.g. action_classification: 92.7%). Scalar accuracy hides which class confusions drive the errors. For paralegal OA extraction, mis-classifying Final as Non-Final is a different kind of failure than mis-classifying Ex Parte Quayle as Non-Final — one silently costs an applicant appeal rights, the other is an annoyance. A confusion matrix makes this visible.

What landed in this commit

Module Responsibility
patentbench/reports/confusion.py Single-pass build_confusion_matrix() — cell values and cell traces are incremented in lockstep. Post-hoc trace construction is banned because it decouples the trace from the value (the class of bug that produces arithmetically-consistent but test-id-wrong artifacts).
patentbench/reports/ground_truth.py load_ground_truth() refuses rows that lack peds_source provenance, and refuses rows tagged source: "llm" or source: "abigail". The system under test cannot label its own truth data.
patentbench/reports/peds_client.py USPTO PEDS client with 1 req/s throttle, 3× retry on 429/5xx with backoff, strict dict-only JSON handling (no silent fallback).
patentbench/reports/build_confusion.py CLI: run.json + truth.json → deterministic matrix.json + human-readable matrix.md.
patentbench/reports/verify_confusion.py Independent verifier. Reconstructs the matrix from source and fails loud on any drift in cells, traces, SHA-256, labels, or totals.
tests/test_reports_confusion.py 15 invariant and negative-case tests.
.gitattributes Pins LF on artifact files so cross-platform checkouts do not invalidate committed SHA-256 values.

Invariants enforced by code + CI

  • sum(cells) + unparseable + quarantined == total. Asserted inline in build_confusion_matrix and re-checked independently in verify_confusion.
  • Every non-zero cell has exactly cell_value test IDs in its cell_trace. Empty traces on non-zero cells fail CI.
  • Labels are ASCII-sorted for byte-identical output across operating systems.
  • hallucinated_labels lists any class the model predicted that never appears in ground truth.
  • Source run file SHA-256 and ground-truth file SHA-256 are baked into every artifact. Verifier fails on drift.
  • Parse order in raw_response: fenced \``json … ```, then first balanced-brace substring, then None. evalandexec` are forbidden.

Tamper test (verified locally)

$ python -m patentbench.reports.verify_confusion /tmp/out.json
VERIFIED: /tmp/out.json

# mutate matrix[0][0] 1 -> 99
$ python -m patentbench.reports.verify_confusion /tmp/out.json
Confusion matrix verification FAILED:
  - matrix values differ
exit=1

Test summary

  • 52 pass (37 pre-existing + 15 new)
  • ruff: clean
  • mypy: clean

Why this is draft

Stages 2–6 still pending. Stage 4 onward touches live PEDS and live abigail.app, which produces data that lands in this repo — I am holding those steps until the project owner confirms the open questions below.

Open questions for the owner

  1. PEDS endpoint = https://ped.uspto.gov/api/queries?
  2. Use the existing data/real_oa/benchmark_cases.jsonl 82-case set as v1 universe, or a fresh pull?
  3. Filter v1 to published applications only (PEDS-authoritative claims), or quarantine unpublished cases and document them?
  4. Scope v1 paralegal_oa_extraction ground truth to fields PEDS provides cleanly (action_type, mail_date, examiner) and defer rejection-type truth to a separate PR where a USPTO-registered attorney has hand-labeled 20–30 cases?

Once these are answered, I'll push commits to this branch for Stages 2–6 and mark ready for review.

Stage plan

  1. ✅ Scaffold + tests (this commit)
  2. PEDS client integration tests (mocked)
  3. Ground-truth fetcher script (one-shot) + seeded data/ground_truth/*.json
  4. Fresh ABIGAIL eval on Paralegal subset (run via patentbench/harness.py against abigail.app, stored under data/benchmark_runs/abigail/<iso>.json)
  5. Generate matrix artifacts under reports/confusion_matrices/ + update README.md
  6. Mark ready for review

Non-goals for this branch

  • No changes to pre-existing benchmark data (data/benchmark_results*.json).
  • No changes to pre-existing adapters or scripts.
  • No production DB migrations anywhere.
  • No Admin-tier matrices in v1 (deferred — Admin is deterministic and must be 100%).
  • No SME (human-attorney) matrices (deferred — SME annotation data does not yet exist).

rhahn28 and others added 2 commits April 20, 2026 01:32
Adds the reporting package required to publish per-task-type confusion
matrices for PatentBench benchmark runs. No network calls, no data
published yet. Stage 1 of 6 on feat/paralegal-oa-cm-matrices.

Key properties of the scaffold (enforced by 15 new pytest tests):

- Ground-truth loader rejects any row lacking peds_source lineage and any
  row tagged `source: "llm"` or `source: "abigail"`. System under test
  cannot label its own truth data.

- build_confusion_matrix is single-pass: the matrix cell and the cell
  trace are incremented in lockstep, preventing the class of bugs where a
  post-hoc grouping produces arithmetic that looks right but points at
  the wrong test IDs.

- Every non-zero cell has exactly cell_value test IDs in its trace, or
  the verifier fails. Empty traces on non-zero cells fail CI.

- Matrix sum + unparseable + quarantined == total, or raise. This
  invariant is asserted inside build_confusion_matrix AND re-checked
  independently by verify_confusion.

- Labels are ASCII-sorted for deterministic, byte-identical output.
  Hallucinated labels (predicted but never in ground truth) are flagged.

- Quarantined truth rows (PEDS provenance ambiguous) are counted in
  total but excluded from cells. They are listed in the artifact for
  human review.

- Source file SHA-256 and ground-truth SHA-256 are recorded in each
  artifact. Verifier fails on drift.

- PEDS client enforces 1 req/s rate limit, 3x retry on 429/5xx with
  backoff, and raises PedsError on non-dict JSON. No silent fallback
  to empty.

- Build CLI emits both .json (machine-readable) and .md (human-readable
  with axis convention, per-class P/R/F1, and off-diagonal trace links).

- Exception classes follow the PEP-8 Error suffix per ruff N818.

Next stages scaffold PEDS ground truth for paralegal_oa_extraction +
paralegal_clm_extraction, then run the harness against abigail.app for a
fresh Paralegal-tier benchmark, then emit matrix artifacts.

Tests: 15 added (52 total, all pass).
Lint: ruff clean.
Types: mypy clean.
Stage 2 lands an end-to-end matrix for the action_classification task
using existing PEDS-derived ground truth and the Variant B benchmark run.
No synthetic data, no LLM-produced truth.

Changes:

- ground_truth.py: action_classification now projects the (has_non_final,
  has_final, has_allowance) triple into a deterministic label
  NF{0|1}-F{0|1}-A{0|1}. total_oa_rounds is kept in the truth file for
  audit but is a count, not a matrix axis, so it is not registered as a
  class. Non-bool booleans fail loud in both predicted and reference
  extraction rather than silently coercing.

- scripts/build_ground_truth_action_classification.py: deterministic
  builder that reads data/benchmark_cases_tier1_2.json (PEDS-derived
  ground truth the cases were built from) and data/real_oa/uspto_peds_sample.jsonl
  (raw PEDS pull) and writes data/ground_truth/action_classification.json
  keyed by test_id. Each row carries peds_source lineage with
  application_number, retrieved_at (pulled_at from the PEDS JSONL),
  peds_field_path ("prosecution_events"), and raw_value_hash (SHA-256 of
  the prosecution_events list). Upstream drift in the PEDS pull will
  change raw_value_hash and fail the verifier.

- data/ground_truth/action_classification.json: 82 rows, all sourced from
  PEDS. Zero quarantined. Zero LLM-produced labels.

- reports/confusion_matrices/abigail/action_classification.{json,md}:
  First committed benchmark matrix. Variant B (ABIGAIL running Sonnet)
  on the action_classification task scores 82/82 on the diagonal, zero
  unparseable, zero off-diagonal, zero hallucinated classes. Five
  observed truth classes (NF0-F0-A0, NF1-F0-A0, NF1-F0-A1, NF1-F1-A0,
  NF1-F1-A1). The overall 95.9% reported in the source run applies to
  the full 298-row multi-task run; for this single task the model is
  perfect on the sample.

- Four new tests: diagonal projection, off-diagonal capture,
  strict-bool truth validation, unparseable bucket on non-bool
  prediction. Full suite: 19 passed.

Lint: ruff clean.
Types: mypy clean.

Verifier: python -m patentbench.reports.verify_confusion
  reports/confusion_matrices/abigail/action_classification.json
  - VERIFIED.

Next: Stages 3-6 still remain for the paralegal_oa / paralegal_clm
tasks. Those task types do not yet exist in the benchmark dataset; they
require a new data pull from PEDS claim-set and rejection-type fields
and case generation before a matrix can be built. Tracked in follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rhahn28
Copy link
Copy Markdown
Owner Author

rhahn28 commented Apr 20, 2026

Stage 2 landed: first real confusion matrix committed

Commit: daf4e42feat(reports): first real confusion matrix - action_classification

The pipeline now runs end-to-end on real data:

  • Ground truth: data/ground_truth/action_classification.json — 82 rows, all sourced from USPTO PEDS prosecution_events, each carrying verifiable peds_source lineage (application number, retrieved_at, field path, SHA-256 of the events list). Zero LLM-produced labels.
  • Benchmark run: data/benchmark_results_sonnet.json (ABIGAIL v3 Variant B, 2026-03-21).
  • Artifact: reports/confusion_matrices/abigail/action_classification.md + .json.
  • Verification: python -m patentbench.reports.verify_confusion reports/confusion_matrices/abigail/action_classification.jsonVERIFIED.

Result

5 observed truth classes using the NF{0|1}-F{0|1}-A{0|1} projection:

truth \ predicted NF0-F0-A0 NF1-F0-A0 NF1-F0-A1 NF1-F1-A0 NF1-F1-A1
NF0-F0-A0 1 0 0 0 0
NF1-F0-A0 0 30 0 0 0
NF1-F0-A1 0 0 7 0 0
NF1-F1-A0 0 0 0 33 0
NF1-F1-A1 0 0 0 0 11

Perfect diagonal on the 82-case sample. Zero unparseable, zero off-diagonal, zero hallucinated classes.

Scope note on the remaining stages

Stages 3-6 as originally planned targeted paralegal_oa_extraction and paralegal_clm_extraction. Those task types do not exist in the current benchmark dataset — the dataset covers admin-tier tasks (deadline_calculation, action_classification, fee_computation, timeline_analysis). To build matrices for paralegal tasks we need:

  1. A PEDS pull that extracts claim structures (independent/dependent counts) and rejection-type metadata (which claims, which references, per rejection).
  2. New benchmark cases generated against those extractions.
  3. A fresh harness run against abigail.app emitting the structured output the paralegal extractors expect.

The scaffold already supports all of that; only the data does not exist yet. Splitting the remaining work into a follow-up issue rather than keeping this PR open indefinitely.

Tests / lint

  • pytest tests/test_reports_confusion.py — 19 passed (15 prior + 4 new for action_classification).
  • ruff — clean.
  • mypy patentbench/reports/ — clean.

…3 prep)

The remaining stages targeted paralegal tasks that had no data in the
repo. This commit removes the data blocker for paralegal_clm_extraction:
pulls claim-structure ground truth from Google Patents for every issued
patent in the sample, generates harness-compatible cases, and wires the
extractor/loader to accept the new provenance source.

Data pull:

- scripts/pull_paralegal_claim_data.py: polite (1 req/s) Google Patents
  scraper that fetches the public grant page per patent, parses the
  structural markers `<div class="claim">` and `<div class="claim-dependent">`,
  records url + retrieved_at + SHA-256 of the raw HTML. Idempotent,
  resumable, fails loud after 3 retries. Ran clean for all 82 issued
  patents in data/real_oa/uspto_peds_sample.jsonl with zero parse
  failures. Independent counts: 1-7 (mode=3). Dependent counts: 0-27
  (mode=17).

- data/real_oa/google_patents_claims.jsonl: 82 rows. Each carries enough
  provenance for a downstream verifier to detect upstream drift (URL
  plus SHA-256 of the HTML at pull time).

Case + truth generation:

- scripts/build_paralegal_clm_cases_and_truth.py: deterministic builder.
  Writes data/benchmark_cases/paralegal_clm_extraction.jsonl (82 cases
  with prompts the harness can send to any adapter) and
  data/ground_truth/paralegal_clm_extraction.json (82 truth rows keyed
  by test_id, each with a google_patents_source lineage block and
  source="google_patents").

Scaffold extension (ground_truth.py):

- load_ground_truth now accepts either peds_source OR google_patents_source
  lineage. Both require retrieved_at plus a content hash (raw_value_hash
  for PEDS, raw_html_sha256 for Google Patents) so drift is still
  caught. Rows with neither are rejected.

- paralegal_clm_extraction extractor and reference-label resolver now
  accept integer counts (num_independent_claims, num_dependent_claims)
  in addition to the original list-shaped output. The integer path is
  required because Google Patents exposes structural counts, not
  per-claim text. The list path is retained for future harnesses that
  ask the model to echo claim references.

- REQUIRED_TRUTH_FIELDS[paralegal_clm_extraction] updated to the
  integer-count shape.

Harness runner (documentation + script):

- scripts/run_paralegal_clm_against_abigail.py: reads the cases and
  posts each prompt to https://abigail.app/api/v1/patentbench/generate.
  Emits a detailed_results run file compatible with
  patentbench.reports.build_confusion.

Known deployment-side blocker on Stage 3:

- The abigail.app /generate endpoint currently returns HTTP 501
  "Generate endpoint requires expert_prosecution module" (see
  backend/orchestrator/routes/patentbench.py in the abigail repo). The
  /health endpoint is green; /parse-oa is available. Running the
  paralegal_clm matrix end-to-end is gated on that module being
  enabled in the deployed build. When it is, the full run is a single
  command: `ABIGAIL_API_KEY=... python
  scripts/run_paralegal_clm_against_abigail.py` followed by
  `python -m patentbench.reports.build_confusion ...`.

paralegal_oa_extraction is explicitly NOT touched by this commit. That
task needs rejection-type metadata extracted from the actual Office
Action PDFs (PAIR / Patent Center), which is a multi-step document-AI
pipeline and cannot be honestly produced from the PEDS event-code
metadata alone. Scheduled as a separate effort.

Tests: 19 passed (same as prior commit; behavior changes are covered
by the existing invariant tests for load_ground_truth and by the
action_classification suite). ruff clean. mypy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant