C1 step 1: per-detector finding-precision measurement (#83) by AndresL230 · Pull Request #106 · recost-dev/extension

AndresL230 · 2026-05-13T21:07:41Z

Summary

PR-1 of issue #83 (calibrate the local waste detector). Adds per-detector-type precision/recall to the D1 benchmark gate so that subsequent PRs can tighten the worst-offender waste detectors one at a time without regressions sneaking through under the global `findingPrecision` average.

`benchmark/metrics.ts` — new `findingCountsByType` per fixture and `findingMetricsByType` aggregate.
`benchmark/runner.ts` — baseline reader/writer extended; per-type regression gate (>1pp drop with sample size ≥ 3 on both sides) added.
`benchmark/report.ts` — per-type table in console output + GITHUB_STEP_SUMMARY, sorted by precision ascending.
`benchmark/baseline.json` — bumped with measured per-type numbers (global 5 numbers unchanged).
`docs/accuracy/findings.md` — calibration table filled in.
`docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md` — multi-PR plan for the rest of C1.

Pure measurement infrastructure — no detector behavior changes.

D1 measurement

Metric	Prior	This PR	Δ (pp)
Detection precision	36.26%	36.26%	+0.00
Detection recall	48.53%	48.53%	+0.00
Provider attribution	82.14%	82.14%	+0.00
Finding precision	6.25%	6.25%	+0.00
Finding recall	33.33%	33.33%	+0.00

(Global numbers unchanged by design — this PR only adds the per-type breakdown.)

Per-detector breakdown (new)

Detector (scanner `type`)	TP	FP	FN	FPR	Precision	Severity (current)
`n_plus_one`	1	0	0	0%	100%	high
`cache`	0	7	0	100%	0%	medium
`batch`	0	7	1	100%	0%	medium
`rate_limit`	0	1	0	100%	0%	low
`unbatched_parallel`	0	0	1	—	—	low

`unbatched_parallel` is corpus terminology; scanner emits `concurrency_control` for the same pattern. Currently zero `concurrency_control` emissions on the corpus, so only the FN side appears. Tracked as a corpus label follow-up.

Acceptance criteria status (issue #83)

FPR re-measured on every benchmark CI run
Per-detector regressions fail the build
Each detector has documented FPR in `docs/accuracy/findings.md`
No detector has FPR > 30% — `cache`, `batch`, `rate_limit` currently at 100%; tightening lives in follow-up PRs

Test plan

`npm test` — 344/344 PASS (4 new per-type tests + 340 existing)
`npm run benchmark` — exit 0, all Δ +0.00pp
Manual gate self-test: hand-bumped one type's precision in baseline → re-run → exit 1 with `findings[].precision: 40.0% → 0.0% (Δ -40.00pp)` message → baseline restored cleanly
CI `benchmark` workflow green on this branch

Closes part 1 of #83. PR-2+ will tighten individual detectors (cache and batch first — each 7 FPs) using the per-type gate to prove each change is a real improvement.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added per-type detection accuracy metrics tracking (true positives, false positives, false negatives, precision, recall).
- Introduced regression gating to prevent per-type detection accuracy drops in CI.
- Added console and markdown reporting of per-type accuracy metrics.
Documentation
- Updated accuracy documentation with measured calibration metrics for each detector type.
Tests
- Enhanced test coverage for per-type accuracy metrics computation and aggregation.

#83) Adds per-finding-type TP/FP/FN tracking to the D1 benchmark gate. Baseline now records precision per detector type; CI fails on any per-type precision drop > 1pp where sample size (TP+FP) is at least 3 on both current and baseline. Console + GITHUB_STEP_SUMMARY reports show a per-type table sorted by precision ascending. Measurement plumbing only; no detector behavior changes. Worst-offender detectors (cache and batch, each at 100% FPR on corpus v1) can now be tightened one at a time in follow-up PRs without regressions sneaking through under the global findingPrecision average. Per-detector FPRs documented in docs/accuracy/findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-13T21:07:55Z

📝 Walkthrough

Walkthrough

Extends the benchmark metrics pipeline with per-finding-type precision/recall computation, baseline persistence, regression gating, and reporting. Adds test coverage for per-type metrics and stores an initial measured calibration baseline. Includes detailed execution plan and calibration documentation.

Changes

Per-Detector Precision Measurement

Layer / File(s)	Summary
Per-type metric data structures and computation `benchmark/metrics.ts`	`FindingTypeCounts` and `FindingTypeMetrics` types store per-type TP/FP/FN and derived metrics; `PerFixtureMetrics` extended with `findingCountsByType`; `computeMetrics` unions all observed finding types and reruns matching logic per type; `aggregate` sums per-type counts across fixtures and derives precision/recall into `findingMetricsByType`.
Baseline persistence: storage and loading `benchmark/baseline.json`, `benchmark/runner.ts`	`baseline.json` now includes `findingMetricsByType` object with per-type precision/recall; baseline writer persists `findingMetricsByType` from report; baseline reader defaults the field to `{}` if missing.
Per-type precision drop detection `benchmark/runner.ts`	`computeDrops` extends to compute per-type precision deltas; skips detection when either baseline or current sample size is less than 3 (TP+FP), otherwise reports delta in percentage points and fails if threshold exceeded.
Console and markdown per-type reporting `benchmark/report.ts`	Adds "Per finding type" console section and "Finding precision by type" markdown table, both showing TP/FP/FN, precision/recall, and optional baseline delta; `sortedTypeEntries` helper sorts by precision (ascending) then type name.
Test coverage and measured baseline `src/test/benchmark-metrics.test.ts`, `benchmark/baseline.json`	Test suite extended with per-type TP/FP/FN counting, false-negative attribution, aggregate precision/recall validation, and empty per-type behavior; baseline.json populated with initial measured metrics for five detector types.
Calibration table and execution plan `docs/accuracy/findings.md`, `docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md`	Replaces placeholder calibration with measured per-detector table (TP/FP/FN/FPR/precision/severity); documents type-name mismatches and acceptance-criteria violations; detailed C1 plan specifies schema, gating rules, task breakdown, and execution steps.

Sequence Diagram(s)

No sequence diagram generated for this change set. The modifications are primarily structural (data types, computation logic, formatting, and persistence) without introducing new sequential interactions between distinct components.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes span multiple files with mixed complexity: type definitions are straightforward, metric computation and aggregation require careful review of per-type logic, regression gating involves sample-size thresholds, and reporting formatting is repetitive. The plan and calibration documentation are informational but lengthy. No single change is dense, but the breadth across metrics, persistence, gating, and reporting, combined with the detailed planning document, demands moderate distributed attention.

Possibly related issues

recost-dev/extension#83: This PR directly implements the per-finding-type precision/recall measurement and CI regression gating for waste-detector calibration described in the issue.

Possibly related PRs

recost-dev/extension#101: This PR extends the benchmark pipeline infrastructure (metrics computation, aggregation, baseline persistence, and gating) introduced in #101 by adding per-finding-type precision/recall tracking and enforcement.

Poem

🐰 A benchmark blooms with broken dreams,
Type by type, precision schemes—
Calibrate, detect, regress with care,
Per-detector truth laid bare!
Five finding types now stand and shine, 🔍

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding per-detector finding-precision measurement infrastructure for the C1 calibration step.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/c1-waste-calibration

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/findings.md`:
- Around line 49-51: Update the wording that currently states the
`concurrency_control` detector is “not in the table” to unambiguous language
such as “no measured row with numeric counts” (or alternatively remove the
placeholder `concurrency_control` row entirely); ensure the sentence references
the label mismatch between `concurrency_control` and the corpus label
`unbatched_parallel` so readers understand the table row exists but contains no
numeric measurements.

In `@docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md`:
- Line 47: The spec uses two inconsistent baseline field names; replace the
incorrect occurrence of `findingPrecisionByType` with the canonical
`findingMetricsByType` and ensure the `benchmark/baseline.json` schema and any
references (e.g., tests, docs, code that reads/writes the baseline) use
`findingMetricsByType` consistently so the new map is added under that key and
all consumers are updated to the same name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8dae3bd9-b547-48c4-8a14-31b8db3cff2f

📥 Commits

Reviewing files that changed from the base of the PR and between 79bc3cf and bf2ac41.

📒 Files selected for processing (7)

benchmark/baseline.json
benchmark/metrics.ts
benchmark/report.ts
benchmark/runner.ts
docs/accuracy/findings.md
docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md
src/test/benchmark-metrics.test.ts

coderabbitai · 2026-05-13T21:12:19Z

+| `concurrency_control`  | — | — | — | —    | —    | low    | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. |
+
+The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify table presence wording for concurrency_control.

Line 49 says this detector is “not in the table,” but there is a row for it. Please reword to “no measured row with numeric counts” (or remove the placeholder row) to avoid ambiguity.

Suggested wording tweak

-| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. | +| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on this corpus, so there are no measured numeric counts. See "Type-name mismatch" below. |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. |

The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.

| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on this corpus, so there are no measured numeric counts. See "Type-name mismatch" below. |

The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/accuracy/findings.md` around lines 49 - 51, Update the wording that currently states the `concurrency_control` detector is “not in the table” to unambiguous language such as “no measured row with numeric counts” (or alternatively remove the placeholder `concurrency_control` row entirely); ensure the sentence references the label mismatch between `concurrency_control` and the corpus label `unbatched_parallel` so readers understand the table row exists but contains no numeric measurements.

coderabbitai · 2026-05-13T21:12:19Z

+- `benchmark/metrics.ts` — `computeMetrics`, `aggregate`, `MetricsReport`, `PerFixtureMetrics` types
+- `benchmark/runner.ts` — `--update-baseline` writer, baseline loader, `computeDrops` gate
+- `benchmark/report.ts` — `formatConsoleReport`, `formatMarkdownReport`
+- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the correct baseline field name consistently.

Line 47 says findingPrecisionByType, but the plan/spec elsewhere uses findingMetricsByType. Aligning this avoids implementation drift.

Suggested fix

-- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map +- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map

- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md` at line 47, The spec uses two inconsistent baseline field names; replace the incorrect occurrence of `findingPrecisionByType` with the canonical `findingMetricsByType` and ensure the `benchmark/baseline.json` schema and any references (e.g., tests, docs, code that reads/writes the baseline) use `findingMetricsByType` consistently so the new map is added under that key and all consumers are updated to the same name.

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

AndresL230 merged commit 6647746 into main May 13, 2026
3 checks passed

This was referenced May 13, 2026

fix(detection): C1 PR-2 — tighten cache detector (closes part 2 of #83) #108

Merged

fix(detection): C1 PR-3 — tighten batch detector (closes part 3 of #83) #109

Merged

C1 PR-4: tighten rate_limit and residual batch FPs (closes #83) #111

Merged

AndresL230 deleted the claude/c1-waste-calibration branch May 22, 2026 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C1 step 1: per-detector finding-precision measurement (#83)#106

C1 step 1: per-detector finding-precision measurement (#83)#106
AndresL230 merged 1 commit into
mainfrom
claude/c1-waste-calibration

AndresL230 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		\| `concurrency_control` \| — \| — \| — \| — \| — \| low \| Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. \|

		The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.

	- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map
	- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map

Conversation

AndresL230 commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

D1 measurement

Per-detector breakdown (new)

Acceptance criteria status (issue #83)

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AndresL230 commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading