C1 step 1: per-detector finding-precision measurement (#83)#106
Conversation
#83) Adds per-finding-type TP/FP/FN tracking to the D1 benchmark gate. Baseline now records precision per detector type; CI fails on any per-type precision drop > 1pp where sample size (TP+FP) is at least 3 on both current and baseline. Console + GITHUB_STEP_SUMMARY reports show a per-type table sorted by precision ascending. Measurement plumbing only; no detector behavior changes. Worst-offender detectors (cache and batch, each at 100% FPR on corpus v1) can now be tightened one at a time in follow-up PRs without regressions sneaking through under the global findingPrecision average. Per-detector FPRs documented in docs/accuracy/findings.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughExtends the benchmark metrics pipeline with per-finding-type precision/recall computation, baseline persistence, regression gating, and reporting. Adds test coverage for per-type metrics and stores an initial measured calibration baseline. Includes detailed execution plan and calibration documentation. ChangesPer-Detector Precision Measurement
Sequence Diagram(s)No sequence diagram generated for this change set. The modifications are primarily structural (data types, computation logic, formatting, and persistence) without introducing new sequential interactions between distinct components. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The changes span multiple files with mixed complexity: type definitions are straightforward, metric computation and aggregation require careful review of per-type logic, regression gating involves sample-size thresholds, and reporting formatting is repetitive. The plan and calibration documentation are informational but lengthy. No single change is dense, but the breadth across metrics, persistence, gating, and reporting, combined with the detailed planning document, demands moderate distributed attention. Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/accuracy/findings.md`:
- Around line 49-51: Update the wording that currently states the
`concurrency_control` detector is “not in the table” to unambiguous language
such as “no measured row with numeric counts” (or alternatively remove the
placeholder `concurrency_control` row entirely); ensure the sentence references
the label mismatch between `concurrency_control` and the corpus label
`unbatched_parallel` so readers understand the table row exists but contains no
numeric measurements.
In `@docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md`:
- Line 47: The spec uses two inconsistent baseline field names; replace the
incorrect occurrence of `findingPrecisionByType` with the canonical
`findingMetricsByType` and ensure the `benchmark/baseline.json` schema and any
references (e.g., tests, docs, code that reads/writes the baseline) use
`findingMetricsByType` consistently so the new map is added under that key and
all consumers are updated to the same name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 8dae3bd9-b547-48c4-8a14-31b8db3cff2f
📒 Files selected for processing (7)
benchmark/baseline.jsonbenchmark/metrics.tsbenchmark/report.tsbenchmark/runner.tsdocs/accuracy/findings.mddocs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.mdsrc/test/benchmark-metrics.test.ts
| | `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. | | ||
|
|
||
| The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern. |
There was a problem hiding this comment.
Clarify table presence wording for concurrency_control.
Line 49 says this detector is “not in the table,” but there is a row for it. Please reword to “no measured row with numeric counts” (or remove the placeholder row) to avoid ambiguity.
Suggested wording tweak
-| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. |
+| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on this corpus, so there are no measured numeric counts. See "Type-name mismatch" below. |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. | | |
| The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern. | |
| | `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on this corpus, so there are no measured numeric counts. See "Type-name mismatch" below. | | |
| The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/accuracy/findings.md` around lines 49 - 51, Update the wording that
currently states the `concurrency_control` detector is “not in the table” to
unambiguous language such as “no measured row with numeric counts” (or
alternatively remove the placeholder `concurrency_control` row entirely); ensure
the sentence references the label mismatch between `concurrency_control` and the
corpus label `unbatched_parallel` so readers understand the table row exists but
contains no numeric measurements.
| - `benchmark/metrics.ts` — `computeMetrics`, `aggregate`, `MetricsReport`, `PerFixtureMetrics` types | ||
| - `benchmark/runner.ts` — `--update-baseline` writer, baseline loader, `computeDrops` gate | ||
| - `benchmark/report.ts` — `formatConsoleReport`, `formatMarkdownReport` | ||
| - `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map |
There was a problem hiding this comment.
Use the correct baseline field name consistently.
Line 47 says findingPrecisionByType, but the plan/spec elsewhere uses findingMetricsByType. Aligning this avoids implementation drift.
Suggested fix
-- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map
+- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map | |
| - `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md` at line
47, The spec uses two inconsistent baseline field names; replace the incorrect
occurrence of `findingPrecisionByType` with the canonical `findingMetricsByType`
and ensure the `benchmark/baseline.json` schema and any references (e.g., tests,
docs, code that reads/writes the baseline) use `findingMetricsByType`
consistently so the new map is added under that key and all consumers are
updated to the same name.
Summary
PR-1 of issue #83 (calibrate the local waste detector). Adds per-detector-type precision/recall to the D1 benchmark gate so that subsequent PRs can tighten the worst-offender waste detectors one at a time without regressions sneaking through under the global `findingPrecision` average.
Pure measurement infrastructure — no detector behavior changes.
D1 measurement
(Global numbers unchanged by design — this PR only adds the per-type breakdown.)
Per-detector breakdown (new)
`unbatched_parallel` is corpus terminology; scanner emits `concurrency_control` for the same pattern. Currently zero `concurrency_control` emissions on the corpus, so only the FN side appears. Tracked as a corpus label follow-up.
Acceptance criteria status (issue #83)
Test plan
Closes part 1 of #83. PR-2+ will tighten individual detectors (cache and batch first — each 7 FPs) using the per-type gate to prove each change is a real improvement.
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests