Skip to content

C1 step 1: per-detector finding-precision measurement (#83)#106

Merged
AndresL230 merged 1 commit into
mainfrom
claude/c1-waste-calibration
May 13, 2026
Merged

C1 step 1: per-detector finding-precision measurement (#83)#106
AndresL230 merged 1 commit into
mainfrom
claude/c1-waste-calibration

Conversation

@AndresL230
Copy link
Copy Markdown
Contributor

@AndresL230 AndresL230 commented May 13, 2026

Summary

PR-1 of issue #83 (calibrate the local waste detector). Adds per-detector-type precision/recall to the D1 benchmark gate so that subsequent PRs can tighten the worst-offender waste detectors one at a time without regressions sneaking through under the global `findingPrecision` average.

  • `benchmark/metrics.ts` — new `findingCountsByType` per fixture and `findingMetricsByType` aggregate.
  • `benchmark/runner.ts` — baseline reader/writer extended; per-type regression gate (>1pp drop with sample size ≥ 3 on both sides) added.
  • `benchmark/report.ts` — per-type table in console output + GITHUB_STEP_SUMMARY, sorted by precision ascending.
  • `benchmark/baseline.json` — bumped with measured per-type numbers (global 5 numbers unchanged).
  • `docs/accuracy/findings.md` — calibration table filled in.
  • `docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md` — multi-PR plan for the rest of C1.

Pure measurement infrastructure — no detector behavior changes.

D1 measurement

Metric Prior This PR Δ (pp)
Detection precision 36.26% 36.26% +0.00
Detection recall 48.53% 48.53% +0.00
Provider attribution 82.14% 82.14% +0.00
Finding precision 6.25% 6.25% +0.00
Finding recall 33.33% 33.33% +0.00

(Global numbers unchanged by design — this PR only adds the per-type breakdown.)

Per-detector breakdown (new)

Detector (scanner `type`) TP FP FN FPR Precision Severity (current)
`n_plus_one` 1 0 0 0% 100% high
`cache` 0 7 0 100% 0% medium
`batch` 0 7 1 100% 0% medium
`rate_limit` 0 1 0 100% 0% low
`unbatched_parallel` 0 0 1 low

`unbatched_parallel` is corpus terminology; scanner emits `concurrency_control` for the same pattern. Currently zero `concurrency_control` emissions on the corpus, so only the FN side appears. Tracked as a corpus label follow-up.

Acceptance criteria status (issue #83)

  • FPR re-measured on every benchmark CI run
  • Per-detector regressions fail the build
  • Each detector has documented FPR in `docs/accuracy/findings.md`
  • No detector has FPR > 30% — `cache`, `batch`, `rate_limit` currently at 100%; tightening lives in follow-up PRs

Test plan

  • `npm test` — 344/344 PASS (4 new per-type tests + 340 existing)
  • `npm run benchmark` — exit 0, all Δ +0.00pp
  • Manual gate self-test: hand-bumped one type's precision in baseline → re-run → exit 1 with `findings[].precision: 40.0% → 0.0% (Δ -40.00pp)` message → baseline restored cleanly
  • CI `benchmark` workflow green on this branch

Closes part 1 of #83. PR-2+ will tighten individual detectors (cache and batch first — each 7 FPs) using the per-type gate to prove each change is a real improvement.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added per-type detection accuracy metrics tracking (true positives, false positives, false negatives, precision, recall).
    • Introduced regression gating to prevent per-type detection accuracy drops in CI.
    • Added console and markdown reporting of per-type accuracy metrics.
  • Documentation

    • Updated accuracy documentation with measured calibration metrics for each detector type.
  • Tests

    • Enhanced test coverage for per-type accuracy metrics computation and aggregation.

Review Change Stack

#83)

Adds per-finding-type TP/FP/FN tracking to the D1 benchmark gate.
Baseline now records precision per detector type; CI fails on any
per-type precision drop > 1pp where sample size (TP+FP) is at least 3
on both current and baseline. Console + GITHUB_STEP_SUMMARY reports
show a per-type table sorted by precision ascending.

Measurement plumbing only; no detector behavior changes. Worst-offender
detectors (cache and batch, each at 100% FPR on corpus v1) can now be
tightened one at a time in follow-up PRs without regressions sneaking
through under the global findingPrecision average.

Per-detector FPRs documented in docs/accuracy/findings.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

Extends the benchmark metrics pipeline with per-finding-type precision/recall computation, baseline persistence, regression gating, and reporting. Adds test coverage for per-type metrics and stores an initial measured calibration baseline. Includes detailed execution plan and calibration documentation.

Changes

Per-Detector Precision Measurement

Layer / File(s) Summary
Per-type metric data structures and computation
benchmark/metrics.ts
FindingTypeCounts and FindingTypeMetrics types store per-type TP/FP/FN and derived metrics; PerFixtureMetrics extended with findingCountsByType; computeMetrics unions all observed finding types and reruns matching logic per type; aggregate sums per-type counts across fixtures and derives precision/recall into findingMetricsByType.
Baseline persistence: storage and loading
benchmark/baseline.json, benchmark/runner.ts
baseline.json now includes findingMetricsByType object with per-type precision/recall; baseline writer persists findingMetricsByType from report; baseline reader defaults the field to {} if missing.
Per-type precision drop detection
benchmark/runner.ts
computeDrops extends to compute per-type precision deltas; skips detection when either baseline or current sample size is less than 3 (TP+FP), otherwise reports delta in percentage points and fails if threshold exceeded.
Console and markdown per-type reporting
benchmark/report.ts
Adds "Per finding type" console section and "Finding precision by type" markdown table, both showing TP/FP/FN, precision/recall, and optional baseline delta; sortedTypeEntries helper sorts by precision (ascending) then type name.
Test coverage and measured baseline
src/test/benchmark-metrics.test.ts, benchmark/baseline.json
Test suite extended with per-type TP/FP/FN counting, false-negative attribution, aggregate precision/recall validation, and empty per-type behavior; baseline.json populated with initial measured metrics for five detector types.
Calibration table and execution plan
docs/accuracy/findings.md, docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md
Replaces placeholder calibration with measured per-detector table (TP/FP/FN/FPR/precision/severity); documents type-name mismatches and acceptance-criteria violations; detailed C1 plan specifies schema, gating rules, task breakdown, and execution steps.

Sequence Diagram(s)

No sequence diagram generated for this change set. The modifications are primarily structural (data types, computation logic, formatting, and persistence) without introducing new sequential interactions between distinct components.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes span multiple files with mixed complexity: type definitions are straightforward, metric computation and aggregation require careful review of per-type logic, regression gating involves sample-size thresholds, and reporting formatting is repetitive. The plan and calibration documentation are informational but lengthy. No single change is dense, but the breadth across metrics, persistence, gating, and reporting, combined with the detailed planning document, demands moderate distributed attention.

Possibly related issues

  • recost-dev/extension#83: This PR directly implements the per-finding-type precision/recall measurement and CI regression gating for waste-detector calibration described in the issue.

Possibly related PRs

  • recost-dev/extension#101: This PR extends the benchmark pipeline infrastructure (metrics computation, aggregation, baseline persistence, and gating) introduced in #101 by adding per-finding-type precision/recall tracking and enforcement.

Poem

🐰 A benchmark blooms with broken dreams,
Type by type, precision schemes—
Calibrate, detect, regress with care,
Per-detector truth laid bare!
Five finding types now stand and shine, 🔍

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding per-detector finding-precision measurement infrastructure for the C1 calibration step.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/c1-waste-calibration

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/accuracy/findings.md`:
- Around line 49-51: Update the wording that currently states the
`concurrency_control` detector is “not in the table” to unambiguous language
such as “no measured row with numeric counts” (or alternatively remove the
placeholder `concurrency_control` row entirely); ensure the sentence references
the label mismatch between `concurrency_control` and the corpus label
`unbatched_parallel` so readers understand the table row exists but contains no
numeric measurements.

In `@docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md`:
- Line 47: The spec uses two inconsistent baseline field names; replace the
incorrect occurrence of `findingPrecisionByType` with the canonical
`findingMetricsByType` and ensure the `benchmark/baseline.json` schema and any
references (e.g., tests, docs, code that reads/writes the baseline) use
`findingMetricsByType` consistently so the new map is added under that key and
all consumers are updated to the same name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 8dae3bd9-b547-48c4-8a14-31b8db3cff2f

📥 Commits

Reviewing files that changed from the base of the PR and between 79bc3cf and bf2ac41.

📒 Files selected for processing (7)
  • benchmark/baseline.json
  • benchmark/metrics.ts
  • benchmark/report.ts
  • benchmark/runner.ts
  • docs/accuracy/findings.md
  • docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md
  • src/test/benchmark-metrics.test.ts

Comment thread docs/accuracy/findings.md
Comment on lines +49 to +51
| `concurrency_control` | — | — | — | — | — | low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. |

The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify table presence wording for concurrency_control.

Line 49 says this detector is “not in the table,” but there is a row for it. Please reword to “no measured row with numeric counts” (or remove the placeholder row) to avoid ambiguity.

Suggested wording tweak
-| `concurrency_control`  | — | — | — | —    | —    | low    | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. |
+| `concurrency_control`  | — | — | — | —    | —    | low    | Scanner emits nothing on this corpus, so there are no measured numeric counts. See "Type-name mismatch" below. |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| `concurrency_control` |||||| low | Scanner emits nothing on the corpus; not in the table. See "Type-name mismatch" below. |
The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.
| `concurrency_control` |||||| low | Scanner emits nothing on this corpus, so there are no measured numeric counts. See "Type-name mismatch" below. |
The corpus labels one fan-out finding as `unbatched_parallel`; the scanner emits `concurrency_control` for the same pattern. The matcher compares type strings exactly, so the expected `unbatched_parallel` shows up as a recall miss (FN = 1) and the scanner's `concurrency_control` (if it were ever emitted on this corpus) would show up as a separate row of FPs. As of 2026-05-13 the scanner emits zero `concurrency_control` findings on the corpus, so only the FN side appears. The label gap is tracked as a corpus follow-up — either rename the expected type to `concurrency_control` or have the scanner emit `unbatched_parallel` for this specific pattern.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/accuracy/findings.md` around lines 49 - 51, Update the wording that
currently states the `concurrency_control` detector is “not in the table” to
unambiguous language such as “no measured row with numeric counts” (or
alternatively remove the placeholder `concurrency_control` row entirely); ensure
the sentence references the label mismatch between `concurrency_control` and the
corpus label `unbatched_parallel` so readers understand the table row exists but
contains no numeric measurements.

- `benchmark/metrics.ts` — `computeMetrics`, `aggregate`, `MetricsReport`, `PerFixtureMetrics` types
- `benchmark/runner.ts` — `--update-baseline` writer, baseline loader, `computeDrops` gate
- `benchmark/report.ts` — `formatConsoleReport`, `formatMarkdownReport`
- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the correct baseline field name consistently.

Line 47 says findingPrecisionByType, but the plan/spec elsewhere uses findingMetricsByType. Aligning this avoids implementation drift.

Suggested fix
-- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map
+- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingPrecisionByType` map
- `benchmark/baseline.json` — current 5 top-level numbers, add new `findingMetricsByType` map
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-13-c1-waste-detector-calibration.md` at line
47, The spec uses two inconsistent baseline field names; replace the incorrect
occurrence of `findingPrecisionByType` with the canonical `findingMetricsByType`
and ensure the `benchmark/baseline.json` schema and any references (e.g., tests,
docs, code that reads/writes the baseline) use `findingMetricsByType`
consistently so the new map is added under that key and all consumers are
updated to the same name.

@AndresL230 AndresL230 merged commit 6647746 into main May 13, 2026
3 checks passed
@AndresL230 AndresL230 deleted the claude/c1-waste-calibration branch May 22, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant