Summary
Build a labeled corpus of real repos. CI runs the scanner against it on every PR and fails if precision or recall drops by more than 1pp.
Why
"Did this PR improve accuracy?" is currently answered by eyeballing output. There is no number that says "the scanner is 87% accurate today." Every other item in this roadmap is unmeasurable without this.
What to do
Corpus (5–10 repos, small subsets each, ~10–50 files)
langchain-ai/langchain subset — heavy OpenAI, wrappers
vercel/ai examples — modern TS
openai/openai-cookbook selections — canonical SDK usage
stripe-samples/* (one or two) — Stripe patterns
- A Bedrock demo — AWS + raw fetch
- A Django/Flask app with mixed providers — Python coverage
- An Express app — generic-HTTP
- A barrel-heavy TS monorepo — re-export resolution
- A repo with dynamic URLs — constant-fold testing
- A repo with deep wrapper chains — multi-hop
Annotation
Per repo, expected.json lists expected endpoints and findings with must_detect: true / is_true_positive: true flags.
Metrics
- Detection precision / recall
- Provider attribution accuracy
- Finding precision / recall
CI integration
npm run benchmark runs the corpus, emits a metrics report
.github/workflows/benchmark.yml runs on every PR
benchmark/baseline.json holds committed baseline
- Workflow fails if precision/recall drops > 1pp
- PRs that legitimately improve metrics update
baseline.json
Acceptance criteria
Reference
Full design: https://github.com/recost-dev/extension/blob/main/docs/accuracy/measurement.md#d1-labeled-benchmark-corpus--ci-precisionrecall-gate
Summary
Build a labeled corpus of real repos. CI runs the scanner against it on every PR and fails if precision or recall drops by more than 1pp.
Why
"Did this PR improve accuracy?" is currently answered by eyeballing output. There is no number that says "the scanner is 87% accurate today." Every other item in this roadmap is unmeasurable without this.
What to do
Corpus (5–10 repos, small subsets each, ~10–50 files)
langchain-ai/langchainsubset — heavy OpenAI, wrappersvercel/aiexamples — modern TSopenai/openai-cookbookselections — canonical SDK usagestripe-samples/*(one or two) — Stripe patternsAnnotation
Per repo,
expected.jsonlists expected endpoints and findings withmust_detect: true/is_true_positive: trueflags.Metrics
CI integration
npm run benchmarkruns the corpus, emits a metrics report.github/workflows/benchmark.ymlruns on every PRbenchmark/baseline.jsonholds committed baselinebaseline.jsonAcceptance criteria
npm run benchmarkproduces metrics reportdocs/accuracy/measurement.mdReference
Full design: https://github.com/recost-dev/extension/blob/main/docs/accuracy/measurement.md#d1-labeled-benchmark-corpus--ci-precisionrecall-gate