[Measurement] Labeled benchmark corpus + CI precision/recall gate (D1)

## Summary
Build a labeled corpus of real repos. CI runs the scanner against it on every PR and fails if precision or recall drops by more than 1pp.

## Why
"Did this PR improve accuracy?" is currently answered by eyeballing output. There is no number that says "the scanner is 87% accurate today." Every other item in this roadmap is unmeasurable without this.

## What to do

### Corpus (5–10 repos, small subsets each, ~10–50 files)
- `langchain-ai/langchain` subset — heavy OpenAI, wrappers
- `vercel/ai` examples — modern TS
- `openai/openai-cookbook` selections — canonical SDK usage
- `stripe-samples/*` (one or two) — Stripe patterns
- A Bedrock demo — AWS + raw fetch
- A Django/Flask app with mixed providers — Python coverage
- An Express app — generic-HTTP
- A barrel-heavy TS monorepo — re-export resolution
- A repo with dynamic URLs — constant-fold testing
- A repo with deep wrapper chains — multi-hop

### Annotation
Per repo, `expected.json` lists expected endpoints and findings with `must_detect: true` / `is_true_positive: true` flags.

### Metrics
- Detection precision / recall
- Provider attribution accuracy
- Finding precision / recall

### CI integration
- `npm run benchmark` runs the corpus, emits a metrics report
- `.github/workflows/benchmark.yml` runs on every PR
- `benchmark/baseline.json` holds committed baseline
- Workflow fails if precision/recall drops > 1pp
- PRs that legitimately improve metrics update `baseline.json`

## Acceptance criteria
- [ ] At least 5 repos in corpus, hand-labeled
- [ ] `npm run benchmark` produces metrics report
- [ ] CI workflow runs on every PR
- [ ] Baseline committed; current metrics published in `docs/accuracy/measurement.md`
- [ ] Regression gate prevents merging PRs that drop precision/recall

## Reference
Full design: https://github.com/recost-dev/extension/blob/main/docs/accuracy/measurement.md#d1-labeled-benchmark-corpus--ci-precisionrecall-gate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Measurement] Labeled benchmark corpus + CI precision/recall gate (D1) #86

Summary

Why

What to do

Corpus (5–10 repos, small subsets each, ~10–50 files)

Annotation

Metrics

CI integration

Acceptance criteria

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Measurement] Labeled benchmark corpus + CI precision/recall gate (D1) #86

Description

Summary

Why

What to do

Corpus (5–10 repos, small subsets each, ~10–50 files)

Annotation

Metrics

CI integration

Acceptance criteria

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions