Skip to content

[Measurement] Labeled benchmark corpus + CI precision/recall gate (D1) #86

@AndresL230

Description

@AndresL230

Summary

Build a labeled corpus of real repos. CI runs the scanner against it on every PR and fails if precision or recall drops by more than 1pp.

Why

"Did this PR improve accuracy?" is currently answered by eyeballing output. There is no number that says "the scanner is 87% accurate today." Every other item in this roadmap is unmeasurable without this.

What to do

Corpus (5–10 repos, small subsets each, ~10–50 files)

  • langchain-ai/langchain subset — heavy OpenAI, wrappers
  • vercel/ai examples — modern TS
  • openai/openai-cookbook selections — canonical SDK usage
  • stripe-samples/* (one or two) — Stripe patterns
  • A Bedrock demo — AWS + raw fetch
  • A Django/Flask app with mixed providers — Python coverage
  • An Express app — generic-HTTP
  • A barrel-heavy TS monorepo — re-export resolution
  • A repo with dynamic URLs — constant-fold testing
  • A repo with deep wrapper chains — multi-hop

Annotation

Per repo, expected.json lists expected endpoints and findings with must_detect: true / is_true_positive: true flags.

Metrics

  • Detection precision / recall
  • Provider attribution accuracy
  • Finding precision / recall

CI integration

  • npm run benchmark runs the corpus, emits a metrics report
  • .github/workflows/benchmark.yml runs on every PR
  • benchmark/baseline.json holds committed baseline
  • Workflow fails if precision/recall drops > 1pp
  • PRs that legitimately improve metrics update baseline.json

Acceptance criteria

  • At least 5 repos in corpus, hand-labeled
  • npm run benchmark produces metrics report
  • CI workflow runs on every PR
  • Baseline committed; current metrics published in docs/accuracy/measurement.md
  • Regression gate prevents merging PRs that drop precision/recall

Reference

Full design: https://github.com/recost-dev/extension/blob/main/docs/accuracy/measurement.md#d1-labeled-benchmark-corpus--ci-precisionrecall-gate

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions