Skip to content

feat(L6): A/B prompt testing — head-to-head outcome comparison #131

@ZaxShen

Description

@ZaxShen

Background

L6 v2 has multi-scorer (outcome / trajectory_required / trajectory_forbidden / cost), but no head-to-head between two prompt versions. Currently the only way to evaluate "is prompt revision X better than the previous one?" is manual comparison of L6 runs across commits.

Proposed framework

bash tests/dogfood/run-l6-ab.sh \
  --flow 02-simple-task \
  --prompt-a "@bro implement Y" \
  --prompt-b "@bro plz implement Y, the gentle way" \
  --runs 10

For each run pair: same fixture, same flow, two prompt variants. Run both N times. Report:

  • outcome scorer pass-rate: A% vs B%
  • mean tokens per run (cost differential)
  • mean latency
  • statistical significance (chi-squared on pass/fail counts) — fail-loud when N is too small

Where it slots into the L0-L6 pyramid

  • Not a gate (token-heavy, manual trigger only)
  • New tier above L6, e.g. "L7 A/B" or just "L6-AB"
  • Useful for CLAUDE.md doctrine changes, planning-skill rewrites, agent prompt revisions
  • Should run in workflow_dispatch only (matches existing token-heavy tier policy)

Acceptance

  • run-l6-ab.sh accepts --flow / --prompt-a / --prompt-b / --runs
  • Reuses existing flow infrastructure (l6_setup_scratch_project, l6_score_flow)
  • Reports per-scorer A vs B percentages + raw counts
  • Statistical-significance flag (chi-squared p-value)
  • Documented in CONTRIBUTING.md "three release gates" section as opt-in tier
  • One worked example in tests/dogfood/scenarios/ab-onboarding-greeting.md

Why now

Came up during the CLAUDE.md slim-down (PRs #124#130). Without A/B, every CLAUDE.md trim is a guess at whether bro got better or worse — outcome scorers across separate runs aren't directly comparable due to LLM nondeterminism.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FeatureNew feature or requestTestsTest infrastructure (L0-L6)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions