Background
L6 v2 has multi-scorer (outcome / trajectory_required / trajectory_forbidden / cost), but no head-to-head between two prompt versions. Currently the only way to evaluate "is prompt revision X better than the previous one?" is manual comparison of L6 runs across commits.
Proposed framework
bash tests/dogfood/run-l6-ab.sh \
--flow 02-simple-task \
--prompt-a "@bro implement Y" \
--prompt-b "@bro plz implement Y, the gentle way" \
--runs 10
For each run pair: same fixture, same flow, two prompt variants. Run both N times. Report:
- outcome scorer pass-rate: A% vs B%
- mean tokens per run (cost differential)
- mean latency
- statistical significance (chi-squared on pass/fail counts) — fail-loud when N is too small
Where it slots into the L0-L6 pyramid
- Not a gate (token-heavy, manual trigger only)
- New tier above L6, e.g. "L7 A/B" or just "L6-AB"
- Useful for CLAUDE.md doctrine changes, planning-skill rewrites, agent prompt revisions
- Should run in
workflow_dispatch only (matches existing token-heavy tier policy)
Acceptance
Why now
Came up during the CLAUDE.md slim-down (PRs #124–#130). Without A/B, every CLAUDE.md trim is a guess at whether bro got better or worse — outcome scorers across separate runs aren't directly comparable due to LLM nondeterminism.
Background
L6 v2 has multi-scorer (outcome / trajectory_required / trajectory_forbidden / cost), but no head-to-head between two prompt versions. Currently the only way to evaluate "is prompt revision X better than the previous one?" is manual comparison of L6 runs across commits.
Proposed framework
For each run pair: same fixture, same flow, two prompt variants. Run both N times. Report:
Where it slots into the L0-L6 pyramid
workflow_dispatchonly (matches existing token-heavy tier policy)Acceptance
run-l6-ab.shaccepts --flow / --prompt-a / --prompt-b / --runsWhy now
Came up during the CLAUDE.md slim-down (PRs #124–#130). Without A/B, every CLAUDE.md trim is a guess at whether bro got better or worse — outcome scorers across separate runs aren't directly comparable due to LLM nondeterminism.