feat(L6): A/B prompt testing — head-to-head outcome comparison

## Background

L6 v2 has multi-scorer (outcome / trajectory_required / trajectory_forbidden / cost), but no head-to-head between two prompt versions. Currently the only way to evaluate "is prompt revision X better than the previous one?" is manual comparison of L6 runs across commits.

## Proposed framework

```bash
bash tests/dogfood/run-l6-ab.sh \
  --flow 02-simple-task \
  --prompt-a "@bro implement Y" \
  --prompt-b "@bro plz implement Y, the gentle way" \
  --runs 10
```

For each run pair: same fixture, same flow, two prompt variants. Run both N times. Report:
- outcome scorer pass-rate: A% vs B%
- mean tokens per run (cost differential)
- mean latency
- statistical significance (chi-squared on pass/fail counts) — fail-loud when N is too small

## Where it slots into the L0-L6 pyramid

- Not a gate (token-heavy, manual trigger only)
- New tier above L6, e.g. "L7 A/B" or just "L6-AB"
- Useful for CLAUDE.md doctrine changes, planning-skill rewrites, agent prompt revisions
- Should run in `workflow_dispatch` only (matches existing token-heavy tier policy)

## Acceptance

- [ ] `run-l6-ab.sh` accepts --flow / --prompt-a / --prompt-b / --runs
- [ ] Reuses existing flow infrastructure (l6_setup_scratch_project, l6_score_flow)
- [ ] Reports per-scorer A vs B percentages + raw counts
- [ ] Statistical-significance flag (chi-squared p-value)
- [ ] Documented in CONTRIBUTING.md "three release gates" section as opt-in tier
- [ ] One worked example in tests/dogfood/scenarios/ab-onboarding-greeting.md

## Why now

Came up during the CLAUDE.md slim-down (PRs #124–#130). Without A/B, every CLAUDE.md trim is a guess at whether bro got better or worse — outcome scorers across separate runs aren't directly comparable due to LLM nondeterminism.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(L6): A/B prompt testing — head-to-head outcome comparison #131

Background

Proposed framework

Where it slots into the L0-L6 pyramid

Acceptance

Why now

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(L6): A/B prompt testing — head-to-head outcome comparison #131

Description

Background

Proposed framework

Where it slots into the L0-L6 pyramid

Acceptance

Why now

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions