📊 Backfill A/B measurement on shipped doctrine variants (depends on #131)

Once #131 ships the A/B framework, use it to retroactively measure already-shipped doctrine variants. Stop guessing about the LLM compliance ceiling; get data.

## Hypotheses to test (each ≥10 paired runs per arm)

1. **CLAUDE.md slim**: did 268 → 109 lines actually improve outcome pass-rate?
   - Arms: pre-slim (snapshot from before PR #126) vs current
   - Metric: outcome pass-rate per flow + total tokens per session
2. **Hybrid D' (cold-start AskUserQuestion + lazy default)**: better than pure-lazy or pure-eager?
   - 3 arms: Hybrid D', always-lazy, always-eager
   - Metric: outcome pass-rate + tokens + time-to-first-useful-action
3. **Direct Mode 4-step protocol**: did the strong NEVER-SKIP-THIS framing move compliance vs. prior 3-step + soft framing?
   - Arms: pre-PR #139 wording, current 4-step, hypothetical post-commit-hook enforcement
   - Metric: rate at which direct_mode_used ledger event lands
4. **First-action chain MANDATORY tightening**: did PR #139 move greeting-flow compliance, or are we at the LLM ceiling?
   - Arms: pre-PR #139, current
   - Metric: rate identity_get + issue_resume appear in trajectory on @bro hi

## Why this matters

Several recent doctrine PRs were guesses. Without comparison runs we don't know which helped. If hypothesis 4 confirms the ceiling is real, we stop prompt-only fixes for compliance and consider programmatic enforcement.

## Acceptance

- [ ] Each hypothesis run with ≥10 paired runs per arm
- [ ] Results published as ADR under docs/trustmybot/architecture/manual/decisions/
- [ ] If a variant lost or was within noise, file followup to revert or re-design
- [ ] Findings inform v0.4.3+ prompt shipping

## Sequencing

Blocked by #131.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 Backfill A/B measurement on shipped doctrine variants (depends on #131) #153

Hypotheses to test (each ≥10 paired runs per arm)

Why this matters

Acceptance

Sequencing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

📊 Backfill A/B measurement on shipped doctrine variants (depends on #131) #153

Description

Hypotheses to test (each ≥10 paired runs per arm)

Why this matters

Acceptance

Sequencing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions