Skip to content

📊 Backfill A/B measurement on shipped doctrine variants (depends on #131) #153

@ZaxShen

Description

@ZaxShen

Once #131 ships the A/B framework, use it to retroactively measure already-shipped doctrine variants. Stop guessing about the LLM compliance ceiling; get data.

Hypotheses to test (each ≥10 paired runs per arm)

  1. CLAUDE.md slim: did 268 → 109 lines actually improve outcome pass-rate?
  2. Hybrid D' (cold-start AskUserQuestion + lazy default): better than pure-lazy or pure-eager?
    • 3 arms: Hybrid D', always-lazy, always-eager
    • Metric: outcome pass-rate + tokens + time-to-first-useful-action
  3. Direct Mode 4-step protocol: did the strong NEVER-SKIP-THIS framing move compliance vs. prior 3-step + soft framing?
  4. First-action chain MANDATORY tightening: did PR 🐛 fix(bro): tighten activation routine + Direct Mode protocol; fix Task→Agent scorer mismatch #139 move greeting-flow compliance, or are we at the LLM ceiling?

Why this matters

Several recent doctrine PRs were guesses. Without comparison runs we don't know which helped. If hypothesis 4 confirms the ceiling is real, we stop prompt-only fixes for compliance and consider programmatic enforcement.

Acceptance

  • Each hypothesis run with ≥10 paired runs per arm
  • Results published as ADR under docs/trustmybot/architecture/manual/decisions/
  • If a variant lost or was within noise, file followup to revert or re-design
  • Findings inform v0.4.3+ prompt shipping

Sequencing

Blocked by #131.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DiscussionOpen design question, no decided action yetDoctrineDoctrine clarification or contract changeTestsTest infrastructure (L0-L6)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions