🐛 A/B framework: 3 measurement bugs surfaced by h4 (#153) run

Surfaced by [run 24976153108](https://github.com/trustmybot/plugin/actions/runs/24976153108) (h4-first-action-mandatory). Per-pair output was correct; the merged report under-counted and misled.

## 1. Merge step drops most rows (`UNIQUE constraint failed: eval_results.id`)

`tests/dogfood/scripts/ab-report.sh` (or the merge-DB step in `.github/workflows/ab-scenario.yml`) does a naive `INSERT` from each per-pair DB into the merged DB without renumbering or `INSERT OR IGNORE`. Each per-pair DB starts row IDs at 1, so the second one onward fails the UNIQUE constraint and is silently dropped.

**Symptom**: h4 report showed only 1 row per scorer for B-pre-pr139, none for A-current-mandatory, despite 10 successful per-pair runs.

**Fix**: either `INSERT INTO eval_results (run_id, flow, scorer, pass, value, explanation, arm, scenario, ts) SELECT ... FROM other.eval_results` (omit `id`, let the merged DB auto-assign), or rewrite as `INSERT OR REPLACE` keyed on `(run_id, scorer)`.

## 2. Cost scorer reports `tokens_total=0`

Every run logged `cost: within budget — tokens_total=0 (in=0 out=0) p99_latency_ms=0`. The pass is meaningless because the input is zero. Either the audit-row token field isn't being populated by the MCP server during arm runs, or the cost scorer's SQL is reading the wrong column.

**Where to look**: `tests/dogfood/lib/scorers.sh:l5_score_cost` and what it reads from `audit` / `ledger`. Cross-check whether `TMB_DEBUG_TRAJECTORY=1` (set in `l5_run_arm`) actually causes tokens to be persisted.

## 3. `tools-required.json` for 95-anonymous-cold-restart contradicts CLAUDE.md doctrine

`tests/dogfood/flows/95-anonymous-cold-restart/tools-required.json` requires `mcp__plugin_tmb_trajectory-server__config_get`. But CLAUDE.md activation-routine section says explicitly: *"never write them; fetch via `config_get` only when you need a specific value (don't add to the activation routine)"*.

The test will always fail this assertion in steady-state because the doctrine forbids the call. Drop `config_get` from required and keep just `identity_get` + `issue_resume`.

---

These three together mean A/B reports for the next several scenarios will be misleading until fixed. Priority: high if we're about to gate releases on A/B data; medium otherwise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 A/B framework: 3 measurement bugs surfaced by h4 (#153) run #164

1. Merge step drops most rows (`UNIQUE constraint failed: eval_results.id`)

2. Cost scorer reports `tokens_total=0`

3. `tools-required.json` for 95-anonymous-cold-restart contradicts CLAUDE.md doctrine

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

🐛 A/B framework: 3 measurement bugs surfaced by h4 (#153) run #164

Description

1. Merge step drops most rows (UNIQUE constraint failed: eval_results.id)

2. Cost scorer reports tokens_total=0

3. tools-required.json for 95-anonymous-cold-restart contradicts CLAUDE.md doctrine

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Merge step drops most rows (`UNIQUE constraint failed: eval_results.id`)

2. Cost scorer reports `tokens_total=0`

3. `tools-required.json` for 95-anonymous-cold-restart contradicts CLAUDE.md doctrine