Surfaced by run 24976153108 (h4-first-action-mandatory). Per-pair output was correct; the merged report under-counted and misled.
1. Merge step drops most rows (UNIQUE constraint failed: eval_results.id)
tests/dogfood/scripts/ab-report.sh (or the merge-DB step in .github/workflows/ab-scenario.yml) does a naive INSERT from each per-pair DB into the merged DB without renumbering or INSERT OR IGNORE. Each per-pair DB starts row IDs at 1, so the second one onward fails the UNIQUE constraint and is silently dropped.
Symptom: h4 report showed only 1 row per scorer for B-pre-pr139, none for A-current-mandatory, despite 10 successful per-pair runs.
Fix: either INSERT INTO eval_results (run_id, flow, scorer, pass, value, explanation, arm, scenario, ts) SELECT ... FROM other.eval_results (omit id, let the merged DB auto-assign), or rewrite as INSERT OR REPLACE keyed on (run_id, scorer).
2. Cost scorer reports tokens_total=0
Every run logged cost: within budget — tokens_total=0 (in=0 out=0) p99_latency_ms=0. The pass is meaningless because the input is zero. Either the audit-row token field isn't being populated by the MCP server during arm runs, or the cost scorer's SQL is reading the wrong column.
Where to look: tests/dogfood/lib/scorers.sh:l5_score_cost and what it reads from audit / ledger. Cross-check whether TMB_DEBUG_TRAJECTORY=1 (set in l5_run_arm) actually causes tokens to be persisted.
3. tools-required.json for 95-anonymous-cold-restart contradicts CLAUDE.md doctrine
tests/dogfood/flows/95-anonymous-cold-restart/tools-required.json requires mcp__plugin_tmb_trajectory-server__config_get. But CLAUDE.md activation-routine section says explicitly: "never write them; fetch via config_get only when you need a specific value (don't add to the activation routine)".
The test will always fail this assertion in steady-state because the doctrine forbids the call. Drop config_get from required and keep just identity_get + issue_resume.
These three together mean A/B reports for the next several scenarios will be misleading until fixed. Priority: high if we're about to gate releases on A/B data; medium otherwise.
Surfaced by run 24976153108 (h4-first-action-mandatory). Per-pair output was correct; the merged report under-counted and misled.
1. Merge step drops most rows (
UNIQUE constraint failed: eval_results.id)tests/dogfood/scripts/ab-report.sh(or the merge-DB step in.github/workflows/ab-scenario.yml) does a naiveINSERTfrom each per-pair DB into the merged DB without renumbering orINSERT OR IGNORE. Each per-pair DB starts row IDs at 1, so the second one onward fails the UNIQUE constraint and is silently dropped.Symptom: h4 report showed only 1 row per scorer for B-pre-pr139, none for A-current-mandatory, despite 10 successful per-pair runs.
Fix: either
INSERT INTO eval_results (run_id, flow, scorer, pass, value, explanation, arm, scenario, ts) SELECT ... FROM other.eval_results(omitid, let the merged DB auto-assign), or rewrite asINSERT OR REPLACEkeyed on(run_id, scorer).2. Cost scorer reports
tokens_total=0Every run logged
cost: within budget — tokens_total=0 (in=0 out=0) p99_latency_ms=0. The pass is meaningless because the input is zero. Either the audit-row token field isn't being populated by the MCP server during arm runs, or the cost scorer's SQL is reading the wrong column.Where to look:
tests/dogfood/lib/scorers.sh:l5_score_costand what it reads fromaudit/ledger. Cross-check whetherTMB_DEBUG_TRAJECTORY=1(set inl5_run_arm) actually causes tokens to be persisted.3.
tools-required.jsonfor 95-anonymous-cold-restart contradicts CLAUDE.md doctrinetests/dogfood/flows/95-anonymous-cold-restart/tools-required.jsonrequiresmcp__plugin_tmb_trajectory-server__config_get. But CLAUDE.md activation-routine section says explicitly: "never write them; fetch viaconfig_getonly when you need a specific value (don't add to the activation routine)".The test will always fail this assertion in steady-state because the doctrine forbids the call. Drop
config_getfrom required and keep justidentity_get+issue_resume.These three together mean A/B reports for the next several scenarios will be misleading until fixed. Priority: high if we're about to gate releases on A/B data; medium otherwise.