Skip to content

🐛 A/B framework: 3 measurement bugs surfaced by h4 (#153) run #164

@ZaxShen

Description

@ZaxShen

Surfaced by run 24976153108 (h4-first-action-mandatory). Per-pair output was correct; the merged report under-counted and misled.

1. Merge step drops most rows (UNIQUE constraint failed: eval_results.id)

tests/dogfood/scripts/ab-report.sh (or the merge-DB step in .github/workflows/ab-scenario.yml) does a naive INSERT from each per-pair DB into the merged DB without renumbering or INSERT OR IGNORE. Each per-pair DB starts row IDs at 1, so the second one onward fails the UNIQUE constraint and is silently dropped.

Symptom: h4 report showed only 1 row per scorer for B-pre-pr139, none for A-current-mandatory, despite 10 successful per-pair runs.

Fix: either INSERT INTO eval_results (run_id, flow, scorer, pass, value, explanation, arm, scenario, ts) SELECT ... FROM other.eval_results (omit id, let the merged DB auto-assign), or rewrite as INSERT OR REPLACE keyed on (run_id, scorer).

2. Cost scorer reports tokens_total=0

Every run logged cost: within budget — tokens_total=0 (in=0 out=0) p99_latency_ms=0. The pass is meaningless because the input is zero. Either the audit-row token field isn't being populated by the MCP server during arm runs, or the cost scorer's SQL is reading the wrong column.

Where to look: tests/dogfood/lib/scorers.sh:l5_score_cost and what it reads from audit / ledger. Cross-check whether TMB_DEBUG_TRAJECTORY=1 (set in l5_run_arm) actually causes tokens to be persisted.

3. tools-required.json for 95-anonymous-cold-restart contradicts CLAUDE.md doctrine

tests/dogfood/flows/95-anonymous-cold-restart/tools-required.json requires mcp__plugin_tmb_trajectory-server__config_get. But CLAUDE.md activation-routine section says explicitly: "never write them; fetch via config_get only when you need a specific value (don't add to the activation routine)".

The test will always fail this assertion in steady-state because the doctrine forbids the call. Drop config_get from required and keep just identity_get + issue_resume.


These three together mean A/B reports for the next several scenarios will be misleading until fixed. Priority: high if we're about to gate releases on A/B data; medium otherwise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't workingTestsTest infrastructure (L0-L6)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions