🧪 feat(tests): L6 evals v2 — outcome-first multi-scorer (#110) by ZaxShen · Pull Request #111 · trustmybot/plugin

ZaxShen · 2026-04-26T17:38:23Z

Closes #110.

Summary

L6 v1 (PR #109) used strict trajectory matching. Anthropic explicitly warns against this — "too brittle and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate." This PR replaces it with the industry-standard multi-scorer pattern.

Architecture (Inspect AI / AgentEvals primitives)

Per-flow directory with 4 scorer configs:

tests/dogfood/flows/<name>/
├── README.md             ← flow spec
├── outcome.sql           ← primary scorer: SQL assertions on final DB state
├── tools-required.json   ← required tools (superset, any order)
├── tools-forbidden.json  ← forbidden tools (subset/safety)
├── cost-budget.json      ← tokens + latency budget
└── run.sh                ← invokes l6_score_flow

Each scorer writes a row to the new eval_results table.

Schema additions

debug_trajectory += tokens_in, tokens_out, latency_ms columns
New eval_results table — (run_id, flow_name, scorer_name, pass, value, explanation, metadata_json)
schema_version stays at 1 (additive)

What's converted

Flow	Status
01-onboarding	✅ v2 multi-scorer
02-simple-task	✅ v2 multi-scorer
D-direct-mode	✅ v2 multi-scorer (with hard `tools-forbidden` invariants)
95-anonymous-cold-restart	✅ v2 multi-scorer (with regression-locking forbidden list)
12 scaffolds	Auto-skip until `outcome.sql` authored — pattern is copy-paste

What's removed

tests/dogfood/expected/ — entire directory (replaced by per-flow outcome.sql + tools-*.json)
l6_assert_trajectory helper — replaced by l6_score_flow

Citations (research-backed, not from memory)

Anthropic: Demystifying evals for AI agents — outcome-first doctrine + warning against strict-step matching
LangSmith trajectory-evals — 4 trajectory match modes (strict/unordered/subset/superset)
Inspect AI (UK AISI) — Dataset/Solver/Scorer/Task primitives we adopt
LangChain agentevals — reference implementation
arxiv 2507.21504 — LLM Agent Evaluation Survey — taxonomy
AWS: Evaluating AI agents at Amazon — production lessons
Hamel Husain on Inspect AI — adoption (Anthropic, DeepMind, Grok)

Test plan

L1 lint passes (16 lints)
L2 unit (245 + 3 new schema tests = 248)
L3 hooks pass
L4 workflow-sim passes
L0 install-smoke (CI Docker)
L6 itself — author triggers it on this PR via L6 label OR runs locally with token

Follow-ups (separate issues)

🧪 feat(tests): L6 evals v2 — outcome-first multi-scorer (#110) #111 (to file) — Combined L0+L6 in Docker: simulates marketplace install (bun install --ignore-scripts) THEN runs L6 inside. Eliminates the L5 manual step entirely. Per user question 2026-04-26.
Phase 4 of Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match) #110: author the 12 scaffold flows
Phase 5 of Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match) #110: LLM-as-judge scorer + pass^k consistency runs

🤖 Generated with Claude Code

Replaces L6 v1's brittle strict-trajectory matching with the industry-standard multi-scorer pattern (Inspect AI / AgentEvals / Anthropic doctrine). ## Why Anthropic explicitly warns against strict-step matching: "There is a common instinct to check that agents followed very specific steps... too brittle and results in overly brittle tests." "Grade what the agent produced, not the path it took." L6 v1 (PR #109) hit this trap. v2 fixes it. ## Schema additions (additive, schema_version=1 unchanged) - debug_trajectory: +tokens_in, +tokens_out, +latency_ms columns - eval_results: NEW table — one row per (flow, scorer) per run ## 4 scorer types - outcome (primary, deterministic) — SQL assertions on final DB state - trajectory_required (secondary) — superset: required tools called - trajectory_forbidden (secondary) — subset/safety: forbidden NOT called - cost (observational) — tokens + latency vs per-flow budget ## Per-flow layout (replaces expected/<name>.txt) tests/dogfood/flows/<name>/ README.md outcome.sql ← primary scorer config tools-required.json ← superset list tools-forbidden.json ← safety list cost-budget.json ← soft/hard budget run.sh ← invokes l6_score_flow ## Coverage 4 wired flows fully converted: 01-onboarding, 02-simple-task, D-direct-mode, 95-anonymous-cold-restart. 12 scaffold flows preserved; auto-skip until outcome.sql authored. Stale L6 v1 helpers removed (l6_assert_trajectory, expected/ dir). ## Tests 3 new schema tests (cost columns + eval_results structure). All L1-L4 green. ## Sources (all in PR body of #110) Anthropic Engineering, LangSmith trajectory-evals docs, Inspect AI (UK AISI), LangChain agentevals, arxiv 2507.21504 survey, AWS prod agent eval lessons, Hamel Husain's Inspect endorsement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ZaxShen mentioned this pull request Apr 26, 2026

Combined L0+L6 in Docker — replace manual L5 marketplace dogfood entirely #112

Closed

5 tasks

ZaxShen merged commit b81854e into dev Apr 26, 2026
2 checks passed

ZaxShen deleted the feat/110-evals-v2-multi-scorer branch April 26, 2026 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 feat(tests): L6 evals v2 — outcome-first multi-scorer (#110)#111

🧪 feat(tests): L6 evals v2 — outcome-first multi-scorer (#110)#111
ZaxShen merged 1 commit into
devfrom
feat/110-evals-v2-multi-scorer

ZaxShen commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZaxShen commented Apr 26, 2026

Summary

Architecture (Inspect AI / AgentEvals primitives)

Schema additions

What's converted

What's removed

Citations (research-backed, not from memory)

Test plan

Follow-ups (separate issues)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant