Skip to content

πŸ§ͺ feat(tests): L6 evals v2 β€” outcome-first multi-scorer (#110)#111

Merged
ZaxShen merged 1 commit into
devfrom
feat/110-evals-v2-multi-scorer
Apr 26, 2026
Merged

πŸ§ͺ feat(tests): L6 evals v2 β€” outcome-first multi-scorer (#110)#111
ZaxShen merged 1 commit into
devfrom
feat/110-evals-v2-multi-scorer

Conversation

@ZaxShen
Copy link
Copy Markdown
Contributor

@ZaxShen ZaxShen commented Apr 26, 2026

Closes #110.

Summary

L6 v1 (PR #109) used strict trajectory matching. Anthropic explicitly warns against this β€” "too brittle and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate." This PR replaces it with the industry-standard multi-scorer pattern.

Architecture (Inspect AI / AgentEvals primitives)

Per-flow directory with 4 scorer configs:

tests/dogfood/flows/<name>/
β”œβ”€β”€ README.md             ← flow spec
β”œβ”€β”€ outcome.sql           ← primary scorer: SQL assertions on final DB state
β”œβ”€β”€ tools-required.json   ← required tools (superset, any order)
β”œβ”€β”€ tools-forbidden.json  ← forbidden tools (subset/safety)
β”œβ”€β”€ cost-budget.json      ← tokens + latency budget
└── run.sh                ← invokes l6_score_flow

Each scorer writes a row to the new eval_results table.

Schema additions

  • debug_trajectory += tokens_in, tokens_out, latency_ms columns
  • New eval_results table β€” (run_id, flow_name, scorer_name, pass, value, explanation, metadata_json)
  • schema_version stays at 1 (additive)

What's converted

Flow Status
01-onboarding βœ… v2 multi-scorer
02-simple-task βœ… v2 multi-scorer
D-direct-mode βœ… v2 multi-scorer (with hard tools-forbidden invariants)
95-anonymous-cold-restart βœ… v2 multi-scorer (with regression-locking forbidden list)
12 scaffolds Auto-skip until outcome.sql authored β€” pattern is copy-paste

What's removed

  • tests/dogfood/expected/ β€” entire directory (replaced by per-flow outcome.sql + tools-*.json)
  • l6_assert_trajectory helper β€” replaced by l6_score_flow

Citations (research-backed, not from memory)

Test plan

  • L1 lint passes (16 lints)
  • L2 unit (245 + 3 new schema tests = 248)
  • L3 hooks pass
  • L4 workflow-sim passes
  • L0 install-smoke (CI Docker)
  • L6 itself β€” author triggers it on this PR via L6 label OR runs locally with token

Follow-ups (separate issues)

πŸ€– Generated with Claude Code

Replaces L6 v1's brittle strict-trajectory matching with the
industry-standard multi-scorer pattern (Inspect AI / AgentEvals /
Anthropic doctrine).

## Why

Anthropic explicitly warns against strict-step matching:
"There is a common instinct to check that agents followed very
specific steps... too brittle and results in overly brittle tests."
"Grade what the agent produced, not the path it took."

L6 v1 (PR #109) hit this trap. v2 fixes it.

## Schema additions (additive, schema_version=1 unchanged)

- debug_trajectory: +tokens_in, +tokens_out, +latency_ms columns
- eval_results: NEW table β€” one row per (flow, scorer) per run

## 4 scorer types

- outcome (primary, deterministic) β€” SQL assertions on final DB state
- trajectory_required (secondary) β€” superset: required tools called
- trajectory_forbidden (secondary) β€” subset/safety: forbidden NOT called
- cost (observational) β€” tokens + latency vs per-flow budget

## Per-flow layout (replaces expected/<name>.txt)

tests/dogfood/flows/<name>/
  README.md
  outcome.sql              ← primary scorer config
  tools-required.json      ← superset list
  tools-forbidden.json     ← safety list
  cost-budget.json         ← soft/hard budget
  run.sh                   ← invokes l6_score_flow

## Coverage

4 wired flows fully converted: 01-onboarding, 02-simple-task,
D-direct-mode, 95-anonymous-cold-restart.
12 scaffold flows preserved; auto-skip until outcome.sql authored.

Stale L6 v1 helpers removed (l6_assert_trajectory, expected/ dir).

## Tests

3 new schema tests (cost columns + eval_results structure).
All L1-L4 green.

## Sources (all in PR body of #110)

Anthropic Engineering, LangSmith trajectory-evals docs, Inspect AI
(UK AISI), LangChain agentevals, arxiv 2507.21504 survey, AWS prod
agent eval lessons, Hamel Husain's Inspect endorsement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ZaxShen ZaxShen merged commit b81854e into dev Apr 26, 2026
2 checks passed
@ZaxShen ZaxShen deleted the feat/110-evals-v2-multi-scorer branch April 26, 2026 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant