π§ͺ feat(tests): L6 evals v2 β outcome-first multi-scorer (#110)#111
Merged
Conversation
Replaces L6 v1's brittle strict-trajectory matching with the industry-standard multi-scorer pattern (Inspect AI / AgentEvals / Anthropic doctrine). ## Why Anthropic explicitly warns against strict-step matching: "There is a common instinct to check that agents followed very specific steps... too brittle and results in overly brittle tests." "Grade what the agent produced, not the path it took." L6 v1 (PR #109) hit this trap. v2 fixes it. ## Schema additions (additive, schema_version=1 unchanged) - debug_trajectory: +tokens_in, +tokens_out, +latency_ms columns - eval_results: NEW table β one row per (flow, scorer) per run ## 4 scorer types - outcome (primary, deterministic) β SQL assertions on final DB state - trajectory_required (secondary) β superset: required tools called - trajectory_forbidden (secondary) β subset/safety: forbidden NOT called - cost (observational) β tokens + latency vs per-flow budget ## Per-flow layout (replaces expected/<name>.txt) tests/dogfood/flows/<name>/ README.md outcome.sql β primary scorer config tools-required.json β superset list tools-forbidden.json β safety list cost-budget.json β soft/hard budget run.sh β invokes l6_score_flow ## Coverage 4 wired flows fully converted: 01-onboarding, 02-simple-task, D-direct-mode, 95-anonymous-cold-restart. 12 scaffold flows preserved; auto-skip until outcome.sql authored. Stale L6 v1 helpers removed (l6_assert_trajectory, expected/ dir). ## Tests 3 new schema tests (cost columns + eval_results structure). All L1-L4 green. ## Sources (all in PR body of #110) Anthropic Engineering, LangSmith trajectory-evals docs, Inspect AI (UK AISI), LangChain agentevals, arxiv 2507.21504 survey, AWS prod agent eval lessons, Hamel Husain's Inspect endorsement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #110.
Summary
L6 v1 (PR #109) used strict trajectory matching. Anthropic explicitly warns against this β "too brittle and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate." This PR replaces it with the industry-standard multi-scorer pattern.
Architecture (Inspect AI / AgentEvals primitives)
Per-flow directory with 4 scorer configs:
Each scorer writes a row to the new
eval_resultstable.Schema additions
debug_trajectory+=tokens_in,tokens_out,latency_mscolumnseval_resultstable β(run_id, flow_name, scorer_name, pass, value, explanation, metadata_json)schema_versionstays at 1 (additive)What's converted
tools-forbiddeninvariants)outcome.sqlauthored β pattern is copy-pasteWhat's removed
tests/dogfood/expected/β entire directory (replaced by per-flowoutcome.sql+tools-*.json)l6_assert_trajectoryhelper β replaced byl6_score_flowCitations (research-backed, not from memory)
Test plan
L6label OR runs locally with tokenFollow-ups (separate issues)
bun install --ignore-scripts) THEN runs L6 inside. Eliminates the L5 manual step entirely. Per user question 2026-04-26.π€ Generated with Claude Code