Skip to content

πŸ§ͺ feat(tests): L6 deterministic-trajectory dogfood + opt-in capture (#108)#109

Merged
ZaxShen merged 2 commits into
devfrom
feat/108-l6-deterministic-trajectory-tests
Apr 26, 2026
Merged

πŸ§ͺ feat(tests): L6 deterministic-trajectory dogfood + opt-in capture (#108)#109
ZaxShen merged 2 commits into
devfrom
feat/108-l6-deterministic-trajectory-tests

Conversation

@ZaxShen
Copy link
Copy Markdown
Contributor

@ZaxShen ZaxShen commented Apr 26, 2026

Closes #108.

Summary

Manual L5 dogfood was the release bottleneck. L6 automates it:
pre-seed DB β†’ run real claude -p β†’ assert MCP/tool sequence matches FLOWS.md.

The trajectory capture is opt-in (TMB_DEBUG_TRAJECTORY=1) so production users see zero overhead.

What landed

Schema (additive β€” schema_version stays at 1)

New debug_trajectory table (15th in the DB), populated only when env is set. Indexed on (session_id, step_n) for fast L6 reads.

Capture wiring

Source Where
MCP tool calls mcp/trajectory-server/src/index.ts β€” wraps CallToolRequestSchema handler
Non-MCP tool calls (Bash, Read, Write, Edit, Task, Skill) New PreToolUse hook scripts/hooks/debug-trajectory.sh, registered in hooks/hooks.json with matcher: "*"

Both fail-soft: any capture error is swallowed so the actual tool call never breaks.

Test infrastructure

tests/dogfood/
β”œβ”€β”€ run-l6.sh                    ← runner (env + tools check, dispatch)
β”œβ”€β”€ lib/flow-helpers.sh          ← l6_setup_scratch_project, l6_seed_db,
β”‚                                  l6_run_claude, l6_assert_trajectory
β”œβ”€β”€ flows/                       ← 16 flow scripts
β”œβ”€β”€ fixtures/                    ← 3 pre-seed SQL files
└── expected/                    ← expected-trajectory files

4 fully wired flows

Flow Asserts
01-onboarding identity_get + config_get probes β†’ identity_set + 3x config_set + ledger_log(tmb_onboarding_complete)
02-simple-task bro detects code-touching ask β†’ triage simple β†’ issue_create + task_create_batch + Task spawn + planning_complete
D-direct-mode ≀3-line typo β†’ Edit + commit + direct_mode_used. Hard invariants: NO task_create_batch, NO Task spawn, exactly one direct_mode_used event
95-anonymous-cold-restart Cold session with Anonymous identity must skip re-onboarding. Hard invariants: NO identity_set, NO config_set after the cold restart. Locks #95 regression.

12 scaffolded flows (auto-skip until expected-trajectory authored)

03-difficult-task, 04-agent-creator, 05-skill-creation, 06-push-gate, 07-architecture-regen, 08-swe-retry, 09-roundtable, C-consultant, 32-team-config, 92-base-branch, 94-arch-bootstrap, 96-halt-on-error.

Each is ~30 lines of pattern-match copy/paste once the expected-trajectory file is authored.

CI

.github/workflows/l6-dogfood.yml β€” triggers:

  • Tag pushes (every release gets a green/red signal)
  • PRs labeled L6 (opt-in for risky doctrine changes)
  • Manual workflow_dispatch

Soft-fails when CLAUDE_CODE_OAUTH_TOKEN secret is absent (forks / external PRs don't break red). Uploads trajectory dumps as artifacts on failure for debugging.

Stale doctrine cleanup (audit found 3 issues)

  • skills/tmb_first-run-onboarding/SKILL.md: fixed event_type from stale tmb_bootstrap_complete β†’ tmb_onboarding_complete + dropped "file copies" reference (swe + pr-reviewer ship globally now)
  • skills/tmb_agent-creator/SKILL.md: dropped tmb_bootstrap reference (skill is gone in v0.3.0+)
  • CLAUDE.md: removed the "tmb_bootstrap is being retired" sentence β€” already retired

Unverified assumption (flagged for follow-up)

claude -p mode behavior with AskUserQuestion is unknown. If the form auto-fails in headless mode, the 01-onboarding flow's trajectory will be shorter than expected β€” a real signal to address (could be a CC bug, could be a doctrine adjustment for headless mode).

The L6 design pre-seeds DB to skip past forms for every other flow. Only 01-onboarding actually exercises the form path.

Test plan

  • L1 lint passes (16 lints)
  • L2 unit passes (245 + 2 new schema tests = 247)
  • L3 hooks pass
  • L4 workflow-sim passes
  • Build green; dist/ committed up-to-date
  • L0 install-smoke (CI Docker)
  • L6 itself β€” runs in CI on this PR if labeled L6, OR run locally with bash tests/dogfood/run-l6.sh

Risk profile

  • Schema change is additive + opt-in β€” zero risk to production users.
  • MCP server wrapper is null-safe and try-wrapped β€” never breaks the real tool call.
  • New hook is matcher: "*" PreToolUse β€” fires on every tool call but exits 0 immediately when env unset (~1ms overhead).
  • All L1-L4 still pass.

Follow-ups (after merge)

  1. Wake-up validation: assign label L6 on this PR (or trigger workflow_dispatch) β†’ CI runs the 4 wired flows for the first time. If claude -p behavior surprises us, follow-up issue.
  2. Fill in 12 scaffold expected-trajectories β€” each is a one-line trajectory dump after a single test run with TMB_DEBUG_TRAJECTORY=1.
  3. Decide if L6 should run on every PR to dev β€” currently opt-in via L6 label. Could expand based on token cost data after a few runs.

πŸ€– Generated with Claude Code

ZaxShen and others added 2 commits April 26, 2026 02:12
…108)

Manual L5 dogfood was the release bottleneck. L6 automates it:
pre-seed DB β†’ run real `claude -p` β†’ assert MCP/tool sequence matches
the expected from FLOWS.md.

## What's new

**Schema** β€” `debug_trajectory` table (15th table). Off by default β€”
populated only when env `TMB_DEBUG_TRAJECTORY=1`. Zero overhead in
production. Schema version stays at 1 (additive).

**Capture** β€”
- MCP server writes a row per MCP call when env is set (src/index.ts wrapper)
- New PreToolUse hook `scripts/hooks/debug-trajectory.sh` (matcher: "*")
  writes rows for non-MCP calls (Bash, Read, Write, Edit, Task, Skill)

**Test infra** at `tests/dogfood/`:
- `run-l6.sh` runner
- `lib/flow-helpers.sh` shared helpers
- 16 flow scripts in `flows/` (4 fully wired, 12 scaffolded)
- `fixtures/` pre-seed SQL (empty, onboarding-named, onboarding-anonymous)
- `expected/` expected-trajectory files

**4 wired flows**: 01-onboarding, 02-simple-task, D-direct-mode (with
hard invariants on no-task-spawn + direct_mode_used event),
95-anonymous-cold-restart (with assertions that no re-onboarding
writes happen β€” locks #95 regression).

**12 scaffolded flows** auto-skip until their expected-trajectory file
is authored. Pattern is copy/paste β€” each follow-up is ~30 lines.

**CI** at `.github/workflows/l6-dogfood.yml` β€” triggers on tag pushes,
PRs labeled `L6`, manual dispatch. Soft-fails when CLAUDE_CODE_OAUTH_TOKEN
secret is absent (forks won't break). Uploads trajectory dumps on failure.

**Stale doctrine cleanup** (audit):
- Onboarding skill: fixed `tmb_bootstrap_complete` β†’ `tmb_onboarding_complete`
- Agent-creator skill: dropped tmb_bootstrap ref (skill is gone)
- Plugin CLAUDE.md: removed retirement-in-progress note for tmb_bootstrap

## Unverified assumption (flagged in #108)

`claude -p` mode behavior with AskUserQuestion. If form auto-fails in
headless mode, the onboarding flow trajectory is shorter than expected
β€” that's a real signal to file as follow-up.

## Tests

2 new schema unit tests (table presence + columns + index). All L1-L4
green. L0 will run in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When authoring an expected-trajectory file for a new flow, the workflow
is: run the flow once with TMB_DEBUG_TRAJECTORY=1, then read the
debug_trajectory table. This script does step 2 cleanly β€” finds the
right DB (handles channel isolation), prints in the L6-expected format
ready to paste into tests/dogfood/expected/<flow>.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ZaxShen ZaxShen merged commit b270fd4 into dev Apr 26, 2026
2 checks passed
ZaxShen added a commit that referenced this pull request Apr 26, 2026
Replaces L6 v1's brittle strict-trajectory matching with the
industry-standard multi-scorer pattern (Inspect AI / AgentEvals /
Anthropic doctrine).

## Why

Anthropic explicitly warns against strict-step matching:
"There is a common instinct to check that agents followed very
specific steps... too brittle and results in overly brittle tests."
"Grade what the agent produced, not the path it took."

L6 v1 (PR #109) hit this trap. v2 fixes it.

## Schema additions (additive, schema_version=1 unchanged)

- debug_trajectory: +tokens_in, +tokens_out, +latency_ms columns
- eval_results: NEW table β€” one row per (flow, scorer) per run

## 4 scorer types

- outcome (primary, deterministic) β€” SQL assertions on final DB state
- trajectory_required (secondary) β€” superset: required tools called
- trajectory_forbidden (secondary) β€” subset/safety: forbidden NOT called
- cost (observational) β€” tokens + latency vs per-flow budget

## Per-flow layout (replaces expected/<name>.txt)

tests/dogfood/flows/<name>/
  README.md
  outcome.sql              ← primary scorer config
  tools-required.json      ← superset list
  tools-forbidden.json     ← safety list
  cost-budget.json         ← soft/hard budget
  run.sh                   ← invokes l6_score_flow

## Coverage

4 wired flows fully converted: 01-onboarding, 02-simple-task,
D-direct-mode, 95-anonymous-cold-restart.
12 scaffold flows preserved; auto-skip until outcome.sql authored.

Stale L6 v1 helpers removed (l6_assert_trajectory, expected/ dir).

## Tests

3 new schema tests (cost columns + eval_results structure).
All L1-L4 green.

## Sources (all in PR body of #110)

Anthropic Engineering, LangSmith trajectory-evals docs, Inspect AI
(UK AISI), LangChain agentevals, arxiv 2507.21504 survey, AWS prod
agent eval lessons, Hamel Husain's Inspect endorsement.

Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant