Skip to content

Automate L5 dogfood as L6: deterministic-trajectory tests in Docker with real Claude Code #108

@ZaxShen

Description

@ZaxShen

Problem

Manual L5 dogfood is now the release bottleneck. Every v0.X.Y stable cut requires ~30-45 min of manual testing — install, walk 7-10 scenarios, eyeball trajectory, run release.sh. This stalls iteration speed and gates every release on human availability.

Existing L0-L4 cover MCP/hook correctness in isolation but don't exercise real Claude Code driving bro through real workflows. The gap is the single biggest reason L5 stays manual.

Proposal — L6 deterministic-trajectory tests

Core idea

Rather than mocking or trying to handle interactive AskUserQuestion, pre-seed the DB to put bro in a known state, then run claude -p "<prompt>" (non-interactive mode), then assert the resulting MCP/tool trajectory matches the expected sequence from FLOWS.md.

Each flow in FLOWS.md becomes a deterministic test:

Test Pre-seed Run Assert
Onboarding Empty DB claude -p "@bro hi" Expected: identity_get, config_get, AskUserQuestion form rendered, identity_set, config_set x3, ledger_log(tmb_onboarding_complete)
1st task (simple triage) DB with completed onboarding claude -p "@bro write a python cli todo" Expected: tmb_project-prescan runs, tmb_lazy-regen-check runs, triage='simple', task_create_batch, Task(swe) spawn, ledger_log(planning_complete)
2nd task (after 1st closed) DB with 1st task closed claude -p "@bro add a --limit flag" Expected: same shape; previous-task context picked up via issue_list / task_get
Bro verification (V1/V2/V3) DB with SWE-completed task claude -p "@bro check task 3" Expected: task_get, git diff, verification commands, ledger_log(bro_verification_pass), task_update_status(closed)
Direct mode DB with onboarding done claude -p "@bro fix typo in README line 3" Expected: Edit, Bash(git commit), ledger_log(direct_mode_used). NO task_create_batch, NO Task spawn.
Anonymous cold-restart DB with anonymous identity claude -p "@bro hi" Expected: identity_get returns row → bro skips onboarding → greets

Why this works (non-determinism is bounded)

User insight: "tmb's interaction MUST be deterministic in numbers of tools usage, mcp, as well as the tool name and mcp name matching" — meaning while Claude's prose varies, the MCP/tool call sequence for a given (prompt, DB state) pair is doctrine-fixed.

If bro's actual trajectory deviates from FLOWS.md, EITHER:

  • The doctrine is broken (real bug to fix)
  • FLOWS.md is stale (doc to update)
  • The skill/agent prompt drifted (regression to fix)

All three are valuable signals — exactly what L5 catches manually today.

Three pieces of infrastructure

1. Docker harness with real Claude Code

  • Base images: ubuntu:24.04, node:22-slim (debian-based), macos-latest (only on local; GH Actions doesn't support nested macOS), windows-server-2022 (later)
  • Install Claude Code in each: npm install -g @anthropic-ai/claude-code (or whatever the install-time command is)
  • Auth: User has Claude Code account; can provide CLAUDE_CODE_OAUTH_TOKEN via repo secret in .env form. Mount or pass to container.
  • Install TMB plugin via the --plugin-dir mode pointing at the checked-out source (so test runs against the PR's code, not the marketplace tag)

2. Debug trajectory table (gated by /debug_tmb or env var)

New table:

CREATE TABLE IF NOT EXISTS debug_trajectory (
  id           INTEGER PRIMARY KEY AUTOINCREMENT,
  session_id   TEXT NOT NULL,
  step_n       INTEGER NOT NULL,
  kind         TEXT NOT NULL,            -- 'tool_use' | 'mcp_call' | 'agent_thinking' | 'response'
  agent        TEXT,                     -- 'bro' | 'swe' | 'pr-reviewer' | etc
  tool_or_mcp_name TEXT,                 -- e.g. 'mcp__plugin_tmb_trajectory-server__identity_get' or 'Bash'
  args_json    TEXT,                     -- input args (truncated)
  result_json  TEXT,                     -- output summary (truncated)
  ts           TEXT NOT NULL
);
  • Only populated when env TMB_DEBUG_TRAJECTORY=1 (or via a /debug_tmb slash command for live debugging). Off by default — zero overhead in production.
  • Populated by a thin wrapper around the MCP server's tool-call dispatcher (writes one row per tool call) + a hook that records Bash/Read/Write/Edit calls.
  • L6 test runner reads from this table after claude -p exits to make assertions.

3. Test runner

tests/dogfood/run-l6.sh
  flows/01-onboarding.test.sh
  flows/02-first-task.test.sh
  flows/03-second-task.test.sh
  flows/04-direct-mode.test.sh
  flows/05-bro-verification.test.sh
  flows/06-anonymous-cold-restart.test.sh
  flows/07-channel-isolation.test.sh
  ...

Each flow test:

  1. Spin up Docker container with TMB plugin installed
  2. Pre-seed .claude/<plugin>/trajectory.db with required state
  3. claude -p "<prompt>" with TMB_DEBUG_TRAJECTORY=1
  4. Read debug_trajectory table
  5. Compare the sequence of (kind, tool_or_mcp_name) against an expected JSON file
  6. Assert match (allowing prose variation in args, but tool sequence is checked)

Out of scope

  • Code quality — already covered by L1-L4. L6 only checks workflow correctness.
  • AskUserQuestion handling — pre-seed DB to skip the form entirely.
  • Token cost optimization — initially run only on release-prep PRs (not every PR).

Cost considerations

  • Each L6 invocation costs real Claude tokens (user's account). Estimate: 5-20K tokens per flow × 7 flows × 3 platforms = ~300K tokens per L6 run.
  • Mitigation: run L6 only on release-prep/* branches and on tags, not every PR.
  • Or: run a single Linux platform on every PR, full matrix only on release.

Open questions

  1. Does claude -p mode allow the plugin's AskUserQuestion skill to render? If not, we need to either (a) pre-seed past every form, or (b) use --allowedTools to skip them.
  2. How to inject CLAUDE_CODE_OAUTH_TOKEN into Docker safely. Repo secret + GH Actions env var → container env var is standard. Local runs would source .env.
  3. Does the debug trajectory table belong in the main schema or a separate file? Probably main — it's just one extra table, gated by the env var.

Acceptance criteria

  • Schema: debug_trajectory table added; populated only when TMB_DEBUG_TRAJECTORY=1
  • Wrapper around MCP dispatch writes one row per tool/mcp call when env is set
  • tests/dogfood/run-l6.sh runner with at least 3 flow tests (onboarding, first-task, direct-mode)
  • CI workflow l6-dogfood.yml runs on release-prep/* and tag pushes; uses CLAUDE_CODE_OAUTH_TOKEN repo secret
  • Documented in tests/README.md — when L6 runs, what it asserts, how to add a flow
  • At least one flow's expected-trajectory JSON committed and matched against actual

Why this matters

Today's release sequence: dev → manual L5 dogfood (45 min) → release.sh. With L6: dev → push tag → CI runs L6 → release.sh auto. Removes the human-in-the-loop bottleneck for routine releases. Manual L5 stays as the safety net for major-version cuts and edge cases.

This is enabling tech for everything else — once L6 lands, every future release is faster and safer.

Metadata

Metadata

Assignees

Labels

FeatureNew feature or requestPriority: HighHigh priority — blocks meaningful workflowsTestsTest infrastructure (L0-L6)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions