Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions .github/workflows/l6-dogfood.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: L6 dogfood (deterministic trajectory)

# L6 runs real Claude Code through pre-seeded TMB flows and asserts the
# resulting MCP/tool trajectory matches FLOWS.md. Issue #108.
#
# Triggers:
# - manual via workflow_dispatch (always available)
# - tag pushes (every release gets a green/red signal)
# - PR labeled `L6` (opt-in for risky doctrine changes)
#
# Skipped on forks where the secret isn't available β€” the secret-presence
# check fails-soft instead of breaking the run.
#
# Security note: untrusted PR-injected strings (titles, bodies, etc.) are
# never interpolated into shell commands. Only secrets and trusted runner
# inputs flow into `run:` blocks via env vars.

on:
workflow_dispatch:
push:
tags:
- 'v*'
pull_request:
types: [labeled]

jobs:
l6-dogfood:
if: ${{ github.event_name != 'pull_request' || github.event.label.name == 'L6' }}
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4

- name: Verify secret is present
env:
TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
run: |
if [ -z "$TOKEN" ]; then
echo "::warning::CLAUDE_CODE_OAUTH_TOKEN repo secret not set β€” L6 cannot run."
echo "Add the secret in Settings β†’ Secrets and variables β†’ Actions."
exit 0
fi
echo "Secret present."

- name: Setup Node 22
uses: actions/setup-node@v4
with:
node-version: '22'

- name: Setup Bun
uses: oven-sh/setup-bun@v2

- name: Install plugin deps + build dist/
run: bun install --frozen-lockfile

- name: Install Claude Code CLI
run: |
npm install -g @anthropic-ai/claude-code
claude --version

- name: Run L6 dogfood flows
env:
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
run: bash tests/dogfood/run-l6.sh

- name: Upload trajectory dumps on failure
if: failure()
uses: actions/upload-artifact@v4
with:
name: l6-trajectory-dumps
path: /tmp/tmb-l6-*/
retention-days: 7
42 changes: 42 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,48 @@ Net: 25 labels β†’ 18. All open issues auto-relabeled in place.

This is the **second** label migration in the v0.4.1 pre-stable window. Acceptable because no public consumers depend on the names yet β€” the rc channel hasn't promoted to stable.

### Added β€” L6 deterministic-trajectory tests + opt-in debug_trajectory schema (issue #108)

Manual L5 dogfood was the release bottleneck. L6 automates it by pre-seeding DB state, running real `claude -p`, and asserting the resulting MCP/tool trajectory matches the expected sequence from `FLOWS.md`. New layer in the test pyramid; existing L0–L5 unchanged.

**New schema table** `debug_trajectory` (15th table):
- Columns: `session_id`, `step_n`, `kind` (`mcp_call`/`tool_use`), `agent`, `tool_or_mcp_name`, `args_json`, `result_json`, `is_error`, `created_at`
- **Off by default β€” populated only when env `TMB_DEBUG_TRAJECTORY=1`.** Zero overhead in production.
- Schema version stays at 1 (additive change).

**Capture wiring**:
- MCP server (`src/index.ts`) writes a row per MCP tool call when env is set
- New PreToolUse hook `scripts/hooks/debug-trajectory.sh` (`matcher: "*"`) writes a row per non-MCP tool call (Bash/Read/Write/Edit/Task/Skill)

**Test infrastructure**:
- `tests/dogfood/run-l6.sh` runner β€” checks env + tools, dispatches to flow scripts
- `tests/dogfood/lib/flow-helpers.sh` β€” shared helpers (`l6_setup_scratch_project`, `l6_seed_db`, `l6_run_claude`, `l6_assert_trajectory`)
- `tests/dogfood/flows/` β€” 16 flow scripts (4 fully wired, 12 scaffold)
- `tests/dogfood/fixtures/` β€” pre-seed SQL (empty, onboarding-named, onboarding-anonymous)
- `tests/dogfood/expected/` β€” expected-trajectory files (one MCP/tool call per line)

**4 fully wired flows** (have expected-trajectory files):
- `01-onboarding` β€” first-run identity + config writes
- `02-simple-task` β€” code-touching ask β†’ triage simple β†’ SWE spawn
- `D-direct-mode` β€” ≀3-line typo fix β†’ Edit + commit, no SWE spawn (with hard invariant assertions)
- `95-anonymous-cold-restart` β€” regression for #95; cold session must skip re-onboarding

**12 scaffolded flows** (auto-skip until expected-trajectory authored): `03-difficult-task`, `04-agent-creator`, `05-skill-creation`, `06-push-gate`, `07-architecture-regen`, `08-swe-retry`, `09-roundtable`, `C-consultant`, `32-team-config`, `92-base-branch`, `94-arch-bootstrap`, `96-halt-on-error`.

**CI workflow** `.github/workflows/l6-dogfood.yml`:
- Triggers: tag pushes, PRs labeled `L6`, manual dispatch
- Soft-fails when `CLAUDE_CODE_OAUTH_TOKEN` secret is absent (forks won't break red)
- Uploads trajectory dumps as artifacts on failure

**Stale doctrine cleanup** (per the migration audit):
- Onboarding skill: fixed event_type from stale `tmb_bootstrap_complete` β†’ `tmb_onboarding_complete`; dropped reference to "file copies" (swe + pr-reviewer ship globally)
- Agent-creator skill: dropped `tmb_bootstrap` reference (skill is gone in v0.3.0+)
- Plugin CLAUDE.md: removed the "tmb_bootstrap is being retired" sentence (it's already retired)

**Unverified assumption flagged in the issue**: `claude -p` mode behavior with `AskUserQuestion`. If the form auto-fails in headless mode, that surfaces as a trajectory-shorter-than-expected failure on the onboarding flow β€” a real signal to address.

2 new schema tests (table presence + columns + index). All L1-L4 green.

---

## v0.3.2 β€” 2026-04-25
Expand Down
2 changes: 0 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,6 @@ Other (non-policy) `plugin_config` keys may be written directly when the Human a
2. **Cache human_name** β€” use it when addressing the Human if set. Otherwise plain second-person; no honorifics.
3. **Resume check** β€” call `issue_resume(agent='bro')` to detect unfinished work.

There is no edge case for "swe.md missing" anymore β€” `swe` ships globally. The legacy `tmb_bootstrap` skill (recovery for hand-deleted local agents) is now unnecessary in v0.3.0+ and is being retired.

## Code-touching asks (in addition to first-action chain)

Default chain (most asks):
Expand Down
10 changes: 10 additions & 0 deletions hooks/hooks.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,16 @@
"timeout": 5
}
]
},
{
"matcher": "*",
"hooks": [
{
"type": "command",
"command": "${CLAUDE_PLUGIN_ROOT}/scripts/hooks/debug-trajectory.sh",
"timeout": 3
}
]
}
]
}
Expand Down
34 changes: 33 additions & 1 deletion mcp/trajectory-server/dist/index.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion mcp/trajectory-server/dist/index.js.map

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions mcp/trajectory-server/dist/schema.sql
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,24 @@ CREATE TABLE IF NOT EXISTS regen_state (
last_seen_sha TEXT,
notes TEXT NOT NULL DEFAULT ''
);

-- L6 deterministic-trajectory test infrastructure (issue #108).
-- Populated ONLY when env TMB_DEBUG_TRAJECTORY=1. Off by default β€” zero
-- overhead in production. The L6 test runner pre-seeds DB state, runs
-- claude -p with the env set, then asserts the resulting trajectory
-- matches an expected sequence from FLOWS.md.
CREATE TABLE IF NOT EXISTS debug_trajectory (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT NOT NULL,
step_n INTEGER NOT NULL,
kind TEXT NOT NULL, -- 'mcp_call' | 'tool_use' | 'agent_thinking'
agent TEXT, -- 'bro' | 'swe' | 'pr-reviewer' | NULL
tool_or_mcp_name TEXT NOT NULL, -- e.g. 'mcp__plugin_tmb_trajectory-server__identity_get' or 'Bash'
args_json TEXT NOT NULL DEFAULT '{}',
result_json TEXT NOT NULL DEFAULT '{}',
is_error INTEGER NOT NULL DEFAULT 0,
created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

CREATE INDEX IF NOT EXISTS idx_debug_trajectory_session
ON debug_trajectory(session_id, step_n);
3 changes: 2 additions & 1 deletion mcp/trajectory-server/dist/test/db.test.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading