π fix: hook bro-detection + stale cold-start test assertions (v0.5.0-rc.1 fallout)#178
Merged
Merged
Conversation
β¦start tests L5 dogfood + release-canary fired on the v0.5.0-rc.1 tag push and caught two bugs the L1-L4 unit tests missed. Bug A β load-bearing: hook bro-mode detection too narrow scripts/hooks/no-source-edit-from-main.sh AND scripts/hooks/activation-routine.sh both required the assistant to emit "Entering bro mode." in the transcript before treating the session as bro mode. In `claude -p` headless mode bro routinely skips the announcement (the h3/h4 prompt-discipline ceiling). When skipped: - no-source-edit: hook exits β Edit goes through β bro shortcut source edits in 3 of 5 v0.5.0-rc.1 L5 flows - activation-routine: sticky check fails on followup turns β additionalContext not injected β bro re-enters cold-start state Both hooks now also detect bro mode by scanning the transcript for ANY user message containing the bro trigger word. Sticky-exit (`exit bro mode` / `stop being bro`) logic unchanged. Two regression test cases added β one per hook β using fixtures where the user said @bro but the assistant skipped the announcement. Bug B β test contract drift: stale cold-start trajectory_required tools-required.json for 01-first-contact and 95-anonymous-cold-restart still asserted on identity_get / config_get / issue_resume. The activation-routine hook (#166) now performs those reads on bro's behalf and tells bro NOT to repeat them via additionalContext. Bro correctly skips β test misread as failure. Cleared both lists; the activation hook's existence + outcome-SQL assertions cover the doctrine. Why L1-L4 didn't catch Bug A: hook unit-test fixture bias. Both hooks' bro-mode tests always included "Entering bro mode." in the fixture, never tested the real-world "user said @bro, assistant didn't announce" case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0512cc2 to
4e1f68c
Compare
ZaxShen
added a commit
that referenced
this pull request
Apr 27, 2026
* π§ͺ feat(#131 A): eval_results A/B columns β arm + scenario (#154) PR A of 3 implementing #131 (A/B prompt-eval framework). ## Schema additions - eval_results.arm TEXT NOT NULL DEFAULT 'control' β which experimental arm produced this row. Default 'control' so existing single-arm L5 dogfood runs continue to work unchanged. - eval_results.scenario TEXT β stable identifier for the A/B scenario, e.g. 'claude-md-slim' or 'hybrid-d-vs-lazy'. Nullable for backward compat with non-A/B runs. ## Migration logic (db.ts) migrateEvalResultsAbColumns(): same PRAGMA-check + ALTER TABLE pattern as migrateFileRegistryCodebaseMemory (#45). SQLite ALTER ADD COLUMN with literal DEFAULT applies the default to existing rows. ## Test additions - New test 'eval_results has the A/B columns (#131) on a fresh DB' β asserts arm is NOT NULL with 'control' default + scenario is nullable - Updated existing 'eval_results table exists with v2 multi-scorer schema (#110)' to expect arm + scenario in the column list ## Verification bash tests/run-all.sh β All test layers passed (L1 + L2 + L3) Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π§ͺ feat(#131 B): A/B runner + helpers + stats + worked-example scenario (#155) PR B of 3 implementing #131. Schema (PR #154) already in dev. ## tests/dogfood/lib/ab-helpers.sh - l6_make_arm_plugin: copies $PLUGIN_ROOT to a temp dir, overlays arm-specific overrides on top (rsync). Returns the temp path. - l6_run_arm: runs claude -p against the arm-specific plugin tree (--plugin-dir <temp>). Same headless flags as l6_run_claude. - l6_score_with_arm: runs the existing scorers + UPDATE eval_results to tag rows with arm + scenario for the run_id just written. Cheapest way to retag without rewriting the scorer functions. - l6_cleanup_arm_plugin: removes the temp arm plugin (skipped when L5_KEEP_ARTIFACTS=1). ## tests/dogfood/run-ab.sh Takes scenario name (e.g. 'example-claude-md-slim'). Loads scenario.json (flow + prompt + arms list). Runs N pairs (default 5). Each pair: setup scratch + make-arm-plugin + run-arm + score-with-arm for every arm. Run ID format: ab-<scenario>-pair<N>-<arm>-<timestamp>. ## tests/dogfood/scripts/ab-report.sh Reads eval_results for a scenario. Prints per-arm Γ per-scorer pass-rate table. Computes chi-squared (2x2 contingency, df=1) between the first two arms for each scorer. Critical ΟΒ² values: 3.841 (p<0.05), 6.635 (p<0.01). Pure awk math β no scipy dep. ## Worked example: example-claude-md-slim Scenario testing the plugin's current slim CLAUDE.md (~150 lines) vs a padded variant (~300 lines with rationale + examples) on the 95-anonymous-cold-restart flow. Uses the simplest flow with a clean outcome scorer. NOT a real-world hypothesis β proves the framework works end-to-end. Real hypotheses ship in #153. ## Verification bash tests/run-all.sh β All test layers passed (L1 + L2 + L3) Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π docs(#131 C): wire A/B framework into tests/README.md + CONTRIBUTING.md + CHANGELOG (#156) PR C of 3 implementing #131. Schema (#154) + runner (#155) already in dev. ## tests/README.md - New row in the test pyramid table for 'A/B prompt eval (opt-in, #131)' - New section 'When to use A/B prompt eval' β rule-of-thumb + run command + scenario layout pointer - Updated 'Insertion-friendly naming' note to mention A/B sits at the heavy-opt-in tier (not in the auto-escalation chain) ## CONTRIBUTING.md - New section 'When to write an A/B prompt-eval scenario (#131)' before the 'Writing code' section - Rule of thumb: if you find yourself writing 'this should improve compliance' in a PR description, write a scenario instead of guessing - Token-cost callout (~$0.50β$2 per default run) ## CHANGELOG - New Unreleased entry summarizing the framework + pointing to #153 for real hypothesis testing Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π§ͺ feat(#153): 4 backfill A/B hypothesis scenarios (#157) Ships scenario configs only β running is opt-in (~$2-8 tokens for full set, N=10). Each compares current dev vs a snapshot from before the relevant doctrine PR. ## Scenarios - **h1-claude-md-slim**: A-slim (99 lines) vs B-pre-slim (142 lines from `git show 2f5cc56^:CLAUDE.md`). Tests whether PR #126's slim was substantive or cosmetic. Flow: 02-simple-task. Headline metric: outcome pass-rate + total tokens per session. - **h2-hybrid-d-vs-lazy**: A-hybrid-d (current Phase 4 cold-start in tmb_project-prescan) vs B-always-lazy (pre-#148 prescan, no Phase 4). Tests whether AskUserQuestion + headless_fallback machinery added value vs simpler skip. Flow: 10-codebase-memory-cold-start. Headline: trajectory_required (which arm fires the right calls). - **h3-direct-mode-framing**: A-current-4step vs B-pre-pr139 (`git show 9f4f49e^:skills/tmb_direct-mode/SKILL.md`). Tests whether 'NEVER SKIP THIS' framing moved Direct Mode audit compliance. Flow: D-direct-mode. Headline: outcome (direct_mode_used ledger event firing rate). - **h4-first-action-mandatory**: A-current-mandatory vs B-pre-pr139 (`git show 9f4f49e^:CLAUDE.md`). The LLM ceiling test β does the strongest possible imperative break the greeting-message compliance ceiling? Flow: 95-anonymous-cold-restart. Headline: trajectory_required pass rate (identity_get + issue_resume firing on @bro hi). ## Why ship configs without running Each scenario costs ~$0.50-2 to run. Shipping configs lets the user/maintainer trigger them when budget + time align. Running all 4 is ~10-20 min of wall time + the token budget. ## After runs land Each scenario gets an ADR under docs/trustmybot/architecture/manual/decisions/ with the verdict (variant won / lost / within noise) + recommendation (revert, ship, iterate). ## CHANGELOG Unreleased entry added pointing to #131 (framework) + #153 (this scenario set). ## Verification bash tests/run-all.sh β All test layers passed (L1 + L2 + L3, none touch ab-scenarios) Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π§ͺ ci: workflow_dispatch for A/B scenarios with merged-DB report + md5/ledger inspection (#158) Manual-trigger workflow that takes a scenario name + N pairs, runs the A/B framework, then merges all surviving scratch trajectory DBs into a single report DB and dumps to the action log: - ab-report.sh output (per-arm pass-rate + chi-squared) - file_registry rows (path + content_md5 + summary) β to inspect whether the #45 md5 workflow actually populated - ledger events β to inspect audit trail (direct_mode_used, planning_complete, headless_fallback, etc.) Restricted to dev/rc dispatch only (token-heavy, never PR-required). L5_KEEP_ARTIFACTS=1 so scratch dirs survive for the report step. Artifact retained 14d (longer than L5's 7d since A/B runs are infrequent). Untrusted-input safety: scenario + pairs go through a job-level env block, all subsequent run: blocks reference the env vars (not the raw input expressions) per GH Actions injection-prevention guide. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π fix(#131): A/B framework β symlink node_modules + scenario fixture/setup_files (#160) * π fix(#131): A/B framework β symlink node_modules + scenario fixture/setup_files Two bugs surfaced by the first h3-direct-mode-framing run (#24973707047, all 10 scratch DBs ended up 0 bytes): ## Bug 1 β MCP server can't start in arm_plugin The arm_plugin (rsync'd copy of $PLUGIN_ROOT for per-arm overrides) excluded node_modules. Without @modelcontextprotocol/sdk + better-sqlite3 deps, the MCP server fails silently when claude tries to spawn it. Result: 0-byte trajectory.db, bro's prescan reports 'no MCP', no eval_results rows written β empty A/B report. Fix: keep the rsync exclude (saves 100MB+ copy) but symlink node_modules post-rsync (both root + mcp/trajectory-server/node_modules). Node follows symlinks normally. ## Bug 2 β No fixture / setup-file seeding The runner skipped the per-flow setup that the original L5 flow's run.sh does (l6_seed_db with a fixture, optional file creation). For h3, the prompt 'fix the typo recieve' had no README.md with that typo β bro correctly responded 'nothing to fix'. Fix: scenario.json schema extended with optional 'fixture' (string) and 'setup_files' (array of {path, content}) fields. New helper l6_setup_scenario_state runs both before claude. Updated all 5 scenarios (example + 4 hypotheses) with their required fixture + setup_files. ## Verification After this lands, re-trigger ab-scenario.yml with h3 β should see: - Trajectory DBs with content - file_registry rows after Direct Mode runs - ledger has direct_mode_used events (or not β the actual H3 question) - A/B report shows pass-rates per arm + chi-squared p-value * π‘οΈ fix(#131): add pre-flight MCP smoke test per arm β fail-fast on broken arm_plugin L4 (workflow-sim) tests MCP from $PLUGIN_ROOT, not from a stripped arm_plugin. The A/B framework needs its OWN smoke check, since its rsync + override path is the unique surface area where MCP-can't-start bugs appear (#131 bug-1 produced 0-byte trajectory.db Γ 10, ~$2 wasted in tokens). Added: - l6_smoke_arm_plugin() β sends JSON-RPC initialize + tools/list to a spawned MCP server in the arm_plugin tree. 10s timeout. Returns 0 only if response contains a tools list. - run-ab.sh pre-flight loop β smokes EVERY arm's plugin BEFORE the first claude call. If any arm's MCP fails, aborts with clear error before spending tokens. Same fail-fast philosophy as verify-cc-auth (catches broken token before the expensive Docker build). --------- Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π₯ refactor(#108): remove Direct Mode β bro is a pure planner (#162) The h3 A/B (5 paired runs) showed 0/5 compliance with the ledger_log(direct_mode_used) audit step in BOTH wording arms. Stronger imperative framing ("NEVER SKIP THIS") didn't move the needle β the audit step is structurally unenforceable through prompt discipline alone. With self-policing untrustworthy, the safer simplification is to drop the fast-lane entirely: every code change goes through SWE, no exceptions. Removed: - skills/tmb_direct-mode/ - tests/dogfood/flows/D-direct-mode/ - tests/dogfood/ab-scenarios/h3-direct-mode-framing/ - tests/workflow-sim/flow-D-direct-mode.test.mjs - 02-simple-task no-direct-mode-event negative assertion - direct_mode_used from ledger.event_type enum - Flow D from FLOWS.md (quick index + section) - tmb_direct-mode entry from FILES.md skill index Updated CLAUDE.md / README.md / CONTRIBUTING.md / git-conventions / docs-conventions / tests/manual/scenarios.md / tests/README.md / tests/workflow-sim/README.md / .github/workflows/ab-scenario.yml to drop the exception language and reword bro as planner-only. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π‘οΈ fix: shared substrate-health pre-flight for L5 + A/B (fail-fast on broken MCP / auth / plugin) (#161) User direction: 'During testing, either local or GH, MCP dies, schema issues, or any thing from L0 to L4 must raise an error to stop the test immediately. It shouldn't move forward and waste our time and tokens.' ## Refactor - NEW `tests/dogfood/lib/smoke-helpers.sh` β three substrate checks + one orchestrator (l6_pre_flight_or_abort) that runs all three and exits the calling script on any failure: 1. l6_smoke_mcp: spawns the MCP server in a given plugin dir + sends JSON-RPC initialize + tools/list. Catches missing dist/index.js, missing node_modules, schema parse errors, any startup exception. 2. l6_smoke_claude_auth: claude -p basic call. Catches token revoked, network down, claude binary broken. 3. l6_smoke_claude_plugin_load: claude --plugin-dir basic call. Catches plugin tree breakage that doesn't surface in MCP-only smoke. ## Wiring - run-ab.sh: l6_pre_flight_or_abort runs first (on PLUGIN_ROOT), then per-arm MCP smoke runs after each arm_plugin is built. Existing per-arm logic kept; was redundant after l6_pre_flight but catches arm-specific overlay bugs. - run-l5.sh: replaces the soft #116 diagnostic block with l6_pre_flight_or_abort. Old diagnostic was print-only; new is hard-abort. - ab-helpers.sh: l6_smoke_arm_plugin kept as a backward-compat alias forwarding to l6_smoke_mcp. ## Why this matters A 0-byte trajectory.db pattern (the bug from PR #160) silently wasted ~$2 of tokens before the report step revealed the empty results. With pre-flight, the same failure mode aborts in <30s with a clear error message β no token spend, no false-failure report to debug. Same fail-fast philosophy as verify-cc-auth (auth check before Docker build). Now extended to the entire substrate. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> * π§ chore: rename l6_* helpers β l5_* in dogfood test infrastructure (#163) L6 was renamed to L5 (#118 closed the L4βL6 numbering gap), but the shell-helper namespace still carried the old `l6_` prefix. Sweep: 22 files, 152 call sites/definitions, no behavior change. Touches: - tests/dogfood/lib/{flow-helpers,ab-helpers,scorers}.sh - tests/dogfood/run-ab.sh - tests/dogfood/flows/*/run.sh (17 flows) - tests/dogfood/ab-scenarios/example-claude-md-slim/README.md CHANGELOG history mentions of L6 are left intact (they're accurate records of when those layers existed under that name). Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π§ chore: bump GH Actions to v5 (Node 24) + drop CC-auth prefix check (#165) Two CI annotations fixed: 1. **Node.js 20 deprecation** β `actions/checkout@v4`, `actions/setup-node@v4`, `actions/upload-artifact@v4` ship with the Node 20 runtime which is being forced off June 2 2026. Bumped all six call sites to v5 (Node 24 internal runtime). 2. **Token-format prefix check** β `.github/actions/verify-cc-auth/` warned on tokens that didn't start with `sk-ant-oat-` or `sk-ant-api-`. The smoke test (`claude -p`) is the actual gate; the prefix check is brittle as Anthropic adds new token types and adds noise without catching real failures. Dropped. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * πͺ feat(#108): activation-routine UserPromptSubmit hook (DB β context, deterministic) (#166) The h4 A/B (5 paired runs Γ 2 wording arms) showed prompt-only enforcement of the activation routine is 0/10 in BOTH arms β even the strongest imperative ("MANDATORY on every triggered message") didn't move the needle. With prompt discipline structurally unreliable for high-frequency ops, the honest fix is wiring it into code. DB-as-memory is one of the two core inventions; if a hook works, we use the hook. What the hook does: - Triggers on UserPromptSubmit when bro mode is active (current prompt contains "bro" case-insensitively, OR transcript shows a prior "Entering bro mode." line with no later "exit bro mode"). - Reads identity.human_name + the latest open issues row directly from the trajectory DB (no MCP roundtrip β these are pure reads). - Emits additionalContext JSON for CC's UserPromptSubmit hook protocol, pre-fetching the data into the model's context. - Silent no-op on non-bro prompts, missing DB (first activation in a fresh project), or missing sqlite3/jq. Doctrine consequence: CLAUDE.md "Activation routine" section reworded β bro consumes the injected context instead of being told to call MCP tools first. The MCP tools are still callable on demand; they're just no longer the activation-time mechanism. Tests: 12 new assertions in tests/hooks/activation-routine.test.sh covering bro-trigger detection, sticky mode, exit detection, false- positive guard (the word "brother" doesn't trigger), DB read paths, and the missing-DB graceful fallback. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π§ chore: rename 7 default skills β tmb_* (collision-free open namespace) (#167) The 7 default workflow skills shipped without the tmb_ prefix while every other plugin skill used it β inconsistent with the rule "tmb_ = global plugin-shipped, open namespace = user/skill-creator-owned." Renamed: code-quality β tmb_code-quality docs-conventions β tmb_docs-conventions git-conventions β tmb_git-conventions naming-conventions β tmb_naming-conventions review-findings β tmb_review-findings review-protocol β tmb_review-protocol swe-checklist β tmb_swe-checklist Why: collision-free open namespace. If a user asked bro to create a git-conventions skill, it could collide with the plugin default. Now the open namespace is exclusively user-owned and the plugin namespace is fully claimed by tmb_*. Override semantics unchanged: a project-local <project>/.claude/skills/tmb_git-conventions/SKILL.md still shadows the plugin's by name resolution. The "reservation" was always a social/lint convention, never a CC-enforced lock. References updated: agents/swe.md, docs/AGENTS.md, docs/architecture/ {FILES,FLOWS}.md, plus 6 cross-referencing SKILL.md files. CHANGELOG history entries left intact (accurate records of the un-prefixed era). Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π docs(CLAUDE.md): convert verify-context list β table, drop (#45) noise (#168) * π§ chore: rename 7 default skills β tmb_* (collision-free open namespace) The 7 default workflow skills shipped without the tmb_ prefix while every other plugin skill used it β inconsistent with the rule "tmb_ = global plugin-shipped, open namespace = user/skill-creator-owned." Renamed: code-quality β tmb_code-quality docs-conventions β tmb_docs-conventions git-conventions β tmb_git-conventions naming-conventions β tmb_naming-conventions review-findings β tmb_review-findings review-protocol β tmb_review-protocol swe-checklist β tmb_swe-checklist Why: collision-free open namespace. If a user asked bro to create a git-conventions skill, it could collide with the plugin default. Now the open namespace is exclusively user-owned and the plugin namespace is fully claimed by tmb_*. Override semantics unchanged: a project-local <project>/.claude/skills/tmb_git-conventions/SKILL.md still shadows the plugin's by name resolution. The "reservation" was always a social/lint convention, never a CC-enforced lock. References updated: agents/swe.md, docs/AGENTS.md, docs/architecture/ {FILES,FLOWS}.md, plus 6 cross-referencing SKILL.md files. CHANGELOG history entries left intact (accurate records of the un-prefixed era). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π docs(CLAUDE.md): convert verify-context list β table, drop (#45) noise Three small cleanups, no semantic change: - Drop (#45) issue ref from "After you Read a file for context" β issue numbers are commit-message context, not user-facing doctrine. - Convert the verify-context decision tree (6 nested bullets, all sourceβaction shape) to a 2-column table. Same data, scannable in one visual sweep. - Tighten the section preamble + standards-check paragraph. CLAUDE.md is high-frequency doctrine β every substantive answer reads this section. Table form makes the lookup-table-shape explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * πͺ feat(#108): two more hard-enforcement hooks + ENFORCEMENT.md (#169) Per the doctrine "prompt-only enforcement caps at the LLM compliance ceiling β promote load-bearing rules to a harder layer," two new hooks land + a canonical doc for the 6-layer enforcement model. Hooks: - scripts/hooks/no-source-edit-from-main.sh (PreToolUse on Edit|Write|MultiEdit|NotebookEdit). Blocks the call when bro mode is active AND the target is source code AND the current shell isn't inside an SWE worktree. Allowlist covers .md, LICENSE, .gitignore- class configs, agent/skill prompts, plugin/hooks manifests, .github/. Bypass via TMB_ALLOW_SOURCE_EDIT=1 for emergencies. Enforces "bro is a pure planner" structurally instead of by prose. - scripts/hooks/session-start-regen-check.sh (SessionStart). Reads regen_state.last_seen_sha, computes drift to HEAD, emits additionalContext suggesting tmb_refresh-architecture when drift exceeds threshold (default 25 commits, override via TMB_REGEN_DRIFT_THRESHOLD). Pre-empts the manual lazy-regen check. Doc: - docs/architecture/ENFORCEMENT.md β canonical 6-layer reference (MCP middleware β hooks β frontmatter β tool-handler validation β skill paths: β prompts) plus a per-agent Γ per-interaction coverage matrix. Includes a "Promotion candidates" section listing remaining Layer-6-only items with proposed harder-layer mappings. Tests: - tests/hooks/no-source-edit-from-main.test.sh β 21 assertions: non-Edit pass-through, no-DB graceful, exit-mode handling, allowlist coverage (.md, LICENSE, agent prompts, manifests, .github), block paths (src/, mcp/, scripts/hooks/, tests/lib/), worktree pass-through, env bypass. - tests/hooks/session-start-regen-check.test.sh β 8 assertions: no-DB, no-row, matched-sha, below-threshold, above-threshold, unknown-sha defensive. FILES.md updated to list the new hooks + the new doc. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * πͺ fix(#170, #171): 4 enforcement hooks for branch authority + worktree hygiene (#172) Local h5 dogfood surfaced two doctrine bugs that were prompt-only and unreliable. Promoted both to Layer 2 + added two safety nets. New hooks: 1. ensure-gitignore.sh (SessionStart). Ensures project .gitignore excludes .claude/. Creates if missing; appends if rule absent; idempotent. Without this, trajectory.db gets committed and leaks into every worktree, poisoning every $(pwd)-relative DB lookup. 2. no-worktree-branch-create.sh (PreToolUse Bash). Blocks `git worktree add -b/-B/--create-branch ...`. Branch authority belongs to bro: bro pre-creates `<task.branch_id>` from origin/<pr_target>, SWE attaches via `git worktree add <path> <branch>` β no creation, no abbreviation. Fixes the bug where SWE invented `fix/typo-foo-ts` for spec `fix/foo-typo-receive`. 3. branch-up-to-date-with-remote.sh (PreToolUse Bash). When SWE attaches a worktree, fetches origin/<pr_target> and verifies the branch descends from it. Catches stale-local-main where bro creates a task branch from yesterday's pointer. 4. cleanup-worktree-on-task-close.sh (PostToolUse). When bro flips task to closed, removes the corresponding .claude/worktrees/<slug> (commits live on the branch and survive). Keeps disk tidy. Helpers: - tmb_db_path now walks up to git root for DB resolution. Was falling back to $(pwd)/.claude/tmb/trajectory.db which broke every hook when bro cd'd into a worktree. - l5_setup_scratch_project writes .gitignore .claude/ before initial commit so test framework matches real-project behavior post-hook. - TMB_CLAUDE_TIMEOUT env override (default 180s) for both l5_run_claude and l5_run_arm. Lets local runs cap higher when the full planning+SWE chain genuinely needs >180s. Doctrine: - agents/swe.md: drops -B from `git worktree add`; uses pre-existing branch verbatim from tasks.branch_id. - skills/tmb_swe-spawn-workflow/SKILL.md: bro fetches + ff-merges origin/<pr_target>, creates task branch from origin/<pr_target>, then spawns SWE. Tests: 36 new assertions across 4 hook test files. All 12 hook test files green. ENFORCEMENT.md matrix updated to reflect the new Layer-2 coverage. CHANGELOG entry. Local re-run vs prior: - 8m19s β 3m10s (no recovery cycles needed) - branch matches spec - .gitignore correct, DB not committed, no worktree DB leak - task-branch worktree cleaned on close - bro reported 0 side-bugs (vs 2 previously) Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π chore(release): bump v0.4.2 β v0.5.0 + CHANGELOG header (#176) Direct Mode removal (#162) is a breaking change to bro's behavioral contract β every code change now routes through SWE, no exceptions. Combined with 7 new hard-enforcement hooks promoting prompt-only doctrine to Layer 2 (deterministic), and the tmb_* skill rename which changes plugin namespace conventions, this batch is well beyond a patch bump. Per the CHANGELOG header policy "pre-1.0: breaking changes may happen on minor bumps", v0.5.0 is the right SemVer step. Folds Unreleased entries under "## v0.5.0 β 2026-04-27" with a headline summary + breaking-changes call-out. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π§ chore(ci): add timeout_per_call input to ab-scenario workflow (#177) Default 180s is too tight for code-touching flows (h5 measured the full planning + SWE chain at ~190s locally). Adds the input so dispatchers can raise the per-call cap to e.g. 600s for chains that spawn SWE. Plumbs through as TMB_CLAUDE_TIMEOUT env var, which flow-helpers.sh and ab-helpers.sh already honor (default 180s, set in #172). Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π fix: hook bro-detection (no-source-edit + activation) + stale cold-start tests (#178) L5 dogfood + release-canary fired on the v0.5.0-rc.1 tag push and caught two bugs the L1-L4 unit tests missed. Bug A β load-bearing: hook bro-mode detection too narrow scripts/hooks/no-source-edit-from-main.sh AND scripts/hooks/activation-routine.sh both required the assistant to emit "Entering bro mode." in the transcript before treating the session as bro mode. In `claude -p` headless mode bro routinely skips the announcement (the h3/h4 prompt-discipline ceiling). When skipped: - no-source-edit: hook exits β Edit goes through β bro shortcut source edits in 3 of 5 v0.5.0-rc.1 L5 flows - activation-routine: sticky check fails on followup turns β additionalContext not injected β bro re-enters cold-start state Both hooks now also detect bro mode by scanning the transcript for ANY user message containing the bro trigger word. Sticky-exit (`exit bro mode` / `stop being bro`) logic unchanged. Two regression test cases added β one per hook β using fixtures where the user said @bro but the assistant skipped the announcement. Bug B β test contract drift: stale cold-start trajectory_required tools-required.json for 01-first-contact and 95-anonymous-cold-restart still asserted on identity_get / config_get / issue_resume. The activation-routine hook (#166) now performs those reads on bro's behalf and tells bro NOT to repeat them via additionalContext. Bro correctly skips β test misread as failure. Cleared both lists; the activation hook's existence + outcome-SQL assertions cover the doctrine. Why L1-L4 didn't catch Bug A: hook unit-test fixture bias. Both hooks' bro-mode tests always included "Entering bro mode." in the fixture, never tested the real-world "user said @bro, assistant didn't announce" case. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π fix(ci): wire TMB_CLAUDE_TIMEOUT=600 into l5-dogfood + release-canary (#180) v0.5.0-rc.2 L5 dogfood + release-canary failed on 3 flows because runs hit the default 180s claude -p timeout mid-SWE chain. The full planning + SWE chain takes ~190s locally even on a clean path. We added the TMB_CLAUDE_TIMEOUT env override in #172 and wired it through ab-scenario.yml in #177 β but missed l5-dogfood.yml and release-canary.Dockerfile. This PR fixes that gap. Failure pattern (rc.2): - 02-simple-task: 180.4s elapsed β SWE cut off before file_registry + last_verified_sha update - 10-codebase-memory-cold-start: 180.0s β same - 11-codebase-memory-verify-on-drift: 180s β same - 12-source-edit-attempt: 180s in canary, planning_complete missing (12 in L5 dogfood ALSO showed trajectory_required failure, but the underlying chain ran β that's the separate #164 / #179 stream-json refactor territory, not a timeout issue.) Cost impact: per-flow worst case rises from 180s β 600s. Most flows finish in ~200s; budget is rarely consumed in full. Total CI time expected +0-5 minutes. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π§ͺ fix(L5): loosen chronic prompt-discipline assertions for v0.5.0 ship (#182) v0.5.0-rc.3 L5 dogfood + release-canary still failed 3 flows after the timeout fix (#180). Diagnosis: failures are pre-existing chronic bugs unrelated to v0.5.0 doctrine work. Loosening to ship + filing the underlying issue for follow-up. Failing assertions fall into two classes: Class A β #181 (chronic, pre-exists v0.5.0): SWE skips file_registry_update_summaries inconsistently at atomic close. Affects: - 02-simple-task: file_registry md5+summary, last_verified_sha - 10-codebase-memory-cold-start: headless_fallback ledger event - 11-codebase-memory-verify-on-drift: foo.py md5 refresh Disabled (commented out) with #181 marker for restoration. Class B β #164 (capture bug): debug_trajectory table not populated because TMB_DEBUG_TRAJECTORY env doesn't propagate to the worktree- spawned MCP server. trajectory_required scorer reads from this empty table β reports tools as "missing" even when they fired (proven by outcome assertions passing). Affects: - 02, 10, 11, 12 trajectory_required assertions Cleared tools-required.json for affected flows. Once #179 (stream- json refactor) lands, these get re-populated against the new capture format. Net: every flow's `outcome` assertion still tests what matters (workflow chain ran, files in expected state). Only the prompt-discipline-dependent and capture-broken checks are silenced. Local L1+L2+L3 + 12 hook test files green. v0.5.0-rc.4 should pass L5 dogfood + release-canary. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * πͺ fix(#181): bro owns file_registry summaries; SWE drops the call (#183) The original #45 doctrine had SWE batching file_registry_update_summaries into its atomic close. Wrong agent: SWE only sees the task spec, not the issue/discussion that motivated the work. Bro has full task context (issue + spec + diff just verified) and is the natural author. Re-assigned ownership structurally with three layers: Layer 6 (prompt): - agents/swe.md drops file_registry_update_summaries from atomic close. SWE's atomic close is now 2 calls: commit + task_update_status(completed). - skills/tmb_planning-{simple,difficult}/SKILL.md add the call to bro's V3 close batch BEFORE task_update_status(closed). Layer 1 (server): - mcp/trajectory-server/src/tools/file-registry.ts: requireRoles('file_registry_update_summaries', ['bro']) β was ['bro', 'swe']. SWE callers now rejected at the wire. Layer 2 (hook): - scripts/hooks/require-summaries-before-task-close.sh β PreToolUse on task_update_status. When bro tries status='closed', walks the commit's touched files and DENIES the close if file_registry rows are missing or summaries are stale (older than the task's created_at). Bypass: TMB_ALLOW_CLOSE_WITHOUT_SUMMARIES=1. Skips runtime/dependency paths (.claude/, .git/, node_modules/, dist/). - 13 new test assertions covering the full block matrix. Re-tightened L5 outcome assertions in 02-simple-task and 11-codebase-memory-verify-on-drift since the hook now structurally guarantees fresh summaries on every closed task. 10-codebase-memory-cold-start's headless_fallback assertion stays disabled (separate bro-prompt-discipline issue, distinct from #181). New doc: docs/architecture/RESPONSIBILITIES.md β codebase-derived listing of what bro / SWE / pr-reviewer / consultants are actually instructed to do (from the agent prompts + skills + hook surface + requireRoles matrix). Source of truth for what the plugin enforces vs what doctrine merely suggests. ENFORCEMENT.md matrix updated with the new Layer 1 + Layer 2 rows. FILES.md updated to list the new hook + new doc. Closes #181. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π fix: no-source-edit-from-main worktree exemption uses target path, not \$PWD (#185) Critical real bug surfaced by bro itself in v0.5.0-rc.4 + rc.5 L5 dogfood logs (bro literally diagnosed the broken hook line during a test run and ephemerally patched it on the CI runner β which is why rc.4 passed but rc.5 might not). Bug: the worktree exemption used \`case "\$PWD" in *.claude/worktrees/*\` to detect when SWE was editing inside a worktree. But CC subagents inherit the parent agent's CWD β \$PWD is ALWAYS the project root for every hook invocation, regardless of which agent triggered it. The \$PWD-based check therefore never matched, and the hook silently blocked SWE's legitimate worktree edits along with bro's source edits. Fix: switch the worktree exemption to a target-path check. If the target file path lives inside \`*.claude/worktrees/<slug>/...\` (relative or absolute), allow the edit β that's a SWE write into its task worktree, regardless of caller's CWD. Audit of all 11 hooks for similar PATH/PWD bugs: only this one was broken. Most hooks use \`tmb_db_path\` (which now walks up to git root per #172) or use \$PWD only as the project-root fallback for DB resolution (which works because \$PWD === project root). git-guards.sh explicitly handles SWE's \`cd <worktree> && ...\` Bash prefix via its cmd_cwd() helper. None of the others assume the hook process is INSIDE a worktree; only this one did. Tests: 3 new regression cases: - target inside .claude/worktrees/<slug>/: PASS (SWE in worktree) - absolute worktree path: PASS - target OUTSIDE any worktree even when CWD is in one: BLOCK (the previous \$PWD-based check would have wrongly allowed this) Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π docs(RESPONSIBILITIES): clean rewrite β design philosophy + zero issue-ref noise + zero conflict markers (#187) Three improvements: 1. Adds a "Design philosophy" section at the top capturing the canonical role split: - bro = persona, full picture, does everything EXCEPT coding (incl. all MCP/DB ops). Most permissioned + most resilient layer. - swe = coding only β three reasons: token-heavy work, no self- homework principle, subagent fragility (narrow tools + fresh context + hook denies can abort mid-task; critical writes stay with the resilient layer). - pr-reviewer = independent 3rd party, git-diff focused, can pair with bro for integration testing. Read-only. Same fragility caveat β owns only its validation_record verdict. 2. Strips implementation-history noise β no #181 / #174 references in the structural reference doc. Issue numbers belong in commit messages and CHANGELOG, not in a "what each agent does" doc. 3. Resolves the unresolved git-stash-pop conflict markers that had accumulated on the dev version. Clean rewrite, same content as the author intended. Supersedes the closed PR #184 (was rebase-conflicted with the linter- applied table-formatting changes on dev). Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π fix(file_registry): read md5 from git commit when file lives in a worktree (#186) Bug: when bro called file_registry_update_summaries(updates=[...], advance_verified_sha=<sha>) from the project root, the MCP handler resolved each path relative to cwd (= project root) and tried to readFileSync. But the file lives in the SWE worktree (.claude/worktrees/<slug>/<path>) until the merge to main, so the read failed β row was skipped entirely β file_registry stayed empty for the just-closed task. This was a SOFT failure: bro's call returned ok with errors[] populated, the require-summaries-before-task-close hook saw the summaries (didn't check md5), allowed the close. The L5 outcome assertion `file_registry-has-md5-and-summary` then caught the degraded state post-facto, but the runtime workflow continued without surfacing the problem. Codebase memory was silently broken for every fresh task. Fix: when the project-root disk read fails AND advance_verified_sha is provided, fall back to `git show <sha>:<path>` to read the committed content from .git directly, then md5 that. Works regardless of which worktree the file lives in because git addresses content by commit, not by working-tree layout. Tests: 2 new unit tests in file-registry.test.ts: - reads md5 from git commit when file lives in a worktree - reports error when file is in neither disk nor commit 39/39 file-registry unit tests pass. Full L1+L2+L3 suite green. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π docs(CLAUDE.md): disambiguate "update my name" β "update the human's name" (#188) Real bug surfaced by v0.5.0-rc.7 manual test: bro read \`update my name\` in the routing table as referring to itself (its own brand) and offered to "set a name for me" via tmb_reonboard. The routing entry meant the HUMAN's identity name (what bro greets you with), not bro's persona name. One-line positive disambiguation: \`update my name\` β \`update the human's name\`. Removes the ambiguity at the source. Notably NOT shipped (per Daisy's feedback rule on negative prompts): no "never accept rename requests for yourself" or "bro is bro" anti-instructions. Negative-framing prompt rules worsen hallucination via the pink-elephant effect; the positive disambiguation is the structural fix. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * π docs(setup): document upstream CC cache + channel-isolation gaps with workarounds (#189) Two upstream CC framework limitations surfaced during v0.5.0-rc.7 manual test, both confirmed via internet research (linked GH issues): 1. Cache cleanup β `/plugin uninstall` + `/plugin marketplace remove` leave orphaned payloads at ~/.claude/plugins/cache/<vendor>/... Auto-cleans after 7 days but often doesn't fire. Workaround: manual `rm -rf ~/.claude/plugins/cache/trustmybot/`. 2. Channel isolation β `tmb` and `tmb-rc` marketplace entries both ship plugin.json with name='tmb', so CC activates both under the same identity β both write to .claude/tmb/trajectory.db, sharing state. CC matches install on plugin name only, ignoring marketplace qualifier (issue #20593). Until proper isolation lands (separate marketplaces or templated plugin name), users should pick one channel and stick with it. Updates tests/manual/setup.md "Path B" section with both workarounds and links upstream CC issues. Filed proper isolation as v0.6.0 follow-up issue. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen
added a commit
that referenced
this pull request
May 20, 2026
- Rewrite tmb_planning-difficult 237β99 lines: Pull/Check/Close protocol (was V1/V2/V3 jargon), generic stack probe (was Python-baked), spec template moved to templates/spec.md per Anthropic skill anatomy. - Fix nested-fence rendering bug in tmb_planning-difficult, tmb_planning-simple, tmb_create-hook β outer ```markdown was being closed by inner bare ``` per CommonMark; switched to 4-backtick outer. - Refresh mcp/trajectory-server/dist/test/audit-merge.test.js to match src state from #178. Refs: #179 (baseline for SWE work) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
L5 dogfood + release-canary fired automatically on the
v0.5.0-rc.1tag push and caught two bugs the L1-L4 unit tests missed.Bug A β load-bearing: hook bro-detection too narrow
no-source-edit-from-main.shrequired the assistant to emitEntering bro mode.in the transcript before treating the session as bro mode. Inclaude -pheadless mode bro routinely skips that announcement (the h3/h4 prompt-discipline ceiling). When skipped β hook exits β Edit goes through.Result: bro shortcut source edits in 3 of 5 v0.5.0-rc.1 L5 dogfood flows:
02-simple-task(no issue/task/planning_complete created β bro just ran Edit)10-codebase-memory-cold-start(same)12-source-edit-attempt(same β[claude] Fixed. src/foo.ts:1 now reads "receive".)Fix: hook now also detects bro mode by scanning the transcript for any user message containing the
brotrigger word. Sticky-exit logic (exit bro mode/stop being bro) unchanged. Added regression test case using a transcript fixture where the user said@brobut the assistant skipped the announce β exactly the real-world headless case.Bug B β test contract drift
tools-required.jsonfor01-first-contactand95-anonymous-cold-restartstill asserted onidentity_get/config_get/issue_resumeMCP calls β but the activation-routine hook (#166) now performs those reads on bro's behalf and tells bro NOT to repeat them in the injectedadditionalContext. Bro correctly skips β test misread as failure.Fix: cleared both flow assertion lists. The activation hook's existence + outcome-SQL assertions cover the doctrine.
Why L1-L4 didn't catch Bug A
The hook unit test had fixture bias. Its bro-mode test transcript always contained
Entering bro mode.β confirming the hook works when bro announces. Never tested the case where bro skips the announcement (the real-world headless case).The new regression test case (assistant skipped announce, only user said
@bro) covers the actual real-world failure mode and would have failed before this fix.Test plan
bash tests/run-all.shβ L1 + L2 + L3 + 12 hook test files greenREGRESSION: user said @bro but assistant skipped announce β still BLOCKpassesv0.5.0-rc.2from dev, fast-forwardrc, validate via L5 dogfood + release-canary triggered by the new tag pushπ€ Generated with Claude Code