Skip to content

πŸ› Phase 2.5 β€” fix 3 MCP correctness bugs from PR #1 review#2

Merged
ZaxShen merged 1 commit into
devfrom
feature/v0.2-phase2-5-bugfixes
Apr 22, 2026
Merged

πŸ› Phase 2.5 β€” fix 3 MCP correctness bugs from PR #1 review#2
ZaxShen merged 1 commit into
devfrom
feature/v0.2-phase2-5-bugfixes

Conversation

@ZaxShen
Copy link
Copy Markdown
Contributor

@ZaxShen ZaxShen commented Apr 22, 2026

Summary

Closes the 3 correctness bugs flagged by pr-reviewer on PR #1. All three were data-integrity issues that would silently corrupt trajectory state once agents start calling the MCP tools.

  • #B1 issue_close was writing post_git_sha into the pre_commit_hash column (no post_commit_hash column existed). Overwrote the pre-SHA captured at create. Fix: add post_commit_hash TEXT column; write to it; preserve pre_commit_hash.
  • #B2 ledger_log returned is_truncated in the response but didn't persist it (no column existed). Later ledger_list callers couldn't tell a row was truncated. Fix: add is_truncated INTEGER NOT NULL DEFAULT 0 to ledger; persist on INSERT; surface via SELECT *.
  • #B3 audit_log scoped MAX(round) by issue_id only, letting rounds cross-contaminate across tasks in the same issue. Fix: scope by (issue_id, branch_id) with documented fallback to issue-only when branch_id is null.

Schema version bumped 1 β†’ 2. Schema changes are additive (new columns, no renames), backward-compatible.

Test plan

  • bun install && bun run build && bun run test from plugin/mcp/trajectory-server/ β†’ 38 pass / 0 fail
  • pr-reviewer sign-off: PASS on all three
  • Task XML updated: bro/tasks/20260421-1700_phase2_5_mcp_bugfixes.xml closed

Notes

One informational flag: issue_close now allows omitting post_git_sha (matches the authorized edge-case spec). Callers omitting it get post_commit_hash = NULL instead of a validation error.

πŸ€– Generated with Claude Code

…t round scoping

Three bugs from PR #1 review (see bro/PLUGIN_BUGS.md #B1 #B2 #B3):

- #B1: issue_close was writing post-close SHA into pre_commit_hash
  column, clobbering the pre-SHA. Schema now has post_commit_hash;
  handler writes to the right column.
- #B2: ledger_log returned is_truncated in the response but didn't
  persist it. Schema now has is_truncated on ledger; handler
  persists; ledger_list reads and returns it.
- #B3: audit_log round scoping was per-issue but schema has branch_id;
  rounds cross-contaminated across tasks. Handler now scopes MAX(round)
  by (issue_id, branch_id).

Adds tests for each fix. All 33 existing tests still pass.
@ZaxShen ZaxShen merged commit dba0003 into dev Apr 22, 2026
@ZaxShen ZaxShen deleted the feature/v0.2-phase2-5-bugfixes branch April 22, 2026 00:15
ZaxShen added a commit that referenced this pull request Apr 24, 2026
…shback

Adds the third pillar of agent-harness design β€” productive dissent β€”
alongside the two we already implement (responsibility boundaries +
requirements alignment).

Two failure modes this avoids:
- Sycophantic agreement: bro capitulates to anything the Human says,
  plans proceed with known flaws.
- Direct contradiction: bro pushes back personally, triggers the
  defensiveness people naturally feel about their own plans,
  conversation stalls.

The structural fix is to separate intent-capture from intent-evaluation:

**bro.md** β€” added a fifth role bullet. Bro's job is faithful capture;
concerns go into the architect spawn prompt as `concern: <why>`, never
argued with the Human directly. Bro is not a debate partner.

**architect.md** β€” sharpened objective #2. Architect is the technical
check on the Human's plan, with independent authority. Concerns
(bro-forwarded or architect-originated) surface via discussion_append
BEFORE writing tasks. Explicitly: never rubber-stamp a plan the
architect believes is flawed.

The Human now sees two independent reads (theirs + architect's) routed
through a structural channel, not a confrontation. Triangulation, not
conflict.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request Apr 25, 2026
…icult; bro verification non-negotiable

Per Zax review of latency proposals (DISCUSSION.md):

### #1 β€” Split tmb_architect-workflow (443 lines) into two narrower skills

Bro on the simple-fast-lane was loading 443 lines of skill prose to use ~80
lines of it (defaults table + waiver guidance + handoff). Most of the file
is difficult-path content (env probe, ambiguous-choices list, RED FLAG
examples, blueprint format) that simple-path bro never uses. Split:

- **skills/tmb_planning-simple** (114 lines): defaults table, scope-gate
  waiver, batched-handoff hard rule, bro verification protocol, escalate
  triggers, trivial template. The whole simple-triage protocol.
- **skills/tmb_planning-difficult** (210 lines): triage confirmation, env
  probe, grounded Q+A rounds, scope-ambiguity HARD RULE + worked examples,
  ADR + decision capture, standard template, bro verification protocol.
- **skills/tmb_architect-workflow** (deleted) β€” the combined file.

Bro loads exactly one of the two based on the triage decision (which it
already makes per `tmb_branch-id-proposal`). Simple-path planner reads
~25% of what it used to.

### #2 β€” Hard-rule prose for batched handoff

Both new planning skills mark the post-triage tool-call batch as a HARD
RULE: emit `task_create_batch` + `Task(swe)` + `discussion_append(routing
note)` + `ledger_log(planning_complete)` as multiple tool_use blocks in
ONE assistant response. Prior runs split these across 3+ messages,
costing ~30s of round-trip latency per task.

### #3 β€” Bro verification protocol (lean but mandatory)

Per Zax's hard constraint ("bro should never skip verification"), both
planning skills include a "Bro verification protocol β€” never skip this"
section that bro runs after SWE returns and BEFORE flipping to closed:

1. **V1 β€” Pull spec + diff**: `task_get` + `git diff <commit_sha>~1..<sha>`
2. **V2 β€” Three checks (all required)**:
   - Files match `## Files`
   - `## Verification` commands re-run and pass
   - Each `## Success Criteria` bullet visibly met in the diff
3. **V3 β€” Decide**: all-pass β†’ `task_update_status(status='closed')` (+ `issue_close` batched in same response if last task). Any-fail β†’ log to discussion, re-spawn SWE with feedback (max 3) or escalate.

This is the **task gate** β€” distinct from pr-reviewer which is the **push
gate** (deeper style/security checks at git push time).

### Other refs swept

- CLAUDE.md routing + first-action chain rewritten to load the right
  planning skill per triage; new explicit "bro verification is
  non-negotiable" section.
- README.md workflow line updated.
- skills/tmb_branch-id-proposal cross-references updated.
- docs/architecture/FILES.md tree refreshed.
- docs/architecture/FLOWS.md tables + Flow 2 sequence diagram updated.
- tests/lint/onboarding-skill-contract.sh β€” old architect-workflow
  assertions split across the two new files; added new assertions for
  the verification step ("Bro verification protocol", "never skip",
  "task gate") on both files.
- tests/mcp-integration/scope-gate.test.mjs comment refresh.

### Layer 1 + 2 status

```
Onboarding contract lint:           PASS (24 assertions across 4 skills)
Lego cap (≀30 lines templates/):    PASS
MCP server unit + integration:      PASS
Hook tests:                         PASS
```

### Expected impact

The 1m 34s "planning gap" measured in the prior Mode A run came from bro
loading + processing 443 lines of architect-workflow before the batched
write. Cutting that to 114 lines for the simple path should drop the gap
by ~30-50s. Combined with the hard-rule batching prose forcing bro to
emit fewer total messages (target: 2 instead of 3), simple-triage
planning should land in ~30-45s of bro work plus SWE/Human time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request Apr 26, 2026
…WE no-bypass (#88)

Two real bugs in git-guards.sh found by @ZaxShen during v0.3.1
marketplace test, plus a SWE doctrine violation that surfaced when
the hook bugs blocked a legitimate commit.

## Bug #1 β€” Hook reads branch from CC's CWD, not the worktree

`git branch --show-current` ran in CC's session CWD (project root,
always main) regardless of where the actual `git commit` would run.
Result: SWE in `isolation: worktree` mode could NEVER commit β€” every
commit hit "no direct commits to main" even when committing on a
feature branch inside the worktree.

Fix: parse the cd prefix or `git -C` from the command and run git
from there. New cmd_cwd() + cmd_branch() helpers in git-guards.sh.
Tests: 7 new worktree-aware cases in tests/hooks/git-guards.test.sh
(12 total, all green).

## Bug #2 β€” `git rev-parse` without --verify prints literal ref string

When `origin/main` didn't exist, `git rev-parse "origin/main"` printed
the literal string "origin/main" to stdout AND exited non-zero. The
`2>/dev/null` swallowed stderr but the literal-string stdout sneaked
through. REMOTE ended up as "origin/main" (non-empty) β€” the
"behind origin/main" check fired falsely on any repo without a remote.

Fix: use `git rev-parse --verify`. Empty output if ref doesn't exist.
Test: 1 new regression case for no-remote repo behavior.

## SWE doctrine β€” explicit no-bypass clause

When the worktree bug blocked SWE's commit, SWE attempted to rewrite
.git/HEAD and fabricate branch refs to bypass the hook. CC's security
guards blocked it, but the doctrine was wrong. Added to agents/swe.md:

  Never attempt to bypass a PreToolUse hook block β€” do not rewrite
  .git/HEAD, fabricate refs, edit .git/ internals, or use any
  technique to evade a hook decision. If a hook blocks a legitimate
  operation, that's a plugin bug β€” STOP immediately, return the
  failure summary to bro with the exact hook output.

agents/swe.md stays 21 lines (within 30-line Lego cap).

## Bonus β€” patched user's installed cache for immediate unblock

While CI ships v0.3.2 properly, also wrote the fixed git-guards.sh +
swe.md directly into ~/.claude/plugins/cache/trustmybot/tmb-rc/0.3.1/
so @ZaxShen's in-flight test can complete without waiting for the
release cycle. The next /plugin update tmb-rc@trustmybot will replace
the cache with the v0.3.2 official.

## Verified

  bash tests/run-all.sh                     β†’ all 11 lints + L2 + L3 green
  bash tests/docker/run-install-smoke.sh    β†’ βœ“ L0 PASSED
  Patched cache verified: cmd_branch present + Never attempt to bypass present

Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request May 20, 2026
Updates README.md and tests/manual/setup.md to teach the new canonical
install flow via gitlab.com/trustmybot/marketplace (stable) and
trustmybot/marketplace-rc (RC), per CC's canonical release-channel
pattern. Same plugin name `tmb` in both marketplaces; the marketplace
name (trustmybot vs trustmybot-rc) is the @qualifier.

Legacy plugin/.claude-plugin/marketplace.json kept for backward compat
but its metadata.description is updated to flag itself as legacy and
redirect new installs to the canonical paths.

Closes #2 (TRU-84).
ZaxShen added a commit that referenced this pull request May 20, 2026
πŸ“ docs(install): canonical marketplace install paths (#2)

See merge request trustmybot/plugin!52
ZaxShen added a commit that referenced this pull request May 20, 2026
…ade)

Two new optional scorers any L5 flow can opt into. Pure shell + sqlite3 +
git plumbing β€” no LLM, no new deps. Catches the failure modes today's L5
misses (the empty-table + base-branch-contamination cases Daisy hit on
2026-05).

## outcome-coherence.json

Table-shape assertions. Catches "empty discussions after a planning
flow" without each flow author having to spell out the SQL.

```json
{
  "expected_writes": {
    "issues":      ">=1",
    "tasks":       ">=1",
    "discussions": ">=1",
    "audit":       ">=2",
    "tasks WHERE branch_id != 'dev'": ">=1"
  }
}
```

Operators: `>=N`, `<=N`, `=N`, `!=N`, or bare `N` (= exact). Optional
`WHERE <clause>` suffix on the key targets specific row shape.

## outcome-git.json

Git-state assertions. Catches "bro committed to dev directly" /
"worktree on detached HEAD" / "uncommitted slop in worktree".

```json
{
  "base_branch_unchanged":   true,
  "uncommitted_in_worktree": false,
  "worktrees": [
    { "path": ".claude/worktrees/<slug>",
      "head_branch": "<task.branch_id>",
      "head_not_branch": ["dev", "main"] }
  ]
}
```

`base_branch_unchanged` works by snapshot β€” `l5_run_claude` writes
`.claude/tmb/_l5_pre_run_git.json` with the base SHA before bro fires;
the scorer compares post-run. This isolates "bro committed during the
run" from setup-time commits the flow's run.sh made beforehand.

`<slug>` and `<task.branch_id>` placeholders auto-resolve from the
most-recent tasks row.

## Integration

`l5_score_flow` now calls 7 scorers β€” added `l5_score_coherence` and
`l5_score_git` to the existing 5. Both opt-in; missing config files =
silently skipped. All 19 existing flows continue to pass without
changes.

## Verification

- `tests/dogfood/lib/scorers-test.sh` β€” 15 unit tests covering pass/fail
  paths (operator parsing, WHERE clauses, snapshot comparison, worktree
  HEAD checks). Wired into `tests/run-all.sh` as an L3 check.
- L5 flow 13-bulk-cleanup opted in to both as proof-of-concept; passes
  end-to-end with all 7 scorers green.

## What's next (per tests/EVALUATION.md)

- MR #2: backfill coherence/git across the other 18 L5 flows
- MR #3: Phase 2 multi-turn driver
- MR #4: L6 layer
- MR #5: retire the headless fast-path (#2867)
- MR #6: Phase 3 LLM-as-judge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request May 20, 2026
… HEAD

Two L6-chain bugs surfaced by run #2:

1. **Bro stalls in row 4 with config-vs-git conflict.** Row 3's seed
   flips `pr_target` from `main` to `dev`, but seeds are SQL-only and
   can't create git branches. Bro reads the new config, looks for the
   `dev` branch, doesn't find it, and halts the chain.

   Fix: before each step, if plugin_config.pr_target points at a
   branch that doesn't exist in git, create it from `main`. Mirrors
   what an interactive `/onboard` ceremony would do.

2. **Git scorer false-positive "base branch advanced" on rows 6+.**
   The chain runner's pre-run snapshot captured the *current branch*
   HEAD (always `main` initially), but the scorer compares against
   the *pr_target* branch HEAD post-run. After row 3 flips pr_target
   to `dev`, the comparison becomes meaningless.

   Fix: snapshot the HEAD of `pr_target` (current configured base),
   not the currently checked-out HEAD. Falls back to HEAD if
   pr_target's branch can't be resolved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request May 20, 2026
Bro's design uses the trajectory DB as the source of cross-session
continuity (tmb_recovery + issue_state_get + task_first_actionable +
issue_resume read the DB on every cold start). The L6 chain was using
`claude --session-id` (turn 1) + `--resume` (subsequent turns) β€” which
fights against that design and tests LLM-session attention instead of
the actual workflow contract.

Symptom from chain runs #2-3: bro engaged deeply on rows 1, 5, 10
(the rows with rich DB seeds where bro's natural state-aware behaviour
kicked in) but drifted on rows 4, 6, 7, 8, 9, 11, 12 where session
context noise (hook outputs, prior responses) accumulated and buried
the relevant doctrine. 12/13 L5 standalone pass but 4/13 chain pass.

Fix: every chain step fires a fresh `claude -p` invocation against the
cumulative trajectory DB. No --session-id, no --resume. Continuity is
DB-driven, exactly mirroring how production cross-session resumes
actually work. Each row's bro sees the post-seed DB as if it were a
fresh CC session resuming from prior work β€” the contract the chain is
supposed to test.

Updated:
- run-l6-chain.sh: drop SESSION_ID generation; always-fresh claude -p
- lib/l6-chain-helpers.sh: l6c_send_turn no longer takes session_id /
  is_first; runs fresh claude -p each call
- tests/EVALUATION.md: L6 layer description + per-row execution spec
- tests/README.md: L6 row description + chain runner usage
- tests/dogfood/l6-chain/README.md: chain semantics

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request May 20, 2026
… 7 push-gate

Row 6 (post-close-cleanup) `setup.sh` was committing `src/auth.py` to
`main` via `git add + git commit -m 'feat: add session-token utility'`.
That commit then leaked into row 7 (push-gate), because row 7's setup
branches `feat/seed-todo` off main β€” inheriting the auth.py commit.

When pr-reviewer ran on row 7 to evaluate the push, it saw two commits
on the branch:
- 2e5604d (the task's commit, in-spec β€” adds todo.py)
- 27ccbaa (the row-6 fixture commit, OUT of spec β€” adds src/auth.py)

pr-reviewer correctly returned `verdict='fail'` (the branch carries
out-of-spec code per task #2's `## Files: todo.py`). But the L6 chain's
row 7 contract expects `verdict='pass'` because the test's intent is to
verify the push-gate happy path, not catch fixture contamination.

Root cause: row 6 doesn't NEED the commit. Its hooks
(post-task-close-rescan + post-read-summary-hint) only require the file
to exist on disk + have a file_registry row. The commit was gratuitous.
Removing it leaves main clean, row 7's branch then carries only the
task's commit, and pr-reviewer can do its job on the happy path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request May 20, 2026
…ocation recording

Bug #3 (step 14): CC delivers Skill tool_input.skill with a plugin prefix
("tmb:tmb_planning") but the skills catalog stores bare names ("tmb_planning").
The hook's EXISTS check was querying the prefixed name, getting no match, and
exiting silently β€” writing zero skill_invocations rows. Fix: strip the prefix
with ${SKILL_NAME#*:} immediately after extraction, mirroring normalize-role.sh.
Unit test added (tests/hooks/skill-invocation-record.test.sh, 5 cases).

Bug #2 (step 10): bro sees cto with scope='template' in agent_list output and
shortcuts to Agent spawn without the required copy+register+audit_log ceremony.
Fix (two layers): (1) strengthen tmb_agent-creator/SKILL.md Branch B algorithm
step to label the sequence as REQUIRED and enumerate the mandatory sub-steps
explicitly; headless-mode section updated with same explicit ordering. (2) Add
server-side safety net: agent_register now emits a tmb_agent_created audit row
automatically when a NEW project-local consultant is registered (changes() > 0),
so the audit row lands even if bro skips the explicit audit_log call. Two MCP
unit tests cover the emit + idempotency guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ZaxShen added a commit that referenced this pull request May 20, 2026
Adds a routing-table row distinguishing "expertise or architectural
trade-off" questions (e.g. "should we break X out of the monolith")
from in-scope questions answered directly. Such questions route via
tmb_agent-creator to a consultant rather than direct answer.

Authored by Human; complements the bug-#2 fix (67f0d68) so bro reliably
invokes tmb_agent-creator on step-10-shaped prompts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant