Skip to content

tmb_project-prescan + skills: parallel-batch failures cascade on empty/new repos #66

@ZaxShen

Description

@ZaxShen

Symptom (observed in PR #64 Layer 3 dogfood)

Bro fires 4–5 parallel Bash exploratory calls during tmb_project-prescan on a fresh project. Several of them fail-by-design on an empty repo (git log --oneline -5 exits 128 when there are no commits; ls .claude/skills/ errors when the dir doesn't exist), and CC's parallel runtime cancels the entire batch when any sibling fails — even the calls that would have succeeded standalone.

Bro then retries serially. If the retry batch also contains a fragile call, it gets cancelled too. Two cascading cancellations were observed in a single prescan, wasting ~9 tool-uses and consuming subagent context.

Verbatim trace (from #64 Mode B run, fresh /tmp/tmb-smoke)

⏺ Bash(git status)              → Cancelled: parallel tool call git log errored
⏺ Bash(git log --oneline -5)    → Error: Exit code 128 (no commits)
⏺ Bash(ls -1)                   → Cancelled: parallel tool call git log errored
⏺ Bash(find ...)                → Cancelled
⏺ Bash(ls .claude/agents/ ...)  → Cancelled

[bro retries serially]

⏺ Bash(git status)              → succeeds
⏺ Bash(ls -1)                   → empty
⏺ Bash(find ...)                → empty
⏺ Bash(ls .claude/agents/ ...)  → Cancelled AGAIN (different transient failure)

Root cause

This is CC platform behavior (parallel-batch-cancellation on any sibling failure) interacting with prompt design that doesn't account for it. The fix is in our prompts.

Two fixes

Fix A — harden tmb_project-prescan per-call

In skills/tmb_project-prescan/SKILL.md, gate fragile calls behind a cheap probe:

# BEFORE (fragile — cancels whole batch on empty repo)
git status
git log --oneline -5
ls -1
ls .claude/agents/
find . -name "package.json"

# AFTER (probe first, branch on result, batch survivors)
HAS_COMMITS=$(git rev-parse HEAD 2>/dev/null && echo yes || echo no)
HAS_AGENTS_DIR=$([ -d .claude/agents ] && echo yes || echo no)

# Then batch only what's known-safe
if [ "$HAS_COMMITS" = yes ]; then git log --oneline -5; fi
if [ "$HAS_AGENTS_DIR" = yes ]; then ls .claude/agents/; fi

Or simpler: append || true to every prescan Bash call so non-zero exits don't cascade. Trade-off: loses the actual error info if something genuinely breaks. Probe-first is cleaner.

Fix B — global guidance for tmb_* skills

Add a new section to tmb_project-prescan, or a new top-level rule in CLAUDE.md:

Parallel-batching rule. Never batch a known-fragile call (anything that can exit non-zero on a valid project state — empty repo, missing dir, no remote, etc.) with other calls in the same assistant response. CC cancels the entire batch on any single failure. Either probe first, or run fragile things solo, or append `|| true` to absorb the exit code.

This applies to bro's prescan, lazy-regen-check, and any other skill that issues exploratory Bash batches.

Acceptance

  • tmb_project-prescan updated with probe-first pattern (Fix A).
  • CLAUDE.md or tmb_project-prescan body includes the parallel-batching guidance (Fix B).
  • Layer 3 dogfood on a fresh empty project shows ≤ 1 cancellation event (ideally 0) during prescan.

Priority

Medium — doesn't break the chain (bro recovers serially), but bleeds context budget on every fresh-project run. Worth fixing before next dogfood retest.

Related

Surfaced during PR #64 Layer 3 dogfood verification. Same hypothesis applies to tmb_lazy-regen-check (does similar exploration in fresh-DB scenarios) — audit and apply same hardening.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions