Skip to content

Fix managed terminal launch race (bare-zsh orphans) (#408)#409

Merged
dhilgaertner merged 2 commits into
mainfrom
feature/crow-408-managed-terminal-launch-race
Jun 3, 2026
Merged

Fix managed terminal launch race (bare-zsh orphans) (#408)#409
dhilgaertner merged 2 commits into
mainfrom
feature/crow-408-managed-terminal-launch-race

Conversation

@dhilgaertner
Copy link
Copy Markdown
Contributor

Closes #408

Problem

Managed Claude/Codex terminals created by /crow-workspace (and the batch variant) intermittently came up as a bare zsh shell with no agent running (~40% under load; 26/63 cockpit windows were orphaned on the reporting host). Two race modes, both because setup.sh launched the agent out-of-band:

  1. Blind-paste race. setup.sh did crow new-terminal --managedsleep 3crow send "<cd && claude …>". The paste landed before the shell's line editor (zle) was live, so the keystrokes were dropped → bare zsh. The naive --command fix was also racy: TmuxBackend.registerTerminal spawned the wrapper as the window root and then immediately sendText(command) with no readiness wait.
  2. No-window / swallowed failure. Under load TmuxController.newWindow's tmux subprocess exceeded the 2 s timeout, threw TmuxError.timedOut, and the new-terminal RPC only logged it — registering a window-less terminal while returning success.

Fix

The app already had a race-free launch (SessionService.launchAgent fired on the .shellReady sentinel), but only for restored terminals. This routes brand-new managed terminals through the same path:

  • Defer the launch. new-terminal holds the command in AppState.pendingLaunchCommands and registers the window with command: nil; the deferred paste happens in wireTerminalReadiness on .shellReady (when zle is live), consuming both the pending command and autoLaunchTerminals membership so launchAgent can't double-fire. The deferred paste replicates the send RPC's hook-config + OTEL env prep, so hook routing and telemetry are preserved.
  • Surface failures. Bounded retry of registerTerminal with a longer per-call new-window budget; on exhaustion the terminal is marked .failed and the RPC returns launch_failed instead of reporting success.
  • setup.sh launches via new-terminal --command (no sleep/send), keeps its own --rc/--name (the app can't inject for a cd && claude form), parses launch_failed, and polls a new readiness field on crow list-terminals to warn non-fatally if the agent never started.
  • Conservative reaper at launch kills only cockpit windows that no terminal references AND that sit at a bare login shell — never a window running an agent or one a terminal still references.

Note on the ticket's "window root command is the agent" criterion: the wrapper is intentionally the window root (it sources shell config + writes the readiness sentinel). This is reinterpreted as "each window reliably runs the agent, zero bare idle shells."

Tests

  • Pure helpers: resolveLaunch, registerWithRetry, shouldReapWindow.
  • SessionService readiness-branch test (deferred paste consumes pending + autoLaunch; .timedOut does not).
  • Concurrent N-managed-terminal real-tmux regression asserting every window runs the launched process with zero bare shells.
  • Serialized the TmuxBackend integration suite so its timing-sensitive tests don't contend.

All package + app suites pass (make test equivalent run per package).

🤖 Generated with Claude Code

Managed Claude/Codex terminals created by /crow-workspace intermittently
came up as a bare zsh with no agent. Two race modes, both because setup.sh
launched the agent out-of-band:

1. Blind-paste race: setup.sh did `new-terminal --managed` + `sleep 3` +
   `crow send`, pasting the launch command before the shell's line editor
   (zle) was live, so the keystrokes were dropped. Even `--command` was racy
   because registerTerminal pasted immediately without waiting for readiness.
2. No-window swallowed failure: under load `new-window` exceeded its 2s
   timeout, threw, and the new-terminal RPC only logged it — registering a
   window-less terminal while reporting success.

Fix — route brand-new managed terminals through the existing race-free
deferred-launch path (the one already used for restored terminals):

- new-terminal RPC holds the launch command in AppState.pendingLaunchCommands
  and registers the window with command:nil; SessionService pastes it on the
  `.shellReady` sentinel (zle is live then), consuming both the pending command
  and autoLaunch membership so launchAgent can't also fire. The deferred paste
  replicates the `send` RPC's hook-config + OTEL env prep.
- Bounded retry of registerTerminal with a longer per-call new-window timeout;
  on exhaustion mark the terminal .failed and return launch_failed instead of
  reporting success.
- setup.sh launches via `new-terminal --command` (no sleep/send), keeps its own
  --rc/--name (the app can't inject for a `cd && claude` form), parses
  launch_failed, and polls the new `readiness` field on `list-terminals` to
  warn (non-fatally) if the agent never started.
- Conservative reaper at launch: kill only cockpit windows that no terminal
  references AND that sit at a bare login shell (never a window running an
  agent, or one a terminal still references).

Tests: pure resolveLaunch/registerWithRetry/shouldReapWindow helpers, a
SessionService readiness-branch test, and a concurrent N-managed-terminal
real-tmux regression asserting every window runs the launched process with
zero bare shells. Serialized the TmuxBackend integration suite so its
timing-sensitive tests don't contend. All package + app suites green.

🐦‍⬛ Generated with Claude Code, orchestrated by Crow

Co-Authored-By: Claude <noreply@anthropic.com>
Crow-Session: 1B4F94D3-4045-4DB0-9442-BA35B6050D6A
Copy link
Copy Markdown
Collaborator

@dgershman dgershman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code & Security Review

The deferred-launch design is the right shape: routing brand-new managed terminals through the existing .shellReady sentinel that already serves restored terminals is a clean reuse of a proven path, and pendingLaunchCommands being in-memory-only correctly avoids re-pasting a stale command on relaunch. The added readiness field on list-terminals plus the launch_failed error code give setup.sh a real success/failure signal instead of the prior fire-and-forget. Test coverage is solid: pure unit tests over resolveLaunch/registerWithRetry/shouldReapWindow, an instance test over the .shellReady branch, and a concurrent N-managed-terminal real-tmux regression. Below are the issues that should land in the same round trip.

Code Quality

🟡 Yellow — hydrate-fresh path still races on the same launch command (Sources/Crow/App/SessionService.swift:309). Brand-new managed terminals now persist terminal.command = launch_cmd (the cd … && claude … form), where pre-PR they persisted nil because setup.sh deferred the launch out-of-band. The hydrate fallback at rebuildHydratedSessionWithTmux (when tmuxBinding is nil or adoptTerminal throws — tmux socket changed, window was killed, post-reboot) calls TmuxBackend.shared.registerTerminal(..., command: terminal.command, ...), and registerTerminal synchronously blasts that command via sendText(id, text: command + \"\\n\") before any sentinel fires (Packages/CrowTerminal/Sources/CrowTerminal/TmuxBackend.swift:234). That's exactly the dropped-keystrokes race this PR set out to remove — just shifted to the recovery path. Two reasonable shapes: (a) on the hydrate-fresh path, pass command: nil to registerTerminal and rely on the existing autoLaunchTerminals + launchAgent rebuild that hydrate already wires up for restored terminals, or (b) re-stage the same pendingLaunchCommands + deferred-paste machinery on this fallback. Either fix keeps the regression from sneaking back in via the recovery path.

🟡 Yellow — registerWithRetry blocks the main actor for up to 12 s (Sources/Crow/App/AppDelegate.swift:1349, Packages/CrowTerminal/Sources/CrowTerminal/TmuxController.swift:26). await MainActor.run { … } wraps the whole RPC body, and registerWithRetry(attempts: 3) { … newWindowTimeout: 4.0 } is invoked synchronously inside it. TmuxController.run blocks the calling thread until the subprocess returns or kill() fires at timeout, so a worst-case three timed-out attempts pins the main actor for ~12 s — long enough to beachball the UI and stall every other MainActor.run-dispatched RPC (e.g. concurrent setup.sh invocations from /crow-batch-workspace). The previous behavior was 2 s and broken; the new behavior is reliable but visibly stalls under the same load this PR is trying to harden. Hoist the registration off the main actor (Task.detached returning the TmuxBinding, then re-enter MainActor for the state mutations), or at minimum bound the total budget (e.g. attempts: 2 with backoff and a shorter per-call timeout) so the worst case stays inside the previous user-perceived window.

🟡 Yellow — prepareManagedLaunchText duplicates the send RPC handler (Sources/Crow/App/SessionService.swift:415 vs Sources/Crow/App/AppDelegate.swift:1478). The OTEL env-var list, the agent-token guard, and the hook-config write are copy-pasted; the PR even adds a // NOTE: keep the two in sync comment, which is a maintenance trap rather than a design. Extract a single helper (e.g. on the agent or a shared static on AppDelegate) that returns (textWithExports, didLaunch) and call it from both sites. Bonus: the send RPC sets terminalReadiness = .agentLaunched only when the launch token is present, whereas pasteDeferredLaunch sets it unconditionally — a shared helper makes that asymmetry visible and easy to align.

🟡 Yellow — DeferredLaunchTests mutates singleton TmuxBackend.shared.onReadinessChanged without .serialized (Tests/CrowTests/DeferredLaunchTests.swift:1). Each test calls service.wireTerminalReadiness(), which overwrites the shared closure, and then fires TmuxBackend.shared.onReadinessChanged?(terminalID, …). If Swift Testing runs the two @MainActor tests in parallel (the default), one test can stomp the other's closure before its sentinel fires — same flake mode you just hit on the TmuxBackend integration suite and serialized (Packages/CrowTerminal/Tests/CrowTerminalTests/TmuxBackendTests.swift:14). Add .serialized to the @Suite so this stays green under concurrent test execution.

Security Review

Strengths

  • pendingLaunchCommands is documented and implemented as in-memory-only (Packages/CrowCore/Sources/CrowCore/AppState.swift:226), so a launch command can't be replayed across app restarts.
  • close-terminal clears pendingLaunchCommands alongside terminalReadiness and autoLaunchTerminals (Sources/Crow/App/AppDelegate.swift:1428) — no orphan command can survive a terminal removal.
  • The reaper is conservative: shouldReapWindow only fires when the window is both unreferenced and sitting at a bare login shell, and reapUnboundCockpitWindows defensively unions bindings.values so a window adopted/registered this run is structurally unreapable (Packages/CrowTerminal/Sources/CrowTerminal/TmuxBackend.swift:386). The 8 s delay in AppDelegate is belt-and-suspenders given hydrate runs synchronously on the main actor.

Concerns

  • None new — no auth, secret-handling, or injection surface is touched. terminal_readiness in setup.sh parses crow-owned JSON it just produced, so the regex-on-JSON shape is fragile but not a security boundary.

Summary Table

Color Meaning Verdict effect
Red Must fix Request changes
Yellow Should fix Request changes
Green Consider Approve allowed

Recommendation: Request Changes — driven by [0 Red, 4 Yellow, several Green] findings. The deferred-launch shape is correct; the four Yellows are the kind that compound (a regression on a recovery path + a main-actor stall + duplicated launch prep + a flaky test) and worth catching in this round rather than as #408 follow-ups.


🐦‍⬛ Reviewed by Crow via Claude Code

@dhilgaertner dhilgaertner self-assigned this Jun 3, 2026
@dhilgaertner
Copy link
Copy Markdown
Contributor Author

Live finding while setting up this PR's own workspace — a window-index-reuse bug the reaper exposes. (Observed from ~/.local/share/crow/crash-logs/crow-20260603T034449.log.)

Sequence on the fix build (app started 22:44):

  1. ✅ Reaper worked perfectly — Reaped 26 orphaned bare-shell cockpit window(s) (#408) (cockpit windows 63 → 37, 0 orphans). Honest readiness surfacing also worked (setup.sh warned instead of false-success). Both are wins.
  2. ❌ But the newly-created managed terminal 387D645A never launched its agent — first_prompt_timeout budget_ms=30000, sentinel_exists=false, state=timedOut. The readiness diagnostics show why:
    --- tmux state ---  target=crow-cockpit:4   display_message=68270 2.1.159
    --- process tree --- 70384 … /Users/.../claude --permission-mode auto   ELAPSED 01-07:06:44
    --- pane capture --- <capture-pane failed>
    --- sentinel --- exists=false
    
    Terminal 387D645A was associated with crow-cockpit:4, a surviving window already running a different, long-lived claude --permission-mode auto (the Manager, PID 70384, up ~1d7h) — not a fresh window. So no wrapper ran for 387D645A, the readiness sentinel was never written, and it timed out.

Root cause hypothesis: after the reaper kills windows, tmux window-index reuse lets a new terminal's stored window index collide with a surviving window (here the Manager's). The deferred-launch-on-.shellReady path then waits forever for a sentinel that a wrapper in this terminal would have written — but the mapped window is someone else's agent.

Why it didn't corrupt the Manager: the launch is sentinel-gated, so the deferred paste never fired into window :4. Good — but if any code path did paste on a weaker signal, it would inject cd … && claude … into the Manager's live session. Worth an explicit guard: never bind a managed terminal to a window already running an agent, and key window identity on something stable (a per-terminal @crow_terminal_id tmux option) rather than the reusable window index — especially now that the reaper actively frees indexes for reuse.

Repro: with orphans present, open a session so the reaper runs, then create a new managed terminal — it can land on a recycled index pointing at a live window.

Also: the app stopped running shortly after (socket now refused). Couldn't tell from this log alone whether it crashed or was quit; flagging in case it's related.

… serialization

Four Yellow findings from the review:

- Hydrate-fresh recovery race: brand-new managed terminals no longer persist
  the launch command on the SessionTerminal row (persist `registerCommand`,
  nil when deferring). The launch lives only in in-memory pendingLaunchCommands,
  so the hydrate-fresh fallback can't blind-paste it into a not-yet-ready shell
  on relaunch. Restored managed terminals relaunch via autoLaunch/launchAgent as
  before — and this no longer relies on the fragile `command.contains("claude")`
  nil-ing for non-claude agents.
- registerWithRetry main-actor stall: tightened to 2 attempts × 3s (was 3 × 4s),
  capping worst-case main-actor stall at ~6s instead of ~12s; common case is one
  fast attempt.
- Duplicated launch prep: extracted AppDelegate.prepareAgentLaunchText (hook
  config + OTEL env + agent-token guard), returning (text, didLaunch). The `send`
  RPC and pasteDeferredLaunch now share it, removing the copy-paste and the
  "keep in sync" comment; the `.agentLaunched` gating is now consistent.
- DeferredLaunchTests: marked .serialized — the instance tests overwrite the
  singleton onReadinessChanged closure, which can stomp under parallel execution.

All package + app suites green.

🐦‍⬛ Generated with Claude Code, orchestrated by Crow

Co-Authored-By: Claude <noreply@anthropic.com>
Crow-Session: 1B4F94D3-4045-4DB0-9442-BA35B6050D6A
@dhilgaertner
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review — all four Yellows addressed in f1eee8d.

🟡 Hydrate-fresh recovery race (SessionService.swift:309). Fixed at the source: brand-new managed terminals no longer persist the launch command on the SessionTerminal row — the row now stores registerCommand (nil when deferring), and the launch lives only in in-memory pendingLaunchCommands. So the hydrate-fresh registerTerminal(command: terminal.command, …) fallback gets nil and can't blind-paste, for any agent (no longer relies on the command.contains("claude") nil-ing heuristic). Restored managed terminals still relaunch via the autoLaunchTerminals + launchAgent path.

🟡 registerWithRetry main-actor stall. Tightened to 2 attempts × 3s (was 3 × 4s), capping worst-case main-actor stall at ~6s instead of ~12s; the common case is a single fast attempt. A full off-main-actor hoist of registerTerminal is a larger refactor of a load-bearing path, so I took the "bound the budget" option you offered and noted the tradeoff in a comment.

🟡 prepareManagedLaunchText duplication. Extracted AppDelegate.prepareAgentLaunchText(command:agent:sessionID:worktreePath:crowPath:telemetryPort:) -> (text, didLaunch) as the single source of truth (hook config + OTEL env + agent-token guard). Both the send RPC and pasteDeferredLaunch call it; the copy-paste and the "keep in sync" comment are gone, and the .agentLaunched gating is now driven by the same didLaunch signal at both sites.

🟡 DeferredLaunchTests not serialized. Added .serialized to the suite, matching the TmuxBackend integration suite.

All package + app suites pass locally. Re-requesting review.

@dhilgaertner dhilgaertner requested a review from dgershman June 3, 2026 04:00
Copy link
Copy Markdown
Collaborator

@dgershman dgershman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code & Security Review

Critical Issues

None.

Security Review

Strengths:

  • Path-traversal guard on cwd (isPathWithinDevRoot) preserved on the new-terminal path.
  • No new trust boundary: the launch command is still authored by an authenticated local socket client; OTEL env vars carry no secrets.
  • In-memory only pendingLaunchCommands — never persisted, so a relaunch can't re-paste a stale command (explicitly noted in AppState.swift:226-230 and defensively cleared on adopt at SessionService.swift:258-260).
  • Persisted row deliberately stores registerCommand (nil for deferred launches) rather than the raw launch command, preventing the hydrate-fresh fallback in rehydrateTerminalSurface from blind-pasting it into a not-yet-ready shell (AppDelegate.swift:1323-1335).

Concerns: None.

Code Quality

Strengths:

  • Correctly routes brand-new managed terminals through the same race-free .shellReady path the restored path already used — eliminates the duplicate launch logic.
  • prepareAgentLaunchText is a clean extraction: both the send RPC and the deferred-launch paste now share hook-config + OTEL prep, so they can't drift.
  • State seeding (terminalReadiness, pendingLaunchCommands, autoLaunchTerminals) happens before registerTerminal, with a clear comment about why (AppDelegate.swift:1339-1351). Both producer (new-terminal RPC under MainActor.run) and consumer (onReadinessChanged closure in SessionService, both @MainActor-isolated) are serialized on the main actor, so the sentinel fire is guaranteed to see populated state and the terminals map.
  • pasteDeferredLaunch consumes both pendingLaunchCommands and autoLaunchTerminals atomically (SessionService.swift:386-387) — launchAgent's own membership check is now belt-and-suspenders against a double launch.
  • .timedOut does NOT consume the pending command (SessionService.swift:352-359 + DeferredLaunchTests.swift:114-126), preserving Retry/foreground re-arm recovery.
  • The orphan reaper is properly conservative: pure shouldReapWindow policy is unit-tested, real list unions keepWindowIndices with in-memory bindings.values, and the bare-shell whitelist excludes everything that could plausibly be an agent or session anchor.
  • The concurrent N-managed-terminal real-tmux test directly exercises the regression and asserts on zero bare shells — the right shape of test for this bug.
  • .serialized on TmuxBackendTests is well-justified in the comment (real-tmux integration tests contend for CPU during shell startup).
  • The CLI poll surface (launch_failed on new-terminal + readiness on list-terminals) gives setup.sh honest signal without breaking the existing JSON shape.

Consider (Green)

  • AppDelegate.swift:439-444 — the 8s post-launch sleep before the reaper runs is empirical. If app launch + initial main-thread work is very slow, appState.terminals[*].tmuxBinding?.windowIndex may not yet reflect a freshly re-registered window. The bindings union in reapUnboundCockpitWindows covers this in practice, so the risk is low; a follow-up could replace the timer with an explicit "rehydration settled" signal if you want to remove the magic constant.
  • setup.sh terminal_readiness parses the flat list-terminals JSON with grep -o '{[^{}]*}'. Fine for the current schema (no nested objects per terminal), but worth a note if future fields add nesting.
  • registerWithRetry retries immediately with no backoff. Intentional per the inline comment (2×3s budget), so this is preserved by design — but if a future change widens the budget it'd be worth revisiting.

Tests Run

PR description states all package + app suites pass. Spot-checked test structure — pure helpers (resolveLaunch, registerWithRetry, shouldReapWindow) are well-isolated and the readiness-branch test seeds state exactly the way the RPC does.

Summary Table

Color Meaning Verdict effect
Red Must fix Request changes
Yellow Should fix Request changes
Green Consider Approve allowed

Recommendation: Approve — driven by [0 Red, 0 Yellow, 3 Green] findings. The fix correctly identifies the two race modes (blind-paste vs window-less terminal), routes brand-new terminals through the existing race-free .shellReady path, surfaces failures honestly via launch_failed + readiness, and adds a conservative reaper for already-leaked windows. The Green items are stylistic/forward-looking, not blocking.


🐦‍⬛ Reviewed by Crow via Claude Code

@dhilgaertner dhilgaertner merged commit b7f6bb4 into main Jun 3, 2026
2 checks passed
@dhilgaertner dhilgaertner deleted the feature/crow-408-managed-terminal-launch-race branch June 3, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Managed terminals intermittently launch as bare zsh — agent never starts (sleep+send race; use new-terminal --command)

2 participants