Skip to content

fix: gr spawn down graceful shutdown + per-agent teardown#567

Merged
laynepenney merged 2 commits intomainfrom
fix/spawn-down-graceful
Apr 14, 2026
Merged

fix: gr spawn down graceful shutdown + per-agent teardown#567
laynepenney merged 2 commits intomainfrom
fix/spawn-down-graceful

Conversation

@laynepenney
Copy link
Copy Markdown
Collaborator

@laynepenney laynepenney commented Apr 14, 2026

Summary

  • Graceful shutdown: gr spawn down sends /exit to each agent's tmux pane (asking the agent process to clean up and terminate), then polls #{pane_dead} every 500ms up to a configurable timeout. Agents that don't exit in time get their tmux window force-killed via kill-window. Per-agent status is reported (graceful exit, force-killed, or state unknown if tmux queries failed).
  • Per-agent teardown: gr spawn down --agent <name> tears down a single agent instead of the whole team.
  • Configurable timeout: gr spawn down --timeout <secs> (default: 10s) controls how long to wait before force-killing.

Shutdown model

This is /exit + tmux lifecycle, not OS-level signal escalation (SIGTERM/SIGKILL). The /exit command asks the agent CLI to shut down cooperatively. If the agent process does not exit within the timeout, kill-window destroys the tmux pane (which sends SIGHUP to the child). This is appropriate because agent processes (Claude Code, Codex CLI) respond to /exit but may not handle bare SIGTERM gracefully.

Error handling (review feedback)

  • pane_exit_state() returns Option<bool> tri-state: Some(true) = exited, Some(false) = running, None = tmux query failed. Tmux errors are logged as warnings, not silently treated as graceful exit.
  • send-keys, kill-window, and kill-session failures are logged with warning output instead of being silently ignored.

Premium boundary: core OSS (CLI workspace orchestration).

Test plan

  • gr spawn up a team, then gr spawn down -- verify all agents get /exit and exit gracefully
  • gr spawn down --agent apollo -- verify only that agent is torn down, session stays alive
  • gr spawn down --timeout 2 with a hung agent -- verify it force-kills after 2s with status message
  • cargo test -- all tests pass
  • cargo clippy and cargo fmt --check clean

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus review: grip#567 (spawn down graceful shutdown)

Clean, well-structured implementation. Three-phase approach (send /exit, poll pane_dead, force-kill) is the right pattern.

Strengths:

  • is_pane_dead() correctly handles both #{pane_dead} check and "window already gone" fallback
  • 500ms poll interval is reasonable; configurable --timeout (default 10s) gives operators control
  • Clear user-facing messaging (green check for graceful, red X for force-kill with timeout)
  • Per-agent teardown via --agent flag properly wired through dispatch
  • Spawn state written before shutdown starts (captures intent even if force-kill needed)

Minor notes:

  • The remain-on-exit tmux setting affects behavior: if not set, panes disappear on exit and is_pane_dead returns true via the fallback. Correct behavior, worth a comment in code.

Premium boundary: core OSS (workspace orchestration).

LGTM from Opus.

Copy link
Copy Markdown
Collaborator Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review note: I can't submit an approval or request-changes state here because GitHub blocks formal self-reviews, so I'm leaving the findings as a comment.

  1. src/cli/commands/spawn.rs:657-669 treats any tmux list-panes failure as pane dead = true. That means a broken tmux socket, wrong session target, or any other tmux error is reported as a graceful exit. The command can print success even when it never observed the pane state. This needs to distinguish 'window already gone' from 'tmux failed' and surface the latter as an error.

  2. src/cli/commands/spawn.rs:585-647 ignores the status from every tmux send-keys, kill-window, and kill-session call. If teardown fails, gr spawn down still returns Ok(()) and prints that the agent exited or was force-killed. That is exactly the path where we need reliable error reporting.

Separately, the implementation is currently 'send /exit, wait, then kill the tmux window' rather than a true TERM/KILL escalation. If that is the intended model, the CLI/help text and PR framing should say that explicitly; otherwise the implementation should actually signal the child process before falling back to hard kill.

Copy link
Copy Markdown
Collaborator Author

@laynepenney laynepenney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review note: GitHub blocks formal self-approve/request-changes on this PR author, so I am leaving actionable review findings here.

Findings:

  1. src/cli/commands/spawn.rs:657-669 treats any tmux list-panes failure as pane dead = true. That means a broken tmux socket, wrong session target, or any other tmux error is reported as a graceful exit. The command can print success even when it never observed the pane state. This should distinguish "window already gone" from generic tmux failure and surface the latter as an error.

  2. src/cli/commands/spawn.rs:585-647 ignores the status from every tmux send-keys, kill-window, and kill-session call. If teardown fails, gr spawn down still returns Ok(()) and prints that the agent exited or was force-killed. This is exactly the path where we need reliable error reporting.

Separately, the implementation is currently send /exit, wait, then kill the tmux window rather than a true TERM/KILL escalation. If that is the intended model, the CLI/help text and PR framing should say that explicitly; otherwise the implementation should actually signal the child process before falling back to hard kill.

- is_pane_dead() -> pane_exit_state(): returns Option<bool> tri-state
  to distinguish "exited" / "still running" / "tmux error". No longer
  treats arbitrary tmux failures as graceful exit.
- Log warnings from send-keys, kill-window, kill-session failures
  instead of silently discarding return statuses.
- Report "state unknown" for agents where tmux queries failed, so the
  operator sees the ambiguity rather than a false "exited gracefully".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@laynepenney laynepenney merged commit 9530863 into main Apr 14, 2026
9 of 10 checks passed
@laynepenney laynepenney deleted the fix/spawn-down-graceful branch April 14, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant