Skip to content

Recover from failed Ghostty surface creation (#217)#218

Merged
dhilgaertner merged 1 commit intomainfrom
feature/crow-217-terminal-surface-recovery
Apr 29, 2026
Merged

Recover from failed Ghostty surface creation (#217)#218
dhilgaertner merged 1 commit intomainfrom
feature/crow-217-terminal-surface-recovery

Conversation

@dhilgaertner
Copy link
Copy Markdown
Contributor

Summary

Closes #217

Auto-review sessions could hang on "Waiting for terminal" forever when
ghostty_surface_new returned nil — the broken GhosttySurfaceView
stayed cached in TerminalManager, no retry was attempted, and the UI
had no failure path. This implements all three remediations called out
in the ticket:

  • Bounded retry inside GhosttySurfaceView.createSurface(): 4
    attempts at 0.5s / 1.0s / 2.0s / 4.0s before giving up. On exhaustion
    it fires a new onSurfaceCreationFailed callback.
  • TerminalManager.surface(for:) rejects broken cached views: if
    the cached entry's hasSurface is false, it's destroyed and replaced
    instead of returned.
  • Failure surfaces in the UI: a new TerminalReadiness.failed
    state (also SurfaceState.failed) flows from
    surfaceDidFailwireTerminalReadinessappState.terminalReadiness,
    and ReadinessAwareTerminal renders an error overlay with a Retry
    button. Retry routes through a new appState.onRetryTerminal
    callback to SessionService.retryTerminal(terminalID:), which resets
    readiness, re-arms auto-launch, and asks TerminalManager to destroy
    the broken view and re-preInitialize.

The session-list status dot also gets a red "failed" variant so the
problem is visible from the sidebar.

Out of scope: a session-level "failed" SessionStatus. The
terminal-level state is enough to recover; adding a new session status
would ripple through CLI, persistence, and validation.

Test plan

  • `make build` succeeds.
  • Force the failure path (e.g., temporarily set `self.surface = nil`
    after `ghostty_surface_new` in `GhosttySurfaceView.createSurface`)
    and confirm the retry log fires 4 times with backoff, then
    `[Ghostty] createSurface() exhausted retries` is logged.
  • After retries exhaust, the UI shows the error overlay with a
    Retry button instead of an indefinite "Waiting for terminal" spinner.
  • Clicking Retry (with the injection removed) creates a fresh
    `GhosttySurfaceView`, reaches `.shellReady`, and auto-launches Claude.
  • Navigating away from a tab with a broken cached surface and back
    logs `cached view has nil surface, discarding` and rebuilds the view.
  • Happy-path regression: ordinary work-session and review-session
    preinit still reach `.shellReady` within ~5s and auto-launch Claude.

🤖 Generated with Claude Code

Auto-review sessions could hang on "Waiting for terminal" forever when
ghostty_surface_new returned nil — the broken GhosttySurfaceView stayed
cached in TerminalManager, no retry was attempted, and the UI had no
failure path. This adds bounded retries inside GhosttySurfaceView (4
attempts with backoff up to ~7.5s), makes TerminalManager.surface(for:)
treat a cached view with a nil surface as a miss, propagates a new
.failed terminal-readiness state through SessionService, and surfaces an
error overlay with a Retry button in ReadinessAwareTerminal that
destroys the broken view and re-preInitializes a fresh one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dhilgaertner dhilgaertner requested a review from dgershman as a code owner April 29, 2026 15:01
Copy link
Copy Markdown
Collaborator

@dgershman dgershman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code & Security Review

Critical Issues

None found.

Security Review

Strengths:

  • All new state and callbacks are @MainActor-isolated, consistent with the existing concurrency model.
  • No new external inputs or command injection surfaces introduced — retry logic operates purely on internal state.
  • TerminalManager.retry() properly destroys the old surface before re-creating, preventing resource leaks.
  • The onSurfaceCreationFailed callback is cleared in destroy(), avoiding retain-cycle or use-after-free issues.

Concerns:

  • None. The changes are internal to the app's process and don't introduce new trust boundaries.

Code Quality

Well-structured changes:

  • Clean separation of concerns: GhosttySurfaceView owns retry mechanics, TerminalManager bridges to SessionService, and SessionService owns the recovery/re-arm logic.
  • The TerminalReadiness.failed enum case is inserted with sortOrder: -1, correctly ordering it below all other states. The Comparable conformance means the existing < .shellReady guard in the loading overlay naturally covers the new case without modification.
  • retryDelays as a static constant with exponential backoff (0.5s → 4.0s, ~7.5s total) is a reasonable budget — long enough for transient Ghostty init issues, short enough to not leave users hanging.
  • The surface(for:) cache-miss path correctly discards broken views (hasSurface == false), preventing callers from getting permanently stuck with a nil-surface view.
  • DispatchQueue.main.asyncAfter retry with [weak self] guard + self.surface == nil check is correct — prevents retrying after success or deallocation.

Minor observations (non-blocking):

  • createAttempts is reset to 0 both on success (line 93) and after exhausting retries (line 118). This means a view that fails and is then destroy()'d + re-created will correctly restart from attempt 0. Good.
  • The wireTerminalReadiness callback guard guard let currentState = self.appState.terminalReadiness[terminalID] would skip the .failed state update if the terminal isn't tracked. However, the .failed case inside the switch is only reachable after the guard passes, and all managed terminals are tracked — so this is correct.

Summary Table

Priority Issue
🟢 Retry backoff schedule is reasonable but not configurable — fine for now
🟢 The fixed 5s shell-readiness delay (existing code, not new) is still a TODO — not introduced by this PR

Recommendation: Approve — well-scoped, clean implementation with correct resource management and no security concerns.

@dhilgaertner dhilgaertner merged commit 163eadc into main Apr 29, 2026
2 checks passed
@dhilgaertner dhilgaertner deleted the feature/crow-217-terminal-surface-recovery branch April 29, 2026 16:47
dhilgaertner added a commit that referenced this pull request May 4, 2026
…xes (#238)

Closes #235

## Summary

- Adds `docs/automation.md` as the canonical guide to Settings →
Automation and the full auto-flow lifecycle.
- Updates `docs/architecture.md` with the dual Ghostty/tmux backend
(#229), the new `TerminalRouter` dispatch, the Settings tab split
(#228), and the Review Board surface (#188, #205, #207, #210, #212,
#220, #226, #231).
- Adds `crow rename-terminal` (#206) to `docs/cli-reference.md`.
- Adds troubleshooting rows for tmux backend missing, Ghostty surface
retry (#218), GitLab nested groups (#233), `GITLAB_HOST` silent skip
(#215), auto-respond not firing, and the silent no-op `hook-event`
behavior (#234).
- Adds `CROW_TMUX_BACKEND` and `CROW_HOOK_DEBUG` to the
`docs/configuration.md` env-var table.
- Backfills `CHANGELOG.md` `[Unreleased]` with every PR from #137
through #234, grouped by theme.
- Updates `README.md` features list and docs index for the new
automation suite, review board, terminal renaming, and tmux opt-in.
- Appends an "Implementation Status (2026-05)" footer to
`docs/terminal-runtime-research.md` noting #229 shipped the headless-PTY
backend recommended in the original research.

The audit checklist called out as deliverable #1 in the issue is posted
as a [comment on
#235](#235 (comment)).

## Test plan

- [ ] `git diff --stat main` shows only the listed `docs/`,
`CHANGELOG.md`, and `README.md` files
- [ ] Render each modified doc on GitHub and confirm anchors /
cross-links resolve
- [ ] Confirm `crow --help` matches the command list in
`docs/cli-reference.md` (only `rename-terminal` was missing pre-PR)
- [ ] Walk every PR number in the CHANGELOG against `git log
--since=2026-04-15 main --oneline` and confirm each one resolves
- [ ] Re-export `docs/crow-screenshot.jpeg` against the current
Settings/Review-Board UI — **deferred to a follow-up**, called out in
the audit comment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-review session stuck on "Waiting for terminal" when Ghostty createSurface() fails

2 participants