Skip to content

fix: release shells mutex before spawn_subshell to prevent deadlock#342

Open
ibash-corpusant wants to merge 1 commit intoshell-pool:masterfrom
ibash-corpusant:fix/shells-mutex-deadlock
Open

fix: release shells mutex before spawn_subshell to prevent deadlock#342
ibash-corpusant wants to merge 1 commit intoshell-pool:masterfrom
ibash-corpusant:fix/shells-mutex-deadlock

Conversation

@ibash-corpusant
Copy link
Copy Markdown

Problem

select_shell_desc() holds the global shells mutex while calling spawn_subshell(), which calls wait_for_startup(). If wait_for_startup blocks (e.g. due to a version mismatch between client and daemon, or a shell that fails to start), the mutex is held forever, deadlocking the entire daemon. All operations (list, attach, detach, kill) that need the shells mutex hang indefinitely.

This was observed in production when a daemon that had been running for 2+ days encountered a client/daemon binary mismatch after a package update. The wait_for_startup sentinel handshake hung, and from that point on, every shpool list, shpool attach, etc. blocked forever.

Fix

Refactor select_shell_desc() into three phases:

Phase Lock held? What it does
Phase 1 ✅ Yes Determine whether to create a new session or reuse existing
Phase 2 ❌ No Call spawn_subshell() (which may block in wait_for_startup)
Phase 3 ✅ Yes (briefly) Insert session into table, return Arc references

The critical change: spawn_subshell() is now called after the shells mutex is released. If wait_for_startup blocks, only the attach thread is affected — list, kill, detach, and other attach requests proceed normally.

Also adds a 30-second timeout to wait_for_startup() as defense-in-depth. Previously it was an infinite loop.

Regression Test

New test (deadlock.rs) that:

  1. Starts a daemon with a fake shell that never produces the startup sentinel
  2. Spawns an attach (which hangs in wait_for_startup)
  3. Asserts that shpool list still completes within 5 seconds

Before fix: DEADLOCK DETECTED — list times out after 5s
After fix: list returns immediately, test passes in 3.2s

Test Results

All existing tests pass (51 passed, 1 pre-existing prompt_prefix_fish failure unrelated to this change).

select_shell_desc() previously held the global 'shells' mutex while
calling spawn_subshell(), which calls wait_for_startup(). If
wait_for_startup blocks (e.g. due to a version mismatch between client
and daemon, or a shell that fails to start), the mutex is held forever,
deadlocking the entire daemon. All operations (list, attach, detach,
kill) that need the shells mutex will hang indefinitely.

Fix: Refactor select_shell_desc() into three phases:
1. Hold the lock to determine whether to create/reuse a session
2. Release the lock, then call spawn_subshell (may block)
3. Re-acquire the lock briefly to insert the session and return refs

Also adds a 30-second timeout to wait_for_startup() as defense-in-depth,
and a regression test that proves list does not deadlock during a slow
shell spawn.
@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 21, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@ethanpailes
Copy link
Copy Markdown
Contributor

Please explain the extent to which you used AI to generate this PR. It will help me know how closely I have to read it while reviewing.

@ibash-corpusant
Copy link
Copy Markdown
Author

I frequently encountered error not being able to reconnect to shpool sessions on my cloudtop. I had smith grab the logs, investigate the code, and then write a test for this case. I also had smith write the fix. So everything is ai generated.

@ethanpailes
Copy link
Copy Markdown
Contributor

Thanks! It will probably take me a minute to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants