Skip to content

Conversation

joshuay03
Copy link
Contributor

Motivation / Background

Closes #55513.

Detail

When parallel tests are running and a worker process dies abruptly (e.g., OOM killed, kill -9), the test suite would hang forever waiting for the dead worker to call stop_worker, which it never could.

The fix detects dead workers during shutdown by using Process.waitpid with WNOHANG to check which processes have exited, then removes them from the server's active worker list before waiting for remaining workers to finish. The server now tracks PIDs alongside worker IDs, allowing it to map dead processes back to their worker entries for cleanup.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

@joshuay03 joshuay03 moved this to On Hold in Open Source Sep 28, 2025
@joshuay03 joshuay03 moved this from On Hold to In Progress / Pending Review in Open Source Sep 28, 2025
Process.kill("KILL", worker_pids.first)
sleep 0.25

Timeout.timeout(2.5, Minitest::Assertion, "Expected shutdown to not hang") { parallelization.shutdown }
Copy link
Contributor Author

@joshuay03 joshuay03 Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main:

Image

Copy link
Member

@byroot byroot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll merge this in a bit, but need to backport etc.

@byroot byroot merged commit cf6b310 into main Sep 29, 2025
3 of 5 checks passed
@byroot byroot deleted the fix-55513 branch September 29, 2025 12:55
byroot added a commit that referenced this pull request Sep 29, 2025
[Fix #55513] parallel tests hanging when worker processes die abruptly
byroot added a commit that referenced this pull request Sep 29, 2025
[Fix #55513] parallel tests hanging when worker processes die abruptly
@joshuay03 joshuay03 moved this from In Progress / Pending Review to Done in Open Source Oct 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel tests hang if a worker dies abruptly

2 participants