-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[core] Ungracefully exit if the agent dies unexpectedly #53847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[core] Ungracefully exit if the agent dies unexpectedly #53847
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the agent manager to distinguish between graceful (exit code 0) and hard (non-zero exit code) agent terminations, enforcing an immediate ungraceful raylet exit on unexpected failures.
- Split fate‐share logic: delay exit only for exit_code == 0, do a FATAL immediate exit for non-zero codes
- Adjust log messages to reflect success vs. failure scenarios
- Invoke
QuickExit()
immediately on non-zero agent exit
Comments suppressed due to low confidence (3)
src/ray/raylet/agent_manager.cc:88
- [nitpick] The wording ‘failed with exit code 0’ is misleading since exit code 0 usually indicates success. Consider changing it to ‘exited with exit code 0’ for clarity.
<< "The raylet exited immediately because one Ray agent failed with exit code 0"
src/ray/raylet/agent_manager.cc:110
- The new non-zero exit code branch introduces immediate exit logic—consider adding a unit or integration test to verify that the raylet ungracefully exits when an agent dies with a non-zero code.
} else {
src/ray/raylet/agent_manager.cc:116
- Calling
QuickExit()
right afterRAY_LOG(FATAL)
may be unreachable becauseLOG(FATAL)
typically aborts the process. Consider removing the redundant call or verifying the behavior ofRAY_LOG(FATAL)
.
QuickExit();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should make this even stronger: if the agent is killed out-of-band, the raylet should always treat that as an error condition, even if it exits with code zero.
The reasoning is that the raylet assumes it controls the lifecycle of the agent.
You'll also need to update the Python test to reflect this change.
+1. Unless agent is killed by raylet during node gracefully shutdown. All the other exits regardless of exit code should be treated as an error. At high level, during graceful node shutdown, raylet should kill all of its children and then exit itself. |
cf522ba
to
42bbc2e
Compare
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
42bbc2e
to
5f55fe2
Compare
// Agent died out-of-band (not during planned raylet shutdown). | ||
// The raylet assumes it controls the lifecycle of the agent, so any | ||
// unplanned exit is treated as an error condition, regardless of exit code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR. @edoakes, we're choosing to fail-fast here instead of trying to restart the agents? Is it because the agents are stateful or because a failure on the agent signals a hardware failure/bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime_env agent is stateful and recovery has never been implemented for it
// Immediately exit ungracefully - the GCS notification is asynchronous | ||
// and will be sent before the process actually terminates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this guaranteed? We don't explicitly wait for the IOService to be shutdown or drain it's pending work.
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Why are these changes needed?
This is a followup to #53762 (comment)
It forces raylet ungraceful exit if agen died unexpectedly (exit_code != 0).
Related issue number
#53739