Skip to content

[core] Ungracefully exit if the agent dies unexpectedly #53847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

codope
Copy link
Contributor

@codope codope commented Jun 16, 2025

Why are these changes needed?

This is a followup to #53762 (comment)

It forces raylet ungraceful exit if agen died unexpectedly (exit_code != 0).

Related issue number

#53739

@Copilot Copilot AI review requested due to automatic review settings June 16, 2025 08:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the agent manager to distinguish between graceful (exit code 0) and hard (non-zero exit code) agent terminations, enforcing an immediate ungraceful raylet exit on unexpected failures.

  • Split fate‐share logic: delay exit only for exit_code == 0, do a FATAL immediate exit for non-zero codes
  • Adjust log messages to reflect success vs. failure scenarios
  • Invoke QuickExit() immediately on non-zero agent exit
Comments suppressed due to low confidence (3)

src/ray/raylet/agent_manager.cc:88

  • [nitpick] The wording ‘failed with exit code 0’ is misleading since exit code 0 usually indicates success. Consider changing it to ‘exited with exit code 0’ for clarity.
<< "The raylet exited immediately because one Ray agent failed with exit code 0"

src/ray/raylet/agent_manager.cc:110

  • The new non-zero exit code branch introduces immediate exit logic—consider adding a unit or integration test to verify that the raylet ungracefully exits when an agent dies with a non-zero code.
} else {

src/ray/raylet/agent_manager.cc:116

  • Calling QuickExit() right after RAY_LOG(FATAL) may be unreachable because LOG(FATAL) typically aborts the process. Consider removing the redundant call or verifying the behavior of RAY_LOG(FATAL).
QuickExit();

@codope codope added the go add ONLY when ready to merge, run all tests label Jun 16, 2025
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should make this even stronger: if the agent is killed out-of-band, the raylet should always treat that as an error condition, even if it exits with code zero.

The reasoning is that the raylet assumes it controls the lifecycle of the agent.

You'll also need to update the Python test to reflect this change.

@jjyao
Copy link
Collaborator

jjyao commented Jun 16, 2025

IMO we should make this even stronger: if the agent is killed out-of-band, the raylet should always treat that as an error condition, even if it exits with code zero.

+1. Unless agent is killed by raylet during node gracefully shutdown. All the other exits regardless of exit code should be treated as an error. At high level, during graceful node shutdown, raylet should kill all of its children and then exit itself.

@codope codope force-pushed the core-53739-agent-ungraceful-exit branch 2 times, most recently from cf522ba to 42bbc2e Compare June 18, 2025 08:43
codope added 3 commits June 19, 2025 14:12
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@codope codope force-pushed the core-53739-agent-ungraceful-exit branch from 42bbc2e to 5f55fe2 Compare June 19, 2025 14:15
Comment on lines +86 to +88
// Agent died out-of-band (not during planned raylet shutdown).
// The raylet assumes it controls the lifecycle of the agent, so any
// unplanned exit is treated as an error condition, regardless of exit code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR. @edoakes, we're choosing to fail-fast here instead of trying to restart the agents? Is it because the agents are stateful or because a failure on the agent signals a hardware failure/bug?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime_env agent is stateful and recovery has never been implemented for it

Comment on lines +113 to +114
// Immediately exit ungracefully - the GCS notification is asynchronous
// and will be sent before the process actually terminates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this guaranteed? We don't explicitly wait for the IOService to be shutdown or drain it's pending work.

Copy link

github-actions bot commented Jul 8, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants