[core] Ungracefully exit if the agent dies unexpectedly #53847

codope · 2025-06-16T08:35:25Z

Why are these changes needed?

It forces raylet ungraceful exit if agen died unexpectedly (exit_code != 0).

Related issue number

Copilot

Pull Request Overview

This PR updates the agent manager to distinguish between graceful (exit code 0) and hard (non-zero exit code) agent terminations, enforcing an immediate ungraceful raylet exit on unexpected failures.

Split fate‐share logic: delay exit only for exit_code == 0, do a FATAL immediate exit for non-zero codes
Adjust log messages to reflect success vs. failure scenarios
Invoke QuickExit() immediately on non-zero agent exit

Comments suppressed due to low confidence (3)

src/ray/raylet/agent_manager.cc:88

[nitpick] The wording ‘failed with exit code 0’ is misleading since exit code 0 usually indicates success. Consider changing it to ‘exited with exit code 0’ for clarity.

<< "The raylet exited immediately because one Ray agent failed with exit code 0"

src/ray/raylet/agent_manager.cc:110

The new non-zero exit code branch introduces immediate exit logic—consider adding a unit or integration test to verify that the raylet ungracefully exits when an agent dies with a non-zero code.

} else {

src/ray/raylet/agent_manager.cc:116

Calling QuickExit() right after RAY_LOG(FATAL) may be unreachable because LOG(FATAL) typically aborts the process. Consider removing the redundant call or verifying the behavior of RAY_LOG(FATAL).

QuickExit();

edoakes

IMO we should make this even stronger: if the agent is killed out-of-band, the raylet should always treat that as an error condition, even if it exits with code zero.

The reasoning is that the raylet assumes it controls the lifecycle of the agent.

You'll also need to update the Python test to reflect this change.

jjyao · 2025-06-16T15:41:34Z

IMO we should make this even stronger: if the agent is killed out-of-band, the raylet should always treat that as an error condition, even if it exits with code zero.

+1. Unless agent is killed by raylet during node gracefully shutdown. All the other exits regardless of exit code should be treated as an error. At high level, during graceful node shutdown, raylet should kill all of its children and then exit itself.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

israbbani · 2025-06-19T17:46:05Z

src/ray/raylet/agent_manager.cc

+      // Agent died out-of-band (not during planned raylet shutdown).
+      // The raylet assumes it controls the lifecycle of the agent, so any
+      // unplanned exit is treated as an error condition, regardless of exit code.


Unrelated to this PR. @edoakes, we're choosing to fail-fast here instead of trying to restart the agents? Is it because the agents are stateful or because a failure on the agent signals a hardware failure/bug?

The runtime_env agent is stateful and recovery has never been implemented for it

israbbani · 2025-06-19T17:49:15Z

src/ray/raylet/agent_manager.cc

+      // Immediately exit ungracefully - the GCS notification is asynchronous
+      // and will be sent before the process actually terminates


How is this guaranteed? We don't explicitly wait for the IOService to be shutdown or drain it's pending work.

github-actions · 2025-07-08T00:40:06Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Copilot AI review requested due to automatic review settings June 16, 2025 08:35

Copilot AI reviewed Jun 16, 2025

View reviewed changes

codope added the go label Jun 16, 2025

edoakes reviewed Jun 16, 2025

View reviewed changes

codope force-pushed the core-53739-agent-ungraceful-exit branch 2 times, most recently from cf522ba to 42bbc2e Compare June 18, 2025 08:43

codope added 3 commits June 19, 2025 14:12

[core] Ungracefully exit if the agent dies unexpectedly

bee06c3

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

ungraceful raylet exit if unexpected agent exit

763a316

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

send rpc before quick exit

5f55fe2

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>

codope force-pushed the core-53739-agent-ungraceful-exit branch from 42bbc2e to 5f55fe2 Compare June 19, 2025 14:15

israbbani reviewed Jun 19, 2025

View reviewed changes

github-actions bot added the stale label Jul 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Ungracefully exit if the agent dies unexpectedly #53847

[core] Ungracefully exit if the agent dies unexpectedly #53847

Uh oh!

codope commented Jun 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

edoakes left a comment

Uh oh!

jjyao commented Jun 16, 2025

Uh oh!

israbbani Jun 19, 2025

Uh oh!

edoakes Jun 23, 2025

Uh oh!

israbbani Jun 19, 2025

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

Uh oh!

		// Immediately exit ungracefully - the GCS notification is asynchronous
		// and will be sent before the process actually terminates

[core] Ungracefully exit if the agent dies unexpectedly #53847

Are you sure you want to change the base?

[core] Ungracefully exit if the agent dies unexpectedly #53847

Uh oh!

Conversation

codope commented Jun 16, 2025

Why are these changes needed?

Related issue number

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

jjyao commented Jun 16, 2025

Uh oh!

israbbani Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

israbbani Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

Uh oh!