Skip to content

[train] Graceful abort catches all RayActorErrors#61375

Merged
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/graceful-abort-catch
Mar 4, 2026
Merged

[train] Graceful abort catches all RayActorErrors#61375
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/graceful-abort-catch

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

Summary

We sometimes observed errors like

2026-02-26 16:32:06,017sINFO data_parallel_trainer.py:289 -- Received SIGINT. Gracefully aborting the training run — this may take a few seconds. To forcefully abort immediately, you can send a different signal, such as SIGKILL.  | 727k/1.56M [10:13<1:16:08, 182 row/s]
(TrainController pid=344312) [State Transition] RUNNING -> ABORTED.       
Traceback (most recent call last):                                                                                                                                                                                                                                           
  File "python/ray/_raylet.pyx", line 2403, in ray._raylet.check_signals                                                                                                                                                                                                     
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/api/data_parallel_trainer.py", line 304, in sigint_handler                                                                                                                                             
    ray.get(controller.abort.remote())                                                                                                                                                                                                                                       
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper                                                                                                                                                      
    return fn(*args, **kwargs)                                                                                                                                                                                                                                               
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper                                                                                                                                                             
    return func(*args, **kwargs)                                                                                                                                                                                                                                             
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2954, in get
    values, debugger_breakpoint = worker.get_objects(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 1012, in get_objects
    raise value
ray.exceptions.ActorUnavailableError: The actor 2fce016608959e16c67765a70e000000 is unavailable: The actor is temporarily unavailable: IntentionalSystemExit: Worker exits with an exit code 0. exit_actor() is called.. The task may or may not have been executed on the actor.

with the graceful abort path. Now we correctly catch all RayActorErrors.

Testing

I sanity checked a dummy script

(ray4) tseah@tseah-LV3607J62K ray % python driver.py
2026-02-26 17:54:12,336	INFO worker.py:1984 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
/Users/tseah/ray/python/ray/_private/worker.py:2032: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
(TrainController pid=11774) Attempting to start training worker group of size 1 with the following resources: [{'CPU': 1}] * 1
(RayTrainWorker pid=11777) Setting up process group for: env:// [rank=0, world_size=1]
(TrainController pid=11774) Started training worker group of size 1: 
(TrainController pid=11774) - (ip=127.0.0.1, pid=11777) world_rank=0, local_rank=0, node_rank=0
(RayTrainWorker pid=11777) sleep for 1 second
(RayTrainWorker pid=11777) sleep for 1 second
^C2026-02-26 17:54:27,866	INFO data_parallel_trainer.py:289 -- Received SIGINT. Gracefully aborting the training run — this may take a few seconds. To forcefully abort immediately, you can send a different signal, such as SIGKILL.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner February 27, 2026 01:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where graceful aborts could fail if the controller actor becomes unavailable, raising an ActorUnavailableError that wasn't being caught. The change broadens the exception handling from ActorDiedError to its superclass RayActorError, which correctly catches ActorUnavailableError and other related actor errors. This ensures a more robust graceful shutdown process. The change is correct and well-justified.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Feb 27, 2026
@justinvyu justinvyu enabled auto-merge (squash) March 2, 2026 19:14
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Mar 2, 2026
@github-actions github-actions bot disabled auto-merge March 3, 2026 20:07
@justinvyu justinvyu merged commit 91a4388 into ray-project:master Mar 4, 2026
6 checks passed
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Mar 13, 2026
Catch all `RayActorError`s raised to the driver process when aborting a Ray Train run.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants