Skip to content

[train] Cleanup Zombie RayTrainWorker Actors#59872

Merged
justinvyu merged 6 commits intoray-project:masterfrom
JasonLi1909:slay-zombie-train-actors
Jan 12, 2026
Merged

[train] Cleanup Zombie RayTrainWorker Actors#59872
justinvyu merged 6 commits intoray-project:masterfrom
JasonLi1909:slay-zombie-train-actors

Conversation

@JasonLi1909
Copy link
Contributor

@JasonLi1909 JasonLi1909 commented Jan 6, 2026

Background

Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory:

unnamed

Upon further investigation, we find that the actor process is hanging on Python garbage collection following termination via __ray_terminate__. __ray_terminate__ marks the actor as DEAD but does not guarantee the termination of the actor process. And following __ray_terminate__ when circular references are cleaned during Python interpreter teardown/garbage collection, the execution of object finalizers hang if the worker state is unstable- preventing the actor from ever exiting.

Previous User Workaround

We find that if garbage collection is performed earlier via gc.collect() while the worker state is more stable prior to shutdown (e.g. at the end of user train_func), there are no zombie actors- a mitigation leveraged by users prior to this stable fix.

Changes

This PR:

  • Replaces __ray_terminate__ with ray.kill in Train run shutdown and abort paths to guarantee the termination of train actors

With the changes, the same train job as above no longer has Zombie train actors:

Screenshot 2026-01-06 at 5 15 17 PM

ray.kill preserves shutdown patience

Previously, __ray_terminate__ was used for "graceful" termination of actors, scheduling the termination task with a timeout of 5 seconds to allow for pending tasks to complete, following which ray.kill would be called for non-terminated actors. While this PR does not schedule a termination task, it allows the worker shutdown hooks a (default) timeout of 5 seconds- preserving the amount of time shutdown hooks and other pending tasks have to complete. The difference is that after those 5 seconds,ray.kill is called to terminate every actor.

Reliability of ray.kill vs __ray_terminate__ Shutdown

ray.kill performs a forceful out-of-band shutdown of an actor process, bypassing atexit handlers and python shutdown hooks- similar in nature to a SIGKILL- whereas __ray_terminate__ raises an exception, allowing python shutdown hooks to run and allowing the opportunity for Train actors to hang. Because Train actors do not rely on said hooks or atexit handlers for functionality, ray.kill is more fitting for our use case.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner January 6, 2026 03:33
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical issue of leaking Ray Train actors that could lead to out-of-memory errors in subsequent training runs. The change from __ray_terminate__ to ray.kill is a solid approach to guarantee actor process termination. The refactored _shutdown_workers function is now cleaner and more reliable. It correctly allows a grace period for shutdown hooks to execute before ensuring all workers are killed. I have one suggestion to improve the robustness of the timeout handling.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Jan 6, 2026
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation! 🚀 Can we add your minimal repro as a test? Ex: can do a pid liveness check assertion

Also, can you expand on the reason why ray_terminate fails vs. ray.kill()? Ex: letting a python process exit itself, vs. externally force killing the process. And can you also add some details about the early gc.collect() fixing things?

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 added go add ONLY when ready to merge, run all tests and removed train Ray Train Related Issue labels Jan 9, 2026
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Jan 9, 2026
@justinvyu justinvyu merged commit 480a4de into ray-project:master Jan 12, 2026
7 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
Leaking Ray Train actors have been observed occupying GPU memory
following Train run termination, causing training failures/OOMs in
subsequent train runs. Despite the train actors being marked DEAD by Ray
Core, we find that upon ssh-ing into nodes, that the actor processes are
still alive and occupying valuable GPU memory.

This PR:
- Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and
abort paths to guarantee the termination of train actors

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
rushikeshadhav pushed a commit to rushikeshadhav/ray that referenced this pull request Jan 14, 2026
Leaking Ray Train actors have been observed occupying GPU memory
following Train run termination, causing training failures/OOMs in
subsequent train runs. Despite the train actors being marked DEAD by Ray
Core, we find that upon ssh-ing into nodes, that the actor processes are
still alive and occupying valuable GPU memory.

This PR: 
- Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and
abort paths to guarantee the termination of train actors

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
Leaking Ray Train actors have been observed occupying GPU memory
following Train run termination, causing training failures/OOMs in
subsequent train runs. Despite the train actors being marked DEAD by Ray
Core, we find that upon ssh-ing into nodes, that the actor processes are
still alive and occupying valuable GPU memory.

This PR: 
- Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and
abort paths to guarantee the termination of train actors

---------

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants