[train] Cleanup Zombie RayTrainWorker Actors by JasonLi1909 · Pull Request #59872 · ray-project/ray

JasonLi1909 · 2026-01-06T03:33:33Z

Background

Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory:

Upon further investigation, we find that the actor process is hanging on Python garbage collection following termination via __ray_terminate__. __ray_terminate__ marks the actor as DEAD but does not guarantee the termination of the actor process. And following __ray_terminate__ when circular references are cleaned during Python interpreter teardown/garbage collection, the execution of object finalizers hang if the worker state is unstable- preventing the actor from ever exiting.

Previous User Workaround

We find that if garbage collection is performed earlier via gc.collect() while the worker state is more stable prior to shutdown (e.g. at the end of user train_func), there are no zombie actors- a mitigation leveraged by users prior to this stable fix.

Changes

This PR:

Replaces __ray_terminate__ with ray.kill in Train run shutdown and abort paths to guarantee the termination of train actors

With the changes, the same train job as above no longer has Zombie train actors:

`ray.kill` preserves shutdown patience

Previously, __ray_terminate__ was used for "graceful" termination of actors, scheduling the termination task with a timeout of 5 seconds to allow for pending tasks to complete, following which ray.kill would be called for non-terminated actors. While this PR does not schedule a termination task, it allows the worker shutdown hooks a (default) timeout of 5 seconds- preserving the amount of time shutdown hooks and other pending tasks have to complete. The difference is that after those 5 seconds,ray.kill is called to terminate every actor.

Reliability of `ray.kill` vs `__ray_terminate__` Shutdown

ray.kill performs a forceful out-of-band shutdown of an actor process, bypassing atexit handlers and python shutdown hooks- similar in nature to a SIGKILL- whereas __ray_terminate__ raises an exception, allowing python shutdown hooks to run and allowing the opportunity for Train actors to hang. Because Train actors do not rely on said hooks or atexit handlers for functionality, ray.kill is more fitting for our use case.

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a critical issue of leaking Ray Train actors that could lead to out-of-memory errors in subsequent training runs. The change from __ray_terminate__ to ray.kill is a solid approach to guarantee actor process termination. The refactored _shutdown_workers function is now cleaner and more reliable. It correctly allows a grace period for shutdown hooks to execute before ensuring all workers are killed. I have one suggestion to improve the robustness of the timeout handling.

python/ray/train/v2/_internal/execution/worker_group/state.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

justinvyu

Great investigation! 🚀 Can we add your minimal repro as a test? Ex: can do a pid liveness check assertion

Also, can you expand on the reason why ray_terminate fails vs. ray.kill()? Ex: letting a python process exit itself, vs. externally force killing the process. And can you also add some details about the early gc.collect() fixing things?

python/ray/train/v2/_internal/execution/worker_group/state.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

python/ray/train/v2/tests/test_worker_group.py

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory. This PR: - Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and abort paths to guarantee the termination of train actors --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory. This PR: - Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and abort paths to guarantee the termination of train actors --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

replaced __ray_terminate__ with ray.kill

ee974e8

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner January 6, 2026 03:33

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/state.py Show resolved Hide resolved

cursor bot reviewed Jan 6, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/state.py Show resolved Hide resolved

Update python/ray/train/v2/_internal/execution/worker_group/state.py

4690f7f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>

ray-gardener bot added the train Ray Train Related Issue label Jan 6, 2026

docstring

1e0c03a

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

justinvyu approved these changes Jan 8, 2026

View reviewed changes

python/ray/train/v2/_internal/execution/worker_group/state.py Outdated Show resolved Hide resolved

JasonLi1909 added 2 commits January 7, 2026 18:31

patience validate

e0b342a

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

added test with zombie repro

0882514

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

cursor bot reviewed Jan 9, 2026

View reviewed changes

python/ray/train/v2/tests/test_worker_group.py Outdated Show resolved Hide resolved

updated test

18a60a6

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 added go add ONLY when ready to merge, run all tests and removed train Ray Train Related Issue labels Jan 9, 2026

ray-gardener bot added the train Ray Train Related Issue label Jan 9, 2026

justinvyu merged commit 480a4de into ray-project:master Jan 12, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Cleanup Zombie RayTrainWorker Actors#59872

[train] Cleanup Zombie RayTrainWorker Actors#59872
justinvyu merged 6 commits intoray-project:masterfrom
JasonLi1909:slay-zombie-train-actors

JasonLi1909 commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JasonLi1909 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Previous User Workaround

Changes

ray.kill preserves shutdown patience

Reliability of ray.kill vs __ray_terminate__ Shutdown

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonLi1909 commented Jan 6, 2026 •

edited

Loading

`ray.kill` preserves shutdown patience

Reliability of `ray.kill` vs `__ray_terminate__` Shutdown