[train] Cleanup Zombie RayTrainWorker Actors#59872
[train] Cleanup Zombie RayTrainWorker Actors#59872justinvyu merged 6 commits intoray-project:masterfrom
Conversation
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a critical issue of leaking Ray Train actors that could lead to out-of-memory errors in subsequent training runs. The change from __ray_terminate__ to ray.kill is a solid approach to guarantee actor process termination. The refactored _shutdown_workers function is now cleaner and more reliable. It correctly allows a grace period for shutdown hooks to execute before ensuring all workers are killed. I have one suggestion to improve the robustness of the timeout handling.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
justinvyu
left a comment
There was a problem hiding this comment.
Great investigation! 🚀 Can we add your minimal repro as a test? Ex: can do a pid liveness check assertion
Also, can you expand on the reason why ray_terminate fails vs. ray.kill()? Ex: letting a python process exit itself, vs. externally force killing the process. And can you also add some details about the early gc.collect() fixing things?
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory. This PR: - Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and abort paths to guarantee the termination of train actors --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory. This PR: - Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and abort paths to guarantee the termination of train actors --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory. This PR: - Replaces `__ray_terminate__` with `ray.kill` in Train run shutdown and abort paths to guarantee the termination of train actors --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Background
Leaking Ray Train actors have been observed occupying GPU memory following Train run termination, causing training failures/OOMs in subsequent train runs. Despite the train actors being marked DEAD by Ray Core, we find that upon ssh-ing into nodes, that the actor processes are still alive and occupying valuable GPU memory:
Upon further investigation, we find that the actor process is hanging on Python garbage collection following termination via
__ray_terminate__.__ray_terminate__marks the actor as DEAD but does not guarantee the termination of the actor process. And following__ray_terminate__when circular references are cleaned during Python interpreter teardown/garbage collection, the execution of object finalizers hang if the worker state is unstable- preventing the actor from ever exiting.Previous User Workaround
We find that if garbage collection is performed earlier via
gc.collect()while the worker state is more stable prior to shutdown (e.g. at the end of user train_func), there are no zombie actors- a mitigation leveraged by users prior to this stable fix.Changes
This PR:
__ray_terminate__withray.killin Train run shutdown and abort paths to guarantee the termination of train actorsWith the changes, the same train job as above no longer has Zombie train actors:
ray.killpreserves shutdown patiencePreviously,
__ray_terminate__was used for "graceful" termination of actors, scheduling the termination task with a timeout of 5 seconds to allow for pending tasks to complete, following whichray.killwould be called for non-terminated actors. While this PR does not schedule a termination task, it allows the worker shutdown hooks a (default) timeout of 5 seconds- preserving the amount of time shutdown hooks and other pending tasks have to complete. The difference is that after those 5 seconds,ray.killis called to terminate every actor.Reliability of
ray.killvs__ray_terminate__Shutdownray.killperforms a forceful out-of-band shutdown of an actor process, bypassing atexit handlers and python shutdown hooks- similar in nature to a SIGKILL- whereas__ray_terminate__raises an exception, allowing python shutdown hooks to run and allowing the opportunity for Train actors to hang. Because Train actors do not rely on said hooks or atexit handlers for functionality,ray.killis more fitting for our use case.