Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Ray workers are not killed by SIGTERM #40182

Closed
rkooo567 opened this issue Oct 6, 2023 · 5 comments · Fixed by #40210
Closed

[Core] Ray workers are not killed by SIGTERM #40182

rkooo567 opened this issue Oct 6, 2023 · 5 comments · Fixed by #40210
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.8 release-blocker P0 Issue that blocks the release

Comments

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 6, 2023

What happened + What you expected to happen

It looks like Ray workers are not killed by SIGTERM. We have several code that uses SIGTERM to terminate worker processes, and this means that those workers are always ungracefully terminated, which means it will not run critical destructor (it is important for ML workloads). I.e., cleaning up child processes.

Versions / Dependencies

master

Reproduction script

send sigterm to a ray worker and see what happens.

Issue Severity

None

@rkooo567 rkooo567 added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core ray 2.8 labels Oct 6, 2023
@rkooo567
Copy link
Contributor Author

rkooo567 commented Oct 6, 2023

the original issue due to this issue; #40189

@YQ-Wang
Copy link
Contributor

YQ-Wang commented Oct 10, 2023

Just want to add my observation. When I run a ray job in the ray cluster, the driver code in the ray head is not able to catch the exception after I terminate the worker node ungracefully.

@rkooo567
Copy link
Contributor Author

Hmm I think it may take some time until it can detect ungraceful failures. For example, detecting a ungraceful node failure would take 30 seconds ~ 1 minute.

@YQ-Wang
Copy link
Contributor

YQ-Wang commented Oct 10, 2023

Hmm I think it may take some time until it can detect ungraceful failures. For example, detecting a ungraceful node failure would take 30 seconds ~ 1 minute.

Yeah, it cannot detect the ungraceful failure like forever. It always stuck in running status even though the worker node is down.

@rkooo567
Copy link
Contributor Author

rkooo567 commented Oct 10, 2023

Hmm intersting. That seems an orthogonal issue from this particular issue (probably related to keepalive). Is it possible to create a new issue with a reproducible script? We can start converstaion from there

@jjyao jjyao added the release-blocker P0 Issue that blocks the release label Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.8 release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants