[Core] Ray workers are not killed by SIGTERM #40182

rkooo567 · 2023-10-06T08:47:26Z

What happened + What you expected to happen

It looks like Ray workers are not killed by SIGTERM. We have several code that uses SIGTERM to terminate worker processes, and this means that those workers are always ungracefully terminated, which means it will not run critical destructor (it is important for ML workloads). I.e., cleaning up child processes.

Versions / Dependencies

master

Reproduction script

send sigterm to a ray worker and see what happens.

Issue Severity

None

rkooo567 · 2023-10-06T21:45:47Z

the original issue due to this issue; #40189

YQ-Wang · 2023-10-10T22:01:33Z

Just want to add my observation. When I run a ray job in the ray cluster, the driver code in the ray head is not able to catch the exception after I terminate the worker node ungracefully.

rkooo567 · 2023-10-10T22:21:03Z

Hmm I think it may take some time until it can detect ungraceful failures. For example, detecting a ungraceful node failure would take 30 seconds ~ 1 minute.

YQ-Wang · 2023-10-10T22:23:00Z

Hmm I think it may take some time until it can detect ungraceful failures. For example, detecting a ungraceful node failure would take 30 seconds ~ 1 minute.

Yeah, it cannot detect the ungraceful failure like forever. It always stuck in running status even though the worker node is down.

rkooo567 · 2023-10-10T22:24:01Z

Hmm intersting. That seems an orthogonal issue from this particular issue (probably related to keepalive). Is it possible to create a new issue with a reproducible script? We can start converstaion from there

rkooo567 added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core ray 2.8 labels Oct 6, 2023

rkooo567 mentioned this issue Oct 6, 2023

[core] Forked child process of a worker is not killed when parent exists due to pg removed #40189

Closed

rkooo567 self-assigned this Oct 7, 2023

rkooo567 mentioned this issue Oct 9, 2023

[Core] Fix a bug where SIGTERM is ignored to worker processes #40210

Merged

8 tasks

jjyao added the release-blocker P0 Issue that blocks the release label Oct 20, 2023

rickyyx closed this as completed in #40210 Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Ray workers are not killed by SIGTERM #40182

[Core] Ray workers are not killed by SIGTERM #40182

rkooo567 commented Oct 6, 2023

rkooo567 commented Oct 6, 2023

YQ-Wang commented Oct 10, 2023

rkooo567 commented Oct 10, 2023

YQ-Wang commented Oct 10, 2023

rkooo567 commented Oct 10, 2023 •

edited

Loading

[Core] Ray workers are not killed by SIGTERM #40182

[Core] Ray workers are not killed by SIGTERM #40182

Comments

rkooo567 commented Oct 6, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

rkooo567 commented Oct 6, 2023

YQ-Wang commented Oct 10, 2023

rkooo567 commented Oct 10, 2023

YQ-Wang commented Oct 10, 2023

rkooo567 commented Oct 10, 2023 • edited Loading

rkooo567 commented Oct 10, 2023 •

edited

Loading