[GCS FT] Mark job as finished for dead node (#40431) #40742

rkooo567 · 2023-10-27T10:09:24Z

This blocks several customers, and lots of kuberay users complain about this issue.

The risk of the fix is pretty low, but we should still rerun the release tests.

The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.

zhe-thoughts

We need for the P0 issue #39947

rkooo567 assigned jjyao Oct 27, 2023

rkooo567 requested a review from a team as a code owner October 27, 2023 10:09

rkooo567 assigned vitsai and zhe-thoughts Oct 27, 2023

jjyao approved these changes Oct 27, 2023

View reviewed changes

zhe-thoughts approved these changes Oct 27, 2023

View reviewed changes

vitsai merged commit 7dfb173 into ray-project:releases/2.8.0 Oct 27, 2023
55 of 59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCS FT] Mark job as finished for dead node (#40431) #40742

[GCS FT] Mark job as finished for dead node (#40431) #40742

rkooo567 commented Oct 27, 2023 •

edited

Loading

zhe-thoughts left a comment

[GCS FT] Mark job as finished for dead node (#40431) #40742

[GCS FT] Mark job as finished for dead node (#40431) #40742

Conversation

rkooo567 commented Oct 27, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

zhe-thoughts left a comment

Choose a reason for hiding this comment

rkooo567 commented Oct 27, 2023 •

edited

Loading