[GCS FT] Mark job as finished for dead node #40431

jonathan-anyscale · 2023-10-17T23:05:08Z

Why are these changes needed?

The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.

Related issue number

Closes #23963
Closes #39947

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567

Can u add e2e tests with test_gcs_fault_tolerance? 1. Start a job long running 2. Restart the head node 3. Verify the previous one is dead with correct error

src/ray/gcs/gcs_server/test/gcs_job_manager_test.cc

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

src/ray/gcs/gcs_server/gcs_job_manager.cc

jjyao · 2023-10-26T19:54:56Z

src/ray/gcs/gcs_server/test/gcs_job_manager_test.cc

+  auto node_id = NodeID::FromBinary(address.raylet_id());
+  gcs_job_manager.OnNodeDead(node_id);
+
+  // Test get all jobs with limit larger than the number of jobs.


What does this comment mean?

it's not suppose to be there, will remove

jjyao · 2023-10-26T19:55:45Z

src/ray/gcs/gcs_server/test/gcs_job_manager_test.cc

+    auto job_info1 = all_job_info_reply2.job_info_list().Get(0);
+    auto job_info2 = all_job_info_reply2.job_info_list().Get(1);


These two variables are not used?

jjyao · 2023-10-26T19:57:37Z

python/ray/tests/test_gcs_fault_tolerance.py

+        entrypoint="python -c 'import ray; ray.init(); print(ray.cluster_resources());'"
+    )
+    # restart the gcs server
+    ray._private.worker._global_node.kill_gcs_server()


Killing gcs won't mark the node as dead, is this what we want to test?

rkooo567

LGTM. Need a little bit more modification to tests before merging it.

rkooo567 · 2023-10-26T22:22:21Z

python/ray/tests/test_gcs_fault_tolerance.py

+    client = JobSubmissionClient(gcs_address)
+
+    # submit job
+    job_id = client.submit_job(


set the gcs_rpc_server_reconnect_timeout_s = ? submit a long running job with 1 head node cluster.remove_node(head) Wait until driver pid is gone Restart head node, cluster.add_node() make sure driver is dead

changed to this flow, but still failing when trying to check if job is marked as FAILED after raylet killed, so might need help on that

I think instead of head_node.kill_raylet, we can kill the entire head node by cluster.remote_node(head_node) and then restart the head node by cluster.add_node()

src/ray/gcs/gcs_server/gcs_job_manager.cc

rkooo567 · 2023-10-26T22:23:54Z

src/ray/gcs/gcs_server/gcs_job_manager.cc

+        RAY_LOG(DEBUG) << "Marking job: " << data.first << " as finished";
+        MarkJobAsFinished(data.second, [data](Status status) {
+          if (!status.ok()) {
+            RAY_LOG(WARNING) << "Failed to mark job as finished";


Add status to logs

<< "Failed to mark job as finished. Status: " << status

src/ray/gcs/gcs_server/test/gcs_job_manager_test.cc

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao · 2023-10-26T23:44:31Z

src/ray/gcs/gcs_server/gcs_job_manager.cc

        client->NumPendingTasks(
            std::move(request),
-            [reply, i, num_processed_jobs, try_send_reply](
+            [data, reply, i, num_processed_jobs, try_send_reply](


Instead of capturing the entire data, let's just capture worker_id

rkooo567 · 2023-10-27T02:02:38Z

lmk when it is ready to be merged

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

python/ray/tests/test_gcs_fault_tolerance.py

src/ray/gcs/gcs_server/gcs_job_manager.cc

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.

The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout. Co-authored-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>

jonathan-anyscale force-pushed the gcs_ft_p0 branch from e1987f9 to b810d56 Compare October 24, 2023 22:21

rkooo567 reviewed Oct 26, 2023

View reviewed changes

src/ray/gcs/gcs_server/test/gcs_job_manager_test.cc Outdated Show resolved Hide resolved

jonathan-anyscale force-pushed the gcs_ft_p0 branch 5 times, most recently from eaf054d to 7c360e9 Compare October 26, 2023 17:59

jonathan-anyscale added 5 commits October 26, 2023 11:01

OnNodeDead handler for job manager

d5d691c

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

gcs_ft e2e unit test

5cfe431

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

gcs ft unit test

758d364

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

clean debug log

f490985

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

lint

9deaf35

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jonathan-anyscale force-pushed the gcs_ft_p0 branch from 7c360e9 to 9deaf35 Compare October 26, 2023 18:01

jonathan-anyscale marked this pull request as ready for review October 26, 2023 18:02

jonathan-anyscale requested a review from a team as a code owner October 26, 2023 18:02

jonathan-anyscale assigned jjyao and rkooo567 Oct 26, 2023

fix build

3866aea

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jonathan-anyscale changed the title ~~[GCS FT]: Fix Ray job submission hang~~ [GCS FT] Mark job as finished for dead node Oct 26, 2023

jjyao reviewed Oct 26, 2023

View reviewed changes

rkooo567 approved these changes Oct 26, 2023

View reviewed changes

gcs fault tolerance unit test

f71bee6

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao reviewed Oct 26, 2023

View reviewed changes

jonathan-anyscale added 3 commits October 26, 2023 23:46

fix gcs fault tolerace test

9740321

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

lint

cdfbf65

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

lint

037e7ec

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

jjyao approved these changes Oct 27, 2023

View reviewed changes

update debug log

fdb5629

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>

rkooo567 merged commit 5ac06e9 into ray-project:master Oct 27, 2023
39 of 44 checks passed

rkooo567 mentioned this pull request Oct 31, 2023

[Core] Rewrite the Get Job API implementation #40829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GCS FT] Mark job as finished for dead node #40431

[GCS FT] Mark job as finished for dead node #40431

jonathan-anyscale commented Oct 17, 2023 •

edited

Loading

rkooo567 left a comment

jjyao Oct 26, 2023

jonathan-anyscale Oct 26, 2023

jjyao Oct 26, 2023

jjyao Oct 26, 2023

rkooo567 left a comment

rkooo567 Oct 26, 2023

jonathan-anyscale Oct 26, 2023

jjyao Oct 26, 2023

rkooo567 Oct 26, 2023

jjyao Oct 26, 2023

rkooo567 commented Oct 27, 2023

		auto job_info1 = all_job_info_reply2.job_info_list().Get(0);
		auto job_info2 = all_job_info_reply2.job_info_list().Get(1);

[GCS FT] Mark job as finished for dead node #40431

[GCS FT] Mark job as finished for dead node #40431

Conversation

jonathan-anyscale commented Oct 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Oct 27, 2023

jonathan-anyscale commented Oct 17, 2023 •

edited

Loading