Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS FT] Mark job as finished for dead node #40431

Merged
merged 11 commits into from
Oct 27, 2023

Conversation

jonathan-anyscale
Copy link
Contributor

@jonathan-anyscale jonathan-anyscale commented Oct 17, 2023

Why are these changes needed?

The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.

Related issue number

Closes #23963
Closes #39947

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u add e2e tests with test_gcs_fault_tolerance? 1. Start a job long running 2. Restart the head node 3. Verify the previous one is dead with correct error

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@jonathan-anyscale jonathan-anyscale changed the title [GCS FT]: Fix Ray job submission hang [GCS FT] Mark job as finished for dead node Oct 26, 2023
src/ray/gcs/gcs_server/gcs_job_manager.cc Show resolved Hide resolved
auto node_id = NodeID::FromBinary(address.raylet_id());
gcs_job_manager.OnNodeDead(node_id);

// Test get all jobs with limit larger than the number of jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not suppose to be there, will remove

Comment on lines 625 to 626
auto job_info1 = all_job_info_reply2.job_info_list().Get(0);
auto job_info2 = all_job_info_reply2.job_info_list().Get(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two variables are not used?

entrypoint="python -c 'import ray; ray.init(); print(ray.cluster_resources());'"
)
# restart the gcs server
ray._private.worker._global_node.kill_gcs_server()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killing gcs won't mark the node as dead, is this what we want to test?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need a little bit more modification to tests before merging it.

client = JobSubmissionClient(gcs_address)

# submit job
job_id = client.submit_job(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set the gcs_rpc_server_reconnect_timeout_s = ?
submit a long running job with 1 head node
cluster.remove_node(head)
Wait until driver pid is gone
Restart head node, cluster.add_node()
make sure driver is dead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to this flow, but still failing when trying to check if job is marked as FAILED after raylet killed, so might need help on that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of head_node.kill_raylet, we can kill the entire head node by cluster.remote_node(head_node) and then restart the head node by cluster.add_node()

src/ray/gcs/gcs_server/gcs_job_manager.cc Show resolved Hide resolved
RAY_LOG(DEBUG) << "Marking job: " << data.first << " as finished";
MarkJobAsFinished(data.second, [data](Status status) {
if (!status.ok()) {
RAY_LOG(WARNING) << "Failed to mark job as finished";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add status to logs

<< "Failed to mark job as finished. Status: " << status

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
client->NumPendingTasks(
std::move(request),
[reply, i, num_processed_jobs, try_send_reply](
[data, reply, i, num_processed_jobs, try_send_reply](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of capturing the entire data, let's just capture worker_id

@rkooo567
Copy link
Contributor

lmk when it is ready to be merged

Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
python/ray/tests/test_gcs_fault_tolerance.py Outdated Show resolved Hide resolved
python/ray/tests/test_gcs_fault_tolerance.py Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_job_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_job_manager.cc Outdated Show resolved Hide resolved
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
@rkooo567 rkooo567 merged commit 5ac06e9 into ray-project:master Oct 27, 2023
39 of 44 checks passed
rkooo567 pushed a commit to rkooo567/ray that referenced this pull request Oct 27, 2023
The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.
vitsai pushed a commit that referenced this pull request Oct 27, 2023
The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.

Co-authored-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants