-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCS FT] Mark job as finished for dead node #40431
Conversation
e1987f9
to
b810d56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can u add e2e tests with test_gcs_fault_tolerance? 1. Start a job long running 2. Restart the head node 3. Verify the previous one is dead with correct error
eaf054d
to
7c360e9
Compare
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
7c360e9
to
9deaf35
Compare
auto node_id = NodeID::FromBinary(address.raylet_id()); | ||
gcs_job_manager.OnNodeDead(node_id); | ||
|
||
// Test get all jobs with limit larger than the number of jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not suppose to be there, will remove
auto job_info1 = all_job_info_reply2.job_info_list().Get(0); | ||
auto job_info2 = all_job_info_reply2.job_info_list().Get(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two variables are not used?
entrypoint="python -c 'import ray; ray.init(); print(ray.cluster_resources());'" | ||
) | ||
# restart the gcs server | ||
ray._private.worker._global_node.kill_gcs_server() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killing gcs won't mark the node as dead, is this what we want to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Need a little bit more modification to tests before merging it.
client = JobSubmissionClient(gcs_address) | ||
|
||
# submit job | ||
job_id = client.submit_job( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set the gcs_rpc_server_reconnect_timeout_s = ?
submit a long running job with 1 head node
cluster.remove_node(head)
Wait until driver pid is gone
Restart head node, cluster.add_node()
make sure driver is dead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to this flow, but still failing when trying to check if job is marked as FAILED after raylet killed, so might need help on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of head_node.kill_raylet, we can kill the entire head node by cluster.remote_node(head_node)
and then restart the head node by cluster.add_node()
RAY_LOG(DEBUG) << "Marking job: " << data.first << " as finished"; | ||
MarkJobAsFinished(data.second, [data](Status status) { | ||
if (!status.ok()) { | ||
RAY_LOG(WARNING) << "Failed to mark job as finished"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add status to logs
<< "Failed to mark job as finished. Status: " << status
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
client->NumPendingTasks( | ||
std::move(request), | ||
[reply, i, num_processed_jobs, try_send_reply]( | ||
[data, reply, i, num_processed_jobs, try_send_reply]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of capturing the entire data, let's just capture worker_id
lmk when it is ready to be merged |
Signed-off-by: Jonathan Nitisastro <jonathancn@anyscale.com>
The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.
The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout. Co-authored-by: jonathan-anyscale <144177685+jonathan-anyscale@users.noreply.github.com>
Why are these changes needed?
The issue is
ray list nodes
is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.Related issue number
Closes #23963
Closes #39947
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.