-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[core] Fix "Check failed: it->second.num_retries_left == -1" #54116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: can <can@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a crash caused by resubmitting a task after it’s been canceled by adding a guard in ResubmitTask
and a corresponding test.
- Return
false
early inTaskManager::ResubmitTask
when a task is canceled. - Add
TestResubmitCanceledTask
to verify that resubmitting a canceled task fails gracefully.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
task_manager.cc | Added a check to bail out when num_retries_left is zero |
task_manager_test.cc | Added TestResubmitCanceledTask to cover the new cancellation path |
cc @dayshah |
src/ray/core_worker/task_manager.cc
Outdated
@@ -326,6 +326,11 @@ bool TaskManager::ResubmitTask(const TaskID &task_id, std::vector<ObjectID> *tas | |||
return false; | |||
} | |||
|
|||
if (it->second.num_retries_left == 0) { | |||
// This can happen when the task has been marked for cancellation. | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do more than just return false out of this because we set the object error, which is visible to the user, based on this. I think if we just directly return false the error will become OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED
but the status should be whatever we set it to when cancel happens, not this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, i'll add another error type
ASSERT_FALSE(manager_.ResubmitTask(spec.TaskId(), &task_deps)); | ||
|
||
// Final cleanup. | ||
reference_counter_->RemoveLocalReference(return_id, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do this in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test suite requires every test case to clean themselves up (line 153)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😨
@dayshah's comments |
I'm missing something really obvious here, but what's the context for this? How does the the object_ref become available if the task was canceled? The possibilities I can think of are:
In all of those cases, should objects be reconstructed if the user intended for the task to be canceled? |
89dcc22
to
26d6a1d
Compare
src/ray/core_worker/task_manager.cc
Outdated
return ResubmitTaskResult::FAILED_MAX_ATTEMPT_EXCEEDED; | ||
} | ||
|
||
if (it->second.num_retries_left == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change is this condition; the rest is boilerplate due to the addition of a new field in common.proto
src/ray/core_worker/task_manager.cc
Outdated
|
||
if (it->second.num_retries_left == 0) { | ||
// This can happen when the task has been marked for cancellation. | ||
return ResubmitTaskResult::FAILED_TASK_CANCELED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking we could just use the existing cancel error type don't need a new custom one just for this situation.
ray/src/ray/protobuf/common.proto
Lines 203 to 204 in 95d41f6
// Indicates that an object has been cancelled. | |
TASK_CANCELLED = 5; |
The user is trying to cancel, we just need to honor that cancel and not resubmit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol that's much better
I think the description needs to be rewritten a bit. The reason for this is that one thread, the thread the user actually calls ray.cancel on, can start the cancel, while another thread (the io_service of the reconstruction periodical runner) decides to resubmit the task. They both use the task manager. Cancelling doesn't atomically use the task manager to cancel, so you could do the first part of the cancel, set ray/src/ray/core_worker/transport/normal_task_submitter.cc Lines 710 to 726 in 95d41f6
@can-anyscale should probably leave a comment in the test too on how these things are called and how this interleaving can happen. Also might be worth thinking about any higher-level fixes, that could fix the submissible_tasks_ check failure issue too. I haven't put too much thought into what the implications of an atomic cancel might be. |
Signed-off-by: can <can@anyscale.com>
return TASK_CANCELED as the error type |
@israbbani - answer yours first, since I think it's simpler for me to explain. The crash was caused by the third case you mentioned—when "the task finished before it was canceled." My fix still prevents the object from being reconstructed (as defined by the API contract here). Previously, without my fix, Ray would crash. With the fix, object reconstruction still fails, but the failure is now properly propagated as a TaskCanceled exception instead of causing a crash. |
@dayshah - after closer investigation, I don't think this is caused by the race in the cancelling logic (basically not the race between https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/normal_task_submitter.cc#L711 and https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/normal_task_submitter.cc#L740). This crash only happens when the task is NOT pending, so the call to Another fix I considered was having But open for other suggestions. |
@can-anyscale thanks for the explanation! I think the PR description can be updated to state these points more clearly:
What happens in the other cases? |
@dayshah what's the submissible_tasks_ check failure issue? |
@dayshah , @israbbani : i added a python e2e test to reproduce the issue without the fix, and works with the fix; hopefully that makes the situation clearer without the fix, the ![]() |
a14f53d
to
90f5208
Compare
give me a chance to review this one before merging |
Signed-off-by: can <can@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nits but lgtm
recovery_failure_callback_(object_id, | ||
rpc::ErrorType::TASK_CANCELLED, | ||
/*pin_object=*/true); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but can you use a switch case here with the two enums. compiler warnings will guarantee that you're actually covering all enumeration cases in the code for the future and you don't need else's / defaults
https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wswitch-enum
/// \return SUCCESS if the task was successfully resubmitted (task or actor being | ||
/// scheduled, but no guarantee on completion), or was already pending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// \return SUCCESS if the task was successfully resubmitted (task or actor being | |
/// scheduled, but no guarantee on completion), or was already pending. | |
/// \return SUCCESS if the task was successfully resubmitted or the task was already pending. |
we never resubmit actor creation tasks through this codepath, that has to go through the gcs.
And scheduled is usually a different term from submitted, means the resubmission actually got assigned to a node, so should just stick to resubmitted
This PR fixed the reported
Check failed: it->second.num_retries_left == -1
. This check fails when the returned object of a canceled task is reconstructed. Concretely the sequencing is:This happens to
This fix still prevents the object from being reconstructed (as defined by the API contract here). Previously, without my fix, Ray would crash. With the fix, object reconstruction still fails, but the failure is now properly propagated as a TaskCanceled exception instead of causing a crash.
I added a
that failed before the fix with the check failed, and pass afterwards.
Stack trace:
Test: