[core] Fix "Check failed: it->second.num_retries_left == -1" #54116

can-anyscale · 2025-06-25T23:58:42Z

This PR fixed the reported Check failed: it->second.num_retries_left == -1. This check fails when the returned object of a canceled task is reconstructed. Concretely the sequencing is:

A task first marked as completed—i.e., no longer pending—but remains in the queue because its return object might need to be reconstructed later
The task is canceled
The task is triggered a retry to reconstruct its lost object

This happens to

Both normal task and actor task
A sequencing of breaking logic rather than a thread racing, can be reproduced reliability via a unitest

This fix still prevents the object from being reconstructed (as defined by the API contract here). Previously, without my fix, Ray would crash. With the fix, object reconstruction still fails, but the failure is now properly propagated as a TaskCanceled exception instead of causing a crash.

I added a

cpp unit test
e2e python test

that failed before the fix with the check failed, and pass afterwards.

Stack trace:

[2025-06-08 23:40:33,057 C 1553 1681] task_manager.cc:341:  Check failed: it->second.num_retries_left == -1 
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.12/site-packages/ray/_raylet.so(+0x1484c2a) [0x79d215843c2a] ray::operator<<()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x79d2158466a9] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager12ResubmitTaskERKNS_6TaskIDEPSt6vectorINS_8ObjectIDESaIS6_EE+0x271) [0x79d214dc32b1] ray::core::TaskManager::ResubmitTask()
/home/ray/anaconda3/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core21ObjectRecoveryManager17ReconstructObjectERKNS_8ObjectIDE+0x1aa) [0x79d214db492a] ray::core::ObjectRecoveryManager::ReconstructObject()
...

Test:

CI

Signed-off-by: can <can@anyscale.com>

Copilot

Pull Request Overview

This PR addresses a crash caused by resubmitting a task after it’s been canceled by adding a guard in ResubmitTask and a corresponding test.

Return false early in TaskManager::ResubmitTask when a task is canceled.
Add TestResubmitCanceledTask to verify that resubmitting a canceled task fails gracefully.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
task_manager.cc	Added a check to bail out when `num_retries_left` is zero
task_manager_test.cc	Added `TestResubmitCanceledTask` to cover the new cancellation path

src/ray/core_worker/task_manager.cc

can-anyscale · 2025-06-26T00:35:02Z

cc @dayshah

dayshah · 2025-06-26T00:40:59Z

src/ray/core_worker/task_manager.cc

@@ -326,6 +326,11 @@ bool TaskManager::ResubmitTask(const TaskID &task_id, std::vector<ObjectID> *tas
      return false;
    }

+    if (it->second.num_retries_left == 0) {
+      // This can happen when the task has been marked for cancellation.
+      return false;


I think we should do more than just return false out of this because we set the object error, which is visible to the user, based on this. I think if we just directly return false the error will become OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED but the status should be whatever we set it to when cancel happens, not this.

make sense, i'll add another error type

dayshah · 2025-06-26T00:42:22Z

src/ray/core_worker/test/task_manager_test.cc

+  ASSERT_FALSE(manager_.ResubmitTask(spec.TaskId(), &task_deps));
+
+  // Final cleanup.
+  reference_counter_->RemoveLocalReference(return_id, nullptr);


why do we need to do this in the test?

this test suite requires every test case to clean themselves up (line 153)

can-anyscale · 2025-06-26T02:53:17Z

@dayshah's comments

israbbani · 2025-06-26T02:54:53Z

This check fails when the returned object of a canceled task is reconstructed.

I'm missing something really obvious here, but what's the context for this? How does the the object_ref become available if the task was canceled? The possibilities I can think of are:

num_returns > 1 from a task
the task was a generator task
the task finished before it was canceled

In all of those cases, should objects be reconstructed if the user intended for the task to be canceled?

can-anyscale · 2025-06-26T05:01:48Z

src/ray/core_worker/task_manager.cc

+      return ResubmitTaskResult::FAILED_MAX_ATTEMPT_EXCEEDED;
+    }
+
+    if (it->second.num_retries_left == 0) {


The main change is this condition; the rest is boilerplate due to the addition of a new field in common.proto

dayshah · 2025-06-26T05:03:53Z

src/ray/core_worker/task_manager.cc

+
+    if (it->second.num_retries_left == 0) {
+      // This can happen when the task has been marked for cancellation.
+      return ResubmitTaskResult::FAILED_TASK_CANCELED;


i was thinking we could just use the existing cancel error type don't need a new custom one just for this situation.

ray/src/ray/protobuf/common.proto

Lines 203 to 204 in 95d41f6

// Indicates that an object has been cancelled.

TASK_CANCELLED = 5;

The user is trying to cancel, we just need to honor that cancel and not resubmit

lol that's much better

dayshah · 2025-06-26T05:16:13Z

This check fails when the returned object of a canceled task is reconstructed.

I'm missing something really obvious here, but what's the context for this? How does the the object_ref become available if the task was canceled? The possibilities I can think of are:

num_returns > 1 from a task

the task was a generator task

the task finished before it was canceled

In all of those cases, should objects be reconstructed if the user intended for the task to be canceled?

I think the description needs to be rewritten a bit. The reason for this is that one thread, the thread the user actually calls ray.cancel on, can start the cancel, while another thread (the io_service of the reconstruction periodical runner) decides to resubmit the task. They both use the task manager. Cancelling doesn't atomically use the task manager to cancel, so you could do the first part of the cancel, set num_retries_left to 0, and then the resubmit happens, and then you use the task manager to try to fail the task and set the object to error.

ray/src/ray/core_worker/transport/normal_task_submitter.cc

Lines 710 to 726 in 95d41f6

    
           if (cancelled_tasks_.find(task_spec.TaskId()) != cancelled_tasks_.end() || 
        
               !task_finisher_.MarkTaskCanceled(task_spec.TaskId()) || 
        
               !task_finisher_.IsTaskPending(task_spec.TaskId())) { 
        
             return Status::OK(); 
        
           } 
        
           auto &scheduling_key_entry = scheduling_key_entries_[scheduling_key]; 
        
           auto &scheduling_tasks = scheduling_key_entry.task_queue; 
        
           // This cancels tasks that have completed dependencies and are awaiting 
        
           // a worker lease. 
        
           if (!scheduling_tasks.empty()) { 
        
             for (auto spec = scheduling_tasks.begin(); spec != scheduling_tasks.end(); spec++) { 
        
               if (spec->TaskId() == task_spec.TaskId()) { 
        
                 scheduling_tasks.erase(spec); 
        
                 CancelWorkerLeaseIfNeeded(scheduling_key); 
        
                 task_finisher_.FailPendingTask(task_spec.TaskId(), 
        
                                                rpc::ErrorType::TASK_CANCELLED);

@can-anyscale should probably leave a comment in the test too on how these things are called and how this interleaving can happen. Also might be worth thinking about any higher-level fixes, that could fix the submissible_tasks_ check failure issue too. I haven't put too much thought into what the implications of an atomic cancel might be.

Signed-off-by: can <can@anyscale.com>

can-anyscale · 2025-06-26T16:09:29Z

return TASK_CANCELED as the error type

can-anyscale · 2025-06-26T16:21:37Z

@israbbani - answer yours first, since I think it's simpler for me to explain. The crash was caused by the third case you mentioned—when "the task finished before it was canceled." My fix still prevents the object from being reconstructed (as defined by the API contract here). Previously, without my fix, Ray would crash. With the fix, object reconstruction still fails, but the failure is now properly propagated as a TaskCanceled exception instead of causing a crash.

can-anyscale · 2025-06-26T16:30:41Z

@dayshah - after closer investigation, I don't think this is caused by the race in the cancelling logic (basically not the race between https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/normal_task_submitter.cc#L711 and https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/normal_task_submitter.cc#L740).

This crash only happens when the task is NOT pending, so the call to FailPendingTask is effectively unreachable. If you look at my test case, it's a simple case of a task being resubmitted after being canceled and no longer pending.

Another fix I considered was having MarkTaskCanceled remove the task from the submissible_map if it's not pending—but that would introduce another function that mutates shared state, which seems to increase the potential for race conditions. So I ended up going with: Hey ResubmitTask, protect yourself from external state—don’t assume anything—kind of approach.

But open for other suggestions.

israbbani · 2025-06-26T16:31:25Z

@can-anyscale thanks for the explanation! I think the PR description can be updated to state these points more clearly:

Which threads is the race between?
Which part of the task life cycle does this happen in?
Which execution models does this is happen in? (e.g. actor tasks, normal tasks, async actors etc)

The crash was caused by the third case you mentioned—when "the task finished before it was canceled."

What happens in the other cases?

israbbani · 2025-06-26T16:31:57Z

. Also might be worth thinking about any higher-level fixes, that could fix the submissible_tasks_ check failure issue too. I haven't put too much thought into what the implications of an atomic cancel might be.

@dayshah what's the submissible_tasks_ check failure issue?

can-anyscale · 2025-06-26T16:44:55Z

found some relevant PRs lol #48661; I'll sync with @dayshah first to understand different scenarios of task cancellation then I'll update the description.

can-anyscale · 2025-06-26T17:49:52Z

@dayshah , @israbbani : i added a python e2e test to reproduce the issue without the fix, and works with the fix; hopefully that makes the situation clearer

without the fix, the test_ray_cancel crashes:

edoakes · 2025-06-26T21:32:41Z

give me a chance to review this one before merging

Signed-off-by: can <can@anyscale.com>

dayshah

minor nits but lgtm

dayshah · 2025-06-27T17:06:01Z

src/ray/core_worker/object_recovery_manager.cc

+      recovery_failure_callback_(object_id,
+                                 rpc::ErrorType::TASK_CANCELLED,
+                                 /*pin_object=*/true);
+    }


nit, but can you use a switch case here with the two enums. compiler warnings will guarantee that you're actually covering all enumeration cases in the code for the future and you don't need else's / defaults

https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wswitch-enum

dayshah · 2025-06-27T17:07:52Z

src/ray/core_worker/task_manager.h

+  /// \return SUCCESS if the task was successfully resubmitted (task or actor being
+  /// scheduled, but no guarantee on completion), or was already pending.


Suggested change

/// \return SUCCESS if the task was successfully resubmitted (task or actor being

/// scheduled, but no guarantee on completion), or was already pending.

/// \return SUCCESS if the task was successfully resubmitted or the task was already pending.

we never resubmit actor creation tasks through this codepath, that has to go through the gcs.
And scheduled is usually a different term from submitted, means the resubmission actually got assigned to a node, so should just stick to resubmitted

[core] fix resubmission of a canceled task

9eff0d9

Signed-off-by: can <can@anyscale.com>

can-anyscale force-pushed the can-p01 branch from c9a7309 to 73ac50c Compare June 26, 2025 00:08

can-anyscale marked this pull request as ready for review June 26, 2025 00:16

can-anyscale requested review from Copilot and a team June 26, 2025 00:16

Copilot AI reviewed Jun 26, 2025

View reviewed changes

src/ray/core_worker/task_manager.cc Outdated Show resolved Hide resolved

dayshah reviewed Jun 26, 2025

View reviewed changes

israbbani added the go add ONLY when ready to merge, run all tests label Jun 26, 2025

can-anyscale force-pushed the can-p01 branch from 73ac50c to ab09b09 Compare June 26, 2025 02:51

can-anyscale requested review from pcmoritz and raulchen as code owners June 26, 2025 02:51

can-anyscale requested a review from dayshah June 26, 2025 02:53

can-anyscale force-pushed the can-p01 branch 2 times, most recently from 89dcc22 to 26d6a1d Compare June 26, 2025 05:00

can-anyscale requested a review from a team as a code owner June 26, 2025 05:00

can-anyscale commented Jun 26, 2025

View reviewed changes

dayshah reviewed Jun 26, 2025

View reviewed changes

[core] fix "Check failed: it->second.num_retries_left == -1"

0de40b7

Signed-off-by: can <can@anyscale.com>

can-anyscale force-pushed the can-p01 branch from 26d6a1d to 0de40b7 Compare June 26, 2025 16:08

can-anyscale requested a review from dayshah June 26, 2025 16:31

israbbani self-assigned this Jun 26, 2025

can-anyscale removed request for pcmoritz and raulchen June 26, 2025 17:50

can-anyscale force-pushed the can-p01 branch 3 times, most recently from a14f53d to 90f5208 Compare June 26, 2025 21:21

add a e2e test case

31b1e85

Signed-off-by: can <can@anyscale.com>

can-anyscale force-pushed the can-p01 branch from 90f5208 to 31b1e85 Compare June 27, 2025 04:16

dayshah approved these changes Jun 27, 2025

View reviewed changes

dayshah mentioned this pull request Jun 27, 2025

[core] Recover intermediate objects if needed while generator running #53999

Open

	// Indicates that an object has been cancelled.
	TASK_CANCELLED = 5;

		/// \return SUCCESS if the task was successfully resubmitted (task or actor being
		/// scheduled, but no guarantee on completion), or was already pending.

[core] Fix "Check failed: it->second.num_retries_left == -1" #54116

Are you sure you want to change the base?

[core] Fix "Check failed: it->second.num_retries_left == -1" #54116

Uh oh!

Conversation

can-anyscale commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

israbbani commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah commented Jun 26, 2025

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

israbbani commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

israbbani commented Jun 26, 2025

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

can-anyscale commented Jun 26, 2025

Uh oh!

edoakes commented Jun 26, 2025

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

can-anyscale commented Jun 25, 2025 •

edited

Loading

israbbani commented Jun 26, 2025 •

edited

Loading