[core] Fix idempotency issues in RequestWorkerLease for scheduled leases #58265

Sparks0219 · 2025-10-29T01:01:34Z

Description

Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in #58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here:

ray/src/ray/raylet/lease_dependency_manager.cc

Lines 222 to 223 in 66c08b4

    
           RAY_CHECK(inserted.second) << "Lease depedencies can be requested only once per lease. " 
        
                                      << lease_id;

The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one.

Signed-off-by: joshlee <joshlee@anyscale.com>

gemini-code-assist

Code Review

This pull request effectively addresses an idempotency issue in RequestWorkerLease by allowing multiple callbacks to be stored for a single lease request, which is a robust way to handle retries from transient network errors. The changes are well-integrated across the scheduling components, and the new unit test provides good validation for the fix. My review includes a couple of suggestions to refine the StoreReplyCallback implementations by using find() instead of operator[] on maps to prevent unintended side effects and improve efficiency. Overall, this is a solid improvement to the scheduler's reliability.

gemini-code-assist · 2025-10-29T01:03:17Z

src/ray/raylet/scheduling/cluster_lease_manager.cc

+  for (const auto &work : leases_to_schedule_[scheduling_class]) {
+    if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
+      work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
+      return;
+    }
+  }
+  for (const auto &work : infeasible_leases_[scheduling_class]) {
+    if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
+      work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
+      return;
+    }
+  }


Using operator[] on leases_to_schedule_ and infeasible_leases_ will create a new empty std::deque if the scheduling_class is not found. This is inefficient and can lead to the map being populated with empty entries. It's better to use find() to check for the key's existence before accessing the deque.

auto it = leases_to_schedule_.find(scheduling_class); if (it != leases_to_schedule_.end()) { for (const auto &work : it->second) { if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) { work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply); return; } } } auto infeasible_it = infeasible_leases_.find(scheduling_class); if (infeasible_it != infeasible_leases_.end()) { for (const auto &work : infeasible_it->second) { if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) { work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply); return; } } }

gemini-code-assist · 2025-10-29T01:03:17Z

src/ray/raylet/local_lease_manager.cc

+  for (const auto &work : leases_to_grant_[scheduling_class]) {
+    if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
+      work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
+      return;
+    }
+  }


Using operator[] on leases_to_grant_ will create a new empty std::deque if the scheduling_class is not found. This is inefficient and can lead to the map being populated with empty entries. It's better to use find() to check for the key's existence before accessing the deque.

auto leases_to_grant_it = leases_to_grant_.find(scheduling_class); if (leases_to_grant_it != leases_to_grant_.end()) { for (const auto &work : leases_to_grant_it->second) { if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) { work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply); return; } } }

Sparks0219 · 2025-10-29T01:22:41Z

src/ray/raylet/local_lease_manager.cc

+  return false;
+}
+
+void LocalLeaseManager::StoreReplyCallback(const SchedulingClass &scheduling_class,


was considering whether I should combine IsLeaseQueued and StoreReplyCallback but felt it was more clear to separate em

…otency-in-request-worker-lease

Signed-off-by: joshlee <joshlee@anyscale.com>

edoakes

Summarizing to check my understanding:

Previously, the local lease manager assumed that it would only ever get a single request to pull dependencies for a lease request.
However, if the RPC is retried after we start to pull dependencies for the lease request, it might be retried, and then we would be re-requesting to pull dependencies again.
To address this, you are allowing duplicate requests and replying to all of them once the pull is complete. You are doing this instead of overwriting the ongoing callback because the retry could come in before the initial request, in which case if we overwrite we would only reply to the initial request and the client would hang forever.

Did I miss anything?

edoakes · 2025-10-30T15:53:10Z

src/ray/raylet/local_lease_manager.cc

+  return false;
+}
+
+void LocalLeaseManager::StoreReplyCallback(const SchedulingClass &scheduling_class,


edoakes · 2025-10-30T15:55:53Z

src/ray/raylet/local_lease_manager.cc

+  for (const auto &reply_callback : reply_callbacks) {
+    ::ray::rpc::ResourceMapEntry *resource;
+    for (auto &resource_id : allocated_resources->ResourceIds()) {


Might be missing something, but it looks like these loops should be inverted -- nothing about the inner loop logic depends on which callback we are iterating through. So you can make a single pass through allocated_resources->ResourceIds() and populate all callbacks' resource mappings at once instead.

Makes sense, inverted the loops

edoakes · 2025-10-30T15:56:21Z

src/ray/raylet/local_lease_manager.cc

+    (*it->second)->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
+    return;
+  }
+}


Sparks0219 · 2025-10-30T17:39:21Z

Did I miss anything?

Nope that pretty much summarizes it! The idempotency guards we have in place only come into action once the lease is granted, but we're vulnerable if in between the lease arrived -> lease granted stage which includes the pulling dependencies stage.

real? ^

Yea... I called StoreReplyCallback under the assumption it's used only after IsLeaseQueued but thats not good, I'll do what the AI said

Signed-off-by: joshlee <joshlee@anyscale.com>

edoakes

LGTM, only stylistic comments. Ping for merge when ready.

src/ray/raylet/scheduling/cluster_lease_manager.h

src/ray/raylet/scheduling/cluster_lease_manager.cc

src/ray/raylet/scheduling/internal.h

Signed-off-by: joshlee <joshlee@anyscale.com>

cursor · 2025-10-30T23:00:05Z

src/ray/raylet/local_lease_manager.cc

+    return true;
+  }
+  return false;
+}


Bug: Lease Queue Race Condition

A race condition exists between IsLeaseQueued and AddReplyCallback due to their inconsistent search orders for leases. IsLeaseQueued checks waiting_leases_index_ then leases_to_grant_, while AddReplyCallback checks the reverse. This allows a lease to move between queues after IsLeaseQueued returns true, causing AddReplyCallback to fail and trigger a RAY_CHECK in HandleRequestWorkerLease, crashing the Raylet.

Additional Locations (1)

src/ray/raylet/node_manager.cc#L1879-L1898

…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

Fix idempotency issues in RequestWorkerLease for scheduled leases

ce3f433

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 requested a review from a team as a code owner October 29, 2025 01:01

Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 29, 2025

Sparks0219 requested review from dayshah and edoakes October 29, 2025 01:01

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

Sparks0219 commented Oct 29, 2025

View reviewed changes

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 29, 2025

Sparks0219 added 2 commits October 29, 2025 01:36

Merge remote-tracking branch 'upstream/master' into joshlee/fix-idemp…

bf1fed8

…otency-in-request-worker-lease

Fix failing cpp tests

106cd11

Signed-off-by: joshlee <joshlee@anyscale.com>

edoakes reviewed Oct 30, 2025

View reviewed changes

Sparks0219 added 2 commits October 30, 2025 17:51

Addressing comments

8ee05d4

Signed-off-by: joshlee <joshlee@anyscale.com>

fix cpp test failure

0c81aed

Signed-off-by: joshlee <joshlee@anyscale.com>

Sparks0219 requested a review from edoakes October 30, 2025 21:13

edoakes approved these changes Oct 30, 2025

View reviewed changes

Sparks0219 added 3 commits October 30, 2025 22:52

Addressing comments

f5e86f8

Signed-off-by: joshlee <joshlee@anyscale.com>

nit

bf5cc92

Signed-off-by: joshlee <joshlee@anyscale.com>

nit

72f44aa

Signed-off-by: joshlee <joshlee@anyscale.com>

cursor bot reviewed Oct 30, 2025

View reviewed changes

edoakes enabled auto-merge (squash) October 30, 2025 23:57

edoakes merged commit 168cdc6 into ray-project:master Oct 31, 2025
7 checks passed

dayshah mentioned this pull request Nov 10, 2025

[core] Adding option for in flight rpc failure injection #58512

Merged

	RAY_CHECK(inserted.second) << "Lease depedencies can be requested only once per lease. "
	<< lease_id;

[core] Fix idempotency issues in RequestWorkerLease for scheduled leases #58265

[core] Fix idempotency issues in RequestWorkerLease for scheduled leases #58265

Uh oh!

Conversation

Sparks0219 commented Oct 29, 2025

Description

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 commented Oct 30, 2025

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Oct 30, 2025

Choose a reason for hiding this comment

Bug: Lease Queue Race Condition

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants