Skip to content

Conversation

@Sparks0219
Copy link
Contributor

Description

Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in #58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here:

RAY_CHECK(inserted.second) << "Lease depedencies can be requested only once per lease. "
<< lease_id;

The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one.

Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 requested a review from a team as a code owner October 29, 2025 01:01
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 29, 2025
@Sparks0219 Sparks0219 requested review from dayshah and edoakes October 29, 2025 01:01
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses an idempotency issue in RequestWorkerLease by allowing multiple callbacks to be stored for a single lease request, which is a robust way to handle retries from transient network errors. The changes are well-integrated across the scheduling components, and the new unit test provides good validation for the fix. My review includes a couple of suggestions to refine the StoreReplyCallback implementations by using find() instead of operator[] on maps to prevent unintended side effects and improve efficiency. Overall, this is a solid improvement to the scheduler's reliability.

Comment on lines 513 to 524
for (const auto &work : leases_to_schedule_[scheduling_class]) {
if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
return;
}
}
for (const auto &work : infeasible_leases_[scheduling_class]) {
if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
return;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using operator[] on leases_to_schedule_ and infeasible_leases_ will create a new empty std::deque if the scheduling_class is not found. This is inefficient and can lead to the map being populated with empty entries. It's better to use find() to check for the key's existence before accessing the deque.

  auto it = leases_to_schedule_.find(scheduling_class);
  if (it != leases_to_schedule_.end()) {
    for (const auto &work : it->second) {
      if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
        work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
        return;
      }
    }
  }
  auto infeasible_it = infeasible_leases_.find(scheduling_class);
  if (infeasible_it != infeasible_leases_.end()) {
    for (const auto &work : infeasible_it->second) {
      if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
        work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
        return;
      }
    }
  }

Comment on lines 1283 to 1288
for (const auto &work : leases_to_grant_[scheduling_class]) {
if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
return;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using operator[] on leases_to_grant_ will create a new empty std::deque if the scheduling_class is not found. This is inefficient and can lead to the map being populated with empty entries. It's better to use find() to check for the key's existence before accessing the deque.

  auto leases_to_grant_it = leases_to_grant_.find(scheduling_class);
  if (leases_to_grant_it != leases_to_grant_.end()) {
    for (const auto &work : leases_to_grant_it->second) {
      if (work->lease_.GetLeaseSpecification().LeaseId() == lease_id) {
        work->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
        return;
      }
    }
  }

return false;
}

void LocalLeaseManager::StoreReplyCallback(const SchedulingClass &scheduling_class,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was considering whether I should combine IsLeaseQueued and StoreReplyCallback but felt it was more clear to separate em

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 29, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing to check my understanding:

  • Previously, the local lease manager assumed that it would only ever get a single request to pull dependencies for a lease request.
  • However, if the RPC is retried after we start to pull dependencies for the lease request, it might be retried, and then we would be re-requesting to pull dependencies again.
  • To address this, you are allowing duplicate requests and replying to all of them once the pull is complete. You are doing this instead of overwriting the ongoing callback because the retry could come in before the initial request, in which case if we overwrite we would only reply to the initial request and the client would hang forever.

Did I miss anything?

return false;
}

void LocalLeaseManager::StoreReplyCallback(const SchedulingClass &scheduling_class,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Comment on lines 1003 to 1005
for (const auto &reply_callback : reply_callbacks) {
::ray::rpc::ResourceMapEntry *resource;
for (auto &resource_id : allocated_resources->ResourceIds()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be missing something, but it looks like these loops should be inverted -- nothing about the inner loop logic depends on which callback we are iterating through. So you can make a single pass through allocated_resources->ResourceIds() and populate all callbacks' resource mappings at once instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, inverted the loops

(*it->second)->reply_callbacks_.emplace_back(std::move(send_reply_callback), reply);
return;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

real? ^

@Sparks0219
Copy link
Contributor Author

Did I miss anything?

Nope that pretty much summarizes it! The idempotency guards we have in place only come into action once the lease is granted, but we're vulnerable if in between the lease arrived -> lease granted stage which includes the pulling dependencies stage.

real? ^

Yea... I called StoreReplyCallback under the assumption it's used only after IsLeaseQueued but thats not good, I'll do what the AI said

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 requested a review from edoakes October 30, 2025 21:13
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only stylistic comments. Ping for merge when ready.

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
return true;
}
return false;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Lease Queue Race Condition

A race condition exists between IsLeaseQueued and AddReplyCallback due to their inconsistent search orders for leases. IsLeaseQueued checks waiting_leases_index_ then leases_to_grant_, while AddReplyCallback checks the reverse. This allows a lease to move between queues after IsLeaseQueued returns true, causing AddReplyCallback to fail and trigger a RAY_CHECK in HandleRequestWorkerLease, crashing the Raylet.

Additional Locations (1)

Fix in Cursor Fix in Web

@edoakes edoakes enabled auto-merge (squash) October 30, 2025 23:57
@edoakes edoakes merged commit 168cdc6 into ray-project:master Oct 31, 2025
7 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants