Skip to content

[core] (FreeObjects 2/n) Added GCS passive owner death object freeing#63218

Open
aaronscalene wants to merge 13 commits into
ray-project:masterfrom
aaronscalene:aaron/passive-owner-callback
Open

[core] (FreeObjects 2/n) Added GCS passive owner death object freeing#63218
aaronscalene wants to merge 13 commits into
ray-project:masterfrom
aaronscalene:aaron/passive-owner-callback

Conversation

@aaronscalene
Copy link
Copy Markdown
Contributor

@aaronscalene aaronscalene commented May 8, 2026

Description

This PR covers the second part of the FreeObjects change. This second part amends the functionality of the current owner dead pubsub callback to be moved from an eager pubsub death callback to a more passive listener callback. Originally, the design was that the owner is thought to be dead when the pubsub connection fails / ends for some unexpected reason, which then tells all the primary copy raylet to free and queue up FreeObjects.

Currently, the GCS already publishes WorkerFailure and NodeRemoved announcements to all raylets. With this new design, the raylets will now just have additional functionality to clear any objects that had their owner on the failed worker / node.

Considerations

Pass Over All Objects

This new design introduces a pass over all objects within the current node every announcement. However, failed nodes and worker announcements should not be too common and if the object is not owned by the failed worker / node it would be a no-op; thus this should not lead to any latency issues.

Owner is Less Eager

This new design will be a less eager free, especially for the primary copy of the node. This is definitely an important part to consider, however since it is similar a pubsub for node information, the latency should not be that big of a deal. Furthermore, the pubsub interface already keeps track of network failures etc. to some extent, and if there are further failures / GCS failures there are probably worse problems to deal with.

LocalObjectManager local_objects_ vs. ObjectManager local_objects_

An important distinction is needed between both classes' local objects, where we need to aggregate and free any objects within both classes. LocalObjectManager in this case keeps track of primary copies as well as spilled copies. ObjectManager keeps track of anything in plasma such as primary copies / secondary copies. Thus, we need to get the union of both objects to get the objects that we want to free.

Testing

Ran python tests and C++ tests adjacent to these changes to ensure no regressions. Future e2e testing in #63181

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new RPC mechanism, FreeLocalObjects, to allow the CoreWorker to explicitly instruct NodeManager instances to free local objects. This replaces the previous eviction-pubsub callback approach. My review identified several areas for improvement: the CoreWorker lambda capture should use shared_from_this for lifetime safety, the RPC implementation should be batched to avoid performance degradation in large clusters, and the now-unused objects_pending_deletion_ logic in LocalObjectManager should be cleaned up.

Comment on lines +4642 to +4643
io_service_.post(
[this, object_id, locations]() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two potential improvements for this block:

  1. Lifetime Safety: Since CoreWorker inherits from std::enable_shared_from_this, it is safer to capture a shared_ptr (e.g., self = shared_from_this()) in the lambda. This ensures the CoreWorker instance remains alive while the task is pending in the io_service_.
  2. Efficiency: Checking if locations is empty before posting to the io_service_ avoids unnecessary task scheduling overhead.
  if (locations.empty()) {
    return;
  }
  io_service_.post(
      [self = shared_from_this(), object_id, locations]() {

io_service_.post(
[this, object_id, locations]() {
rpc::FreeLocalObjectsRequest request;
request.add_object_ids(object_id.Binary());

This comment was marked as spam.

Comment on lines 176 to 178
if (!objects_pending_deletion_.empty()) {
RAY_LOG(DEBUG) << "Freeing " << objects_pending_deletion_.size()
<< " out-of-scope objects";
// TODO(irabbani): CORE-1640 will modify as much as the plasma API as is
// reasonable to remove usage of vectors in favor of sets.
std::vector<ObjectID> objects_to_delete(objects_pending_deletion_.begin(),
objects_pending_deletion_.end());
on_objects_freed_(objects_to_delete);
objects_pending_deletion_.clear();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the removal of the on_objects_freed_ callback, objects_pending_deletion_ appears to be dead code as the objects are now deleted directly in NodeManager::FreeLocalObjects. As noted in the TODO on line 334 of the header, this set and the logic in FlushFreeObjects should be removed to simplify the code and avoid confusion.

@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 483b867 to 9482fe8 Compare May 8, 2026 23:21
@aaronscalene aaronscalene changed the title [core] [DO NOT MERGE] (FreeObjects 2/n) [core] (FreeObjects 2/n) Added GCS passive owner death removal May 15, 2026
@aaronscalene aaronscalene changed the title [core] (FreeObjects 2/n) Added GCS passive owner death removal [core] (FreeObjects 2/n) Added GCS passive owner death object freeing May 15, 2026
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 3c5129a to 01dd4d9 Compare May 15, 2026 21:20
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
@aaronscalene aaronscalene force-pushed the aaron/passive-owner-callback branch from 01dd4d9 to a8c096f Compare May 20, 2026 22:22
@aaronscalene aaronscalene marked this pull request as ready for review May 21, 2026 03:14
@aaronscalene aaronscalene requested a review from a team as a code owner May 21, 2026 03:14
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit a8c096f. Configure here.

repeated bytes object_ids = 1;
}

message FreeLocalObjectsReply {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto file modification requires fault-tolerance review

Low Severity

⚠️ This PR modifies one or more .proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: Bugbot Rules

Reviewed by Cursor Bugbot for commit a8c096f. Configure here.

return matched;
}

// TODO(aaronscalene): Deal with ProcessSpilledObjectsDeleteQueue within FlushFreeObjects
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: on_objects_freed_ member never invoked

Medium Severity

The on_objects_freed_ member variable (still accepted as a constructor parameter and stored) is never invoked anywhere after the FlushFreeObjects refactor removed its only call site. In production this stored a lambda calling object_manager->FreeObjects(object_ids, /*local_only=*/false), which also performed buffer_pool_.AbortCreate to cancel in-flight object transfers. That abort logic is now silently lost—neither NodeManager::FreeLocalObjects nor the new RPC path calls it.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a8c096f. Configure here.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant