[core] (FreeObjects 2/n) Added GCS passive owner death object freeing#63218
[core] (FreeObjects 2/n) Added GCS passive owner death object freeing#63218aaronscalene wants to merge 13 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new RPC mechanism, FreeLocalObjects, to allow the CoreWorker to explicitly instruct NodeManager instances to free local objects. This replaces the previous eviction-pubsub callback approach. My review identified several areas for improvement: the CoreWorker lambda capture should use shared_from_this for lifetime safety, the RPC implementation should be batched to avoid performance degradation in large clusters, and the now-unused objects_pending_deletion_ logic in LocalObjectManager should be cleaned up.
| io_service_.post( | ||
| [this, object_id, locations]() { |
There was a problem hiding this comment.
There are two potential improvements for this block:
- Lifetime Safety: Since
CoreWorkerinherits fromstd::enable_shared_from_this, it is safer to capture ashared_ptr(e.g.,self = shared_from_this()) in the lambda. This ensures theCoreWorkerinstance remains alive while the task is pending in theio_service_. - Efficiency: Checking if
locationsis empty before posting to theio_service_avoids unnecessary task scheduling overhead.
if (locations.empty()) {
return;
}
io_service_.post(
[self = shared_from_this(), object_id, locations]() {| io_service_.post( | ||
| [this, object_id, locations]() { | ||
| rpc::FreeLocalObjectsRequest request; | ||
| request.add_object_ids(object_id.Binary()); |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
| if (!objects_pending_deletion_.empty()) { | ||
| RAY_LOG(DEBUG) << "Freeing " << objects_pending_deletion_.size() | ||
| << " out-of-scope objects"; | ||
| // TODO(irabbani): CORE-1640 will modify as much as the plasma API as is | ||
| // reasonable to remove usage of vectors in favor of sets. | ||
| std::vector<ObjectID> objects_to_delete(objects_pending_deletion_.begin(), | ||
| objects_pending_deletion_.end()); | ||
| on_objects_freed_(objects_to_delete); | ||
| objects_pending_deletion_.clear(); | ||
| } |
There was a problem hiding this comment.
With the removal of the on_objects_freed_ callback, objects_pending_deletion_ appears to be dead code as the objects are now deleted directly in NodeManager::FreeLocalObjects. As noted in the TODO on line 334 of the header, this set and the logic in FlushFreeObjects should be removed to simplify the code and avoid confusion.
483b867 to
9482fe8
Compare
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
3c5129a to
01dd4d9
Compare
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaron.li <aaron.li@anyscale.com>
01dd4d9 to
a8c096f
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit a8c096f. Configure here.
| repeated bytes object_ids = 1; | ||
| } | ||
|
|
||
| message FreeLocalObjectsReply {} |
There was a problem hiding this comment.
Proto file modification requires fault-tolerance review
Low Severity
.proto files.
Please review the RPC fault-tolerance & idempotency standards guide here:
https://github.com/ray-project/ray/tree/master/doc/source/ray-core/internals/rpc-fault-tolerance.rst
Additional Locations (1)
Triggered by project rule: Bugbot Rules
Reviewed by Cursor Bugbot for commit a8c096f. Configure here.
| return matched; | ||
| } | ||
|
|
||
| // TODO(aaronscalene): Deal with ProcessSpilledObjectsDeleteQueue within FlushFreeObjects |
There was a problem hiding this comment.
Dead code: on_objects_freed_ member never invoked
Medium Severity
The on_objects_freed_ member variable (still accepted as a constructor parameter and stored) is never invoked anywhere after the FlushFreeObjects refactor removed its only call site. In production this stored a lambda calling object_manager->FreeObjects(object_ids, /*local_only=*/false), which also performed buffer_pool_.AbortCreate to cancel in-flight object transfers. That abort logic is now silently lost—neither NodeManager::FreeLocalObjects nor the new RPC path calls it.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a8c096f. Configure here.


Description
This PR covers the second part of the FreeObjects change. This second part amends the functionality of the current owner dead pubsub callback to be moved from an eager pubsub death callback to a more passive listener callback. Originally, the design was that the owner is thought to be dead when the pubsub connection fails / ends for some unexpected reason, which then tells all the primary copy raylet to free and queue up FreeObjects.
Currently, the GCS already publishes WorkerFailure and NodeRemoved announcements to all raylets. With this new design, the raylets will now just have additional functionality to clear any objects that had their owner on the failed worker / node.
Considerations
Pass Over All Objects
This new design introduces a pass over all objects within the current node every announcement. However, failed nodes and worker announcements should not be too common and if the object is not owned by the failed worker / node it would be a no-op; thus this should not lead to any latency issues.
Owner is Less Eager
This new design will be a less eager free, especially for the primary copy of the node. This is definitely an important part to consider, however since it is similar a pubsub for node information, the latency should not be that big of a deal. Furthermore, the pubsub interface already keeps track of network failures etc. to some extent, and if there are further failures / GCS failures there are probably worse problems to deal with.
LocalObjectManager local_objects_ vs. ObjectManager local_objects_
An important distinction is needed between both classes' local objects, where we need to aggregate and free any objects within both classes. LocalObjectManager in this case keeps track of primary copies as well as spilled copies. ObjectManager keeps track of anything in plasma such as primary copies / secondary copies. Thus, we need to get the union of both objects to get the objects that we want to free.
Testing
Ran python tests and C++ tests adjacent to these changes to ensure no regressions. Future e2e testing in #63181