[core] Clarify ref counting PublishFailure fallback paths#63560
Merged
edoakes merged 2 commits intoMay 21, 2026
Merged
Conversation
Signed-off-by: yicheng <yicheng@anyscale.com>
f68d4fa to
0fa9d85
Compare
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates comments and log messages in src/ray/core_worker/reference_counter.cc to provide more context regarding object reference counting and subscriber handling, specifically referencing GitHub PR #63557. I have no feedback to provide as there were no review comments to evaluate.
Yicheng-Lu-llll
commented
May 20, 2026
| @@ -1745,9 +1745,8 @@ void ReferenceCounter::PublishObjectLocationSnapshot(const ObjectID &object_id) | |||
| auto it = object_id_refs_.find(object_id); | |||
| if (it == object_id_refs_.end()) { | |||
| RAY_LOG(WARNING).WithField(object_id) | |||
Member
Author
There was a problem hiding this comment.
log so the user can trace the popped up error back to this log.
| // NOTE(swang): We have to publish failure to subscribers in case they | ||
| // subscribe after the ref is already deleted. | ||
| // It is possible that when ref count reaches zero, there are still subscribers. | ||
| // See https://github.com/ray-project/ray/pull/63560 for details |
Member
Author
There was a problem hiding this comment.
Not adding a log here since this is a hot path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
It is possible that:
The owner's ref count reaches zero while there are still active object location subscribers. The owner must
PublishFailureto all current subscribers so they receive a clean error immediately, instead of hanging until the fetch timeout fires.A new object location subscription arrives after the owner has already removed the ref. The owner must
PublishFailureto that subscriber for the same reason: surface a clean error rather than letting them wait for a location that will never come.The claim that "ref count reaching zero does not always mean no one is still holding the ref and might subscribe to its location" is not really a bug in the ref counting protocol. This is by current design, and could (and maybe should) be improved in the future. A detailed example follows.
First, create a root actor. The root actor then starts actor A and actor B. Based on the ownership model, killing actor A won't cause actor B to die.
Now an object is created via
ray.putinside the root actor. The root actor passes the ref to actor A through an actor task argument (for example,actor_a.task.remote([ref])). Actor A passes the same ref to actor B in the same way, then returns immediately. Actor B stores the ref inself.ref.At this point actor B is borrowing the ref, but the root actor doesn't yet know: actor A's task hasn't replied, so the borrower info hasn't propagated back. Suppose actor A then dies before that reply sent to the root actor. From the root actor's perspective:
Meanwhile actor B is still alive (killing A does not cascade to B, since they are siblings, not in a parent-child cascade relationship), still holds the ref in
self.ref, and may at any point callray.get(self.ref). This subscribes actor B to the root actor's object location channel for that object.Two possible timings:
PublishFailurecall inEraseReferencenotifies actor B.PublishObjectLocationSnapshotfinds no entry for the object and sendsPublishFailureto actor B directly.Either way, actor B receives a clean error from
ray.getinstead of hanging for a long time waiting for a location that will never arrive.