Skip to content

[core] Clarify ref counting PublishFailure fallback paths#63560

Merged
edoakes merged 2 commits into
ray-project:masterfrom
Yicheng-Lu-llll:clarify-ref-counting-fallback-comments
May 21, 2026
Merged

[core] Clarify ref counting PublishFailure fallback paths#63560
edoakes merged 2 commits into
ray-project:masterfrom
Yicheng-Lu-llll:clarify-ref-counting-fallback-comments

Conversation

@Yicheng-Lu-llll
Copy link
Copy Markdown
Member

@Yicheng-Lu-llll Yicheng-Lu-llll commented May 20, 2026

Description

It is possible that:

  • The owner's ref count reaches zero while there are still active object location subscribers. The owner must PublishFailure to all current subscribers so they receive a clean error immediately, instead of hanging until the fetch timeout fires.

  • A new object location subscription arrives after the owner has already removed the ref. The owner must PublishFailure to that subscriber for the same reason: surface a clean error rather than letting them wait for a location that will never come.

The claim that "ref count reaching zero does not always mean no one is still holding the ref and might subscribe to its location" is not really a bug in the ref counting protocol. This is by current design, and could (and maybe should) be improved in the future. A detailed example follows.

First, create a root actor. The root actor then starts actor A and actor B. Based on the ownership model, killing actor A won't cause actor B to die.

root actor worker -> child actor A worker
                  -> child actor B worker      (cascade kill tree)

Now an object is created via ray.put inside the root actor. The root actor passes the ref to actor A through an actor task argument (for example, actor_a.task.remote([ref])). Actor A passes the same ref to actor B in the same way, then returns immediately. Actor B stores the ref in self.ref.

At this point actor B is borrowing the ref, but the root actor doesn't yet know: actor A's task hasn't replied, so the borrower info hasn't propagated back. Suppose actor A then dies before that reply sent to the root actor. From the root actor's perspective:

  • Actor A's task failed, so its submitted task ref count drops to zero.
  • The root actor never received the reply from A that would have told it actor B is borrowing.
  • If the root actor also drops its own local ref, the ref count reaches zero and the root actor erases the entry.

Meanwhile actor B is still alive (killing A does not cascade to B, since they are siblings, not in a parent-child cascade relationship), still holds the ref in self.ref, and may at any point call ray.get(self.ref). This subscribes actor B to the root actor's object location channel for that object.

Two possible timings:

  • Actor B subscribed before the root actor erased the entry: the PublishFailure call in EraseReference notifies actor B.
  • Actor B subscribes after the entry has been erased: PublishObjectLocationSnapshot finds no entry for the object and sends PublishFailure to actor B directly.

Either way, actor B receives a clean error from ray.get instead of hanging for a long time waiting for a location that will never arrive.

Signed-off-by: yicheng <yicheng@anyscale.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the clarify-ref-counting-fallback-comments branch from f68d4fa to 0fa9d85 Compare May 20, 2026 23:18
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates comments and log messages in src/ray/core_worker/reference_counter.cc to provide more context regarding object reference counting and subscriber handling, specifically referencing GitHub PR #63557. I have no feedback to provide as there were no review comments to evaluate.

@Yicheng-Lu-llll Yicheng-Lu-llll added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels May 20, 2026
@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as ready for review May 20, 2026 23:42
@Yicheng-Lu-llll Yicheng-Lu-llll requested a review from a team as a code owner May 20, 2026 23:42
@@ -1745,9 +1745,8 @@ void ReferenceCounter::PublishObjectLocationSnapshot(const ObjectID &object_id)
auto it = object_id_refs_.find(object_id);
if (it == object_id_refs_.end()) {
RAY_LOG(WARNING).WithField(object_id)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log so the user can trace the popped up error back to this log.

// NOTE(swang): We have to publish failure to subscribers in case they
// subscribe after the ref is already deleted.
// It is possible that when ref count reaches zero, there are still subscribers.
// See https://github.com/ray-project/ray/pull/63560 for details
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not adding a log here since this is a hot path.

Copy link
Copy Markdown
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great description! 🚢

@edoakes edoakes merged commit fc328ad into ray-project:master May 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants