[core] Actor reconstruction when a creation arg is in plasma by dayshah · Pull Request #51653 · ray-project/ray

dayshah · 2025-03-24T22:15:47Z

Problem

The current issue is that on success of initial actor creation lineage_ref_count for args in plasma is decremented, even if the actor could restart later. Therefore the objects won't stick around and could get deleted by the time the actor needs to restart.

For context, the actor creation task is initially submitted by the owner. But all the actor restart, restart scheduling, and actor death logic happens in the gcs actor manager. This necessitates a more complicated solution. An ideal solution would be to use the gcs actor manager only for detached actors and have normal actors managed by an actor manager on the owner worker.

Solution:

Incrementing lineage_ref_count
- Tell the task manager that the actor creation task has retries if task_spec.MaxActorRestarts() != 0 && !task_spec.IsDetachedActor(). Because of this, the task manager will not decrement the lineage_ref_count when the actor creation task finishes for the first time.
Decrementing lineage_ref_count
- Send a gcs → owner OwnedActorDead rpc on actor death (no restarts left + not detached) to decrement the lineage_ref_count for all the objects that it was incremented for in step 1.

Idempotency of HandleOwnedActorDead: Keeping a new set of owned actor id's in the core worker. Adding to it when an actor creation task is first submitted. Removing on the first execution of HandleOwnedActorDead and never executing it if the actor id is not in the set.

Note: This feature will still not work for detached actors because there is inherently no owner.

stale · 2025-04-26T03:24:47Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions · 2025-06-07T00:32:55Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2025-06-09T01:07:56Z

@jjyao can you take an initial look at this solution to see if it makes sense

Copilot

Pull Request Overview

This PR ensures that when an actor creation argument lives in plasma, its lineage is kept alive for potential restarts and then properly released once the actor dies.

Configure TaskManager to increment lineage refs for retryable, non-detached actor creation tasks.
Send a new OwnedActorDead RPC on actor death to decrement those lineage refs.
Implement ReferenceCounter::DecrementLineageRefCount, hook it into CoreWorker, and add an end-to-end test.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/ray/rpc/worker/core_worker_server.h	Register new `OwnedActorDead` RPC handler
src/ray/rpc/worker/core_worker_client.h	Add `OwnedActorDead` stub and retryable client method
src/ray/protobuf/core_worker.proto	Define `OwnedActorDeadRequest/Reply` and add to `CoreWorkerService`
src/ray/gcs/gcs_server/gcs_server.cc	Provide GCS with a placeholder callback when constructing CoreWorkerClient
src/ray/gcs/gcs_server/gcs_actor_manager.cc	Send `OwnedActorDead` RPC on non-detached actor death
src/ray/core_worker/task_manager.cc	Skip lineage release for actor creation tasks retried by GCS
src/ray/core_worker/reference_count.h	Declare `DecrementLineageRefCount`
src/ray/core_worker/reference_count.cc	Implement `DecrementLineageRefCount`
src/ray/core_worker/core_worker.h	Track `owned_actor_ids_` and declare handler
src/ray/core_worker/core_worker.cc	Handle `OwnedActorDead`, update `CreateActor` for retry logic
src/mock/ray/rpc/worker/core_worker_client.h	Mock `OwnedActorDead` client method
python/ray/tests/test_actor_lineage_reconstruction.py	New test for plasma-based actor reconstruction

Comments suppressed due to low confidence (2)

src/ray/gcs/gcs_server/gcs_server.cc:483

The empty-capture lambda passed as the CoreWorkerClient callback likely has the wrong signature. It should match the expected ClientCallback<...> signature (e.g. (const Status&, const rpc::OwnedActorDeadReply&)) to compile and behave correctly.

address, client_call_manager_, []() {

src/ray/core_worker/task_manager.cc:917

After erasing it with submissible_tasks_.erase(it), the function should return early to avoid dereferencing an invalid iterator in subsequent code.

} else if (it->second.num_retries_left != 0 && spec.IsActorCreationTask()) {

src/ray/core_worker/reference_count.h

edoakes · 2025-06-09T13:21:19Z

Note: This feature will still not work for detached actors because there is inherently no owner.

One solution here would be to ban spawning detached actors that have args in plasma (always inline all args into the task spec).

We can't leave it in a completely broken state, at a minimum we should loudly warn the user if this happens.

edoakes

The implementation feels very brittle to me given that we are bolting on some special cases in multiple places. Specifically, passing hard-coded max_retries=1 is misusing the TaskManager interface.

Let's explore if there is a way we can make this behavior more explicit, either by:

Adding a separate argument to TaskManager.
Not relying on the TaskManager to increment the lineage ref count for us in this case, and instead doing it fully manually.

edoakes · 2025-06-09T13:25:50Z

python/ray/tests/test_actor_lineage_reconstruction.py

    wait_for_condition(lambda: verify3())


+def test_actor_reconstruction_relies_on_plasma_object(ray_start_cluster):


needs comments. at a minimum, a header docstring comment that describes the high level goal of the test.

edoakes · 2025-06-09T13:26:25Z

src/ray/core_worker/core_worker.cc

+    // For named actor, we still go through the sync way because for
+    // functions like list actors these actors need to be there, especially
+    // for local driver. But the current code all go through the gcs right now.


can you clean up the wording of this comment while you're touching it

this branch in behavior is really quite problematic. surely there's a better way to handle this consistency requirement

edoakes · 2025-06-09T13:29:04Z

src/ray/core_worker/core_worker.cc

+    return Status::OK();
+  }
+
+  task_manager_->AddPendingTask(


The header comment for AddPendingTask needs to be updated:

ray/src/ray/core_worker/task_manager.h

Line 210 in c54437c

/// The local ref count for all return refs (excluding actor creation tasks)

edoakes · 2025-06-09T13:29:36Z

src/ray/core_worker/core_worker.cc

+      rpc_address_,
+      task_spec,
+      CurrentCallSite(),
+      // Actor creation task retry happens through the gcs, so the task manager only


this should also describe why we have the !task_spec.IsDetachedActor() condition

edoakes · 2025-06-09T13:33:00Z

src/ray/core_worker/task_manager.cc

+      // spec here and also don't need to count this against
+      // total_lineage_footprint_bytes_. GCS will directly release lineage for the


hm why doesn't it count against total_lineage_footprint_bytes_?

edoakes · 2025-06-09T13:36:54Z

src/ray/core_worker/core_worker.h

  mutable utils::container::ThreadSafeSharedLruCache<std::string, rpc::RuntimeEnvInfo>
      runtime_env_json_serialization_cache_;
+
+  // Reconstructable actors owned by this worker.


Suggested change

// Reconstructable actors owned by this worker.

// Restartable actors owned by this worker.

also indicate the lifecycle of entries in the map

dayshah · 2025-06-09T18:11:07Z

The implementation feels very brittle to me given that we are bolting on some special cases in multiple places. Specifically, passing hard-coded max_retries=1 is misusing the TaskManager interface.

Let's explore if there is a way we can make this behavior more explicit, either by:

Adding a separate argument to TaskManager.

Not relying on the TaskManager to increment the lineage ref count for us in this case, and instead doing it fully manually.

Ya agreed. Updated the initial description, the issue isn't the increment, the issue is that TaskManager's CompletePendingTask will decrement the lineage ref count based on the num_retries_left in the submissible_tasks entry. So we still have to go through TaskManager. I can make it a little more clear though and add a parameter and a member in TaskEntry like restartable_actor_creation_task though.

dayshah · 2025-06-10T18:46:43Z

@edoakes

I talked to @jjyao about this, and basically there's still an issue here if the argument of the actor creation task is the output of a retryable task. The arg could get evicted even if lineage_ref_count > 0 So on restart, we'd also have to tell the owner that this arg is needed and it should lineage reconstruct the arg. This kind of becomes a mess because of the owner worker / gcs actor manager split brain. There's also various race conditions that can arise from the split brain and the code to handle it could get kind of messy (like the RegisterActor stuff, timings on when the owner should do things based on when the gcs kv storage persists things).

The solution that simplifies everything is to have an actor manager on the owner worker for the actors it owns. And the gcs actor manager should only worry about detached actors. We can explicitly document the restart limitations for detached actors, and non-detached actors get all the lineage reconstruction benefits that come with objects and can follow the same codepath.

jjyao · 2025-06-10T19:03:14Z

If it's not blocking anything. highly suggest we design and do it properly. Adding more RPCs between core worker and GCS due to split brain will just make things more and more complicated.

dayshah · 2025-06-10T19:06:05Z

We made the decision to do this by killing the split brain.

…sma (#53713) Currently actor restarts won't because lineage ref counting doesn't work for actors with restarts. See description here #51653 (comment). Minimal repro ``` cluster = ray_start_cluster cluster.add_node(num_cpus=0) # head ray.init(address=cluster.address) worker1 = cluster.add_node(num_cpus=1) @ray.remote(num_cpus=1, max_restarts=1) class Actor: def __init__(self, config): self.config = config def ping(self): return self.config # Arg is >100kb so will go in the object store actor = Actor.remote(np.zeros(100 * 1024 * 1024, dtype=np.uint8)) ray.get(actor.ping.remote()) worker2 = cluster.add_node(num_cpus=1) cluster.remove_node(worker1, allow_graceful=True) # This line will break ray.get(actor.ping.remote()) ``` --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…sma (#53713) Currently actor restarts won't because lineage ref counting doesn't work for actors with restarts. See description here #51653 (comment). Minimal repro ``` cluster = ray_start_cluster cluster.add_node(num_cpus=0) # head ray.init(address=cluster.address) worker1 = cluster.add_node(num_cpus=1) @ray.remote(num_cpus=1, max_restarts=1) class Actor: def __init__(self, config): self.config = config def ping(self): return self.config # Arg is >100kb so will go in the object store actor = Actor.remote(np.zeros(100 * 1024 * 1024, dtype=np.uint8)) ray.get(actor.ping.remote()) worker2 = cluster.add_node(num_cpus=1) cluster.remove_node(worker1, allow_graceful=True) # This line will break ray.get(actor.ping.remote()) ``` --------- Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

dayshah force-pushed the actor-reconstruction-plasma branch 3 times, most recently from c48e6c1 to bb6f5c7 Compare March 25, 2025 16:55

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 26, 2025

dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 28, 2025

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 7, 2025

dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 7, 2025

dayshah added 2 commits June 7, 2025 18:46

add test

fd8d874

Signed-off-by: dayshah <dhyey2019@gmail.com>

update test

ba746ef

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah force-pushed the actor-reconstruction-plasma branch from bb6f5c7 to ba746ef Compare June 8, 2025 01:46

dayshah changed the title ~~[core] Fix actor reconstruction that depends on plasma object~~ [core] Actor reconstruction when the creation arg is in plasma Jun 8, 2025

dayshah changed the title ~~[core] Actor reconstruction when the creation arg is in plasma~~ [core] Actor reconstruction when a creation arg is in plasma Jun 8, 2025

the actual stuff

af791df

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah assigned jjyao Jun 9, 2025

dayshah marked this pull request as ready for review June 9, 2025 01:07

Copilot AI review requested due to automatic review settings June 9, 2025 01:07

dayshah requested a review from a team as a code owner June 9, 2025 01:07

Copilot AI reviewed Jun 9, 2025

View reviewed changes

src/ray/core_worker/reference_count.h Show resolved Hide resolved

dayshah mentioned this pull request Jun 9, 2025

[core] Use core worker client pool in GCS #53654

Merged

8 tasks

edoakes reviewed Jun 9, 2025

View reviewed changes

dayshah closed this Jun 10, 2025

This was referenced Jun 10, 2025

[core] Warning when creating actor with restarts and arguments in plasma #53713

Merged

[core] Actor restarts don't work when an actor creation arg is evicted from plasma #53727

Open

owenowenisme mentioned this pull request Oct 18, 2025

[Data] "Actor ... has constructor arguments in the object store" warning with minimal Ray Data use #57838

Closed

		wait_for_condition(lambda: verify3())


		def test_actor_reconstruction_relies_on_plasma_object(ray_start_cluster):

		// spec here and also don't need to count this against
		// total_lineage_footprint_bytes_. GCS will directly release lineage for the

	// Reconstructable actors owned by this worker.
	// Restartable actors owned by this worker.

Conversation

dayshah commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution:

Uh oh!

stale bot commented Apr 26, 2025

Uh oh!

github-actions bot commented Jun 7, 2025

Uh oh!

dayshah commented Jun 9, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

edoakes commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah commented Jun 9, 2025

Uh oh!

dayshah commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjyao commented Jun 10, 2025

Uh oh!

dayshah commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

dayshah commented Mar 24, 2025 •

edited

Loading

edoakes commented Jun 9, 2025 •

edited

Loading

dayshah commented Jun 10, 2025 •

edited

Loading