Skip to content

[Core]Fix actor creation race condition of #59642#62994

Merged
MengjinYan merged 1 commit into
ray-project:masterfrom
YoyinZyc:gcs_actor_option2_clean
May 15, 2026
Merged

[Core]Fix actor creation race condition of #59642#62994
MengjinYan merged 1 commit into
ray-project:masterfrom
YoyinZyc:gcs_actor_option2_clean

Conversation

@YoyinZyc
Copy link
Copy Markdown
Contributor

@YoyinZyc YoyinZyc commented Apr 28, 2026

Description

Run and clear the actor creation callback before actor recreation to avoid race condition.

A race condition occurs when an actor is successfully created on a worker, but the worker dies before GCS completes the asynchronous write of the actor's ALIVE state to storage. While waiting for the storage write, GCS processes the worker death and clears the actor's address in memory. When the storage write finally completes, its callback reads the cleared (Nil) address and sends it to the client, causing a crash.

With the suggestion from @MengjinYan, in RestartActor, before clearing the actor's address in memory, we check if the actor was already successfully created (ALIVE) and has pending creation callbacks. If so, we invoke the callbacks early with the valid address still in memory and the stored borrowed_refs. This ensures the client receives a valid address and avoids the crash.

Related issues

Fixes #59642

Tested

  • added a new unit test
  • reproduce the issue and verify the fix

@YoyinZyc YoyinZyc requested a review from a team as a code owner April 28, 2026 18:03
@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from 97750c5 to 2685458 Compare April 28, 2026 18:05
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality to store and retrieve borrowed references for actors within the GCS. It also updates the RestartActor logic to trigger pending creation callbacks early using these saved references. A critical review comment points out that these callbacks should only be invoked with a success status if the actor had previously reached the ALIVE state; otherwise, it might incorrectly signal to the creator that the actor is ready while it is still restarting from a pending state.

Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc
@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch 3 times, most recently from 250e89b to 6e59e6a Compare April 28, 2026 18:17
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Copy link
Copy Markdown
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @YoyinZyc, can you try to run the reproduction script shared in the original issue (#59642) with your changes?

@YoyinZyc
Copy link
Copy Markdown
Contributor Author

Thanks for the PR @YoyinZyc, can you try to run the reproduction script shared in the original issue (#59642) with your changes?

Yes I reproduced the issue like what suggested in the issue and verified the change works.

  • Injected the failure by delaying GCS storage write
  • kill the worker after actor recreation

Verified

  • No nil address crash
  • "Invoking creation callbacks early in RestartActor" in gcs log.

@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from 6e59e6a to b989d6f Compare April 28, 2026 19:15
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Apr 28, 2026
@MengjinYan
Copy link
Copy Markdown
Contributor

Thanks for the PR, @YoyinZyc!

I have 2 general comments:

  • In the PR description, can you add a bit more context about the issue itself and the changes you made so that people can get more context of the change when they read the PR?
  • Wondering is it possible to add a test to verify the situation of the issue?

@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from b989d6f to b6142d1 Compare April 28, 2026 20:25
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc
@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from b6142d1 to d032034 Compare April 28, 2026 21:42
@YoyinZyc
Copy link
Copy Markdown
Contributor Author

@MengjinYan thanks for the suggestion.

Updated the pr description and added a new unit test.

@YoyinZyc
Copy link
Copy Markdown
Contributor Author

YoyinZyc commented May 4, 2026

Friendly ping on this issue:) @MengjinYan @andrewsykim

@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from bb43f59 to f844338 Compare May 7, 2026 19:07
Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!! All are nit comments.

Also, it seems that the lint is failing. Probably also need to fix that

Comment thread src/ray/gcs/actor/gcs_actor.h Outdated
Comment thread src/ray/gcs/actor/gcs_actor.h
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc Outdated
Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc
Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc
@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label May 9, 2026
@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from f844338 to 1e3df78 Compare May 11, 2026 19:07
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 1e3df78. Configure here.

Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc Outdated
@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch 4 times, most recently from ffb117d to ba3645f Compare May 11, 2026 23:00
…eation

Signed-off-by: Yuchen Zhou <yczhou@google.com>
@YoyinZyc YoyinZyc force-pushed the gcs_actor_option2_clean branch from ba3645f to d77b469 Compare May 12, 2026 18:09
Copy link
Copy Markdown
Contributor

@MengjinYan MengjinYan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MengjinYan MengjinYan merged commit 18ac26f into ray-project:master May 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Actor Creation got error: Check failed: !WorkerID::FromBinary(worker_addr.worker_id()).IsNil()

3 participants