[Core]Fix actor creation race condition of #59642#62994
Conversation
97750c5 to
2685458
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces functionality to store and retrieve borrowed references for actors within the GCS. It also updates the RestartActor logic to trigger pending creation callbacks early using these saved references. A critical review comment points out that these callbacks should only be invoked with a success status if the actor had previously reached the ALIVE state; otherwise, it might incorrectly signal to the creator that the actor is ready while it is still restarting from a pending state.
250e89b to
6e59e6a
Compare
Yes I reproduced the issue like what suggested in the issue and verified the change works.
Verified
|
6e59e6a to
b989d6f
Compare
|
Thanks for the PR, @YoyinZyc! I have 2 general comments:
|
b989d6f to
b6142d1
Compare
b6142d1 to
d032034
Compare
|
@MengjinYan thanks for the suggestion. Updated the pr description and added a new unit test. |
d032034 to
bb43f59
Compare
|
Friendly ping on this issue:) @MengjinYan @andrewsykim |
bb43f59 to
f844338
Compare
f844338 to
1e3df78
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 1e3df78. Configure here.
ffb117d to
ba3645f
Compare
…eation Signed-off-by: Yuchen Zhou <yczhou@google.com>
ba3645f to
d77b469
Compare

Description
Run and clear the actor creation callback before actor recreation to avoid race condition.
A race condition occurs when an actor is successfully created on a worker, but the worker dies before GCS completes the asynchronous write of the actor's ALIVE state to storage. While waiting for the storage write, GCS processes the worker death and clears the actor's address in memory. When the storage write finally completes, its callback reads the cleared (Nil) address and sends it to the client, causing a crash.
With the suggestion from @MengjinYan, in RestartActor, before clearing the actor's address in memory, we check if the actor was already successfully created (ALIVE) and has pending creation callbacks. If so, we invoke the callbacks early with the valid address still in memory and the stored
borrowed_refs. This ensures the client receives a valid address and avoids the crash.Related issues
Fixes #59642
Tested