-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support state recovery when meta reboot #1702
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1702 +/- ##
==========================================
+ Coverage 71.15% 71.34% +0.19%
==========================================
Files 598 599 +1
Lines 77556 77645 +89
==========================================
+ Hits 55182 55399 +217
+ Misses 22374 22246 -128
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM. Great job!
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Yingjun Wu <yingjunwu@singularity-data.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What's changed and what's your intention?
As title, there are several changes in this PR to support state recovery when meta reboot:
force_stop_actors
. When failover found in some compute nodes, other compute nodes that contains actors in related DAGs will panic and gone. Further more, the original implementation offorce_stop_actors
will cause panic in compute node, that's not acceptable when reuse it for meta reboot. Here we just inject a stop barrier for all exist actors in living compute nodes, that works and will help us to stop all exist actors.Checklist
Refer to a related PR or issue link (optional)
Resolve #1277