-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: handle state store sync in local barrier manager #14377
Conversation
Not reviewed detailedly...
Does this decrease the overall complexity? Previously local barrier manager is only responsible for collecting and sending barriers, while RPC handlers call
Is it possible to split the changes above and other minor refactors? |
I plan to replace the current two rpc Previously, the rpc handler gets notified from barrier manager on barrier collected, and then call sync. In this PR, the process of notifying barrier collected and calling sync is replaced with creating a future that calls sync and notifying on future completion. Hopes this explanation can help review this PR.
The greatest refactor in this PR is to change the key of |
…g/local-barrier-manager-handle-sync
…g/local-barrier-manager-handle-sync
Some refactor logic is split to #14436 and gets merged. Now we can focus the logic of handling state store sync of this PR. |
Why do we introduce this PR, is it a prerequisite of some feature? |
We are going to deprecate the |
Any comments? @BugenZhao @yezizp2012 @kwannoel @tabVersion @hzxa21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a rough look, basically LGTM. Wait for @yezizp2012 review
pin!(self.state.next_completed_epoch()), | ||
pin!(event_rx.recv()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a preferred branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either branch is fine. The handler for each case is not async and blocking, so each case handler is expected to finish a very short time.
@@ -46,12 +48,11 @@ pub const ENABLE_BARRIER_AGGREGATION: bool = false; | |||
|
|||
/// Collect result of some barrier on current compute node. Will be reported to the meta service. | |||
#[derive(Debug)] | |||
pub struct CollectResult { | |||
pub struct BarrierCompleteResult { | |||
pub sync_result: Option<SyncResult>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a docstring for this field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
warn!(err=?e.as_ref().map(|_|()), "fail to send collect epoch result"); | ||
}); | ||
loop { | ||
let item = drop_either_future( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on the use of drop_either_future
here?
Why we just drop lhs / rhs future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use tokio::select!
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to tokio::select!
. Original use of drop_either_future
is because the future will hold a mutable reference to self, and if not dropped we cannot modify the states in self.
} | ||
} | ||
|
||
/// Notify if we have collected barriers from all actor ids. The state must be `Issued`. | ||
fn may_notify(&mut self, prev_epoch: u64) { | ||
fn may_have_collected_all(&mut self, prev_epoch: u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the docstring and naming of this function? may_have_collected_all
seems kind of vague, I didn't really get the use of it.
From the code it seems to call sync_epoch
and update the mview progress mainly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added docstring on the method. The naming may_xxx
is to be consistent with the original method to help better compare with the original logic for code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, btw do we have any detailed docs about the implementation of partial checkpoint?
.get_mut(&prev_epoch) | ||
.expect("should exist"); | ||
// sanity check on barrier state | ||
match &state.inner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits: can use assert_matches
instead.
Detailed design about barrier collection in partial checkpoint can be found in partial-checkpoint.md, which is in risingwavelabs/rfcs#84. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Previously, we cannot call async method in local barrier manager, and therefore the
sync
of state store should be called in the rpc handler. After we change the local barrier manager to a event loop worker, we can now call and poll the future returned from async method. In this PR, we change to call sync of state store inside the local barrier manager worker loop to simplify the logic outside the local barrier manager.The managed barrier state has two more enum variant,
AllCollected
andCompleted
. When all actors have collected a barrier, the barrier state will transform toAllCollected
, and a future that callsync
will be created. The future will be polled when calling thenext_completed_epoch
method.Some other refactors are done accordingly.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.