feat: handle state store sync in local barrier manager #14377

wenym1 · 2024-01-05T05:57:50Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously, we cannot call async method in local barrier manager, and therefore the sync of state store should be called in the rpc handler. After we change the local barrier manager to a event loop worker, we can now call and poll the future returned from async method. In this PR, we change to call sync of state store inside the local barrier manager worker loop to simplify the logic outside the local barrier manager.

The managed barrier state has two more enum variant, AllCollected and Completed. When all actors have collected a barrier, the barrier state will transform to AllCollected, and a future that call sync will be created. The future will be polled when calling the next_completed_epoch method.

Some other refactors are done accordingly.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

BugenZhao · 2024-01-05T09:27:59Z

Not reviewed detailedly...

we change to call sync of state store inside the local barrier manager worker loop to simplify the logic outside the local barrier manager.

Does this decrease the overall complexity? Previously local barrier manager is only responsible for collecting and sending barriers, while RPC handlers call sync for their own epochs. Now it seems mixed up and there's complicated concurrent control flows. 😕

Some other refactors are done accordingly.

Is it possible to split the changes above and other minor refactors?

wenym1 · 2024-01-05T09:58:04Z

Does this decrease the overall complexity? Previously local barrier manager is only responsible for collecting and sending barriers, while RPC handlers call sync for their own epochs. Now it seems mixed up and there's complicated concurrent control flows. 😕

I plan to replace the current two rpc inject_barrier and collect_barrier with a long streaming bi-direction rpc between CN and meta. Therefore this PR is more like a refactor PR to move the work of inject_barrier and collect_barrier to the local barrier manager. Previously we had to call sync in the RPC handles because the local barrier manager was not in a async context.

Previously, the rpc handler gets notified from barrier manager on barrier collected, and then call sync. In this PR, the process of notifying barrier collected and calling sync is replaced with creating a future that calls sync and notifying on future completion. Hopes this explanation can help review this PR.

Is it possible to split the changes above and other minor refactors?

The greatest refactor in this PR is to change the key of barrier_state_map from epoch.curr to epoch.prev, and this is necessary in this PR because we should find find a specify barrier by its epoch.prev.

…g/local-barrier-manager-handle-sync

wenym1 · 2024-01-10T07:13:45Z

Is it possible to split the changes above and other minor refactors?

Some refactor logic is split to #14436 and gets merged. Now we can focus the logic of handling state store sync of this PR.

kwannoel · 2024-01-10T14:15:35Z

Why do we introduce this PR, is it a prerequisite of some feature?

wenym1 · 2024-01-11T06:34:47Z

Why do we introduce this PR, is it a prerequisite of some feature?

We are going to deprecate the send_barrier and complete_barrier rpc and replace them with a bidirectional grpc stream. This will be useful to the partial checkpoint implementation. It also helps the current code to better handle stale request during recovery.

wenym1 · 2024-01-12T04:51:54Z

Any comments? @BugenZhao @yezizp2012 @kwannoel @tabVersion @hzxa21

tabVersion

Took a rough look, basically LGTM. Wait for @yezizp2012 review

src/compute/src/rpc/service/stream_service.rs

tabVersion · 2024-01-12T06:48:19Z

src/stream/src/task/barrier_manager.rs

+                    pin!(self.state.next_completed_epoch()),
+                    pin!(event_rx.recv()),


is there a preferred branch?

Either branch is fine. The handler for each case is not async and blocking, so each case handler is expected to finish a very short time.

kwannoel · 2024-01-12T14:12:34Z

src/stream/src/task/barrier_manager.rs

@@ -46,12 +48,11 @@ pub const ENABLE_BARRIER_AGGREGATION: bool = false;

 /// Collect result of some barrier on current compute node. Will be reported to the meta service.
 #[derive(Debug)]
-pub struct CollectResult {
+pub struct BarrierCompleteResult {
+    pub sync_result: Option<SyncResult>,


Let's add a docstring for this field.

kwannoel · 2024-01-15T03:11:53Z

src/stream/src/task/barrier_manager.rs

-                        warn!(err=?e.as_ref().map(|_|()), "fail to send collect epoch result");
-                    });
+        loop {
+            let item = drop_either_future(


Can you elaborate more on the use of drop_either_future here?

Why we just drop lhs / rhs future?

Why don't we use tokio::select! instead?

I've changed to tokio::select!. Original use of drop_either_future is because the future will hold a mutable reference to self, and if not dropped we cannot modify the states in self.

kwannoel · 2024-01-15T03:20:11Z

src/stream/src/task/barrier_manager/managed_state.rs

        }
    }

    /// Notify if we have collected barriers from all actor ids. The state must be `Issued`.
-    fn may_notify(&mut self, prev_epoch: u64) {
+    fn may_have_collected_all(&mut self, prev_epoch: u64) {


Can we update the docstring and naming of this function? may_have_collected_all seems kind of vague, I didn't really get the use of it.

From the code it seems to call sync_epoch and update the mview progress mainly.

I've added docstring on the method. The naming may_xxx is to be consistent with the original method to help better compare with the original logic for code review.

yezizp2012

LGTM, btw do we have any detailed docs about the implementation of partial checkpoint?

yezizp2012 · 2024-01-16T08:09:15Z

src/stream/src/task/barrier_manager/managed_state.rs

+                .get_mut(&prev_epoch)
+                .expect("should exist");
+            // sanity check on barrier state
+            match &state.inner {


Nits: can use assert_matches instead.

wenym1 · 2024-01-16T10:18:42Z

LGTM, btw do we have any detailed docs about the implementation of partial checkpoint?

Detailed design about barrier collection in partial checkpoint can be found in partial-checkpoint.md, which is in risingwavelabs/rfcs#84.

wenym1 added 3 commits January 4, 2024 19:29

feat: handle state store sync in local barrier manager

43e1c0a

store collected epoch

abdb5d8

refactor

e407279

github-actions bot added the type/feature label Jan 5, 2024

wenym1 added 2 commits January 5, 2024 15:23

rename and add doc

fc5b072

rename

8cb6b75

wenym1 requested review from hzxa21 and BugenZhao January 5, 2024 07:47

BugenZhao requested a review from yezizp2012 January 5, 2024 09:22

tabVersion self-requested a review January 5, 2024 10:00

wenym1 added 3 commits January 7, 2024 17:15

refactor and add doc

21ce888

Merge branch 'main' into yiming/local-barrier-manager-handle-sync

a2dc396

refactor: local barrier manager use prev epoch to track barrier state

61db720

wenym1 mentioned this pull request Jan 9, 2024

refactor: local barrier manager use prev epoch to track barrier state #14436

Merged

9 tasks

wenym1 added 4 commits January 9, 2024 16:15

Merge branch 'yiming/local-barrier-manager-prev-epoch-map' into yimin…

802a5f1

…g/local-barrier-manager-handle-sync

fix log fmt

6a7f4e0

Merge branch 'main' into yiming/local-barrier-manager-prev-epoch-map

2559026

Merge branch 'yiming/local-barrier-manager-prev-epoch-map' into yimin…

cbe7602

…g/local-barrier-manager-handle-sync

tabVersion reviewed Jan 12, 2024

View reviewed changes

kwannoel reviewed Jan 12, 2024

View reviewed changes

kwannoel reviewed Jan 15, 2024

View reviewed changes

wenym1 added 2 commits January 15, 2024 13:36

use select macro and add doc

43beb4c

update doc

5f1db30

wenym1 requested review from tabVersion and kwannoel January 15, 2024 08:52

yezizp2012 approved these changes Jan 16, 2024

View reviewed changes

wenym1 added 2 commits January 16, 2024 18:07

Merge branch 'main' into yiming/local-barrier-manager-handle-sync

82970e4

use match macro

0c115b5

wenym1 enabled auto-merge January 16, 2024 10:19

wenym1 added this pull request to the merge queue Jan 16, 2024

Merged via the queue into main with commit 50fd512 Jan 16, 2024
26 of 27 checks passed

wenym1 deleted the yiming/local-barrier-manager-handle-sync branch January 16, 2024 11:07

Little-Wallace pushed a commit that referenced this pull request Jan 20, 2024

feat: handle state store sync in local barrier manager (#14377)

8e95cd4

wenym1 mentioned this pull request Sep 12, 2024

Tracking: support partial checkpoint #14041

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: handle state store sync in local barrier manager #14377

feat: handle state store sync in local barrier manager #14377

wenym1 commented Jan 5, 2024 •

edited

Loading

BugenZhao commented Jan 5, 2024

wenym1 commented Jan 5, 2024

wenym1 commented Jan 10, 2024

kwannoel commented Jan 10, 2024

wenym1 commented Jan 11, 2024

wenym1 commented Jan 12, 2024

tabVersion left a comment

tabVersion Jan 12, 2024

wenym1 Jan 15, 2024

kwannoel Jan 12, 2024

wenym1 Jan 15, 2024

kwannoel Jan 15, 2024

hzxa21 Jan 15, 2024

wenym1 Jan 15, 2024

kwannoel Jan 15, 2024

wenym1 Jan 15, 2024

yezizp2012 left a comment

yezizp2012 Jan 16, 2024

wenym1 commented Jan 16, 2024

		pin!(self.state.next_completed_epoch()),
		pin!(event_rx.recv()),

feat: handle state store sync in local barrier manager #14377

feat: handle state store sync in local barrier manager #14377

Conversation

wenym1 commented Jan 5, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

BugenZhao commented Jan 5, 2024

wenym1 commented Jan 5, 2024

wenym1 commented Jan 10, 2024

kwannoel commented Jan 10, 2024

wenym1 commented Jan 11, 2024

wenym1 commented Jan 12, 2024

tabVersion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yezizp2012 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenym1 commented Jan 16, 2024

wenym1 commented Jan 5, 2024 •

edited

Loading