raftstore-v2: introduce apply trace #13939

BusyJay · 2022-12-14T09:37:42Z

What is changed and how it works?

Issue Number: Ref #12842

What's Changed:

raftstore v2 disables WAL for all tablets and store all states to raft
engine. To be able to recover from restart, we need to build some
relations between raft engine and tablets flush. In the previous PR,
flush indexes are stored in raft engine by `PersistenceListener`.


In this PR, ApplyTrace is introduced to anaylze apply index after
restart. And it will trigger persistence for more apply progress like
split.

Check List

Tests

Unit test
Integration test

Release note

None

raftstore v2 disables WAL for all tablets and store all states to raft engine. To be able to recover from restart, we need to build some relations between raft engine and tablets flush. In the previous PR, flush indexes are stored in raft engine by `PersistenceListener`. In this PR, ApplyTrace is introduced to anaylze apply index after restart. And it will trigger persistence for more apply progress like split. Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

ti-chi-bot · 2022-12-14T09:37:44Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

tabokie
tonyxuqqi

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tonyxuqqi · 2022-12-15T05:47:34Z

components/engine_rocks/src/misc.rs

+            for cf in self.cf_names() {
+                handles.push(util::get_cf_handle(self.as_inner(), cf)?);
+            }
+        }


minor: maybe it's cleaner to use a temp cfs to cover both cases.
something like:
let cfs = cfs.is_empty()? &self.cf_names() : cfs;
for cf in cfs .....

cf_names will allocate.

cf_names will allocate.
if !cfs.is_empty(), self.cf_names() won't be called? it's a minor issue anyway.

cf_names will return a Vec, so you can't put it inside if branch. Instead, it needs to allocate first, and then choose what to borrow.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

components/engine_test/src/lib.rs

components/engine_traits/src/flush.rs

components/raftstore-v2/src/router/message.rs

src/server/engine_factory.rs

components/raftstore-v2/src/operation/ready/apply_trace.rs

components/raftstore-v2/src/raft/peer.rs

components/raftstore-v2/src/operation/ready/apply_trace.rs

tabokie · 2022-12-15T07:47:21Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+    data_cfs: Box<[Progress; DATA_CFS_LEN]>,
+    admin: Progress,
+    // Index that is issued to be written. It may not be truely persisted.
+    persisted_applied: u64,


Where is this used?

It's for persisting admin.flushed every 100 indexes or every 5 minutes, which is not implemented yet.

tonyxuqqi

One general question: When we redo the split, what to do with the existing tablets? They should be deleted. But I did not see the code.

And the more complicated part is for merge, because once merge is committed, the source regions data may be gone. And there's no way to redo the merge.

To get rid of all these problems, I think we should force the flush in both split and merge. Then we don't need to redo any of them.

tonyxuqqi · 2022-12-15T07:42:57Z

components/engine_rocks/src/misc.rs

+            for cf in self.cf_names() {
+                handles.push(util::get_cf_handle(self.as_inner(), cf)?);
+            }
+        }


cf_names will allocate.
if !cfs.is_empty(), self.cf_names() won't be called? it's a minor issue anyway.

tonyxuqqi · 2022-12-15T07:59:28Z

components/engine_traits/src/flush.rs

+                    continue;
+                }
+                // Note flushed largest_seqno equals to earliest_seqno of next memtable.
+                if pr.earliest_seqno < largest_seqno {


So if pr.earliest_seqno >= largest_seqno, it will panic later. Should we add an assert here to be more clear?

It will break instead of panicking. If any pr is found, it will not panic.

It will break instead of panicking. If any pr is found, it will not panic.

Got it.

tonyxuqqi · 2022-12-15T08:05:51Z

components/raftstore-v2/src/operation/command/write/mod.rs

+        let off = cf_offset(cf);
+        if self.should_skip(off, index) {
+            return Ok(());
+        }


why not add pre_apply to wrap this code, otherwise it's duplicated in all apply_ methods.

let off = cf_offset(cf); if self.should_skip(off, index) { return Ok(()); }

Also add post_apply for
self.modifications_mut()[off] = index;

These pre_/post_apply can be called outside
Then, these apply_put, apply_delete won't need change at all.

It's more clear to be called in place. The semantics of modifications is it changes data not whether the command succeed or not.

tonyxuqqi · 2022-12-15T18:19:32Z

components/engine_traits/src/flush.rs

            state_changes: changes,
        });
    }

    /// Called a memtable finished flushing.
-    pub fn on_flush_completed(&self, cf: &str, id: u64) {
+    pub fn on_flush_completed(&self, cf: &str, largest_seqno: u64) {


I remembered this API was called in separate thread in v1 other than raft threads, otherwise it will have performance regression.

In v2 this API is called by rocksdb flush thread.

tonyxuqqi · 2022-12-15T18:26:10Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+
+#[derive(Clone, Copy, Default)]
+struct Progress {
+    flushed: u64,


Suggested change

flushed: u64,

flushed_index: u64,

tonyxuqqi · 2022-12-15T18:51:17Z

components/raftstore-v2/src/raft/apply.rs

    applied_term: u64,
+    modifications: DataTrace,


please add comments for these fields.

tonyxuqqi · 2022-12-15T18:53:10Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+}
+
+/// An alias of frequent use type that each data cf has a u64.
+pub type DataTrace = [u64; DATA_CFS_LEN];


Suggested change

pub type DataTrace = [u64; DATA_CFS_LEN];

pub type CfsFlushIndexes = [u64; DATA_CFS_LEN];

DataTrace is too general and hard to understand what's the exactly meaning.

Actually DataTrace is a better name to reflect the idea. As we don't trace all indexes of all CFs, it only trace the data related CFs.

tonyxuqqi · 2022-12-15T18:56:08Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+        }
+    }
+
+    fn record_modify(&mut self, cf: &str, index: u64) {


Suggested change

fn record_modify(&mut self, cf: &str, index: u64) {

fn set_last_modified_index(&mut self, cf: &str, index: u64) {

Same as other record_xxx. Let's use set_ for consistency. Also please add the suffix _index to be more specific.

It's on purpose to avoid the _index suffix, otherwise almost every methods fields in this files will have the suffixes. As long as we all know the file is about indexes, the suffix is not needed.

tonyxuqqi · 2022-12-15T18:56:38Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+        self.data_cfs.iter().map(|p| p.flushed).min().unwrap()
+    }
+
+    fn record_flush(&mut self, cf: &str, index: u64) {


Suggested change

fn record_flush(&mut self, cf: &str, index: u64) {

fn set_cf_flushed_index(&mut self, cf: &str, index: u64) {

record_flush here means record a flush event.

tonyxuqqi · 2022-12-15T18:57:53Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+    }
+
+    #[inline]
+    fn data_index(&self) -> u64 {


Suggested change

fn data_index(&self) -> u64 {

fn min_flushed_index(&self) -> u64 {

They are not the same. data_index is the index of data cfs, not arbitrary flushed index.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tabokie · 2022-12-16T10:20:05Z

components/raftstore-v2/src/operation/ready/snapshot.rs

+        lb.put_region_state(region_id, last_index, self.region_state())
+            .unwrap();
+        for cf in ALL_CFS {
+            lb.put_flushed_index(region_id, cf, last_index, last_index)


Why isn't it enough to just record CF_RAFT?

tabokie · 2022-12-16T10:25:33Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+            .get_region_state(region_id, trace.admin.flushed)?
+            .unwrap();
+        let data_index = trace.data_index();
+        // If index is not larger than applied_index, it means some CF doesn't have any


I don't know what does "index" and "applied_index" refers to here. It should be "data_index" and "admin_flushed"?

tabokie · 2022-12-16T10:26:02Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+        apply_trace.maybe_advance_admin_flushed(apply_index);
+    }
+
+    pub fn on_manual_flush(&mut self, cfs: Vec<&'static str>, ch: CmdResChannel) {


This isn't needed anymore after post write callback?

No, as far as I know.

Remove it now?

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tabokie · 2022-12-16T11:26:23Z

components/raftstore-v2/src/raft/peer.rs

@@ -78,6 +82,13 @@ pub struct Peer<EK: KvEngine, ER: RaftEngine> {

    // Trace which peers have not finished split.
    split_trace: Vec<(u64, HashSet<u64>)>,
+
+    /// Apply ralated State changes that needs to be persisted to raft engine.


Suggested change

/// Apply ralated State changes that needs to be persisted to raft engine.

/// Apply related state changes that needs to be persisted to raft engine.

tabokie · 2022-12-16T11:28:34Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+        apply_trace.maybe_advance_admin_flushed(apply_index);
+    }
+
+    pub fn on_manual_flush(&mut self, cfs: Vec<&'static str>, ch: CmdResChannel) {


Remove it now?

tabokie · 2022-12-16T11:30:21Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+    /// Logs may be replayed from the some apply index, but those data may have
+    /// been flushed in the past, so we need the flushed indexes to decide what
+    /// logs can be skipped for certain CFs. If all CFs are flushed before the
+    /// apply index, `None` is returned.


apply index -> admin flushed / persisted apply index ?

tabokie · 2022-12-16T11:36:00Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+
+    // All events before `mem_index` must be consumed before calling this function.
+    fn maybe_advance_admin_flushed(&mut self, mem_index: u64) {
+        if self.admin.flushed < self.admin.last_modified {


I think it can be confusing sometimes for admin to reuse the progress struct. For normal progress it should have invariant flushed <= last_modified.

No, the relation between flushed and last_modified can be arbitrary.

tabokie · 2022-12-16T11:38:27Z

components/raftstore-v2/src/operation/ready/apply_trace.rs

+        }
+        // At best effort, we can only advance the index to `mem_index`.
+        let mut candidate = mem_index;
+        for pr in self.data_cfs.iter() {


maybe filter().map().min() is more readable.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

BusyJay · 2022-12-17T03:35:55Z

/merge

ti-chi-bot · 2022-12-17T03:35:57Z

@BusyJay: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2022-12-17T03:35:59Z

This pull request has been accepted and is ready to merge.

Commit hash: 261c162

ti-chi-bot · 2022-12-17T03:36:12Z

@BusyJay: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

BusyJay · 2022-12-17T05:07:36Z

/merge

ti-chi-bot · 2022-12-17T05:07:37Z

@BusyJay: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2022-12-17T05:07:39Z

This pull request has been accepted and is ready to merge.

Commit hash: 42fa3c9

ti-chi-bot added release-note-none size/XXL labels Dec 14, 2022

BusyJay added 2 commits December 14, 2022 20:00

staging

2b38fa6

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

ignore id check

332d4d5

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

BusyJay force-pushed the introduce-apply-trace branch from 48dc36c to 332d4d5 Compare December 14, 2022 12:44

fix test

8005e7f

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

BusyJay requested review from tabokie and tonyxuqqi December 14, 2022 13:34

add unit test case for persistence listener

48e22ce

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

BusyJay force-pushed the introduce-apply-trace branch from 6e9fcdc to 48e22ce Compare December 14, 2022 17:14

tonyxuqqi reviewed Dec 15, 2022

View reviewed changes

add more test case for snapshot

96a3e47

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tabokie reviewed Dec 15, 2022

View reviewed changes

tonyxuqqi reviewed Dec 15, 2022

View reviewed changes

address comment

43e0396

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tabokie reviewed Dec 16, 2022

View reviewed changes

BusyJay added 3 commits December 16, 2022 18:37

let raftstore to store apply related states and address comment

7acbcce

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

make it compile

5dcc343

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

update comment

a8a407f

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tabokie approved these changes Dec 16, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 Status: PR - There is already 1 approval label Dec 16, 2022

address comment

261c162

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

tonyxuqqi approved these changes Dec 16, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Dec 16, 2022

ti-chi-bot added the status/can-merge Status: Can merge to base branch label Dec 17, 2022

Merge branch 'master' into introduce-apply-trace

561b5ef

fix test

42fa3c9

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

ti-chi-bot removed the status/can-merge Status: Can merge to base branch label Dec 17, 2022

ti-chi-bot added the status/can-merge Status: Can merge to base branch label Dec 17, 2022

ti-chi-bot merged commit 416f7b7 into tikv:master Dec 17, 2022

ti-chi-bot added this to the Pool milestone Dec 17, 2022

BusyJay deleted the introduce-apply-trace branch December 17, 2022 05:10

	pub type DataTrace = [u64; DATA_CFS_LEN];
	pub type CfsFlushIndexes = [u64; DATA_CFS_LEN];

	fn record_modify(&mut self, cf: &str, index: u64) {
	fn set_last_modified_index(&mut self, cf: &str, index: u64) {

	fn record_flush(&mut self, cf: &str, index: u64) {
	fn set_cf_flushed_index(&mut self, cf: &str, index: u64) {

	fn data_index(&self) -> u64 {
	fn min_flushed_index(&self) -> u64 {

	/// Apply ralated State changes that needs to be persisted to raft engine.
	/// Apply related state changes that needs to be persisted to raft engine.

raftstore-v2: introduce apply trace #13939

raftstore-v2: introduce apply trace #13939

Conversation

BusyJay commented Dec 14, 2022

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Dec 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay Dec 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyxuqqi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay commented Dec 17, 2022

ti-chi-bot commented Dec 17, 2022

ti-chi-bot commented Dec 17, 2022

ti-chi-bot commented Dec 17, 2022

BusyJay commented Dec 17, 2022

ti-chi-bot commented Dec 17, 2022

ti-chi-bot commented Dec 17, 2022

ti-chi-bot commented Dec 14, 2022 •

edited

BusyJay Dec 16, 2022 •

edited