raftstore: restrict the total write size of each apply round #13594

glorv · 2022-10-13T07:07:54Z

What is changed and how it works?

Issue Number: ref #13313

What's Changed:

Add a new config `raftstore.apply-yield-write-size` to yield the processing of current fsm after flush when the flushed data size reach this threshold.

Currently, the is a config raftstore.messages-per-tick that restricts the amount of ApplyMsg for one round of apply poll. Because one ApplyMsg contains all the committed raft entries of this round of raft poll, a single ApplyMsg can contain many raft entries(any more kvs therefore) and may cause tens of milliseconds for the apply poller to handle it. This kind of huge apply can cause latency spike. While the apply poller can trigger a flush when it meets 256 kvs, this flush is at the ApplyMsg level, so a big ApplyMsg can lead to long flush time. Further more, due to the rocksdb WriteBatch commit mechanism, a slow huge WriteBatch can also slow down other small WatchBatches that commited after it, thus adding more apply threads having little help at this scenario.
In the tidb side, the kv-client by default will split a big write request by 16KB, so a raftstore.messages-per-tick config bigger to this value can ensure the handled message size of 1 poll won't be too large.
Because most of the apply time is cost by the WriteBatch commit and tikv has a hard threshold of 256 kvs for triggering a commit, set a proper messages-size threshold(ideally 256 kvs) won't hurt the overall throughput but can reduce the tail latency when there are big write requests.

Benchmark result:
I uses sysbench oltp_read_write/oltp_write_only and a single thread batch insert program to test the performance impact. The environment contains 2 * tidb, 1 * pd and 3 * tikv(8c16g). The sysbench data is 32 table with 50m rows per table, the workload is run at 10 thread so the tikv's cpu is not fully used. The batch insert is running on a table of 3 secondary keys and 25k rows per batch.

each of these two benchmarks are running for 3 round, with 30min for each round. The 1st round only runs the ontp workload, and this is the ideal performance. The 2nd round runs the oltp workload and the bulk insert workload with the old code; the 3rd round runs the the oltp workload and the bulk insert workload with the new code

oltp_write_only:

	oltp_write_only	oltp_write_only + batch insert (old)	oltp_write_only + batch insert (new)
tps	2110.76	1734.54	1762.99
qps	12664.55	10407.25	10577.97
lat min	3.09	3.08	3.13
lat avg	4.74	5.76	5.67
lat 99%	10.27	32.53	19.65
lat max	184.06	325.08	209.06
batch insert count	-	564	553
batch insert avg duration	-	3.20s	3.26s

oltp_read_write:

	oltp_read_write	oltp_read_write + batch insert (old)	oltp_read_write + batch insert (new)
tps	531.54	446.15	504.90
qps	10630.85	8923.06	10097.99
lat min	13.15	14.21	13.42
lat avg	18.81	22.41	19.80
lat 99%	33.12	77.19	44.17
lat max	318.01	361.58	209.91
batch insert count		529	553
batch insert avg duration		3.41s	3.26s

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression
Breaking backward compatibility

Release note

Add a new config `raftstore.apply-yield-write-size` to yield the fsm when written data size exceed this threshold

ti-chi-bot · 2022-10-13T07:07:55Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

BusyJay
Connor1996

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Signed-off-by: glorv <glorvs@163.com>

glorv · 2022-10-13T09:03:17Z

/test

glorv · 2022-10-21T02:32:08Z

@Connor1996 @BusyJay PTAL

Connor1996 · 2022-10-21T05:33:33Z

components/raftstore/src/store/fsm/apply.rs

            match normal.receiver.try_recv() {
-                Ok(msg) => self.msg_buf.push(msg),
+                Ok(msg) => {
+                    total_size += msg.entries_size();


Why not check batch size directly by moving the handle task logic into this loop

Sorry, I don't get your point. Are there any benifit to do so?

msg's size doesn't mean the real size written to db, but batch size does

BusyJay

How about yield at size?

BusyJay · 2022-10-26T03:56:13Z

components/raftstore/src/store/config.rs

@@ -116,6 +116,7 @@ pub struct Config {
    #[online_config(skip)]
    pub notify_capacity: usize,
    pub messages_per_tick: usize,
+    pub messages_size_per_tick: usize,


Size configuration should be defined as ReadableSize.

Signed-off-by: glorv <glorvs@163.com>

…nto entry-size-per-tick

Signed-off-by: glorv <glorvs@163.com>

glorv · 2022-10-27T02:34:38Z

@BusyJay @Connor1996 PTAL again, thanks

Signed-off-by: glorv <glorvs@163.com>

BusyJay

Any test cases?

BusyJay · 2022-10-27T07:51:56Z

components/raftstore/src/store/config.rs

@@ -386,6 +387,7 @@ impl Default for Config {
            hibernate_regions: true,
            dev_assert: false,
            apply_yield_duration: ReadableDuration::millis(500),
+            apply_yield_msg_size: ReadableSize::kb(32),


Note raft batch max size is 2MiB by default.

In my test, 32kb is a good choice because the max message thread hold is 16kb and apply thread will flush the write batch at 256 keys. In the general case, one batch entry won't be too large when there is no bottleneck; but when the process reaches it bottleneck, big raft entry should result in better overall throughput.

Then you need to explain the configuration in docs and code comment why it's smaller than entry size by default.

BusyJay · 2022-10-27T07:55:32Z

components/raftstore/src/store/peer.rs

+        for (i, entry) in committed_entries.into_iter().enumerate() {
+            batch_size += entry.get_data().len();
+            entry_batch.push(entry);
+            if batch_size >= ctx.cfg.apply_yield_msg_size.0 as usize


Why need to split tasks?

BusyJay · 2022-10-27T08:03:39Z

Can you verify it still performs the same as past?

Signed-off-by: glorv <glorvs@163.com>

glorv · 2022-10-27T15:11:39Z

Can you verify it still performs the same as past?

I tested the new commit with the same case "oltp_read_write" and batch insert. When set apply_yield_msg_size to a large number, the oltp_read_write p99 latency is about 80ms, and set it to 32KiB, the p99 latency is 37ms. So I think to performance is the same as the old change.

Connor1996

LGTM

Signed-off-by: glorv <glorvs@163.com>

BusyJay

/cc @CalvinNeo

ti-chi-bot · 2022-10-28T10:21:04Z

@BusyJay: GitHub didn't allow me to request PR reviews from the following users: CalvinNeo.

Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @CalvinNeo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: glorv <glorvs@163.com>

…nto entry-size-per-tick

glorv · 2022-10-31T09:59:53Z

/run-build-release

glorv · 2022-11-01T13:00:22Z

I did some benchmark with write heavy workloads, the result is as following:

test 1: sysbench oltp_insert --auto_inc --tables=1 --threads=200 --time=900

branch	qps	avg lat	p99 lat	p999 lat
master	24105	8.08	18.1	30.8
this PR	25404	7.66	17.3	30.7

In this test, the cpu usage of all 3 tikv are around 650~700%, and the performance of the new commit is slightly better.

test 2: sysbenh oltp_insert --auto_inc --tables=1 --threads=200 --time=900, in this test, the table schema only contains a auto_increment primary key but not the secondary key, so all write should located in the single same region.

branch	qps	avg lat	p99 lat	p999 lat
master	47348	4.00	8.80	15.6
this PR	47779	3.96	9.02	15.6

In this test, 1 tikv's cpu usage is around 650%, and the other 2 is around 100%, and the performance is almost the same.

test 3: sysbenh oltp_insert --auto_inc --tables=32 --threads=200 --time=900. In this test, all the table schema only contains the auto_increment primary key but not the secondary key.

branch	qps	avg lat	p99 lat	p999 lat
master	52153	3.58	9.41	17.5
this PR	51741(-0.8%)	3.61	10.3	18.2

In the test all three tikv's cpu usage is about 650%, the performance is slightly regressed by 0.8%.

@BusyJay PTAL

BusyJay · 2022-11-02T03:50:10Z

LGTM

BusyJay · 2022-11-02T05:58:30Z

/merge

ti-chi-bot · 2022-11-02T05:58:32Z

@BusyJay: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2022-11-02T05:58:33Z

This pull request has been accepted and is ready to merge.

Commit hash: a56e49c

glorv · 2022-11-02T06:15:17Z

/test

ti-chi-bot added release-note size/L labels Oct 13, 2022

restrict the total message size of each apply round

d337d1c

Signed-off-by: glorv <glorvs@163.com>

glorv force-pushed the entry-size-per-tick branch from 7de46ed to d337d1c Compare October 13, 2022 07:08

glorv added 3 commits October 13, 2022 15:08

Merge branch 'master' into entry-size-per-tick

7cfaa78

format code

953211f

Signed-off-by: glorv <glorvs@163.com>

fix unit test

5192183

Signed-off-by: glorv <glorvs@163.com>

Merge branch 'master' into entry-size-per-tick

5343ce3

Connor1996 reviewed Oct 21, 2022

View reviewed changes

glorv requested a review from BusyJay October 25, 2022 10:05

Merge branch 'master' into entry-size-per-tick

18e70bc

BusyJay reviewed Oct 26, 2022

View reviewed changes

glorv added 3 commits October 26, 2022 14:46

use ReadableSize

b2909a2

Signed-off-by: glorv <glorvs@163.com>

Merge branch 'entry-size-per-tick' of https://github.com/glorv/tikv i…

c6863e3

…nto entry-size-per-tick

fix test

72aa623

Signed-off-by: glorv <glorvs@163.com>

glorv added 3 commits October 27, 2022 15:37

restrict the apply size at commit apply

b628ac2

Signed-off-by: glorv <glorvs@163.com>

remove useless code

2aa32d8

Signed-off-by: glorv <glorvs@163.com>

Merge branch 'master' into entry-size-per-tick

9e76488

BusyJay reviewed Oct 27, 2022

View reviewed changes

fix

aa70d4f

Signed-off-by: glorv <glorvs@163.com>

ti-chi-bot added size/S and removed size/L labels Oct 27, 2022

Connor1996 approved these changes Oct 28, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 Status: PR - There is already 1 approval label Oct 28, 2022

ti-chi-bot added size/L and removed size/S labels Oct 28, 2022

fix apply metrics

11f392e

Signed-off-by: glorv <glorvs@163.com>

glorv force-pushed the entry-size-per-tick branch from 9d79c52 to 11f392e Compare October 28, 2022 09:10

BusyJay added the component/configuration Component: Configuration label Oct 28, 2022

glorv added 2 commits October 28, 2022 17:55

fix clippy and add a comment

173c59a

Signed-off-by: glorv <glorvs@163.com>

Merge branch 'master' into entry-size-per-tick

5ac4ca2

glorv mentioned this pull request Oct 28, 2022

tikv: add new config raftstore.apply-yield-write-size pingcap/docs-cn#11831

Merged

14 tasks

BusyJay approved these changes Oct 28, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Oct 28, 2022

glorv added 3 commits October 31, 2022 13:44

change cfg name

9eb1f79

Signed-off-by: glorv <glorvs@163.com>

Merge branch 'entry-size-per-tick' of https://github.com/glorv/tikv i…

2dfbd8e

…nto entry-size-per-tick

Merge branch 'master' into entry-size-per-tick

a56e49c

BusyJay changed the title ~~raftstore: restrict the total message size of each apply round~~ raftstore: restrict the total write size of each apply round Nov 1, 2022

ti-chi-bot added the status/can-merge Status: Can merge to base branch label Nov 2, 2022

Merge branch 'master' into entry-size-per-tick

9e93876

ti-chi-bot merged commit de73806 into tikv:master Nov 2, 2022

ti-chi-bot added this to the Pool milestone Nov 2, 2022

glorv deleted the entry-size-per-tick branch November 4, 2022 02:48

LykxSassinator mentioned this pull request Nov 7, 2022

add v6.4 release notes pingcap/docs-cn#11835

Merged

14 tasks

BusyJay mentioned this pull request Jan 6, 2023

raftstore-v2: adaptive apply #14020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raftstore: restrict the total write size of each apply round #13594

raftstore: restrict the total write size of each apply round #13594

glorv commented Oct 13, 2022 •

edited

ti-chi-bot commented Oct 13, 2022 •

edited

glorv commented Oct 13, 2022

glorv commented Oct 21, 2022

Connor1996 Oct 21, 2022

glorv Oct 21, 2022

Connor1996 Oct 21, 2022

BusyJay left a comment

BusyJay Oct 26, 2022

glorv commented Oct 27, 2022

BusyJay left a comment

BusyJay Oct 27, 2022

glorv Oct 28, 2022

BusyJay Oct 28, 2022

BusyJay Oct 27, 2022

BusyJay commented Oct 27, 2022

glorv commented Oct 27, 2022

Connor1996 left a comment

BusyJay left a comment

ti-chi-bot commented Oct 28, 2022

glorv commented Oct 31, 2022

glorv commented Nov 1, 2022

BusyJay commented Nov 2, 2022

BusyJay commented Nov 2, 2022

ti-chi-bot commented Nov 2, 2022

ti-chi-bot commented Nov 2, 2022

glorv commented Nov 2, 2022

raftstore: restrict the total write size of each apply round #13594

raftstore: restrict the total write size of each apply round #13594

Conversation

glorv commented Oct 13, 2022 • edited

What is changed and how it works?

Related changes

Check List

Release note

ti-chi-bot commented Oct 13, 2022 • edited

glorv commented Oct 13, 2022

glorv commented Oct 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glorv commented Oct 27, 2022

BusyJay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BusyJay commented Oct 27, 2022

glorv commented Oct 27, 2022

Connor1996 left a comment

Choose a reason for hiding this comment

BusyJay left a comment

Choose a reason for hiding this comment

ti-chi-bot commented Oct 28, 2022

glorv commented Oct 31, 2022

glorv commented Nov 1, 2022

BusyJay commented Nov 2, 2022

BusyJay commented Nov 2, 2022

ti-chi-bot commented Nov 2, 2022

ti-chi-bot commented Nov 2, 2022

glorv commented Nov 2, 2022

glorv commented Oct 13, 2022 •

edited

ti-chi-bot commented Oct 13, 2022 •

edited