New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raftstore: restrict the total write size of each apply round #13594
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Signed-off-by: glorv <glorvs@163.com>
7de46ed
to
d337d1c
Compare
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
/test |
@Connor1996 @BusyJay PTAL |
match normal.receiver.try_recv() { | ||
Ok(msg) => self.msg_buf.push(msg), | ||
Ok(msg) => { | ||
total_size += msg.entries_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check batch size directly by moving the handle task logic into this loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get your point. Are there any benifit to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg's size doesn't mean the real size written to db, but batch size does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about yield at size?
@@ -116,6 +116,7 @@ pub struct Config { | |||
#[online_config(skip)] | |||
pub notify_capacity: usize, | |||
pub messages_per_tick: usize, | |||
pub messages_size_per_tick: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size configuration should be defined as ReadableSize
.
Signed-off-by: glorv <glorvs@163.com>
…nto entry-size-per-tick
@BusyJay @Connor1996 PTAL again, thanks |
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any test cases?
@@ -386,6 +387,7 @@ impl Default for Config { | |||
hibernate_regions: true, | |||
dev_assert: false, | |||
apply_yield_duration: ReadableDuration::millis(500), | |||
apply_yield_msg_size: ReadableSize::kb(32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note raft batch max size is 2MiB by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my test, 32kb is a good choice because the max message thread hold is 16kb and apply thread will flush the write batch at 256 keys. In the general case, one batch entry won't be too large when there is no bottleneck; but when the process reaches it bottleneck, big raft entry should result in better overall throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you need to explain the configuration in docs and code comment why it's smaller than entry size by default.
for (i, entry) in committed_entries.into_iter().enumerate() { | ||
batch_size += entry.get_data().len(); | ||
entry_batch.push(entry); | ||
if batch_size >= ctx.cfg.apply_yield_msg_size.0 as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need to split tasks?
Can you verify it still performs the same as past? |
I tested the new commit with the same case "oltp_read_write" and batch insert. When set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: glorv <glorvs@163.com>
9d79c52
to
11f392e
Compare
Signed-off-by: glorv <glorvs@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @CalvinNeo
@BusyJay: GitHub didn't allow me to request PR reviews from the following users: CalvinNeo. Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: glorv <glorvs@163.com>
…nto entry-size-per-tick
/run-build-release |
I did some benchmark with write heavy workloads, the result is as following: test 1: sysbench oltp_insert --auto_inc --tables=1 --threads=200 --time=900
In this test, the cpu usage of all 3 tikv are around 650~700%, and the performance of the new commit is slightly better. test 2: sysbenh oltp_insert --auto_inc --tables=1 --threads=200 --time=900, in this test, the table schema only contains a auto_increment primary key but not the secondary key, so all write should located in the single same region.
In this test, 1 tikv's cpu usage is around 650%, and the other 2 is around 100%, and the performance is almost the same. test 3: sysbenh oltp_insert --auto_inc --tables=32 --threads=200 --time=900. In this test, all the table schema only contains the auto_increment primary key but not the secondary key.
In the test all three tikv's cpu usage is about 650%, the performance is slightly regressed by 0.8%. @BusyJay PTAL |
LGTM |
/merge |
@BusyJay: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: a56e49c
|
/test |
What is changed and how it works?
Issue Number: ref #13313
What's Changed:
Currently, the is a config
raftstore.messages-per-tick
that restricts the amount of ApplyMsg for one round of apply poll. Because one ApplyMsg contains all the committed raft entries of this round of raft poll, a single ApplyMsg can contain many raft entries(any more kvs therefore) and may cause tens of milliseconds for the apply poller to handle it. This kind of huge apply can cause latency spike. While the apply poller can trigger a flush when it meets 256 kvs, this flush is at the ApplyMsg level, so a big ApplyMsg can lead to long flush time. Further more, due to the rocksdb WriteBatch commit mechanism, a slow huge WriteBatch can also slow down other small WatchBatches that commited after it, thus adding more apply threads having little help at this scenario.In the tidb side, the kv-client by default will split a big write request by 16KB, so a
raftstore.messages-per-tick
config bigger to this value can ensure the handled message size of 1 poll won't be too large.Because most of the apply time is cost by the WriteBatch commit and tikv has a hard threshold of 256 kvs for triggering a commit, set a proper messages-size threshold(ideally 256 kvs) won't hurt the overall throughput but can reduce the tail latency when there are big write requests.
Benchmark result:
I uses sysbench oltp_read_write/oltp_write_only and a single thread batch insert program to test the performance impact. The environment contains 2 * tidb, 1 * pd and 3 * tikv(8c16g). The sysbench data is 32 table with 50m rows per table, the workload is run at 10 thread so the tikv's cpu is not fully used. The batch insert is running on a table of 3 secondary keys and 25k rows per batch.
each of these two benchmarks are running for 3 round, with 30min for each round. The 1st round only runs the ontp workload, and this is the ideal performance. The 2nd round runs the oltp workload and the bulk insert workload with the old code; the 3rd round runs the the oltp workload and the bulk insert workload with the new code
Related changes
pingcap/docs
/pingcap/docs-cn
:Check List
Tests
Side effects
Release note