Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftstore: restrict the total write size of each apply round #13594

Merged
merged 21 commits into from Nov 2, 2022

Conversation

glorv
Copy link
Contributor

@glorv glorv commented Oct 13, 2022

What is changed and how it works?

Issue Number: ref #13313

What's Changed:

Add a new config `raftstore.apply-yield-write-size` to yield the processing of current fsm after flush when the flushed data size reach this threshold.

Currently, the is a config raftstore.messages-per-tick that restricts the amount of ApplyMsg for one round of apply poll. Because one ApplyMsg contains all the committed raft entries of this round of raft poll, a single ApplyMsg can contain many raft entries(any more kvs therefore) and may cause tens of milliseconds for the apply poller to handle it. This kind of huge apply can cause latency spike. While the apply poller can trigger a flush when it meets 256 kvs, this flush is at the ApplyMsg level, so a big ApplyMsg can lead to long flush time. Further more, due to the rocksdb WriteBatch commit mechanism, a slow huge WriteBatch can also slow down other small WatchBatches that commited after it, thus adding more apply threads having little help at this scenario.
In the tidb side, the kv-client by default will split a big write request by 16KB, so a raftstore.messages-per-tick config bigger to this value can ensure the handled message size of 1 poll won't be too large.
Because most of the apply time is cost by the WriteBatch commit and tikv has a hard threshold of 256 kvs for triggering a commit, set a proper messages-size threshold(ideally 256 kvs) won't hurt the overall throughput but can reduce the tail latency when there are big write requests.

Benchmark result:
I uses sysbench oltp_read_write/oltp_write_only and a single thread batch insert program to test the performance impact. The environment contains 2 * tidb, 1 * pd and 3 * tikv(8c16g). The sysbench data is 32 table with 50m rows per table, the workload is run at 10 thread so the tikv's cpu is not fully used. The batch insert is running on a table of 3 secondary keys and 25k rows per batch.

each of these two benchmarks are running for 3 round, with 30min for each round. The 1st round only runs the ontp workload, and this is the ideal performance. The 2nd round runs the oltp workload and the bulk insert workload with the old code; the 3rd round runs the the oltp workload and the bulk insert workload with the new code

  • oltp_write_only:
oltp_write_only oltp_write_only + batch insert (old) oltp_write_only + batch insert (new)
tps 2110.76 1734.54 1762.99
qps 12664.55 10407.25 10577.97
lat min 3.09 3.08 3.13
lat avg 4.74 5.76 5.67
lat 99% 10.27 32.53 19.65
lat max 184.06 325.08 209.06
batch insert count - 564 553
batch insert avg duration - 3.20s 3.26s
  • oltp_read_write:
oltp_read_write oltp_read_write + batch insert (old) oltp_read_write + batch insert (new)
tps 531.54 446.15 504.90
qps 10630.85 8923.06 10097.99
lat min 13.15 14.21 13.42
lat avg 18.81 22.41 19.80
lat 99% 33.12 77.19 44.17
lat max 318.01 361.58 209.91
batch insert count 529 553
batch insert avg duration 3.41s 3.26s

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression
  • Breaking backward compatibility

Release note

Add a new config `raftstore.apply-yield-write-size` to yield the fsm when written data size exceed this threshold

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Oct 13, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • BusyJay
  • Connor1996

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
@glorv
Copy link
Contributor Author

glorv commented Oct 13, 2022

/test

@glorv
Copy link
Contributor Author

glorv commented Oct 21, 2022

@Connor1996 @BusyJay PTAL

match normal.receiver.try_recv() {
Ok(msg) => self.msg_buf.push(msg),
Ok(msg) => {
total_size += msg.entries_size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check batch size directly by moving the handle task logic into this loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get your point. Are there any benifit to do so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msg's size doesn't mean the real size written to db, but batch size does

@glorv glorv requested a review from BusyJay October 25, 2022 10:05
Copy link
Member

@BusyJay BusyJay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about yield at size?

@@ -116,6 +116,7 @@ pub struct Config {
#[online_config(skip)]
pub notify_capacity: usize,
pub messages_per_tick: usize,
pub messages_size_per_tick: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size configuration should be defined as ReadableSize.

Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
@glorv
Copy link
Contributor Author

glorv commented Oct 27, 2022

@BusyJay @Connor1996 PTAL again, thanks

Copy link
Member

@BusyJay BusyJay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any test cases?

@@ -386,6 +387,7 @@ impl Default for Config {
hibernate_regions: true,
dev_assert: false,
apply_yield_duration: ReadableDuration::millis(500),
apply_yield_msg_size: ReadableSize::kb(32),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note raft batch max size is 2MiB by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my test, 32kb is a good choice because the max message thread hold is 16kb and apply thread will flush the write batch at 256 keys. In the general case, one batch entry won't be too large when there is no bottleneck; but when the process reaches it bottleneck, big raft entry should result in better overall throughput.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you need to explain the configuration in docs and code comment why it's smaller than entry size by default.

for (i, entry) in committed_entries.into_iter().enumerate() {
batch_size += entry.get_data().len();
entry_batch.push(entry);
if batch_size >= ctx.cfg.apply_yield_msg_size.0 as usize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to split tasks?

@BusyJay
Copy link
Member

BusyJay commented Oct 27, 2022

Can you verify it still performs the same as past?

Signed-off-by: glorv <glorvs@163.com>
@ti-chi-bot ti-chi-bot added size/S and removed size/L labels Oct 27, 2022
@glorv
Copy link
Contributor Author

glorv commented Oct 27, 2022

Can you verify it still performs the same as past?

I tested the new commit with the same case "oltp_read_write" and batch insert. When set apply_yield_msg_size to a large number, the oltp_read_write p99 latency is about 80ms, and set it to 32KiB, the p99 latency is 37ms. So I think to performance is the same as the old change.

Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Status: PR - There is already 1 approval label Oct 28, 2022
@ti-chi-bot ti-chi-bot added size/L and removed size/S labels Oct 28, 2022
Signed-off-by: glorv <glorvs@163.com>
@BusyJay BusyJay added the component/configuration Component: Configuration label Oct 28, 2022
Copy link
Member

@BusyJay BusyJay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ti-chi-bot
Copy link
Member

@BusyJay: GitHub didn't allow me to request PR reviews from the following users: CalvinNeo.

Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @CalvinNeo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot added status/LGT2 Status: PR - There are already 2 approvals and removed status/LGT1 Status: PR - There is already 1 approval labels Oct 28, 2022
@glorv
Copy link
Contributor Author

glorv commented Oct 31, 2022

/run-build-release

@BusyJay BusyJay changed the title raftstore: restrict the total message size of each apply round raftstore: restrict the total write size of each apply round Nov 1, 2022
@glorv
Copy link
Contributor Author

glorv commented Nov 1, 2022

I did some benchmark with write heavy workloads, the result is as following:

test 1: sysbench oltp_insert --auto_inc --tables=1 --threads=200 --time=900

branch qps avg lat p99 lat p999 lat
master 24105 8.08 18.1 30.8
this PR 25404 7.66 17.3 30.7

In this test, the cpu usage of all 3 tikv are around 650~700%, and the performance of the new commit is slightly better.

test 2: sysbenh oltp_insert --auto_inc --tables=1 --threads=200 --time=900, in this test, the table schema only contains a auto_increment primary key but not the secondary key, so all write should located in the single same region.

branch qps avg lat p99 lat p999 lat
master 47348 4.00 8.80 15.6
this PR 47779 3.96 9.02 15.6

In this test, 1 tikv's cpu usage is around 650%, and the other 2 is around 100%, and the performance is almost the same.

test 3: sysbenh oltp_insert --auto_inc --tables=32 --threads=200 --time=900. In this test, all the table schema only contains the auto_increment primary key but not the secondary key.

branch qps avg lat p99 lat p999 lat
master 52153 3.58 9.41 17.5
this PR 51741(-0.8%) 3.61 10.3 18.2

In the test all three tikv's cpu usage is about 650%, the performance is slightly regressed by 0.8%.

@BusyJay PTAL

@BusyJay
Copy link
Member

BusyJay commented Nov 2, 2022

LGTM

@BusyJay
Copy link
Member

BusyJay commented Nov 2, 2022

/merge

@ti-chi-bot
Copy link
Member

@BusyJay: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: a56e49c

@ti-chi-bot ti-chi-bot added the status/can-merge Status: Can merge to base branch label Nov 2, 2022
@glorv
Copy link
Contributor Author

glorv commented Nov 2, 2022

/test

@ti-chi-bot ti-chi-bot merged commit de73806 into tikv:master Nov 2, 2022
@ti-chi-bot ti-chi-bot added this to the Pool milestone Nov 2, 2022
@glorv glorv deleted the entry-size-per-tick branch November 4, 2022 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/configuration Component: Configuration release-note size/L status/can-merge Status: Can merge to base branch status/LGT2 Status: PR - There are already 2 approvals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants