Performance Diagnosis Enhancements #12362

sticnarf · 2022-04-14T09:13:57Z

TiKV users and developers often care about performance, but it is never easy to diagnose performance issues or find performance bottlenecks.

Latency of some parts in the critical path is not recorded, sometimes making the cause of performance issues only inferred indirectly.
Metrics displayed in the Grafana dashboard are not very well organized. It's not easy to find the cause quickly.
It's hard to investigate the cause of an individual slow request if the overall latency is normal.

To improve the situation, we need to fill up the missing parts of the metrics, add more details to the metrics and try to improve the way we display the metrics. And, it's better to collect performance data in the request level and send back to the client if possible, enabling analyzing tail latency causes.

Data to be collected

Below are the missing data we are going to collect. "(R)" means the data may be collected for each request, and "(M)" means they are collected as metrics.

Read

Write

Background jobs

(M) Backup I/O usage
(M) Backup network bandwidth usage
(M) Generating snapshot I/O usage #12838
(M) Sending snapshot network bandwidth usage

Internal jobs from TiDB (ref pingcap/tidb#33963)

(M) TiDB auto-analyze CPU and I/O usage
(M) TiDB DDL reorganization (add index/change column) CPU and I/O usage, raftstore commands ratio
(M) Checksum CPU and I/O usage

Diagnosis helper

The performance overview dashboard displays performance critical metrics. Some representitive newly added metrics can be put on the same dashboard as well.

Smart diagnosis service may use the metrics as input and generate reports about possible cause of the problem. But this is not in the TiKV project plan AFAIK.

Existing related issues

The following issues were created before, they may be some sub-tasks of this meta-issue. We hope this issue will cover most of the problems raised by the following issues.

Scheduler: There's no conclusive way to determine the wait time of a request for scheduler worker #12359 (Missing scheduler wait duration)
Roadmap for stage latency statistics #11525
Output slow log for KvGet and KvBatchGet #8944 (We need to furthur consider how and where to output TiKV slow log)
Trace client-side and server-side latency to diagnose RPC latency #11378 (Mentioning something about network latency)
Implement tracing for TiKV #11872 (Tracing provides more detailed per-request time data, but is more complex and expensive)
metrics: Accurate duration tracing of storage/scheduler message handling #8403 (The PR description lists some inaccuracy and missing parts of the metrics)

The text was updated successfully, but these errors were encountered:

ref #12362 PerfFlags provides more fine-grained PerfContext collecting. In this way, we only collect the duration we really care about. Theoretically, it reduces the overhead of collecting PerfContext compared to using PerfLevel::EnableTime. This commit only changes the raftstore part. The read pool part will be done later. The default value of raftstore.perf-level is changed to Uninitialized. The config is not documented before, so I think the change is not so important to change the default value. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

…2432) ref #12362 This commit enables some more perf flags about reading, such as block read time, DB mutex wait time. So, it enables us to find out the root cause of more performance issues. It also includes some refactorings of unifying the functions of capturing PerfContext in tikv_kv into engine_rocks because engine specific functions should be hidden in engine_* as much as possible. This is the first step that removes statistics from tikv_kv, and later code referencing directly to the engine_rocks should be also removed. Signed-off-by: Yilin Chen <sticnarf@gmail.com>

…kv#12432) ref tikv#12362 This commit enables some more perf flags about reading, such as block read time, DB mutex wait time. So, it enables us to find out the root cause of more performance issues. It also includes some refactorings of unifying the functions of capturing PerfContext in tikv_kv into engine_rocks because engine specific functions should be hidden in engine_* as much as possible. This is the first step that removes statistics from tikv_kv, and later code referencing directly to the engine_rocks should be also removed. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: 3AceShowHand <jinl1037@hotmail.com>

ref #12362 This commit measures the read duration spent on reading (e.g. write conflict checks). This fixes the missing part of scheduler commands and may help diagnosis when scheduler reading takes a long time. This commit also changes some now_coarse to now. now_coarse has a precision of 10ms on many systems, so it may not meet our precision requirement. Instant::now is fast enough to be called in these cases. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#12362 This commit measures the read duration spent on reading (e.g. write conflict checks). This fixes the missing part of scheduler commands and may help diagnosis when scheduler reading takes a long time. This commit also changes some now_coarse to now. now_coarse has a precision of 10ms on many systems, so it may not meet our precision requirement. Instant::now is fast enough to be called in these cases. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12362 This commit is a large refactoring that moves collecting engine PerfContext from storage and coprocessor to engine_rocks and the tracker. Now, the storage and coprocessor are mostly decoupled with a specific engine (engine_rocks). And it introduces a general trakcer mechanism to collect the metrics of a request during its whole lifetime. It will help us collect more performance critical data of a single request more easily. Signed-off-by: Yilin Chen <sticnarf@gmail.com>

ref tikv#12362 This commit measures the read duration spent on reading (e.g. write conflict checks). This fixes the missing part of scheduler commands and may help diagnosis when scheduler reading takes a long time. This commit also changes some now_coarse to now. now_coarse has a precision of 10ms on many systems, so it may not meet our precision requirement. Instant::now is fast enough to be called in these cases. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12362 This commit records perf contexts around executing txn scheduler commands. This helps us know the detail performance data of the underlying engine. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12362 Our raft implementation has its flow control mechanism to limit the inflight message number. But we're not able to know whether and when we are throttled. Then, it's hard for us to know whether we should adjust the max-inflight-msgs config. This commit tries to improve this case. It's complex to add hooks to raft-rs to know how long a message is throttled, but we can do it in the raftstore. We record the propose time, and consume it and record it in the histogram when the messages is send through the transport. If flow control takes effect, the ready will be smaller than all under-replicated logs. So, we can know if flow control takes effect. Signed-off-by: Yilin Chen <sticnarf@gmail.com>

ref #12362 This commit replaces the request_times in the raftstore callback with a tracker token. Then, the waterfall metrics of a raft command will be recorded into the tracker. Signed-off-by: Yilin Chen <sticnarf@gmail.com>

ref tikv/raft-engine#227, ref #12362 We used to record perf contexts for Raft RocksDB and KV RocksDB with the same PerfContext. But we also have raft-engine now. So, we will miss perf contexts if we still use RocksDB perf contexts. This commit adds PerfContext support to RaftEngine and distinguish it from the perf context used for applying. Then, we'll record correct perf statistics for both raft engine and KV DB. Updated raft-engine to include tikv/raft-engine#227 Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#12362 This commit replaces the request_times in the raftstore callback with a tracker token. Then, the waterfall metrics of a raft command will be recorded into the tracker. Signed-off-by: Yilin Chen <sticnarf@gmail.com>

ref tikv/raft-engine#227, ref tikv#12362 We used to record perf contexts for Raft RocksDB and KV RocksDB with the same PerfContext. But we also have raft-engine now. So, we will miss perf contexts if we still use RocksDB perf contexts. This commit adds PerfContext support to RaftEngine and distinguish it from the perf context used for applying. Then, we'll record correct perf statistics for both raft engine and KV DB. Updated raft-engine to include tikv/raft-engine#227 Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12362 TiKV client can pass request_source through Context. It is useful for us to know how many requests there are from each source. So, this commit collects the count and the total duration by request source. The source label is not added to the command type in order to avoid creating too many label combinations. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #931, ref #12362 kvproto#931 adds a few more details about the time used by an RPC request. We are filling these additional information to the responses in this commit. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#12362 TiKV client can pass request_source through Context. It is useful for us to know how many requests there are from each source. So, this commit collects the count and the total duration by request source. The source label is not added to the command type in order to avoid creating too many label combinations. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref tikv#931, ref tikv#12362 kvproto#931 adds a few more details about the time used by an RPC request. We are filling these additional information to the responses in this commit. Signed-off-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

sticnarf · 2022-07-28T10:28:15Z

@OneSizeFitsQuorum Are you interested in the time breakdown inside async_snapshot? It may be useful to collect whether lease is expired and record the raft store loop wait duration if lease is expired.

OneSizeFitsQuorum · 2022-07-28T12:39:38Z

Yes, I can practice on it first. But I’m curious about what is Apply loop Queue Length and Apply loop wait rounds, will them be helpful for us to monitor the the applying status? If so it maybe useful for parallel apply demo too.

sticnarf · 2022-07-28T15:00:49Z

Yes, I can practice on it first. But I’m curious about what is Apply loop Queue Length and Apply loop wait rounds, will them be helpful for us to monitor the the applying status? If so it maybe useful for parallel apply demo too.

I was thinking Apply loop Queue Length as the total number of commands that is pending to be applied at the time when a new command is committed and sent to the apply threads. And meanwhile, we can record the value of the current "apply round" counter. The counter increases for each apply batch. So, we can know how many rounds of batches have passed before we are able to handle the command in the apply thread.

Probably they are helpful to account for the long apply wait time while the apply threads are vacant. But I am also a bit worried that these metrics are too tedious in most cases. (Maybe we can define levels for metrics too.)

ref pingcap/kvproto#965, ref #12362 This commit record the pessimistic_lock_wait time for pessimistic transactions in the waitManager. Signed-off-by: OneSizeFitQuorum <tanxinyu@apache.org>

ref pingcap/kvproto#974, ref pingcap/kvproto#978, ref #12362 This commit records read_index_propose_wait_nanos, read_index_confirm_wait_nanos and read_pool_schedule_wait_nanos Signed-off-by: OneSizeFitQuorum <tanxinyu@apache.org>

ref pingcap/kvproto#974, ref pingcap/kvproto#978, ref tikv#12362 This commit records read_index_propose_wait_nanos, read_index_confirm_wait_nanos and read_pool_schedule_wait_nanos Signed-off-by: OneSizeFitQuorum <tanxinyu@apache.org> Signed-off-by: TonsnakeLin <lpbgytong@163.com>

ref pingcap/kvproto#965, ref tikv#12362 This commit record the pessimistic_lock_wait time for pessimistic transactions in the waitManager. Signed-off-by: OneSizeFitQuorum <tanxinyu@apache.org> Signed-off-by: fengou1 <feng.ou@pingcap.com>

ref pingcap/kvproto#974, ref pingcap/kvproto#978, ref tikv#12362 This commit records read_index_propose_wait_nanos, read_index_confirm_wait_nanos and read_pool_schedule_wait_nanos Signed-off-by: OneSizeFitQuorum <tanxinyu@apache.org> Signed-off-by: fengou1 <feng.ou@pingcap.com>

) ref #12362 This commit adds `perform_read_local` fail_point so we can force the lease read to be triggered, which allows the test `test_read_execution_tracker` to pass stably. Signed-off-by: OneSizeFitQuorum <tanxinyu@apache.org>

ref #12362 Fix the returned apply write wal tracking time. Signed-off-by: cfzjywxk <lsswxrxr@163.com>

…heduler process duration (#14476) ref #12362 Signed-off-by: cfzjywxk <lsswxrxr@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

ref #12362 Signed-off-by: cfzjywxk <lsswxrxr@163.com>

ref tikv#12362 Signed-off-by: cfzjywxk <lsswxrxr@163.com> Signed-off-by: tonyxuqqi <tonyxuqi@outlook.com>

ref #12362 add log and metric in raft_client. Signed-off-by: crazycs520 <crazycs520@gmail.com>

sticnarf added the type/enhancement Type: Issue - Enhancement label Apr 14, 2022

cfzjywxk mentioned this issue Apr 15, 2022

diagnose: try to differentiate between internal and external requests pingcap/tidb#33963

Closed

3 tasks

sticnarf mentioned this issue Apr 19, 2022

engine_rocks: replace PerfLevel with PerfFlags in raftstore #12373

Merged

cfzjywxk mentioned this issue Apr 19, 2022

Performance Diagnosis Enhancements pingcap/tidb#34106

Open

65 tasks

sticnarf mentioned this issue Apr 26, 2022

*: apply perf flags to record RocksDB perf context in coprocessor #12432

Merged

cfzjywxk mentioned this issue May 9, 2022

Slow Log Enhancement Tracking Issue pingcap/tidb#34487

Open

17 tasks

This was referenced May 12, 2022

*: introduce general request tracker #12460

Merged

storage: record perf statistics for scheduler commands #12500

Merged

sticnarf mentioned this issue Jun 1, 2022

storage: measure the read duration for scheduler commands #12716

Merged

This was referenced Jun 16, 2022

raftstore: record metrics of proposal send wait duration #12837

Merged

raftstore: record write durations into tracker #12783

Merged

sticnarf mentioned this issue Jun 24, 2022

*: set write and time details in RPC responses #12900

Merged

sticnarf mentioned this issue Jun 27, 2022

raftstore: separate raft and kv perf contexts #12915

Merged

sticnarf mentioned this issue Jul 4, 2022

server: collect count and duration by request source #12954

Merged

sticnarf mentioned this issue Jul 11, 2022

raftstore: change send proposal time to waterfall metrics #12993

Merged

cfzjywxk mentioned this issue Jul 27, 2022

txn: add lock wait duration and read exeution duration track for pessimistic lock #13138

Closed

sticnarf mentioned this issue Jul 29, 2022

storage: record and return latch wait and process time #13159

Open

sticnarf mentioned this issue Aug 16, 2022

configs: add switch for tracking requests #13290

Closed

cfzjywxk mentioned this issue Aug 16, 2022

txn: seperate prewrite and commit in detail information and add clone methods tikv/client-go#567

Merged

OneSizeFitsQuorum mentioned this issue Aug 18, 2022

storage: record and return pessimistic_lock_wait time #13309

Merged

OneSizeFitsQuorum mentioned this issue Aug 26, 2022

storage: record and return asycn snapshot metric #13358

Merged

OneSizeFitsQuorum mentioned this issue Sep 7, 2022

storage: add perform_read_local fail_point to stabilize the test #13427

Merged

cfzjywxk mentioned this issue Mar 23, 2023

storage: fix the apply write wal tracking time #14444

Merged

ti-chi-bot pushed a commit that referenced this issue Mar 23, 2023

storage: fix the apply write wal tracking time (#14444)

4b2dda4

ref #12362 Fix the returned apply write wal tracking time. Signed-off-by: cfzjywxk <lsswxrxr@163.com>

cfzjywxk mentioned this issue Mar 28, 2023

txn: record the latch wait, flow control throttle, quota delay and scheduler process duration #14476

Merged

cfzjywxk mentioned this issue Jun 7, 2023

scheduler: record process and write durtaion details #14895

Merged

ti-chi-bot bot pushed a commit that referenced this issue Jun 8, 2023

scheduler: record process and write durtaion details (#14895)

f91dd4e

ref #12362 Signed-off-by: cfzjywxk <lsswxrxr@163.com>

cfzjywxk mentioned this issue Jun 21, 2023

Add v7.2.0 release notes pingcap/docs-cn#14154

Merged

16 tasks

tonyxuqqi pushed a commit to tonyxuqqi/tikv that referenced this issue Jun 22, 2023

scheduler: record process and write durtaion details (tikv#14895)

42c9423

ref tikv#12362 Signed-off-by: cfzjywxk <lsswxrxr@163.com> Signed-off-by: tonyxuqqi <tonyxuqi@outlook.com>

This was referenced Dec 5, 2023

*: add log and metric in raft_client #16142

Closed

*: add log and metric in raft_client #16143

Merged

ti-chi-bot bot pushed a commit that referenced this issue Dec 19, 2023

*: add log and metric in raft_client (#16143)

37b1dce

ref #12362 add log and metric in raft_client. Signed-off-by: crazycs520 <crazycs520@gmail.com>

crazycs520 mentioned this issue Dec 22, 2023

*: add metrics about raft message wait flush duration #16214

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Diagnosis Enhancements #12362

Performance Diagnosis Enhancements #12362

sticnarf commented Apr 14, 2022 •

edited

sticnarf commented Jul 28, 2022

OneSizeFitsQuorum commented Jul 28, 2022

sticnarf commented Jul 28, 2022

Performance Diagnosis Enhancements #12362

Performance Diagnosis Enhancements #12362

Comments

sticnarf commented Apr 14, 2022 • edited

Data to be collected

Read

Write

Background jobs

Internal jobs from TiDB (ref pingcap/tidb#33963)

Diagnosis helper

Existing related issues

sticnarf commented Jul 28, 2022

OneSizeFitsQuorum commented Jul 28, 2022

sticnarf commented Jul 28, 2022

sticnarf commented Apr 14, 2022 •

edited