improve region request log for diagnose #1300

crazycs520 · 2024-04-18T08:58:56Z

What changed and how does it work?

1. Reduce `region_request` log

add a slow log threshold 100ms, avoid print too much log.

Before This PR

After 1 tikv server down when benchmark test, tidb will output tens of thousands log about region_request in a matter of minutes

> rg 'region_request.go' tidb.log | wc
  26917 1839370 26737091

This PR

> rg 'region_request.go' tidb.log | wc
   2675  285852 3440758

2. make log more friendly to diagnostic slow queries.

Such as use Connection ID and txn timestamp to grep tidb log, then got following log:

> rg 449392618766336000 tidb.log  | rg 1023410562
[2024/04/28 17:33:19.926 +08:00] [INFO] [region_request.go:1890] ["throwing pseudo region error due to no replica available"] [conn=1023410562] [session_alias=] [req-ts=449392618766336000] [req-type=Cop] [region="{ region id: 21079, ver: 497, confVer: 317 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:[192.168.80.3:23361](1) wait recvLoop timeout,timeout:100ms, wait_duration:3.746µs:: context deadline exceeded, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 4, cacheRegionIsValid: true, replicaStatus: [peer: 330789, store: 1, isEpochStale: true, attempts: 0, attempts_time: 0s, replica-epoch: 0, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable peer: 214307066, store: 4, isEpochStale: false, attempts: 2, attempts_time: 100.3ms, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 2117752236, store: 9, isEpochStale: false, attempts: 1, attempts_time: 100.1ms, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [max-exec-timeout-ms=100] [total-round-stats="{total-backoff: 2ms, total-backoff-times: 1, total-rpc: {Cop:{num_rpc:3, total_time:200.4ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:2}}}"] [current-round-stats="{time: 202.5ms, backoff: 2ms, retry-times: 3, replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context deadline exceeded}, {replica_read, peer:2117752236, store:9, err:context deadline exceeded}}}"]
[2024/04/28 17:33:20.223 +08:00] [INFO] [region_request.go:1890] ["send request failed, err: context canceled"] [conn=1023410562] [session_alias=] [req-ts=449392618766336000] [req-type=Cop] [region="{ region id: 21079, ver: 497, confVer: 317 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:[192.168.80.2:23361](1) context canceled, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 2, cacheRegionIsValid: true, replicaStatus: [peer: 330789, store: 1, isEpochStale: true, attempts: 0, attempts_time: 0s, replica-epoch: 0, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable peer: 214307066, store: 4, isEpochStale: false, attempts: 2, attempts_time: 291.1ms, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 2117752236, store: 9, isEpochStale: false, attempts: 0, attempts_time: 0s, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [max-exec-timeout-ms=60000] [total-round-stats="{total-backoff: 8ms, total-backoff-times: 3, total-rpc: {Cop:{num_rpc:5, total_time:491.5ms}, rpc_errors:{context deadline exceeded:2, context canceled:1, data_is_not_ready:2}}}"] [current-round-stats="{time: 295.2ms, backoff: 4ms, retry-times: 1, replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context canceled}}}"]
[2024/04/28 17:33:20.223 +08:00] [INFO] [conn.go:1151] ["command dispatched failed"] [conn=1023410562] [session_alias=] [connInfo="id:1023410562, addr:192.168.80.1:48832 status:10, collation:latin1_swedish_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select pk, sk, ts, v from t1 as of timestamp now() - interval 5 second where pk = '000000000000000000000000131914' and sk = '0000-0001-0003-0007' order by ts desc limit 5"] [txn_mode=PESSIMISTIC] [timestamp=449392618766336000] [err="[executor:3024]Query execution was interrupted, maximum statement execution time exceeded\ngithub.com/pingcap/errors.AddStack\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/errors.go:178\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/normalize.go:175\ngithub.com/pingcap/tidb/pkg/util/sqlkiller.(*SQLKiller).HandleSignal\n\t/data/cs/tidb/pkg/util/sqlkiller/sqlkiller.go:68\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.(*executorKillerHandler).HandleSQLKillerSignal\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:251\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:409\ngithub.com/pingcap/tidb/pkg/executor.(*LimitExec).Next\n\t/data/cs/tidb/pkg/executor/executor.go:1374\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).next\n\t/data/cs/tidb/pkg/executor/adapter.go:1211\ngithub.com/pingcap/tidb/pkg/executor.(*recordSet).Next\n\t/data/cs/tidb/pkg/executor/adapter.go:160\ngithub.com/pingcap/tidb/pkg/server/internal/resultset.(*tidbResultSet).Next\n\t/data/cs/tidb/pkg/server/internal/resultset/resultset.go:64\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeChunks\n\t/data/cs/tidb/pkg/server/conn.go:2316\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeResultSet\n\t/data/cs/tidb/pkg/server/conn.go:2259\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/data/cs/tidb/pkg/server/conn.go:2052\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/data/cs/tidb/pkg/server/conn.go:1774\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/data/cs/tidb/pkg/server/conn.go:1348\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/data/cs/tidb/pkg/server/conn.go:1114\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/data/cs/tidb/pkg/server/server.go:739\nruntime.goexit\n\t/data/cs/go/src/runtime/asm_amd64.s:1650"]

With upper log information, we can know following information about this query:

In first round:
- max-exec-timeout-ms=100
- total-round-stats="{total-backoff: 2ms, total-backoff-times: 1, total-rpc: {Cop:{num_rpc:3, total_time:200.4ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:2}}}"
- current-round-stats="{time: 202.5ms, backoff: 2ms, retry-times: 3}
  - replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context deadline exceeded}, {replica_read, peer:2117752236, store:9, err:context deadline exceeded}}
  - time: 202.5ms
In second round try:
- max-exec-timeout-ms=60000
- total-round-stats="{total-backoff: 8ms, total-backoff-times: 3, total-rpc: {Cop:{num_rpc:5, total_time:491.5ms}, rpc_errors:{context deadline exceeded:2, context canceled:1, data_is_not_ready:2}}}"
- current-round-stats={time: 295.2ms, backoff: 4ms, retry-times: 1,
  - replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context canceled}}}
  - time: 295.2ms

3. Add RPC Error stats.

Before This PR, the plan with execution info may be like following:

We can know TableReader_17 execution info is:

time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
Why 1 cop task send 6 rpc request, What rpc error was encountered?

| id                            | estRows | estCost | actRows | task      | access object | execution info                                                                                                                                                                                                                                                                                                                                              | operator info                                                                                                                      | memory  | disk     |
| Sort_10                       | 0.00    | 818.87  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | test.t1.sk                                                                                                                         | 0 Bytes | 0 Bytes  |
| └─Selection_13                | 0.00    | 506.07  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | le(Column#6, 4)                                                                                                                    | 2.61 KB | N/A      |
|   └─Window_14                 | 0.00    | 506.05  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | row_number()->Column#6 over(partition by test.t1.pk, test.t1.sk order by test.t1.ts desc rows between current row and current row) | N/A     | N/A      |
|     └─Sort_18                 | 0.00    | 506.05  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | test.t1.pk, test.t1.sk, test.t1.ts:desc                                                                                            | 0 Bytes | 0 Bytes  |
|       └─TableReader_17        | 0.00    | 439.47  | 0       | root      |               | time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}                                                                                                  | data:Selection_16                                                                                                                  | 0 Bytes | N/A      |
|         └─Selection_16        | 0.00    | 6591.46 | 0       | cop[tikv] |               |                                                                                                                                                                                                                                                                                                                                                             | ge(test.t1.ts, 1713771327676811017), lt(test.t1.ts, 1713771377676812586)                                                           | N/A     | N/A      |
|           └─TableRangeScan_15 | 14.95   | 5099.20 | 0       | cop[tikv] | table:t1      |                                                                                                                                                                                                                                                                                                                                                             | range:["000000000000000000000000013839" "0000-0001-0002","000000000000000000000000013839" "0000-0001-0003"), keep order:false      | N/A     | N/A      |

This PR

We can know TableReader_17 execution info is:

time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , rpc_errors:{data_is_not_ready:2, not_leader:2, context deadline exceeded:1, context canceled:1}, backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
With rpc_errors info, we can known the cop task meets data_is_not_ready error 2 times, meets not_leader error 2 times, meets context deadline exceeded 1 times, and meets context canceled 1 time.

4. Refine Region Error metric and add RPC error metric

Signed-off-by: crazycs520 <crazycs520@gmail.com>

internal/locate/region_request.go

ekexium

LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?

integration_tests/main_test.go

txnkv/txnsnapshot/snapshot.go

internal/locate/region_request.go

Signed-off-by: crazycs520 <crazycs520@gmail.com>

…e-log-and-rpc-stats

Signed-off-by: crazycs520 <crazycs520@gmail.com>

crazycs520 · 2024-04-24T05:38:12Z

LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?

Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later.

…e-log-and-rpc-stats

Signed-off-by: crazycs520 <crazycs520@gmail.com>

…e-log-and-rpc-stats

crazycs520 · 2024-04-25T03:20:25Z

LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?

Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later.

The sysbench performance test result is here, little effect if test jitter is excluded.

Signed-off-by: crazycs520 <crazycs520@gmail.com>

internal/locate/region_request.go

cfzjywxk · 2024-04-25T07:53:27Z

internal/locate/region_request.go

+// ReplicaAccessStats records the first 20 access info, after more than 20 records, count them by `OverflowCount` field.
+type ReplicaAccessStats struct {
+	AccessInfos   []ReplicaAccessInfo
+	OverflowCount int


How about merging the access infor by peer when there are many access following? Overflowcount could not provide useful information.

Since ReplicaAccessStats currently only records for this round, and in normal case, each replica should only access only once, for leader replica can be access 10 times, so normally, OverflowCount should always be 0.

What about the follower read or stale read cases? The OverflowCount still does not seem very helpful.
Consider such a case, the NotLeader error is encountered serveral times because of region election, after the new leader information is returned to the kv-client, some new errors caused by disk io problems are encountered, while the related information is recorded as OverflowCount .

great, done. PTAL

Signed-off-by: crazycs520 <crazycs520@gmail.com>

…e-log-and-rpc-stats

internal/locate/region_request.go

Signed-off-by: crazycs520 <crazycs520@gmail.com>

internal/locate/region_request.go

Signed-off-by: crazycs520 <crazycs520@gmail.com>

MyonKeminta · 2024-04-29T06:20:23Z

internal/locate/region_request.go

+		if replicaRead {
+			tp = ReqReplicaRead
+		} else if staleRead {


Should replicaRead be prior to staleRead? Or is it guaranteed that replicaRead and staleRead must be exclusive?

According to the current tests, normally, replicaRead and staleRead flag must be exclusive.

* improve region request log for diagnose Signed-off-by: crazycs520 <crazycs520@gmail.com> * rename struct Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine region error metric with store id label and add rpc error metric Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine comment Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * restrict log Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine log Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * fix test Signed-off-by: crazycs520 <crazycs520@gmail.com> * address comment Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine log Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>

* improve region request log for diagnose * rename struct * refine region error metric with store id label and add rpc error metric * refine comment * refine code * restrict log * refine code * refine * refine * refine log * refine code * fix test * address comment * refine * refine * refine log --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>

Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>

Signed-off-by: crazycs520 <crazycs520@gmail.com>

crazycs520 changed the title ~~[WIP]improve region request log for diagnose~~ improve region request log for diagnose Apr 22, 2024

crazycs520 marked this pull request as ready for review April 22, 2024 08:14

crazycs520 mentioned this pull request Apr 22, 2024

improve region request log for diagnose and add rpc error stats pingcap/tidb#52809

Closed

cfzjywxk requested review from cfzjywxk, you06, zyguan, MyonKeminta and ekexium April 23, 2024 03:41

improve region request log for diagnose

d6c8e6e

Signed-off-by: crazycs520 <crazycs520@gmail.com>

crazycs520 force-pushed the refine-log-and-rpc-stats branch from d1670c1 to d6c8e6e Compare April 23, 2024 03:46

cfzjywxk reviewed Apr 23, 2024

View reviewed changes

internal/locate/region_request.go Outdated Show resolved Hide resolved

internal/locate/region_request.go Show resolved Hide resolved

ekexium reviewed Apr 23, 2024

View reviewed changes

MyonKeminta reviewed Apr 23, 2024

View reviewed changes

integration_tests/main_test.go Show resolved Hide resolved

txnkv/txnsnapshot/snapshot.go Show resolved Hide resolved

internal/locate/region_request.go Outdated Show resolved Hide resolved

crazycs520 added 4 commits April 23, 2024 18:32

rename struct

9c5af68

Signed-off-by: crazycs520 <crazycs520@gmail.com>

refine region error metric with store id label and add rpc error metric

a57fd09

Signed-off-by: crazycs520 <crazycs520@gmail.com>

Merge branch 'master' of https://github.com/tikv/client-go into refin…

127dd0f

…e-log-and-rpc-stats

refine comment

96a9f97

Signed-off-by: crazycs520 <crazycs520@gmail.com>

crazycs520 added 4 commits April 24, 2024 16:45

Merge branch 'master' of https://github.com/tikv/client-go into refin…

234ffb4

…e-log-and-rpc-stats

refine code

aef9264

Signed-off-by: crazycs520 <crazycs520@gmail.com>

restrict log

1459709

Signed-off-by: crazycs520 <crazycs520@gmail.com>

Merge branch 'master' of https://github.com/tikv/client-go into refin…

3644a8a

…e-log-and-rpc-stats

ekexium approved these changes Apr 25, 2024

View reviewed changes

crazycs520 added 3 commits April 25, 2024 11:46

refine code

209753e

Signed-off-by: crazycs520 <crazycs520@gmail.com>

refine

4b728e7

Signed-off-by: crazycs520 <crazycs520@gmail.com>

refine

b36d4a8

Signed-off-by: crazycs520 <crazycs520@gmail.com>

cfzjywxk reviewed Apr 25, 2024

View reviewed changes

crazycs520 mentioned this pull request Apr 25, 2024

*: add rpc error stats and refine log pingcap/tidb#52810

Merged

13 tasks

refine log

aed1111

Signed-off-by: crazycs520 <crazycs520@gmail.com>

crazycs520 added 4 commits April 26, 2024 17:09

refine code

cea82a0

Signed-off-by: crazycs520 <crazycs520@gmail.com>

fix test

90e8738

Signed-off-by: crazycs520 <crazycs520@gmail.com>

Merge branch 'master' of https://github.com/tikv/client-go into refin…

0edb2ed

…e-log-and-rpc-stats

Merge branch 'master' into refine-log-and-rpc-stats

87ebba3

cfzjywxk reviewed Apr 28, 2024

View reviewed changes

internal/locate/region_request.go Show resolved Hide resolved

address comment

14f7a51

Signed-off-by: crazycs520 <crazycs520@gmail.com>

cfzjywxk reviewed Apr 28, 2024

View reviewed changes

internal/locate/region_request.go Outdated Show resolved Hide resolved

crazycs520 added 3 commits April 28, 2024 18:58

refine

0fa24aa

Signed-off-by: crazycs520 <crazycs520@gmail.com>

refine

fd82389

Signed-off-by: crazycs520 <crazycs520@gmail.com>

refine log

a4f9994

Signed-off-by: crazycs520 <crazycs520@gmail.com>

MyonKeminta reviewed Apr 29, 2024

View reviewed changes

MyonKeminta approved these changes Apr 29, 2024

View reviewed changes

cfzjywxk approved these changes Apr 29, 2024

View reviewed changes

cfzjywxk merged commit 31a502b into tikv:master Apr 29, 2024
9 of 10 checks passed

crazycs520 mentioned this pull request Apr 29, 2024

improve region request log for diagnose (#1300) #1328

Merged

ti-chi-bot mentioned this pull request Apr 30, 2024

*: add rpc error stats and refine log (#52810) pingcap/tidb#52998

Merged

13 tasks

crazycs520 mentioned this pull request May 13, 2024

improve region request log for diagnose (1300) #1340

Merged

nolouch mentioned this pull request May 15, 2024

nil panic once meet quota limit error #1343

Closed

crazycs520 added a commit to crazycs520/client-go that referenced this pull request Aug 27, 2024

improve region request log for diagnose (tikv#1300)

ecb50a7

Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>

This was referenced Aug 27, 2024

improve region request log for diagnose #1300 #1452

Merged

*: add rpc error stats and refine log (#52810) (#52998) pingcap/tidb#55681

Closed

ti-chi-bot bot pushed a commit that referenced this pull request Aug 28, 2024

improve region request log for diagnose #1300 (#1452)

bc6139e

Signed-off-by: crazycs520 <crazycs520@gmail.com>

ti-chi-bot mentioned this pull request Aug 28, 2024

*: add rpc error stats and refine log (#52810) pingcap/tidb#55719

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve region request log for diagnose #1300

improve region request log for diagnose #1300

crazycs520 commented Apr 18, 2024 •

edited

Loading

ekexium left a comment

crazycs520 commented Apr 24, 2024

crazycs520 commented Apr 25, 2024

cfzjywxk Apr 25, 2024

crazycs520 Apr 25, 2024

cfzjywxk Apr 26, 2024

crazycs520 Apr 26, 2024

MyonKeminta Apr 29, 2024

crazycs520 Apr 29, 2024

improve region request log for diagnose #1300

improve region request log for diagnose #1300

Conversation

crazycs520 commented Apr 18, 2024 • edited Loading

What changed and how does it work?

1. Reduce region_request log

2. make log more friendly to diagnostic slow queries.

3. Add RPC Error stats.

4. Refine Region Error metric and add RPC error metric

ekexium left a comment

Choose a reason for hiding this comment

crazycs520 commented Apr 24, 2024

crazycs520 commented Apr 25, 2024

cfzjywxk Apr 25, 2024

Choose a reason for hiding this comment

crazycs520 Apr 25, 2024

Choose a reason for hiding this comment

cfzjywxk Apr 26, 2024

Choose a reason for hiding this comment

crazycs520 Apr 26, 2024

Choose a reason for hiding this comment

MyonKeminta Apr 29, 2024

Choose a reason for hiding this comment

crazycs520 Apr 29, 2024

Choose a reason for hiding this comment

crazycs520 commented Apr 18, 2024 •

edited

Loading

1. Reduce `region_request` log