Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve region request log for diagnose #1300

Merged
merged 21 commits into from
Apr 29, 2024

Conversation

crazycs520
Copy link
Contributor

@crazycs520 crazycs520 commented Apr 18, 2024

Close #1315.

What changed and how does it work?

1. Reduce region_request log

  • add a slow log threshold 100ms, avoid print too much log.

Before This PR

After 1 tikv server down when benchmark test, tidb will output tens of thousands log about region_request in a matter of minutes

> rg 'region_request.go' tidb.log | wc
  26917 1839370 26737091

This PR

> rg 'region_request.go' tidb.log | wc
   2675  285852 3440758

2. make log more friendly to diagnostic slow queries.

Such as use  Connection ID and txn timestamp to grep tidb log, then got following log:

> rg 449392618766336000 tidb.log  | rg 1023410562
[2024/04/28 17:33:19.926 +08:00] [INFO] [region_request.go:1890] ["throwing pseudo region error due to no replica available"] [conn=1023410562] [session_alias=] [req-ts=449392618766336000] [req-type=Cop] [region="{ region id: 21079, ver: 497, confVer: 317 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:[192.168.80.3:23361](1) wait recvLoop timeout,timeout:100ms, wait_duration:3.746µs:: context deadline exceeded, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 4, cacheRegionIsValid: true, replicaStatus: [peer: 330789, store: 1, isEpochStale: true, attempts: 0, attempts_time: 0s, replica-epoch: 0, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable peer: 214307066, store: 4, isEpochStale: false, attempts: 2, attempts_time: 100.3ms, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 2117752236, store: 9, isEpochStale: false, attempts: 1, attempts_time: 100.1ms, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [max-exec-timeout-ms=100] [total-round-stats="{total-backoff: 2ms, total-backoff-times: 1, total-rpc: {Cop:{num_rpc:3, total_time:200.4ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:2}}}"] [current-round-stats="{time: 202.5ms, backoff: 2ms, retry-times: 3, replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context deadline exceeded}, {replica_read, peer:2117752236, store:9, err:context deadline exceeded}}}"]
[2024/04/28 17:33:20.223 +08:00] [INFO] [region_request.go:1890] ["send request failed, err: context canceled"] [conn=1023410562] [session_alias=] [req-ts=449392618766336000] [req-type=Cop] [region="{ region id: 21079, ver: 497, confVer: 317 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:[192.168.80.2:23361](1) context canceled, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 2, cacheRegionIsValid: true, replicaStatus: [peer: 330789, store: 1, isEpochStale: true, attempts: 0, attempts_time: 0s, replica-epoch: 0, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable peer: 214307066, store: 4, isEpochStale: false, attempts: 2, attempts_time: 291.1ms, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 2117752236, store: 9, isEpochStale: false, attempts: 0, attempts_time: 0s, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable]}}"] [max-exec-timeout-ms=60000] [total-round-stats="{total-backoff: 8ms, total-backoff-times: 3, total-rpc: {Cop:{num_rpc:5, total_time:491.5ms}, rpc_errors:{context deadline exceeded:2, context canceled:1, data_is_not_ready:2}}}"] [current-round-stats="{time: 295.2ms, backoff: 4ms, retry-times: 1, replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context canceled}}}"]
[2024/04/28 17:33:20.223 +08:00] [INFO] [conn.go:1151] ["command dispatched failed"] [conn=1023410562] [session_alias=] [connInfo="id:1023410562, addr:192.168.80.1:48832 status:10, collation:latin1_swedish_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select pk, sk, ts, v from t1 as of timestamp now() - interval 5 second where pk = '000000000000000000000000131914' and sk = '0000-0001-0003-0007' order by ts desc limit 5"] [txn_mode=PESSIMISTIC] [timestamp=449392618766336000] [err="[executor:3024]Query execution was interrupted, maximum statement execution time exceeded\ngithub.com/pingcap/errors.AddStack\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/errors.go:178\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/normalize.go:175\ngithub.com/pingcap/tidb/pkg/util/sqlkiller.(*SQLKiller).HandleSignal\n\t/data/cs/tidb/pkg/util/sqlkiller/sqlkiller.go:68\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.(*executorKillerHandler).HandleSQLKillerSignal\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:251\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:409\ngithub.com/pingcap/tidb/pkg/executor.(*LimitExec).Next\n\t/data/cs/tidb/pkg/executor/executor.go:1374\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).next\n\t/data/cs/tidb/pkg/executor/adapter.go:1211\ngithub.com/pingcap/tidb/pkg/executor.(*recordSet).Next\n\t/data/cs/tidb/pkg/executor/adapter.go:160\ngithub.com/pingcap/tidb/pkg/server/internal/resultset.(*tidbResultSet).Next\n\t/data/cs/tidb/pkg/server/internal/resultset/resultset.go:64\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeChunks\n\t/data/cs/tidb/pkg/server/conn.go:2316\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeResultSet\n\t/data/cs/tidb/pkg/server/conn.go:2259\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/data/cs/tidb/pkg/server/conn.go:2052\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/data/cs/tidb/pkg/server/conn.go:1774\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/data/cs/tidb/pkg/server/conn.go:1348\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/data/cs/tidb/pkg/server/conn.go:1114\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/data/cs/tidb/pkg/server/server.go:739\nruntime.goexit\n\t/data/cs/go/src/runtime/asm_amd64.s:1650"]

With upper log information, we can know following information about this query:

  • In first round:
    • max-exec-timeout-ms=100
    • total-round-stats="{total-backoff: 2ms, total-backoff-times: 1, total-rpc: {Cop:{num_rpc:3, total_time:200.4ms}, rpc_errors:{data_is_not_ready:1, context deadline exceeded:2}}}"
    • current-round-stats="{time: 202.5ms, backoff: 2ms, retry-times: 3}
      • replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context deadline exceeded}, {replica_read, peer:2117752236, store:9, err:context deadline exceeded}}
      • time: 202.5ms
  • In second round try:
    • max-exec-timeout-ms=60000
    • total-round-stats="{total-backoff: 8ms, total-backoff-times: 3, total-rpc: {Cop:{num_rpc:5, total_time:491.5ms}, rpc_errors:{context deadline exceeded:2, context canceled:1, data_is_not_ready:2}}}"
    • current-round-stats={time: 295.2ms, backoff: 4ms, retry-times: 1,
      • replica-access: {{stale_read, peer:214307066, store:4, err:data_is_not_ready}, {replica_read, peer:214307066, store:4, err:context canceled}}}
      • time: 295.2ms

3. Add RPC Error stats.

Before This PR, the plan with execution info may be like following:

We can know TableReader_17 execution info is:

  • time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
  • Why 1 cop task send 6 rpc request, What rpc error was encountered?
| id                            | estRows | estCost | actRows | task      | access object | execution info                                                                                                                                                                                                                                                                                                                                              | operator info                                                                                                                      | memory  | disk     |
| Sort_10                       | 0.00    | 818.87  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | test.t1.sk                                                                                                                         | 0 Bytes | 0 Bytes  |
| └─Selection_13                | 0.00    | 506.07  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | le(Column#6, 4)                                                                                                                    | 2.61 KB | N/A      |
|   └─Window_14                 | 0.00    | 506.05  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | row_number()->Column#6 over(partition by test.t1.pk, test.t1.sk order by test.t1.ts desc rows between current row and current row) | N/A     | N/A      |
|     └─Sort_18                 | 0.00    | 506.05  | 0       | root      |               | time:502.4ms, loops:1                                                                                                                                                                                                                                                                                                                                       | test.t1.pk, test.t1.sk, test.t1.ts:desc                                                                                            | 0 Bytes | 0 Bytes  |
|       └─TableReader_17        | 0.00    | 439.47  | 0       | root      |               | time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}                                                                                                  | data:Selection_16                                                                                                                  | 0 Bytes | N/A      |
|         └─Selection_16        | 0.00    | 6591.46 | 0       | cop[tikv] |               |                                                                                                                                                                                                                                                                                                                                                             | ge(test.t1.ts, 1713771327676811017), lt(test.t1.ts, 1713771377676812586)                                                           | N/A     | N/A      |
|           └─TableRangeScan_15 | 14.95   | 5099.20 | 0       | cop[tikv] | table:t1      |                                                                                                                                                                                                                                                                                                                                                             | range:["000000000000000000000000013839" "0000-0001-0002","000000000000000000000000013839" "0000-0001-0003"), keep order:false      | N/A     | N/A      |

This PR

We can know TableReader_17 execution info is:

  • time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , rpc_errors:{data_is_not_ready:2, not_leader:2, context deadline exceeded:1, context canceled:1}, backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
  • With rpc_errors info, we can known the cop task meets data_is_not_ready error 2 times, meets not_leader error 2 times, meets context deadline exceeded 1 times, and meets context canceled 1 time.

4. Refine Region Error metric and add RPC error metric

img_v3_02a8_e9ad4d38-bc64-4553-a7c6-d9d4b3ab482g

@crazycs520 crazycs520 changed the title [WIP]improve region request log for diagnose improve region request log for diagnose Apr 22, 2024
@crazycs520 crazycs520 marked this pull request as ready for review April 22, 2024 08:14
Signed-off-by: crazycs520 <crazycs520@gmail.com>
internal/locate/region_request.go Outdated Show resolved Hide resolved
internal/locate/region_request.go Show resolved Hide resolved
Copy link
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?

integration_tests/main_test.go Show resolved Hide resolved
txnkv/txnsnapshot/snapshot.go Show resolved Hide resolved
internal/locate/region_request.go Outdated Show resolved Hide resolved
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
@crazycs520
Copy link
Contributor Author

LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?

Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later.

Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
@crazycs520
Copy link
Contributor Author

LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?

Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later.

The sysbench performance test result is here, little effect if test jitter is excluded.

image

Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
internal/locate/region_request.go Show resolved Hide resolved
// ReplicaAccessStats records the first 20 access info, after more than 20 records, count them by `OverflowCount` field.
type ReplicaAccessStats struct {
AccessInfos []ReplicaAccessInfo
OverflowCount int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about merging the access infor by peer when there are many access following? Overflowcount could not provide useful information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since ReplicaAccessStats currently only records for this round, and in normal case, each replica should only access only once, for leader replica can be access 10 times, so normally, OverflowCount should always be 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the follower read or stale read cases? The OverflowCount still does not seem very helpful.
Consider such a case, the NotLeader error is encountered serveral times because of region election, after the new leader information is returned to the kv-client, some new errors caused by disk io problems are encountered, while the related information is recorded as OverflowCount .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, done. PTAL

Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Comment on lines +297 to +299
if replicaRead {
tp = ReqReplicaRead
} else if staleRead {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should replicaRead be prior to staleRead? Or is it guaranteed that replicaRead and staleRead must be exclusive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the current tests, normally, replicaRead and staleRead flag must be exclusive.

@cfzjywxk cfzjywxk merged commit 31a502b into tikv:master Apr 29, 2024
9 of 10 checks passed
crazycs520 added a commit to crazycs520/client-go that referenced this pull request Apr 29, 2024
* improve region request log for diagnose

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* rename struct

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine region error metric with store id label and add rpc error metric

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine comment

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine code

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* restrict log

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine code

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine log

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine code

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* fix test

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* address comment

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine

Signed-off-by: crazycs520 <crazycs520@gmail.com>

* refine log

Signed-off-by: crazycs520 <crazycs520@gmail.com>

---------

Signed-off-by: crazycs520 <crazycs520@gmail.com>
cfzjywxk pushed a commit that referenced this pull request Apr 29, 2024
* improve region request log for diagnose



* rename struct



* refine region error metric with store id label and add rpc error metric



* refine comment



* refine code



* restrict log



* refine code



* refine



* refine



* refine log



* refine code



* fix test



* address comment



* refine



* refine



* refine log



---------

Signed-off-by: crazycs520 <crazycs520@gmail.com>
crazycs520 added a commit to crazycs520/client-go that referenced this pull request Aug 27, 2024
Signed-off-by: crazycs520 <crazycs520@gmail.com>

---------

Signed-off-by: crazycs520 <crazycs520@gmail.com>
ti-chi-bot bot pushed a commit that referenced this pull request Aug 28, 2024
 

Signed-off-by: crazycs520 <crazycs520@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improve region_request log for observability
5 participants