Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve region_request log for observability #1315

Closed
crazycs520 opened this issue Apr 22, 2024 · 0 comments · Fixed by #1300
Closed

improve region_request log for observability #1315

crazycs520 opened this issue Apr 22, 2024 · 0 comments · Fixed by #1300

Comments

@crazycs520
Copy link
Contributor

crazycs520 commented Apr 22, 2024

Issue-1: Too much log when 1 tikv node down

After 1 tikv server down when benchmark test, tidb will output tens of thousands log about region_request in a matter of minutes, follwing is an example in my test:

  • Too much log about throwing pseudo region error due to no replica available
  • Too much log about DataIsNotReady, this should be fix by reduce unnecessary log when DataIsNotReady #1304
  • Too much log about send request failed, err: context canceled
  • Too much log about send request on region error failed
> rg 'region_request.go' tidb.log | wc
  26917 1839370 26737091
> rg 'region_request.go' tidb.log | rg 'throwing pseudo region error due to no replica available' | wc
  11221 1033975 11960379
> rg 'region_request.go' tidb.log | rg 'tikv reports `DataIsNotReady` retry later' | wc
  14489  695519 13496704
> rg 'region_request.go' tidb.log | rg 'send request failed, err: context canceled' | wc
    884   79561  938332
> rg 'region_request.go' tidb.log | rg 'send request on region error failed' | wc
    320   30098  339321

Issue-2: log is not friendly to diagnostic slow queries

After 1 tikv server down when benchmark test, we will find some slow queries in TiDB dashboard:

image image image

Then we may use Connection ID and txn timestamp to grep tidb log, then got following log:

# rg 449253584666624000  tidb.log | rg 3573547264
[2024/04/22 14:13:46.410 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=true] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 1, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=1] [total-backoff-ms=2] [total-backoff-times=1] [max-exec-timeout-ms=100] [total-region-errors=310072-data_is_not_ready:1]
[2024/04/22 14:13:46.424 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=false] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 3, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 2, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=3] [total-backoff-ms=14] [total-backoff-times=5] [max-exec-timeout-ms=60000] [total-region-errors="310072-data_is_not_ready:1, 310072-not_leader-nil:1, 310073-not_leader-nil:1"]
[2024/04/22 14:13:46.437 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=true] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 1, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=1] [total-backoff-ms=26] [total-backoff-times=7] [max-exec-timeout-ms=60000] [total-region-errors=310072-data_is_not_ready:1]
[2024/04/22 14:13:46.462 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=true] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 1, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=1] [total-backoff-ms=50] [total-backoff-times=9] [max-exec-timeout-ms=60000] [total-region-errors=310072-data_is_not_ready:1]
[2024/04/22 14:13:46.511 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=true] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 1, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=1] [total-backoff-ms=98] [total-backoff-times=11] [max-exec-timeout-ms=60000] [total-region-errors=310072-data_is_not_ready:1]
[2024/04/22 14:13:46.608 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=true] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 1, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=1] [total-backoff-ms=194] [total-backoff-times=13] [max-exec-timeout-ms=60000] [total-region-errors=310072-data_is_not_ready:1]
[2024/04/22 14:13:46.800 +08:00] [INFO] [region_request.go:1688] ["throwing pseudo region error due to no replica available"] [conn=3573547264] [session_alias=] [req-ts=449253584666624000] [req-type=Cop] [region="{ region id: 310071, ver: 504, confVer: 277 }"] [replica-read-type=mixed] [stale-read=true] [request-sender="{rpcError:<nil>, replicaSelector: replicaSelectorV2{replicaReadType: mixed, attempts: 1, cacheRegionIsValid: false, replicaStatus: [peer: 310072, store: 4, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310073, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable peer: 310074, store: 2, isEpochStale: false, attempts: 0, replica-epoch: 1, store-epoch: 1, store-state: resolved, store-liveness-state: unreachable]}}"] [retry-times=1] [total-backoff-ms=386] [total-backoff-times=15] [max-exec-timeout-ms=60000] [total-region-errors=310072-data_is_not_ready:1]
[2024/04/22 14:13:46.891 +08:00] [INFO] [conn.go:1151] ["command dispatched failed"] [conn=3573547264] [session_alias=] [connInfo="id:3573547264, addr:192.168.80.1:44872 status:10, collation:latin1_swedish_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="select pk, sk, ts, v, n from (             select pk, sk, ts, v, row_number() over (partition by pk, sk order by ts desc) as n             from t1 as of timestamp now() - interval 5 second where pk = '000000000000000000000000854967' and sk >= '0000-0001-0002-0005' and sk < '0000-0001-0003-0006' and ts >= 1713766366407575348 and ts < 1713766416407576961         ) as tt where n <= 6 order by sk asc"] [txn_mode=PESSIMISTIC] [timestamp=449253584666624000] [err="[executor:3024]Query execution was interrupted, maximum statement execution time exceeded\ngithub.com/pingcap/errors.AddStack\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/errors.go:178\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/data/cs/code/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/normalize.go:175\ngithub.com/pingcap/tidb/pkg/util/sqlkiller.(*SQLKiller).HandleSignal\n\t/data/cs/tidb/pkg/util/sqlkiller/sqlkiller.go:68\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.(*executorKillerHandler).HandleSQLKillerSignal\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:251\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:409\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunksUnparallel\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:615\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunks\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:528\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).Next\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:277\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*PipelinedWindowExec).fetchChild\n\t/data/cs/tidb/pkg/executor/pipelined_window.go:210\ngithub.com/pingcap/tidb/pkg/executor.(*PipelinedWindowExec).getRowsInPartition\n\t/data/cs/tidb/pkg/executor/pipelined_window.go:181\ngithub.com/pingcap/tidb/pkg/executor.(*PipelinedWindowExec).Next\n\t/data/cs/tidb/pkg/executor/pipelined_window.go:126\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*SelectionExec).Next\n\t/data/cs/tidb/pkg/executor/executor.go:1609\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunksUnparallel\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:615\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).fetchChunks\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:528\ngithub.com/pingcap/tidb/pkg/executor/sortexec.(*SortExec).Next\n\t/data/cs/tidb/pkg/executor/sortexec/sort.go:277\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Next\n\t/data/cs/tidb/pkg/executor/internal/exec/executor.go:403\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).next\n\t/data/cs/tidb/pkg/executor/adapter.go:1210\ngithub.com/pingcap/tidb/pkg/executor.(*recordSet).Next\n\t/data/cs/tidb/pkg/executor/adapter.go:159\ngithub.com/pingcap/tidb/pkg/server/internal/resultset.(*tidbResultSet).Next\n\t/data/cs/tidb/pkg/server/internal/resultset/resultset.go:64\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeChunks\n\t/data/cs/tidb/pkg/server/conn.go:2316\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).writeResultSet\n\t/data/cs/tidb/pkg/server/conn.go:2259\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/data/cs/tidb/pkg/server/conn.go:2052\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/data/cs/tidb/pkg/server/conn.go:1774\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/data/cs/tidb/pkg/server/conn.go:1348\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/data/cs/tidb/pkg/server/conn.go:1114\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/data/cs/tidb/pkg/server/server.go:739\nruntime.goexit\n\t/data/cs/go/src/runtime/asm_amd64.s:1650"]

The problem is:

  • Too many throwing pseudo region error due to no replica available.
  • total-region-errors only contain 1 error, but the retry-times=1, what is the rest of the rpc error.
  • It looks like only one peer is accessed, so why not access the other peers.
  • Don't have rpc stats, why it is so slow, Is the RPC slow or something else, like load region?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant