-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve region request log for diagnose #1300
Conversation
Signed-off-by: crazycs520 <crazycs520@gmail.com>
d1670c1
to
d6c8e6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM expect one question: does it introduce observable performance overhead? Do we have experiments for that?
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Nice catch. Since I only collect replica access paths when meet error, so in most normal cases won't introduce performance overhead. Anyway, I triggered a sysbench performance test, let's check out the results later. |
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
…e-log-and-rpc-stats
internal/locate/region_request.go
Outdated
// ReplicaAccessStats records the first 20 access info, after more than 20 records, count them by `OverflowCount` field. | ||
type ReplicaAccessStats struct { | ||
AccessInfos []ReplicaAccessInfo | ||
OverflowCount int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about merging the access infor by peer when there are many access following? Overflowcount
could not provide useful information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ReplicaAccessStats
currently only records for this round, and in normal case, each replica should only access only once, for leader replica can be access 10 times, so normally, OverflowCount should always be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the follower read
or stale read
cases? The OverflowCount
still does not seem very helpful.
Consider such a case, the NotLeader
error is encountered serveral times because of region election, after the new leader information is returned to the kv-client, some new errors caused by disk io problems are encountered, while the related information is recorded as OverflowCount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, done. PTAL
Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com>
…e-log-and-rpc-stats
Signed-off-by: crazycs520 <crazycs520@gmail.com>
if replicaRead { | ||
tp = ReqReplicaRead | ||
} else if staleRead { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should replicaRead
be prior to staleRead
? Or is it guaranteed that replicaRead
and staleRead
must be exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the current tests, normally, replicaRead
and staleRead
flag must be exclusive.
* improve region request log for diagnose Signed-off-by: crazycs520 <crazycs520@gmail.com> * rename struct Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine region error metric with store id label and add rpc error metric Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine comment Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * restrict log Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine log Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine code Signed-off-by: crazycs520 <crazycs520@gmail.com> * fix test Signed-off-by: crazycs520 <crazycs520@gmail.com> * address comment Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine Signed-off-by: crazycs520 <crazycs520@gmail.com> * refine log Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>
* improve region request log for diagnose * rename struct * refine region error metric with store id label and add rpc error metric * refine comment * refine code * restrict log * refine code * refine * refine * refine log * refine code * fix test * address comment * refine * refine * refine log --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>
Signed-off-by: crazycs520 <crazycs520@gmail.com> --------- Signed-off-by: crazycs520 <crazycs520@gmail.com>
Close #1315.
What changed and how does it work?
1. Reduce
region_request
logBefore This PR
After 1 tikv server down when benchmark test, tidb will output tens of thousands log about
region_request
in a matter of minutesThis PR
2. make log more friendly to diagnostic slow queries.
Such as use
Connection ID
andtxn timestamp
to grep tidb log, then got following log:With upper log information, we can know following information about this query:
3. Add RPC Error stats.
Before This PR, the plan with execution info may be like following:
We can know
TableReader_17
execution info is:time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
This PR
We can know
TableReader_17
execution info is:time:502.4ms, loops:1, cop_task: {num: 1, max: 0s, proc_keys: 0, rpc_num: 6, rpc_time: 485.7ms, copr_cache_hit_ratio: 0.00, build_task_duration: 7.34µs, max_distsql_concurrency: 1}, , rpc_errors:{data_is_not_ready:2, not_leader:2, context deadline exceeded:1, context canceled:1}, backoff{regionMiss: 2ms, dataNotReady: 6ms, regionScheduling: 6ms}
rpc_errors
info, we can known the cop task meetsdata_is_not_ready
error 2 times, meetsnot_leader
error 2 times, meetscontext deadline exceeded
1 times, and meetscontext canceled
1 time.4. Refine Region Error metric and add RPC error metric