-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry short read reqs immediately on server busy #990
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
/cc @crazycs520 |
|
return bo.Backoff(args.cfg, args.err) | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the purpose of these code? I didn't understand it 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used to maintain delayed backoffs.
- "
pendingBackoffs[addr]
is not nil" means there is a delayed backoff which should be applied on retrying the corresponding store. delayBoTiKVServerBusy
is the callback for recording delayed backoffs.backoffOnRetry
is used to apply pending backoffs on retry.- when there is no candidate, the kv client just return a fake error and let the caller do retry. to avoid potential frequent rpcs, we call
backoffOnFakeErr
before returning a fake error, which chooses a pending backoff with the largest base duration and applies it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be done by adding "busy" flag to the stores, instead of maintaining maps here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean the store struct in region cache, probably no I think. They have different scope/lifetime, that is, each SendReqCtx
call should has its own pendingBackoffs
, and has no effect to each others. Maybe we can add a field to the replica struct, however, this PR is just a quick-and-dirty fix to the customer issue, I just try to make the least change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks fine to me. I wonder if the code can be better structured...
Wait for @crazycs520 's new solution
If a read-only workload running with a short max-execution-time (eg. 500ms), injecting IO delay faults will cause the QPS amost drop to zero, because the queries will be interrupted during server-busy backoff and have no chance to try other available peers.
To address the issue, let the sender retry those requests immediately and do backoff lazily. Here is the test results of injecting IO delay (1s) on one tikv for 3 minutes.