retry short read reqs immediately on server busy #990

zyguan · 2023-09-22T05:42:32Z

If a read-only workload running with a short max-execution-time (eg. 500ms), injecting IO delay faults will cause the QPS amost drop to zero, because the queries will be interrupted during server-busy backoff and have no chance to try other available peers.

To address the issue, let the sender retry those requests immediately and do backoff lazily. Here is the test results of injecting IO delay (1s) on one tikv for 3 minutes.

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk · 2023-09-29T08:10:24Z

/cc @crazycs520

ekexium · 2023-10-08T05:45:01Z

The change makes backoff more complex. It only takes effect when max execution time is smaller than read timeout, which may itself be considered a misuse, I suppose? I'm not sure if it's worth to cover this case
I see, it's not checking the SQL max_execution_time

MyonKeminta · 2023-12-28T08:57:18Z

internal/locate/region_request.go

+			return bo.Backoff(args.cfg, args.err)
+		}
+		return nil
+	}


Could you explain the purpose of these code? I didn't understand it 🤔

It's used to maintain delayed backoffs.

"pendingBackoffs[addr] is not nil" means there is a delayed backoff which should be applied on retrying the corresponding store.

delayBoTiKVServerBusy is the callback for recording delayed backoffs.

backoffOnRetry is used to apply pending backoffs on retry.

when there is no candidate, the kv client just return a fake error and let the caller do retry. to avoid potential frequent rpcs, we call backoffOnFakeErr before returning a fake error, which chooses a pending backoff with the largest base duration and applies it.

Can this be done by adding "busy" flag to the stores, instead of maintaining maps here?

If you mean the store struct in region cache, probably no I think. They have different scope/lifetime, that is, each SendReqCtx call should has its own pendingBackoffs, and has no effect to each others. Maybe we can add a field to the replica struct, however, this PR is just a quick-and-dirty fix to the customer issue, I just try to make the least change.

Signed-off-by: zyguan <zhongyangguan@gmail.com>

MyonKeminta

The logic looks fine to me. I wonder if the code can be better structured...

@crazycs520

Wait for @crazycs520 's new solution

zyguan added 5 commits September 22, 2023 13:04

retry short read reqs immediately on server busy

90887db

Signed-off-by: zyguan <zhongyangguan@gmail.com>

be compatible with load-based replica read

999d08a

Signed-off-by: zyguan <zhongyangguan@gmail.com>

do not retry same peer after fast retry

71902c9

Signed-off-by: zyguan <zhongyangguan@gmail.com>

Merge remote-tracking branch 'origin/master' into retry-fast

49d3070

add unit tests

1c22121

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan marked this pull request as ready for review September 25, 2023 08:25

cfzjywxk requested review from you06, cfzjywxk, MyonKeminta and ekexium September 29, 2023 08:10

MyonKeminta reviewed Dec 28, 2023

View reviewed changes

zyguan added 2 commits January 3, 2024 10:21

Merge remote-tracking branch 'origin/master' into retry-fast

8dee151

fix ut

d2b78dc

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk requested a review from MyonKeminta January 3, 2024 03:53

MyonKeminta mentioned this pull request Feb 27, 2024

reduce unnecessary tikvServerBusy backoff when able to try next replica #1184

Merged

MyonKeminta previously approved these changes Feb 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry short read reqs immediately on server busy #990

retry short read reqs immediately on server busy #990

zyguan commented Sep 22, 2023

cfzjywxk commented Sep 29, 2023

ekexium commented Oct 8, 2023 •

edited

Loading

MyonKeminta Dec 28, 2023

zyguan Dec 28, 2023 •

edited

Loading

MyonKeminta Jan 2, 2024

zyguan Jan 2, 2024

MyonKeminta left a comment

retry short read reqs immediately on server busy #990

Are you sure you want to change the base?

retry short read reqs immediately on server busy #990

Conversation

zyguan commented Sep 22, 2023

cfzjywxk commented Sep 29, 2023

ekexium commented Oct 8, 2023 • edited Loading

MyonKeminta Dec 28, 2023

Choose a reason for hiding this comment

zyguan Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

MyonKeminta Jan 2, 2024

Choose a reason for hiding this comment

zyguan Jan 2, 2024

Choose a reason for hiding this comment

MyonKeminta left a comment

Choose a reason for hiding this comment

ekexium commented Oct 8, 2023 •

edited

Loading

zyguan Dec 28, 2023 •

edited

Loading