schedule: add split-scatter for release-8.5-20251204-v8.5.4#10678
Conversation
(cherry picked from commit 38a0f9a) Signed-off-by: lhy1024 <admin@liudos.us>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Cherry-pick conflict / adaptation notesSource PR: #10621, merge commit I rechecked the actual cherry-pick conflicts by applying the source merge commit on the release branch with: git cherry-pick -m 1 --no-commit 38a0f9ab0a3e373c2da23fa7c556423b4a67a6dbThe actual content-conflict files are:
File-level parity was checked against source PR #10621. The business file set matches the source PR semantically. The only extra files in this backport are Go module dependency files required by the release branch kvproto update:
Risk classification for reviewHigh / should review carefully:
Medium / worth checking but mostly mechanical:
Low / should not need deep review:
Conflict resolution details
Dependency handling
Unrelated master prerequisites intentionally not included
Verification after conflict resolutionrg -n '<<<<<<<|=======|>>>>>>>' $(gh pr diff 10678 --repo tikv/pd --name-only)
PATH="$PWD/.tools/bin:$PATH" GOTOOLCHAIN=go1.23.12 make check
.tools/bin/failpoint-ctl enable . && go test ./pkg/schedule/checker -run SplitScatter -count=1; rc=$?; .tools/bin/failpoint-ctl disable .; exit $rc
.tools/bin/failpoint-ctl enable . && (cd tests/integrations && GOTOOLCHAIN=go1.23.12 go test ./mcs/resourcemanager -run 'TestResourceManagerClientTestSuite/TestSwitchBurst' -count=1); rc=$?; .tools/bin/failpoint-ctl disable .; exit $rc
git diff --check |
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
| opController *operator.Controller, | ||
| addPendingProcessedRegions func(needCheckLen bool, ids ...uint64), | ||
| ) *splitScatterController { | ||
| return &splitScatterController{ |
There was a problem hiding this comment.
This gauge is process-global. Could we reset it when the split-scatter controller is created so a previous coordinator/leader term does not leave a stale nonzero pending count exposed?
| // Keep pendingMu short: cluster reads below can be slower and do not need | ||
| // to block pending updates. Stale snapshots are safe because candidates | ||
| // are rechecked before dispatch and delay/delete recheck the pending identity. | ||
| pendingSnapshot := make([]splitScatterPendingItem, 0, len(c.pending)) |
There was a problem hiding this comment.
This snapshots all pending entries and later sorts candidates on every dispatch pass. With the 4096 pending cap, can we avoid the full scan/sort by tracking retry/expire order or throttling blocked batches?
There was a problem hiding this comment.
Handled the throttling side in 4ab32c3 by adding a nextDispatchAt gate when dispatch is blocked or no candidate is ready. I kept the pending scan/sort structure unchanged because replacing it with retry/expire ordering is a larger source/master-level optimization, not specific to this release backport.
| ordinaryPeer := make(map[uint64]uint64) | ||
| ordinaryLeader := make(map[uint64]uint64) | ||
| specialPeer := make(map[string]map[uint64]uint64) | ||
| for _, store := range r.cluster.GetStores() { |
There was a problem hiding this comment.
This rebuilds the range baseline by iterating all stores for each internal scatter. For regions in the same table/index group, can we reuse the scatter state within one dispatch pass to avoid repeated range count queries?
There was a problem hiding this comment.
This is inherited from the source PR/current master, not introduced by this backport. Reusing scatter state across regions in one dispatch pass would require broader RegionScatterer API/state changes, so I prefer leaving it to a master follow-up rather than expanding this release backport.
| // Check pending processed regions first. | ||
| c.checkPendingProcessedRegions() | ||
|
|
||
| c.splitScatter.dispatchSplitScatterRegions() |
There was a problem hiding this comment.
This runs on every patrol tick (10ms by default). Could we add a cheap backoff/next-retry gate when split-scatter is blocked so it does not add steady overhead to patrol?
There was a problem hiding this comment.
Handled in 4ab32c3: split-scatter now has a cheap nextDispatchAt gate. When schedule limit blocks dispatch or no pending item is ready, patrol returns before another full collect pass until the retry backoff expires.
| for _, pending := range pendingSnapshot { | ||
| regionID := pending.regionID | ||
| region := c.cluster.GetRegion(regionID) | ||
| if region == nil { |
There was a problem hiding this comment.
This skip makes most missing-region cases invisible to splitScatterDispatchRegionMissingCounter below; the pending item will just wait until TTL. Could we count or clean these missing entries here so the metric reflects the actual drop/delay reason?
There was a problem hiding this comment.
missing pending target/source regions are now counted with splitScatterDispatchRegionMissingCounter and delayed with retryAt, so they no longer stay invisible until TTL and also avoid repeated counting during backoff.
| return | ||
| } | ||
| limit := c.cluster.GetCheckerConfig().GetSplitScatterScheduleLimit() | ||
| if limit == 0 { |
There was a problem hiding this comment.
When the limit is set to 0, split-scatter is documented as disabled, but new split batches are still recorded and kept until TTL. Could we skip recording pending entries as well, or otherwise avoid accumulating pending work and counter noise while the feature is disabled?
There was a problem hiding this comment.
RecordSplitScatterBatch skips recording when split-scatter is disabled, and dispatch clears any existing pending entries if the limit is later set to 0, updating the pending gauge.
Reset the process-global pending gauge when creating a split-scatter controller. Avoid recording or retaining pending entries while split-scatter is disabled, add a cheap dispatch backoff for blocked pending work, and count missing pending regions when they are delayed. Signed-off-by: lhy1024 <admin@liudos.us>
681c716 to
4ab32c3
Compare
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bufferflies, niubell, rleungx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
399f778
into
tikv:release-8.5-20251204-v8.5.4
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release-8.5-20251204-v8.5.4 #10678 +/- ##
==============================================================
Coverage ? 77.82%
==============================================================
Files ? 469
Lines ? 63081
Branches ? 0
==============================================================
Hits ? 49090
Misses ? 10391
Partials ? 3600
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|


What problem does this PR solve?
Issue Number: ref #10621
What is changed and how does it work?
Cherry-pick / dependency handling
Source PR: #10621, merge commit
38a0f9ab0a3e373c2da23fa7c556423b4a67a6db.Target branch:
release-8.5-20251204-v8.5.4.Current head:
44949721a3483de54eec810979bbc5cba1532e38.tools/, andtests/integrations/modules togithub.com/pingcap/kvproto v0.0.0-20260518035033-ca8835cfa721.github.com/lhy1024/kvproto@07f3062areplace and stale checksums after pd: support split reason for release-8.5-20251208-v8.5.4 pingcap/kvproto#1464 was merged.prometheus/testutilstyle; rootgithub.com/prometheus/client_model v0.6.1remains indirect aftergo mod tidy.Check List
Tests
Manual test commands:
Code changes
Side effects
Related changes
Release note