cherrypick data affinity feature to release-8.5#10061
Conversation
|
Skipping CI for Draft Pull Request. |
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release-8.5 #10061 +/- ##
===============================================
- Coverage 77.95% 77.86% -0.09%
===============================================
Files 471 482 +11
Lines 63502 65343 +1841
===============================================
+ Hits 49505 50882 +1377
- Misses 10370 10725 +355
- Partials 3627 3736 +109
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
7f1d976 to
b3248da
Compare
…tikv#9993) ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us> Co-authored-by: 混沌DM <hundundm@gmail.com> Signed-off-by: lhy1024 <admin@liudos.us>
…anager (tikv#10038) ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
close tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
b3248da to
ef62a65
Compare
0467ae4 to
cecd033
Compare
| // UpdateAffinityGroupPeers updates the leader and voter stores of an affinity group. | ||
| UpdateAffinityGroupPeers(ctx context.Context, groupID string, leaderStoreID uint64, voterStoreIDs []uint64) (*AffinityGroupState, error) | ||
| // DeleteAffinityGroup deletes an affinity group by group ID. | ||
| DeleteAffinityGroup(ctx context.Context, groupID string, force bool) error |
There was a problem hiding this comment.
pls note when we should force or not.
There was a problem hiding this comment.
we should force when the groups contain key range. I think we can add it in another pr.
| var state AffinityGroupState | ||
| err = c.request(ctx, newRequestInfo(). | ||
| WithName("UpdateAffinityGroupPeers"). | ||
| WithURI(fmt.Sprintf(AffinityGroupByID, groupID)). |
There was a problem hiding this comment.
How about extracting a new function, such as WithURI(RegionByID(regionID))?
There was a problem hiding this comment.
Ok, But it doesn't affect correctness. I've documented it, and I'll complete them in another PR.
| return errors.Trace(err) | ||
| } | ||
| // Use POST with ?delete query parameter for batch deletion | ||
| url := AffinityGroups + "?delete" |
| } | ||
|
|
||
| // Check if region is in an affinity group that doesn't allow regular scheduling | ||
| if !r.affinityFilter.Select(region).IsOK() { |
There was a problem hiding this comment.
If it returns an error, the client will retry the scatter, but it always fails. So we should return nil for the affinity region
| Subsystem: "affinity", | ||
| Name: "status", | ||
| Help: "Status of the affinity manager.", | ||
| }, []string{"type"}) |
There was a problem hiding this comment.
We need to add a group label to check which group has problems.
| defaultReplicaScheduleLimit = 64 | ||
| defaultMergeScheduleLimit = 8 | ||
| defaultHotRegionScheduleLimit = 4 | ||
| defaultAffinityScheduleLimit = 0 // default to disable |
There was a problem hiding this comment.
During normal use, what value will this be set to? In addition, what is the specific function of this parameter? Controlling the rate?
There was a problem hiding this comment.
It controls affinity-move-peer and affinity-merge-region. We can use the value from leader-schedule-limit and merge-schedule-limit.
So I think 4 or 8 is ok. It also performed well in local testing.
There was a problem hiding this comment.
When defaultAffinityScheduleLimit is 0, it means affinity scheduling is disabled.
There was a problem hiding this comment.
When there is already a switch to control the feature, do we still need to set this default to 0?
There was a problem hiding this comment.
I think it is safe to set defaultAffinityScheduleLimit to 0. Because we need to cherry pick it to 8.5.5.
And we now use it more in poc scenario. We should avoid to affect other users.
There was a problem hiding this comment.
there is already a switch to control the feature
We use affinity-schedule-limit as a switch, and there is no other config to control it.
In checker PR, we replace affinity-schedule-enable with affinity-schedule-limit
ref tikv#9764 Signed-off-by: lhy1024 <admin@liudos.us>
| // AffinityGroupsPathPrefix returns the path prefix for all affinity groups. | ||
| // Its format is: "/pd/{cluster_id}/affinity_groups/" | ||
| // This is used for loading all groups via a range read. | ||
| func AffinityGroupsPathPrefix() string { |
There was a problem hiding this comment.
In the furture, we will pick mcs for it. We will use AffinityGroupsPathPrefix in mcs.
I think we could keep it. And it has no any side effects
| } | ||
|
|
||
| // SetNodeState sets the node state for the store. | ||
| func SetNodeState(nodeState metapb.NodeState) StoreCreateOption { |
There was a problem hiding this comment.
Can we directly use SetStoreState?
There was a problem hiding this comment.
SetStoreState cannot set NodeState_Preparing, and it only be used in test file.
[LGTM Timeline notifier]Timeline:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bufferflies, HunDunDM, niubell, okJiang, rleungx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |


Conflict Resolution Details
Commit1: PR #9993 - Affinity Storage & Config
Conflict 1:
pkg/slice/slice.goandpkg/slice/slice_test.goResolution: Accept both
Conflict 2:
pkg/storage/storage.goResolution: Remove MaintenanceStorage
Conflict 3: Keypath Architecture⚠️ Critical
Resolution: Refactor keypath implementation
AffinityGroupIDPathreturns relative path rather than full path (consistent withRegionLabelKeyPath)LoadAllAffinityGroupsuseskeypath.AffinityGroupPath+"/"becauseloadRangeByPrefixwill addPDRootPath()prefixMaster branch implementation:
Release-8.5 implementation:
I checked it in etcd-ctl. They are the same.


8.5
master
Commit2-4: PR #9997, #9998, #9999
✅ No conflicts
Commit5: PR #10038 - Affinity Filter & kvproto
Conflict 1:
go.sumandgo.modResolution: Update kvproto version
Conflict 2:
pkg/schedule/core/cluster_informer.goResolution:
AllocID() (uint64, error)GetAffinityManager() *affinity.ManagerConflict 3:
pkg/mock/mockcluster/mockcluster.go,pkg/schedule/schedulers/balance_leader_test.go,pkg/schedule/schedulers/balance_region_test.goResolution: Kept private field
keyRangeManagerwith getter methodConflict 4:⚠️ File Migration (Critical)
hot_region_solver.goResolution: Migrated affinity filter logic to
hot_region.go:622-640Conflict 5:
server/server.go:2155-2172Resolution: Kept both methods
GetGlobalTSOAllocator()GetAffinityManager()Conflict 6:
pkg/schedule/labeler/labeler.go:436-445andpkg/core/store_option.go:101-108Resolution: Copied function implementation from master
Commit6: PR #10050 - Refactor
✅ No conflicts
Commit7: PR #10041 - API & Client Support
Conflict 1:
client/http/api.goandserver/apiv2/router.goResolution:
GetKeyspaceMetaByIDandRegisterMaintenanceConflict 2:
tests/integrations/client/http_client_test.goResolution: Remove unrelated
TestGetSiblingsRegionsConflict 3:
tests/server/apiv2/handlers/affinity_test.goResolution: Replace test framework functions
RunTest()RunTestBasedOnMode()RunTestInNonMicroserviceEnv()RunTestInPDMode()RunTestInMicroserviceEnv()RunTestInAPIMode()Conflict 4: Naming Convention
Resolution: Use
MicroService(capital S) rather thanMicroserviceCommit8: PR #10040 - Affinity Checker
Conflict 1:
pkg/schedule/checker/metrics.goIncoming changes:
Resolution: Only keep affinity-related metrics, discard patrol phases and controller metrics
Conflict 2:
pkg/schedule/checker/checker_controller.goIncoming changes:
Resolution: Accept the ruleManager initialization
Conflict 3: Checker Integration - measureChecker wrapper⚠️ Critical
Incoming changes:
Resolution: Remove
measureCheckerwrapper, keep release-8.5 styleFinal implementation:
Conflict 4: API Method
Resolution: Replace
AllowLeaderTransferIn()withAllowLeaderTransfer()Commit9: PR #10043 - pd-ctl Support
✅ No conflicts
Commit9: PR #10081 - merge fix
replace suite.newRegionInfor with RegionInfo
Commit10: PR #10080 - other fix
Conflict 1: Function Name
Resolution: Replace
AllowLeaderTransferIn()withAllowLeaderTransfer()Conflict 2: Other PR
Resolution: Keep old
Conflict 3: OpAdmin⚠️ Critical
Keep OpAdmin|OpMerge, it changed to only OpAdmin in master.
But it still be OpAdmin|OpMerge in release 8.5.
Conflict 4: Rename
replace
PauseLeaderTransferInwithPauseLeaderTransferWhat problem does this PR solve?
Issue Number: Close #9764
What is changed and how does it work?
Check List
Tests
Release note