-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcs: fix the prepare checker is directly skipped #7678
Conversation
Signed-off-by: Ryan Leung <rleungx@gmail.com>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
@@ -452,7 +452,8 @@ func (c *Cluster) runUpdateStoreStats() { | |||
func (c *Cluster) runCoordinator() { | |||
defer logutil.LogPanic() | |||
defer c.wg.Done() | |||
c.coordinator.RunUntilStop() | |||
// force wait for 1 minute to make prepare checker won't be directly skipped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is minute enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The heartbeat interval is 1 minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to use a constant variable?
pkg/mcs/scheduling/server/cluster.go
Outdated
@@ -452,7 +452,8 @@ func (c *Cluster) runUpdateStoreStats() { | |||
func (c *Cluster) runCoordinator() { | |||
defer logutil.LogPanic() | |||
defer c.wg.Done() | |||
c.coordinator.RunUntilStop() | |||
// force wait for 1 minute to make prepare checker won't be directly skipped | |||
c.coordinator.RunUntilStop(time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to manual test about unnecessary balance leader schedule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can test it in dev env?
Signed-off-by: Ryan Leung <rleungx@gmail.com>
a5f5ad5
to
5292107
Compare
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #7678 +/- ##
==========================================
- Coverage 73.98% 73.57% -0.42%
==========================================
Files 429 429
Lines 47385 47389 +4
==========================================
- Hits 35059 34866 -193
- Misses 9352 9543 +191
- Partials 2974 2980 +6
Flags with carried forward coverage won't be shown. Click here to find out more. |
/merge |
@rleungx: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: ca36936
|
close tikv#7671 Signed-off-by: Ryan Leung <rleungx@gmail.com> Signed-off-by: pingandb <songge102@pingan.com.cn>
What problem does this PR solve?
Issue Number: Close #7671.
What is changed and how does it work?
It is because the scheduling service won't sync the region from either local storage or other PD. When it is started, the region tree will be empty and skip the prepare checker directly. The same phenomenon is also happening when using pd-recover to recover the pd cluster. It will wrongly send many operators especially balance leaders to TiKV. But this PR doesn't handle the pd-recover case.
Check List
Tests
Using tiup playground to create a cluster and restart the scheduling service.
Release note