mcs: fix the prepare checker is directly skipped #7678

rleungx · 2024-01-08T05:03:19Z

What problem does this PR solve?

Issue Number: Close #7671.

What is changed and how does it work?

It is because the scheduling service won't sync the region from either local storage or other PD. When it is started, the region tree will be empty and skip the prepare checker directly. The same phenomenon is also happening when using pd-recover to recover the pd cluster. It will wrongly send many operators especially balance leaders to TiKV. But this PR doesn't handle the pd-recover case.

Check List

Tests

Manual test

Using tiup playground to create a cluster and restart the scheduling service.

[2024/01/08 12:59:08.094 +08:00] [INFO] [prepare_checker.go:68] ["not loaded from storage region number is satisfied, finish prepare checker"] [not-from-storage-region=61] [total-region=61]
[2024/01/08 12:59:08.094 +08:00] [INFO] [coordinator.go:390] ["coordinator has finished cluster information preparation"]
[2024/01/08 12:59:08.094 +08:00] [INFO] [coordinator.go:400] ["coordinator starts to run schedulers"]

Release note

None.

Signed-off-by: Ryan Leung <rleungx@gmail.com>

ti-chi-bot · 2024-01-08T05:03:21Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

lhy1024
nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

lhy1024 · 2024-01-08T05:40:07Z

pkg/mcs/scheduling/server/cluster.go

@@ -452,7 +452,8 @@ func (c *Cluster) runUpdateStoreStats() {
 func (c *Cluster) runCoordinator() {
 	defer logutil.LogPanic()
 	defer c.wg.Done()
-	c.coordinator.RunUntilStop()
+	// force wait for 1 minute to make prepare checker won't be directly skipped


Is minute enough?

The heartbeat interval is 1 minute.

do we need to use a constant variable?

lhy1024 · 2024-01-08T05:41:00Z

pkg/mcs/scheduling/server/cluster.go

@@ -452,7 +452,8 @@ func (c *Cluster) runUpdateStoreStats() {
 func (c *Cluster) runCoordinator() {
 	defer logutil.LogPanic()
 	defer c.wg.Done()
-	c.coordinator.RunUntilStop()
+	// force wait for 1 minute to make prepare checker won't be directly skipped
+	c.coordinator.RunUntilStop(time.Minute)


Do we need to manual test about unnecessary balance leader schedule?

I think we can test it in dev env?

Signed-off-by: Ryan Leung <rleungx@gmail.com>

codecov · 2024-01-08T09:32:20Z

Codecov Report

Merging #7678 (ca36936) into master (6d94c83) will decrease coverage by 0.42%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7678      +/-   ##
==========================================
- Coverage   73.98%   73.57%   -0.42%     
==========================================
  Files         429      429              
  Lines       47385    47389       +4     
==========================================
- Hits        35059    34866     -193     
- Misses       9352     9543     +191     
- Partials     2974     2980       +6

Flag	Coverage Δ
unittests	`73.57% <100.00%> (-0.42%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

rleungx · 2024-01-09T04:12:46Z

/merge

ti-chi-bot · 2024-01-09T04:12:48Z

@rleungx: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2024-01-09T04:12:50Z

This pull request has been accepted and is ready to merge.

Commit hash: ca36936

close tikv#7671 Signed-off-by: Ryan Leung <rleungx@gmail.com> Signed-off-by: pingandb <songge102@pingan.com.cn>

fix the prepare checker is directly skipped

666aa3a

Signed-off-by: Ryan Leung <rleungx@gmail.com>

ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none labels Jan 8, 2024

ti-chi-bot bot requested review from disksing and lhy1024 January 8, 2024 05:03

ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 8, 2024

rleungx requested review from HuSharp and removed request for disksing January 8, 2024 05:07

lhy1024 reviewed Jan 8, 2024

View reviewed changes

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Jan 8, 2024

fix the test

5292107

Signed-off-by: Ryan Leung <rleungx@gmail.com>

rleungx force-pushed the fix-prepare-checker branch from a5f5ad5 to 5292107 Compare January 8, 2024 09:19

address the comment

0ca41af

Signed-off-by: Ryan Leung <rleungx@gmail.com>

Merge branch 'master' into fix-prepare-checker

ca36936

rleungx requested a review from lhy1024 January 8, 2024 09:32

lhy1024 approved these changes Jan 8, 2024

View reviewed changes

ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 8, 2024

nolouch approved these changes Jan 8, 2024

View reviewed changes

ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jan 8, 2024

ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 9, 2024

ti-chi-bot bot merged commit 562945e into tikv:master Jan 9, 2024
25 of 26 checks passed

rleungx deleted the fix-prepare-checker branch January 9, 2024 04:13

pingandb pushed a commit to pingandb/pd that referenced this pull request Jan 18, 2024

mcs: fix the prepare checker is directly skipped (tikv#7678)

9412057

close tikv#7671 Signed-off-by: Ryan Leung <rleungx@gmail.com> Signed-off-by: pingandb <songge102@pingan.com.cn>

rleungx added a commit to rleungx/pd that referenced this pull request Jan 25, 2024

mcs: fix the prepare checker is directly skipped (tikv#7678) (tikv#249)

0b36d41

HuSharp mentioned this pull request Jan 29, 2024

ci-subtask: exit ci-subtask.sh when execute ci failed #7766

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcs: fix the prepare checker is directly skipped #7678

mcs: fix the prepare checker is directly skipped #7678

rleungx commented Jan 8, 2024 •

edited

Loading

ti-chi-bot bot commented Jan 8, 2024 •

edited

Loading

lhy1024 Jan 8, 2024

rleungx Jan 8, 2024

HuSharp Jan 8, 2024

lhy1024 Jan 8, 2024

rleungx Jan 8, 2024

codecov bot commented Jan 8, 2024 •

edited

Loading

rleungx commented Jan 9, 2024

ti-chi-bot bot commented Jan 9, 2024

ti-chi-bot bot commented Jan 9, 2024

mcs: fix the prepare checker is directly skipped #7678

mcs: fix the prepare checker is directly skipped #7678

Conversation

rleungx commented Jan 8, 2024 • edited Loading

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

ti-chi-bot bot commented Jan 8, 2024 • edited Loading

lhy1024 Jan 8, 2024

Choose a reason for hiding this comment

rleungx Jan 8, 2024

Choose a reason for hiding this comment

HuSharp Jan 8, 2024

Choose a reason for hiding this comment

lhy1024 Jan 8, 2024

Choose a reason for hiding this comment

rleungx Jan 8, 2024

Choose a reason for hiding this comment

codecov bot commented Jan 8, 2024 • edited Loading

Codecov Report

rleungx commented Jan 9, 2024

ti-chi-bot bot commented Jan 9, 2024

ti-chi-bot bot commented Jan 9, 2024

rleungx commented Jan 8, 2024 •

edited

Loading

ti-chi-bot bot commented Jan 8, 2024 •

edited

Loading

codecov bot commented Jan 8, 2024 •

edited

Loading