Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcs: fix the prepare checker is directly skipped #7678

Merged
merged 4 commits into from
Jan 9, 2024

Conversation

rleungx
Copy link
Member

@rleungx rleungx commented Jan 8, 2024

What problem does this PR solve?

Issue Number: Close #7671.

What is changed and how does it work?

It is because the scheduling service won't sync the region from either local storage or other PD. When it is started, the region tree will be empty and skip the prepare checker directly. The same phenomenon is also happening when using pd-recover to recover the pd cluster. It will wrongly send many operators especially balance leaders to TiKV. But this PR doesn't handle the pd-recover case.

Check List

Tests

  • Manual test

Using tiup playground to create a cluster and restart the scheduling service.

[2024/01/08 12:59:08.094 +08:00] [INFO] [prepare_checker.go:68] ["not loaded from storage region number is satisfied, finish prepare checker"] [not-from-storage-region=61] [total-region=61]
[2024/01/08 12:59:08.094 +08:00] [INFO] [coordinator.go:390] ["coordinator has finished cluster information preparation"]
[2024/01/08 12:59:08.094 +08:00] [INFO] [coordinator.go:400] ["coordinator starts to run schedulers"]

Release note

None.

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Copy link
Contributor

ti-chi-bot bot commented Jan 8, 2024

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • lhy1024
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot requested review from disksing and lhy1024 January 8, 2024 05:03
@ti-chi-bot ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 8, 2024
@rleungx rleungx requested review from HuSharp and removed request for disksing January 8, 2024 05:07
@@ -452,7 +452,8 @@ func (c *Cluster) runUpdateStoreStats() {
func (c *Cluster) runCoordinator() {
defer logutil.LogPanic()
defer c.wg.Done()
c.coordinator.RunUntilStop()
// force wait for 1 minute to make prepare checker won't be directly skipped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is minute enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heartbeat interval is 1 minute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use a constant variable?

@@ -452,7 +452,8 @@ func (c *Cluster) runUpdateStoreStats() {
func (c *Cluster) runCoordinator() {
defer logutil.LogPanic()
defer c.wg.Done()
c.coordinator.RunUntilStop()
// force wait for 1 minute to make prepare checker won't be directly skipped
c.coordinator.RunUntilStop(time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to manual test about unnecessary balance leader schedule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can test it in dev env?

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Copy link

codecov bot commented Jan 8, 2024

Codecov Report

Merging #7678 (ca36936) into master (6d94c83) will decrease coverage by 0.42%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7678      +/-   ##
==========================================
- Coverage   73.98%   73.57%   -0.42%     
==========================================
  Files         429      429              
  Lines       47385    47389       +4     
==========================================
- Hits        35059    34866     -193     
- Misses       9352     9543     +191     
- Partials     2974     2980       +6     
Flag Coverage Δ
unittests 73.57% <100.00%> (-0.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

@rleungx rleungx requested a review from lhy1024 January 8, 2024 09:32
@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 8, 2024
@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jan 8, 2024
@rleungx
Copy link
Member Author

rleungx commented Jan 9, 2024

/merge

Copy link
Contributor

ti-chi-bot bot commented Jan 9, 2024

@rleungx: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Copy link
Contributor

ti-chi-bot bot commented Jan 9, 2024

This pull request has been accepted and is ready to merge.

Commit hash: ca36936

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 9, 2024
@ti-chi-bot ti-chi-bot bot merged commit 562945e into tikv:master Jan 9, 2024
25 of 26 checks passed
@rleungx rleungx deleted the fix-prepare-checker branch January 9, 2024 04:13
pingandb pushed a commit to pingandb/pd that referenced this pull request Jan 18, 2024
close tikv#7671

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: pingandb <songge102@pingan.com.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/S Denotes a PR that changes 10-29 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

There will be lots of balance leader operators after enabling scheduling service
4 participants