Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Auto fix gc_worker's service safepoint for upgraded clusters #3371

Merged

Conversation

MyonKeminta
Copy link
Contributor

Signed-off-by: MyonKeminta MyonKeminta@users.noreply.github.com

What problem does this PR solve?

Fixes #3366

Though we did a fix in PR #3146 , there's still a problem for clusters upgraded from older version, where the gc_worker's service safepoint may be invalid or missing and there are other service safepoints in the cluster.

What is changed and how it works?

Checks if "gc_worker"'s service safepoint exist and has a infinite TTL every time loading safepoints, and tries to fix it if possible.

Check List

Tests

  • Unit test

Code changes

Side effects

  • Increased code complexity

Related changes

  • Need to cherry-pick to the release branch
    • release-4.0

Release note

  • No release note

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@MyonKeminta
Copy link
Contributor Author

@lichunzhu @disksing @rleungx PTAL

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Jan 20, 2021

Codecov Report

Merging #3371 (5fb63fc) into master (cd62c50) will decrease coverage by 0.02%.
The diff coverage is 68.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3371      +/-   ##
==========================================
- Coverage   74.95%   74.93%   -0.03%     
==========================================
  Files         243      243              
  Lines       23291    23304      +13     
==========================================
+ Hits        17457    17462       +5     
- Misses       4267     4276       +9     
+ Partials     1567     1566       -1     
Flag Coverage Δ
unittests 74.93% <68.75%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/core/storage.go 68.44% <66.66%> (+0.44%) ⬆️
server/grpc_service.go 57.42% <100.00%> (ø)
pkg/errs/errs.go 75.00% <0.00%> (-25.00%) ⬇️
server/tso/global_allocator.go 66.14% <0.00%> (-7.09%) ⬇️
server/region_syncer/server.go 83.20% <0.00%> (-6.11%) ⬇️
server/election/leadership.go 82.92% <0.00%> (-3.66%) ⬇️
pkg/etcdutil/etcdutil.go 84.70% <0.00%> (-3.53%) ⬇️
server/server.go 72.94% <0.00%> (-0.80%) ⬇️
server/config/persist_options.go 91.33% <0.00%> (-0.79%) ⬇️
server/cluster/coordinator.go 74.28% <0.00%> (-0.72%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd62c50...5fb63fc. Read the comment docs.

@MyonKeminta
Copy link
Contributor Author

Ping @lichunzhu @disksing @rleungx PTAL

c.Assert(err, IsNil)

// Force set invalid ttl to gc_worker
//err = s.srv.GetStorage().SaveServiceGCSafePoint(&core.ServiceSafePoint{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry. I forgot to remove it.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
// It's a new cluster, or everything is lost so we have no way to recover it. Initialize gc_worker's service
// safepoint to zero.
_, err = s.initServiceGCSafePointForGCWorker(0)
return true, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the initialization is wrong, we still return true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I thought err will be not nil here so it doesn't matter 🤔 maybe it doesn't look good. I'll change it.


// gc_worker is missing.
_, err = s.initServiceGCSafePointForGCWorker(min)
return true, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do 1. check whether do we have gc_worker key 2. compute the min safepoint in one loop. After this loop, if 1 is false, we can insert the gc_worker key.

// in the older version may have invalid TTL for gc_worker's safepoint, and it also might be missing. gc_worker's
// safepoint may also be missing when the cluster is just bootstrapped. Detect these cases and fix gc_worker's safepoint
// if necessary.
func (s *Storage) fixGCWorkerServiceSafePpoint(allServiceSafePoints []*ServiceSafePoint) (modified bool, err error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *Storage) fixGCWorkerServiceSafePpoint(allServiceSafePoints []*ServiceSafePoint) (modified bool, err error) {
func (s *Storage) fixGCWorkerServiceSafePoint(allServiceSafePoints []*ServiceSafePoint) (modified bool, err error) {

if err := json.Unmarshal([]byte(values[i]), ssp); err != nil {
if modified {
// Reload the safepoints
keys, allServiceSafePoints, err = loadAll()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is just modifying the gcWorkerServiceSafePointID key okay?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's more, it sames that loadAll() will not affect the result of min... I think we can remove this logic and modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user needs the minimum value among all non-expired safepoints, but I want to fix it with the minimum value including expired-but-not-deleted ones. So the final min value may decrease. It's possible to avoid reloading and do all these things in one loop, but trying to make the code more readable, I choose the less-effective way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@MyonKeminta
Copy link
Contributor Author

@lichunzhu @rleungx PTAL again thx

Copy link

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

if !hasGCWorker {
// If there exists some service safepoints but gc_worker is missing, init it with the min value among all
// safepoints (including expired ones)
return s.initServiceGCSafePointForGCWorker(min)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return s.initServiceGCSafePointForGCWorker(min)
return s.initServiceGCSafePointForGCWorker(validMin.SafePoint)

}
}

if ssp.SafePoint < min {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ssp is expired, we may get a wrong min here. I suggest deleting min and keeping only validMin.

@ti-chi-bot
Copy link
Member

@lichunzhu: /lgtm is only allowed for the reviewers in list.

In response to this:

Rest LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@MyonKeminta
Copy link
Contributor Author

@lichunzhu updated, PTAL again

Copy link

@lichunzhu lichunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Member

@lichunzhu: /lgtm is only allowed for the reviewers in list.

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 25, 2021
@MyonKeminta
Copy link
Contributor Author

This PR needs to be cherry-picked to release 4.0. I don't have permission to edit the label.

@rleungx rleungx added needs-cherry-pick-release-4.0 The PR needs to cherry pick to release-4.0 branch. require-LGT1 Indicates that the PR requires an LGTM. labels Jan 26, 2021
@rleungx
Copy link
Member

rleungx commented Jan 26, 2021

/merge

@ti-chi-bot
Copy link
Member

@rleungx: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 1903929

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 26, 2021
@ti-chi-bot
Copy link
Member

@MyonKeminta: Your PR has out-of-dated, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot merged commit cbb89ec into tikv:master Jan 26, 2021
ti-srebot pushed a commit to ti-srebot/pd that referenced this pull request Jan 26, 2021
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #3391

ti-srebot pushed a commit to ti-srebot/pd that referenced this pull request Jan 26, 2021
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-5.0-rc in PR #3392

nolouch pushed a commit that referenced this pull request Jan 28, 2021
…3371) (#3391)

* cherry pick #3371 to release-4.0

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-4.0 The PR needs to cherry pick to release-4.0 branch. require-LGT1 Indicates that the PR requires an LGTM. status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Service Safepoint with infinite TTL may keep disappearing after upgrading PD
5 participants