Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revocation phase using durability policies #9659

Merged
merged 28 commits into from
Mar 1, 2022

Conversation

GuptaManan100
Copy link
Member

Description

This PR adds the ability to check for revocation using the durability policies in EmergencyReparentShard. What that means is that instead of exiting incase of more than 1 failure, ERS will first try and find whether it is safe to proceed even when we have multiple failures!
So, if the durability_policies allow for more than 1 failure, ERS will succeed!

Consider an example for the same. Let us say we are using semi_sync as our durability policy and we have four servers, A, B, C and D. A is our current primary, B and C are replicas and D is a rdonly tablet.
Lets say that A and D, both go down.

In the older version, ERS would have failed since there are more than 1 failures
But now, ERS will see that even with A and D down, we can proceed safely. Safety, here means that we can guarantee that the unreachable (equivalent to down) tablets cannot make forward progress. In this case, only A is eligible to be a primary and D cannot provide Semi_Sync ACKs, which means that A and D cannot make progress, so it is safe to continue with the ERS.

Related Issue(s)

Checklist

  • Should this PR be backported? No
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

…emiSyncAcks for a given primary

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…event a tablet from making forward progress

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…o know that no primary capable tablet can accept any new write

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…ds additional tests

Signed-off-by: Manan Gupta <manan@planetscale.com>
…ds when 2 tablets are failing

Signed-off-by: Manan Gupta <manan@planetscale.com>
…lures

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…rwise they start overriding each others durability policies configuration

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from a first pass. I still need to understand the new Revoke funcs.

go/test/endtoend/reparent/utils/utils.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability_policies.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability_policies.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability_policies.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Show resolved Hide resolved
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100
Copy link
Member Author

@deepthi I have renamed the files back so it is easier to review and added the requested comments in the durabler interface.

go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/durability_funcs.go Outdated Show resolved Hide resolved
go/vt/vtctl/reparentutil/replication.go Show resolved Hide resolved
Signed-off-by: Manan Gupta <manan@planetscale.com>
Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some naming/testing suggestions! i think the rest is looking good :)

go/vt/topo/topoproto/tablet.go Outdated Show resolved Hide resolved
go/vt/topo/topoproto/tablet_test.go Show resolved Hide resolved
go/vt/topo/topoproto/tablet_test.go Outdated Show resolved Hide resolved
}

// RevokeForTablet checks whether we have reached enough tablets such that the given primary eligible tablet cannot accept any new writes
func RevokeForTablet(primaryEligible *topodatapb.Tablet, tabletsReached []*topodatapb.Tablet, allTablets []*topodatapb.Tablet) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these functions that check and return bools, i'd suggest renaming to from <Thing> to Is<Thing> to make them read a little nicer where they are used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about, haveRevokedForTablet? instead of isRevokedForTablet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change for haveRevokedForTablet. If you feel the other name is better, I'll rename it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me! as long as it implies the boolean/question nature, and we're consistent with what we use for both, i'm happy :)

}

// Revoked checks whether we have reached enough tablets to guarantee that no tablet eligible to become a primary can accept any write
func Revoked(tabletsReached []*topodatapb.Tablet, allTablets []*topodatapb.Tablet) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same naming suggestion here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to haveRevoked

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100
Copy link
Member Author

Thankyou @deepthi and @ajm188 for the reviews! I have addressed all the review comments, could you take a look once, again?

@ajm188
Copy link
Contributor

ajm188 commented Feb 25, 2022

lgtm, but let's get deepthi's approval (and clean up the commit history length) before merging

@deepthi
Copy link
Member

deepthi commented Feb 25, 2022

lgtm, but let's get deepthi's approval (and clean up the commit history length) before merging

Do you mean squash some of the commits together?
If so, I recommend squashing only the last 5 (since we started review) either into one, or into whatever is the most relevant older commit. Typically we don't squash at all (though I'm well aware that this makes certain things like reverts and bisects that much harder).

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 merged commit 2024c03 into vitessio:main Mar 1, 2022
@GuptaManan100 GuptaManan100 deleted the durability-policy branch March 1, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Component: VTorc Vitess Orchestrator integration Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants