[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail #7486

ajm188 · 2021-02-11T20:54:12Z

Description

This introduces two background goroutines and contexts derived from
context.Background() to poll and signal for (1) at least one replica
finished successfully and (2) all replicas finished, regardless of
status, respectively.

The overall promotePrimary function returns error conditions based on
which of those contexts gets signaled first. We still fail if
PromoteReplica fails, which gets checked first, but this covers the
case where we're not running with semi-sync, and none of the replicas
are able to reparent even if the new primary is able to populate its
reparent journal (in the semi-sync case, if 0 replicas succeed to
SetMaster, than the primary will fail to PromoteReplica, so this was
already covered there).

Fixes #7480

Signed-off-by: Andrew Mason amason@slack-corp.com

Related Issue(s)

Fixes [vtctl/reparentutil] During ERS, any errors from SetMaster on replica are lost #7480

Checklist

Should this PR be backported? no
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

ajm188 · 2021-02-11T21:18:03Z

Okay I seem to have made those tests flaky .... taking a look. Hilariously they passed the few times I ran them locally, but if I run them enough times in a row I can get them to fail in a few ways.

This introduces two background goroutines and contexts derived from `context.Background()` to poll and signal for (1) at least one replica finished successfully and (2) all replicas finished, regardless of status, respectively. The overall `promotePrimary` function returns error conditions based on which of those contexts gets signaled first. We still fail if `PromoteReplica` fails, which gets checked first, but this covers the case where we're not running with semi-sync, and none of the replicas are able to reparent even if the new primary is able to populate its reparent journal (in the semi-sync case, if 0 replicas succeed to `SetMaster`, than the primary will fail to `PromoteReplica`, so this was already covered there). Fixes vitessio#7480 Signed-off-by: Andrew Mason <amason@slack-corp.com>

…l case Signed-off-by: Andrew Mason <amason@slack-corp.com>

…r` and context cancelling Also add a duplicated check on whether at least one replica succeeded, to guard against racing goroutines between the `replSuccessCtx` and `allReplicasDoneCtx` cancellations. Signed-off-by: Andrew Mason <amason@slack-corp.com>

doeg

This makes sense to me. The comments + tests are great, as usual. ✨

I'm not yet confident enough in my Vitess or Go concurrency skillz to give you an official approval on this, though, since ERS is... rather important! 😎

@deepthi or @setassociative would you mind reviewing this one, too, when you get a chance? 🙇‍♀️

PrismaPhonic · 2021-02-12T23:54:13Z

I'm a little concerned with encouraging use of EmergencyReparent in cases where a user isn't using semi-sync, because you would very likely experience data loss. Certainly I think it's questionable that we could select an appropriate replica on behalf of the user in those cases. We could have a --force mode that takes the tablet supplied by the user, and forces reparenting to it, with the acknowledgement that very bad things will probably happen in that case.

I realize that it's actually not related to what this PR actually does, and is more replying to what you wrote in your description and related issue. We should think about restricting the mode where we find the best candidate to only semi-sync use cases.

PrismaPhonic · 2021-02-13T00:01:01Z

go/vt/vtctl/reparentutil/emergency_reparenter.go

+		// At least one replica was able to SetMaster successfully
+		return nil
+	case <-allReplicasDoneCtx.Done():
+		if len(rec.Errors) >= numReplicas {


Why would summed up errors exceed the number of replicas? I think if it should never be greater than this is an odd thing to write in code.

My thinking was that being more permissive here means the code still works if handleReplica is ever updated to record multiple errors per replica.

I think if that's the case it should get updated when that happens. This code path implies this is a valid state of existence. (I'm required as a reader to read into the comment to understand that this is simply defensive programming, and for a reality we don't live in yet)

I guess

if len(rec.Errors) == numReplicas

feels weirdly specific, but I agree with @PrismaPhonic that if we ever record more than one error from each goroutine, this will need to change anyway. So it is probably best to check for the specific condition.

This code path implies this is a valid state of existence.

The code directly states that this is an error state and not a valid state.

If I change this to len(rec.Errors) == numReplicas then we can return success in places where we definitely shouldn't. Open to suggestions on how to make the comments more clear, though!

How about:

switch { case len(rec.Errors) == numReplicas: // original error case in question case len(rec.Errors) > numReplicas: return vterrors.Wrapf(rec.Error(), "received more errors (= %d) than replicas (= %d), which should be impossible", len(rec.Errors), numReplicas) default: return nil }

I still am not seeing how we would return success in cases when we shouldn't if we truly think that errors will never exceeed count of num replicas. This does seem overly defensive to me, but I do think this newer version is better.

PrismaPhonic · 2021-02-13T00:03:49Z

go/vt/vtctl/reparentutil/replication.go

@@ -142,6 +142,10 @@ func FindValidEmergencyReparentCandidates(
 // ReplicaWasRunning returns true if a StopReplicationStatus indicates that the
 // replica had running replication threads before being stopped.
 func ReplicaWasRunning(stopStatus *replicationdatapb.StopReplicationStatus) bool {
+	if stopStatus.Before == nil {


I'm not sure false is the right thing to return here. If before is nil we can't know if it was running or not. Not getting Before at all, IMO, should be a panic case.

I think we shouldn't have code in the critical path of an ERS that can panic, because we'd either need to add recover defers and figure out how to recover from ... any point where the process can fail, or risk leaving the cluster in a worse state than when we started.

How about:

func ReplicaWasRunning(stopStatus *replicationdatapb.StopReplicationStatus) (bool, error) { if stopStatus.Before == nil { return false, errors.New("...") } return stopStatus.Before.IoThreadRunning || stopStatus.Before.SqlThreadRunning, nil }

Better error than panic :) This lgtm.

This looks better

deepthi

LGTM except for the comment on the comment.

deepthi · 2021-02-12T23:55:06Z

go/vt/vtctl/reparentutil/emergency_reparenter.go

+	// to signal when all replica goroutines have finished. In the case where at
+	// least one replica succeeds, replSuccessCtx will be canceled first, while
+	// allReplicasDoneCtx is guaranteed to be canceled within
+	// opts.WaitReplicasTimeout plus some jitter.


Very nice. IIUC, the calls to SetMaster are bounded by WaitReplicasTimeout which guarantees that allReplicasDoneCancel is eventually called.

deepthi · 2021-02-12T23:55:47Z

go/vt/vtctl/reparentutil/emergency_reparenter.go

+	// finished. If one replica is slow, but another finishes quickly, the main
+	// thread of execution in this function while this goroutine will run until
+	// the parent context times out, without slowing down the flow of ERS.


Can you rephrase this sentence? I had difficulty parsing it.

Yeah, let me take a shot at editing that.

Let me know if the newest version makes more sense, or if you have other suggestions!

Much better.

deepthi · 2021-02-13T00:06:47Z

I'm a little concerned with encouraging use of EmergencyReparent in cases where a user isn't using semi-sync, because you would very likely experience data loss. Certainly I think it's questionable that we could select an appropriate replica on behalf of the user in those cases. We could have a --force mode that takes the tablet supplied by the user, and forces reparenting to it, with the acknowledgement that very bad things will probably happen in that case.

I don't see that we are encouraging it. Can you elaborate?

PrismaPhonic · 2021-02-13T00:07:56Z

@deepthi Just saw your reply. I edited my comment to clarify that I was referring to the authors comments in their description and related issue.

deepthi · 2021-02-13T00:10:59Z

@PrismaPhonic Re semi-sync, I think we should also fix #7441
Right now semi-sync gets set only on REPLICA tablets.

ajm188 · 2021-02-13T01:43:33Z

R.e. semi-sync, happy to talk more about that outside the PR, but I don't think it has any bearing here. This doesn't change the behavior in that case; it only fixes the case where if you were running semi-sync, and all your replicas failed to SetMaster, then you would have a bunch of goroutines (one per replica, per ERS where this condition happens) blocked forever on the vtctld until you restarted the service).

For the actual safety/integrity of your vttablet components (not running semi-sync), that is unchanged as a result of this PR.

Signed-off-by: Andrew Mason <amason@slack-corp.com>

…and callers Signed-off-by: Andrew Mason <amason@slack-corp.com>

… cases Signed-off-by: Andrew Mason <amason@slack-corp.com>

This takes the core of the change from vitessio#7486 and backports it into 8.0. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

Backport some panic protection during ERS This takes the core of the change from vitessio#7486 and backports it into 8.0. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

ajm188 requested review from deepthi, doeg and setassociative February 11, 2021 20:54

ajm188 force-pushed the am_ers_setmaster_errors branch 3 times, most recently from 93fb928 to 35d1db6 Compare February 12, 2021 00:57

ajm188 added 2 commits February 11, 2021 20:19

Fix data setup for reparentShardLocked cases, handle Before == ni…

1ee9045

…l case Signed-off-by: Andrew Mason <amason@slack-corp.com>

ajm188 force-pushed the am_ers_setmaster_errors branch from 35d1db6 to 1ee9045 Compare February 12, 2021 01:19

doeg reviewed Feb 12, 2021

View reviewed changes

PrismaPhonic reviewed Feb 13, 2021

View reviewed changes

deepthi reviewed Feb 13, 2021

View reviewed changes

ajm188 added 3 commits February 12, 2021 21:19

PR feedback: attempt to clarify a comment

b9fe9a4

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Update ReplicaWasRunning to propagate error checking, update tests …

cdffd63

…and callers Signed-off-by: Andrew Mason <amason@slack-corp.com>

Refactor promotePrimary to separate defensive cases from normal error…

8f2e1fc

… cases Signed-off-by: Andrew Mason <amason@slack-corp.com>

ajm188 force-pushed the am_ers_setmaster_errors branch from eba839c to 8f2e1fc Compare February 16, 2021 18:05

deepthi approved these changes Feb 16, 2021

View reviewed changes

PrismaPhonic approved these changes Feb 16, 2021

View reviewed changes

deepthi merged commit a520804 into vitessio:master Feb 16, 2021

askdba added the Component: Cluster management label Feb 22, 2021

askdba added this to the v10.0 milestone Feb 22, 2021

ajm188 deleted the am_ers_setmaster_errors branch March 4, 2021 16:32

setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 8, 2021

Backport some panic protection during ERS

9f31422

This takes the core of the change from vitessio#7486 and backports it into 8.0. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

setassociative mentioned this pull request Mar 8, 2021

Backport some panic protection during ERS tinyspeck/vitess#196

Merged

setassociative mentioned this pull request Mar 8, 2021

Port ERS bug fixes into 8.0 tinyspeck/vitess#197

Closed

setassociative mentioned this pull request Mar 9, 2021

Vitess v8.0 Release branch tinyspeck/vitess#194

Merged

ajm188 added the Type: Bug label May 23, 2021

ajm188 added this to In progress in Vtctld Service via automation May 23, 2021

ajm188 moved this from In progress to Done in Vtctld Service May 23, 2021

ajm188 mentioned this pull request Jul 14, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail #7486

[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail #7486

ajm188 commented Feb 11, 2021 •

edited

ajm188 commented Feb 11, 2021

doeg left a comment

PrismaPhonic commented Feb 12, 2021 •

edited

PrismaPhonic Feb 13, 2021

ajm188 Feb 13, 2021

PrismaPhonic Feb 13, 2021

deepthi Feb 13, 2021

ajm188 Feb 13, 2021

ajm188 Feb 13, 2021

deepthi Feb 16, 2021

PrismaPhonic Feb 16, 2021

PrismaPhonic Feb 13, 2021

ajm188 Feb 16, 2021

deepthi Feb 16, 2021

PrismaPhonic Feb 16, 2021

deepthi left a comment

deepthi Feb 12, 2021

ajm188 Feb 13, 2021

deepthi Feb 12, 2021

ajm188 Feb 13, 2021

ajm188 Feb 13, 2021

deepthi Feb 16, 2021

deepthi commented Feb 13, 2021

PrismaPhonic commented Feb 13, 2021

deepthi commented Feb 13, 2021

ajm188 commented Feb 13, 2021

[reparentutil / ERS] confirm at least one replica succeeded to SetMaster, or fail #7486

[reparentutil / ERS] confirm at least one replica succeeded to SetMaster, or fail #7486

Conversation

ajm188 commented Feb 11, 2021 • edited

Description

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

ajm188 commented Feb 11, 2021

doeg left a comment

Choose a reason for hiding this comment

PrismaPhonic commented Feb 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi commented Feb 13, 2021

PrismaPhonic commented Feb 13, 2021

deepthi commented Feb 13, 2021

ajm188 commented Feb 13, 2021

[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail #7486

[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail #7486

ajm188 commented Feb 11, 2021 •

edited

PrismaPhonic commented Feb 12, 2021 •

edited