Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reparentutil / ERS] confirm at least one replica succeeded to SetMaster, or fail #7486

Merged
merged 6 commits into from Feb 16, 2021

Conversation

ajm188
Copy link
Contributor

@ajm188 ajm188 commented Feb 11, 2021

Description

This introduces two background goroutines and contexts derived from
context.Background() to poll and signal for (1) at least one replica
finished successfully and (2) all replicas finished, regardless of
status, respectively.

The overall promotePrimary function returns error conditions based on
which of those contexts gets signaled first. We still fail if
PromoteReplica fails, which gets checked first, but this covers the
case where we're not running with semi-sync, and none of the replicas
are able to reparent even if the new primary is able to populate its
reparent journal (in the semi-sync case, if 0 replicas succeed to
SetMaster, than the primary will fail to PromoteReplica, so this was
already covered there).

Fixes #7480

Signed-off-by: Andrew Mason amason@slack-corp.com

Related Issue(s)

Checklist

  • Should this PR be backported? no
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 11, 2021

Okay I seem to have made those tests flaky .... taking a look. Hilariously they passed the few times I ran them locally, but if I run them enough times in a row I can get them to fail in a few ways.

@ajm188 ajm188 force-pushed the am_ers_setmaster_errors branch 3 times, most recently from 93fb928 to 35d1db6 Compare February 12, 2021 00:57
This introduces two background goroutines and contexts derived from
`context.Background()` to poll and signal for (1) at least one replica
finished successfully and (2) all replicas finished, regardless of
status, respectively.

The overall `promotePrimary` function returns error conditions based on
which of those contexts gets signaled first. We still fail if
`PromoteReplica` fails, which gets checked first, but this covers the
case where we're not running with semi-sync, and none of the replicas
are able to reparent even if the new primary is able to populate its
reparent journal (in the semi-sync case, if 0 replicas succeed to
`SetMaster`, than the primary will fail to `PromoteReplica`, so this was
already covered there).

Fixes vitessio#7480

Signed-off-by: Andrew Mason <amason@slack-corp.com>
…l case

Signed-off-by: Andrew Mason <amason@slack-corp.com>
…r` and context cancelling

Also add a duplicated check on whether at least one replica succeeded,
to guard against racing goroutines between the `replSuccessCtx` and
`allReplicasDoneCtx` cancellations.

Signed-off-by: Andrew Mason <amason@slack-corp.com>
Copy link
Contributor

@doeg doeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. The comments + tests are great, as usual. ✨

I'm not yet confident enough in my Vitess or Go concurrency skillz to give you an official approval on this, though, since ERS is... rather important! 😎

@deepthi or @setassociative would you mind reviewing this one, too, when you get a chance? 🙇‍♀️

@PrismaPhonic
Copy link
Contributor

PrismaPhonic commented Feb 12, 2021

I'm a little concerned with encouraging use of EmergencyReparent in cases where a user isn't using semi-sync, because you would very likely experience data loss. Certainly I think it's questionable that we could select an appropriate replica on behalf of the user in those cases. We could have a --force mode that takes the tablet supplied by the user, and forces reparenting to it, with the acknowledgement that very bad things will probably happen in that case.

I realize that it's actually not related to what this PR actually does, and is more replying to what you wrote in your description and related issue. We should think about restricting the mode where we find the best candidate to only semi-sync use cases.

// At least one replica was able to SetMaster successfully
return nil
case <-allReplicasDoneCtx.Done():
if len(rec.Errors) >= numReplicas {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would summed up errors exceed the number of replicas? I think if it should never be greater than this is an odd thing to write in code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that being more permissive here means the code still works if handleReplica is ever updated to record multiple errors per replica.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if that's the case it should get updated when that happens. This code path implies this is a valid state of existence. (I'm required as a reader to read into the comment to understand that this is simply defensive programming, and for a reality we don't live in yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess

if len(rec.Errors) == numReplicas

feels weirdly specific, but I agree with @PrismaPhonic that if we ever record more than one error from each goroutine, this will need to change anyway. So it is probably best to check for the specific condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path implies this is a valid state of existence.

The code directly states that this is an error state and not a valid state.

If I change this to len(rec.Errors) == numReplicas then we can return success in places where we definitely shouldn't. Open to suggestions on how to make the comments more clear, though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

switch {
case len(rec.Errors) == numReplicas:
    // original error case in question
case len(rec.Errors) > numReplicas:
    return vterrors.Wrapf(rec.Error(), "received more errors (= %d) than replicas (= %d), which should be impossible", len(rec.Errors), numReplicas)
default:
    return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still am not seeing how we would return success in cases when we shouldn't if we truly think that errors will never exceeed count of num replicas. This does seem overly defensive to me, but I do think this newer version is better.

@@ -142,6 +142,10 @@ func FindValidEmergencyReparentCandidates(
// ReplicaWasRunning returns true if a StopReplicationStatus indicates that the
// replica had running replication threads before being stopped.
func ReplicaWasRunning(stopStatus *replicationdatapb.StopReplicationStatus) bool {
if stopStatus.Before == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure false is the right thing to return here. If before is nil we can't know if it was running or not. Not getting Before at all, IMO, should be a panic case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't have code in the critical path of an ERS that can panic, because we'd either need to add recover defers and figure out how to recover from ... any point where the process can fail, or risk leaving the cluster in a worse state than when we started.

How about:

func ReplicaWasRunning(stopStatus *replicationdatapb.StopReplicationStatus) (bool, error) {
	if stopStatus.Before == nil {
		return false, errors.New("...")
	}

	return stopStatus.Before.IoThreadRunning || stopStatus.Before.SqlThreadRunning, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better error than panic :) This lgtm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the comment on the comment.

// to signal when all replica goroutines have finished. In the case where at
// least one replica succeeds, replSuccessCtx will be canceled first, while
// allReplicasDoneCtx is guaranteed to be canceled within
// opts.WaitReplicasTimeout plus some jitter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. IIUC, the calls to SetMaster are bounded by WaitReplicasTimeout which guarantees that allReplicasDoneCancel is eventually called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely!

Comment on lines 211 to 213
// finished. If one replica is slow, but another finishes quickly, the main
// thread of execution in this function while this goroutine will run until
// the parent context times out, without slowing down the flow of ERS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rephrase this sentence? I had difficulty parsing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me take a shot at editing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the newest version makes more sense, or if you have other suggestions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better.

@deepthi
Copy link
Member

deepthi commented Feb 13, 2021

I'm a little concerned with encouraging use of EmergencyReparent in cases where a user isn't using semi-sync, because you would very likely experience data loss. Certainly I think it's questionable that we could select an appropriate replica on behalf of the user in those cases. We could have a --force mode that takes the tablet supplied by the user, and forces reparenting to it, with the acknowledgement that very bad things will probably happen in that case.

I don't see that we are encouraging it. Can you elaborate?

@PrismaPhonic
Copy link
Contributor

@deepthi Just saw your reply. I edited my comment to clarify that I was referring to the authors comments in their description and related issue.

@deepthi
Copy link
Member

deepthi commented Feb 13, 2021

@PrismaPhonic Re semi-sync, I think we should also fix #7441
Right now semi-sync gets set only on REPLICA tablets.

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 13, 2021

R.e. semi-sync, happy to talk more about that outside the PR, but I don't think it has any bearing here. This doesn't change the behavior in that case; it only fixes the case where if you were running semi-sync, and all your replicas failed to SetMaster, then you would have a bunch of goroutines (one per replica, per ERS where this condition happens) blocked forever on the vtctld until you restarted the service).

For the actual safety/integrity of your vttablet components (not running semi-sync), that is unchanged as a result of this PR.

Signed-off-by: Andrew Mason <amason@slack-corp.com>
…and callers

Signed-off-by: Andrew Mason <amason@slack-corp.com>
… cases

Signed-off-by: Andrew Mason <amason@slack-corp.com>
@deepthi deepthi merged commit a520804 into vitessio:master Feb 16, 2021
@askdba askdba added this to the v10.0 milestone Feb 22, 2021
@ajm188 ajm188 deleted the am_ers_setmaster_errors branch March 4, 2021 16:32
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 8, 2021
This takes the core of the change from vitessio#7486 and backports it into 8.0.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
setassociative pushed a commit to tinyspeck/vitess that referenced this pull request Mar 8, 2021
[reparentutil / ERS] confirm at least one replica succeeded to `SetMaster`, or fail

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 9, 2021
Backport some panic protection during ERS

This takes the core of the change from vitessio#7486 and backports it into 8.0.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 17, 2021
Backport some panic protection during ERS

This takes the core of the change from vitessio#7486 and backports it into 8.0.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
@ajm188 ajm188 added this to In progress in Vtctld Service via automation May 23, 2021
@ajm188 ajm188 moved this from In progress to Done in Vtctld Service May 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

[vtctl/reparentutil] During ERS, any errors from SetMaster on replica are lost
5 participants