Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523

Merged
merged 4 commits into from Feb 24, 2021

Conversation

ajm188
Copy link
Contributor

@ajm188 ajm188 commented Feb 19, 2021

Description

This was either in the previous previous implementation, or not at all; I haven't gone back far enough to check.
I've added tests as penance.

The root issue is that a MASTER tablet will return mysql.ErrNotReplica from tmc.StopReplicationAndGetStatus, which will cause us to add an entry for that tablet in the primaryStatusMap and not the statusMap, in StopReplicationAndBuildStatusMap. Then, when we attempt to wait for all valid candidates to apply their relay logs (which can include MASTER tablets), we end up passing a nil Status to WaitForRelayLogsToApply, causing the segfault.

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

This was in the old implementation, and was overlooked in the port to an
encapsulated struct. I've added tests as penance.

Signed-off-by: Andrew Mason <amason@slack-corp.com>
@setassociative
Copy link
Contributor

setassociative commented Feb 19, 2021

@ajm188 before digging into this are the unit test failures concerning? They feel topical given it's testing ERS even though it's not obvious if all ERS tests are related to this change given it's a different impl

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 19, 2021

It's definitely worth looking into those, but they may be red herrings:

I0219 21:54:03.947626   16521 vtctlclient_process.go:149] Executing vtctlclient with command: vtctlclient -server localhost:16806 EmergencyReparentShard -keyspace_shard ks/0 -wait_replicas_timeout 30s
    reparent_test.go:137: 
        	Error Trace:	reparent_test.go:137
        	Error:      	Expected value not to be nil.
        	Test:       	TestReparentIgnoreReplicas
        	Messages:   	W0219 21:54:03.965804   28276 main.go:67] W0219 21:54:03.965656 replication.go:205] failed to get replication status from zone1-0000000103: rpc error: code = Canceled desc = latest connection error: connection error: desc = "transport: Error while dialing dial tcp :0: connect: connection refused"
--- FAIL: TestReparentIgnoreReplicas (50.45s)

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 20, 2021

Hmmm, @setassociative looking again at #7464, in particular here, it seems this block I've added actually didn't exist anymore, and was removed in 6449 (I was looking at tinyspeck/master and not vitessio just prior to my first refactor).

Anyway, the reason that I noticed this, was when writing tests for adding ERS to the new VtctldServer API, if I tried to ERS with a reachable shard primary, then I get a segfault on nil pointer. The reason here is that first we construct two maps, statusMap, which is a mapping of tablet alias -> StopReplicationStatus for all the replicas in the shard, and primaryStatusMap, which is a mapping of tablet alias -> MasterStatus for any tablets that claim to be MASTER, but pass only the statusMap through to waitForAllRelayLogsToApply, which results in a nil Status when we eventually call WaitForRelayLogsToApply. This was true prior to my refactor as best as I can tell.

I'm confident this is the "issue" I saw, but I'm not confident it's actual issue vs me accidentally creating an impossible situation in my test setup (but we should add a guard against nil input so we don't segfault during ERS either way), or how to best handle this. I think what we actually want is to just skip the MASTERs, because they aren't applying relay logs since they are not replicating, so something like:

func (erp *EmergencyReparnter) waitForAllRelayLogsToApply(/* blah blah blah */) (/* blah blah*/) {
	errCh := make(chan error)
	defer close(errCh)

	groupCtx, groupCancel := context.WithTimeout(ctx, opts.WaitReplicasTimeout)
	defer groupCancel()

	replicaCount := 0

	for candidate := range validCandidates {
		tablet, ok := tabletMap[candidate]
		// handle !ok and return error
		if tablet.Type == topodatapb.TabletType_MASTER {
			// log
			continue
		}

		go func(alias string) {
			var err error
			defer func() { errCh <- err }()
			err = WaitForRelayLogsToApply(groupCtx, erp.tmc, tabletMap[alias], statusMap[alias])
		}(candidate)

		replicaCount++
	}

	errgroup := concurrency.ErrorGroup{
		NumGoroutines:        replicaCount,
		NumRequiredSuccesses: replicaCount,
		NumAllowedErrors:     0,
	}
	rec := errgroup.Wait(groupCancel, errCh)

	// rest of the function unchanged
}

cc @deepthi @PrismaPhonic to double-check my read here

@PrismaPhonic
Copy link
Contributor

The design is VERY intentional upon not assuming who masters are. This allows us to handle a host of edge cases where a tablet thinks it's master when it's really not.

Copy link
Contributor

@PrismaPhonic PrismaPhonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking on this until we can discuss it further, because it pushes against the design by trying to know who the master is. That was a very important part of our re-design (to intentionally not try to assume who the master was)

@ajm188
Copy link
Contributor Author

ajm188 commented Feb 20, 2021

I am well aware of that design consideration. In my most recent comment, I point out that the change currently in this PR is not the change we want:

it seems this block I've added actually didn't exist anymore, and was removed in 6449 (I was looking at tinyspeck/master and not vitessio just prior to my first refactor).

The actual issue, as I attempted to describe in my most recent comment, is that a MASTER tablet will return mysql.ErrNotReplica from tmc.StopReplicationAndGetStatus, which will cause us to add an entry for that tablet in the primaryStatusMap and not the statusMap, in StopReplicationAndBuildStatusMap. Then, when we attempt to wait for all valid candidates to apply their relay logs (which can include MASTER tablets), we end up passing a nil Status to WaitForRelayLogsToApply, causing the segfault.

I agree that not making assumptions about who the primary is remains an important aspect of the design, but this needs to be fixed, as crashing during an ERS is also quite dangerous to health of the cluster the ERS is trying to salvage.

…eration"

This reverts commit c5bbcc7.

Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
…ng StopReplication phase

Signed-off-by: Andrew Mason <amason@slack-corp.com>
@ajm188
Copy link
Contributor Author

ajm188 commented Feb 20, 2021

I've pushed, in order:

  1. A revert of the original change
  2. A minimal test case that caused the panic, to demonstrate the problem we are trying to solve here.
  3. A fix for that test case, with commentary.

Assuming we're all on board with this, I'll update the PR description and dig in to any other tests that broke as a result; and, again, assuming everything's good, when merging the final version of this, rebase to remove the original commit + revert of that commit, to clean up the git branch.

@deepthi
Copy link
Member

deepthi commented Feb 22, 2021

LGTM
@ajm188 can you update the title and description of the PR to better reflect the changes? Thanks!

@ajm188 ajm188 changed the title [reparentutil] ERS should remove currently-serving shard primary from consideration [reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication Feb 22, 2021
Copy link
Contributor

@PrismaPhonic PrismaPhonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a much better approach. Thank you!

@deepthi deepthi merged commit 6403489 into vitessio:master Feb 24, 2021
@askdba askdba added this to the v10.0 milestone Feb 26, 2021
@ajm188 ajm188 deleted the am_ers_delete_current_primary branch February 27, 2021 00:55
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 9, 2021
OP vitessio#7523

This basically protects from trying to catch up on replication on hosts
that are likely not replicating.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 9, 2021
OP vitessio#7523

This basically protects from trying to catch up on replication on hosts
that are likely not replicating.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
setassociative added a commit to tinyspeck/vitess that referenced this pull request Mar 17, 2021
OP vitessio#7523

This basically protects from trying to catch up on replication on hosts
that are likely not replicating.

Signed-off-by: Richard Bailey <rbailey@slack-corp.com>
@ajm188 ajm188 added this to In progress in Vtctld Service via automation May 23, 2021
@ajm188 ajm188 moved this from In progress to Done in Vtctld Service May 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants