[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523

ajm188 · 2021-02-19T21:47:12Z

Description

This was either in the previous previous implementation, or not at all; I haven't gone back far enough to check.
I've added tests as penance.

The root issue is that a MASTER tablet will return mysql.ErrNotReplica from tmc.StopReplicationAndGetStatus, which will cause us to add an entry for that tablet in the primaryStatusMap and not the statusMap, in StopReplicationAndBuildStatusMap. Then, when we attempt to wait for all valid candidates to apply their relay logs (which can include MASTER tablets), we end up passing a nil Status to WaitForRelayLogsToApply, causing the segfault.

Related Issue(s)

[vtctld/wrangler] Extract EmergencyReparentShard logic to dedicated struct and add unit tests #7464

Checklist

Should this PR be backported? no ([vtctld/wrangler] Extract EmergencyReparentShard logic to dedicated struct and add unit tests #7464 isn't released, so no need to backport)
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

This was in the old implementation, and was overlooked in the port to an encapsulated struct. I've added tests as penance. Signed-off-by: Andrew Mason <amason@slack-corp.com>

setassociative · 2021-02-19T22:45:19Z

@ajm188 before digging into this are the unit test failures concerning? They feel topical given it's testing ERS even though it's not obvious if all ERS tests are related to this change given it's a different impl

ajm188 · 2021-02-19T23:55:05Z

It's definitely worth looking into those, but they may be red herrings:

I0219 21:54:03.947626   16521 vtctlclient_process.go:149] Executing vtctlclient with command: vtctlclient -server localhost:16806 EmergencyReparentShard -keyspace_shard ks/0 -wait_replicas_timeout 30s
    reparent_test.go:137: 
        	Error Trace:	reparent_test.go:137
        	Error:      	Expected value not to be nil.
        	Test:       	TestReparentIgnoreReplicas
        	Messages:   	W0219 21:54:03.965804   28276 main.go:67] W0219 21:54:03.965656 replication.go:205] failed to get replication status from zone1-0000000103: rpc error: code = Canceled desc = latest connection error: connection error: desc = "transport: Error while dialing dial tcp :0: connect: connection refused"
--- FAIL: TestReparentIgnoreReplicas (50.45s)

ajm188 · 2021-02-20T03:23:45Z

Hmmm, @setassociative looking again at #7464, in particular here, it seems this block I've added actually didn't exist anymore, and was removed in 6449 (I was looking at tinyspeck/master and not vitessio just prior to my first refactor).

Anyway, the reason that I noticed this, was when writing tests for adding ERS to the new VtctldServer API, if I tried to ERS with a reachable shard primary, then I get a segfault on nil pointer. The reason here is that first we construct two maps, statusMap, which is a mapping of tablet alias -> StopReplicationStatus for all the replicas in the shard, and primaryStatusMap, which is a mapping of tablet alias -> MasterStatus for any tablets that claim to be MASTER, but pass only the statusMap through to waitForAllRelayLogsToApply, which results in a nil Status when we eventually call WaitForRelayLogsToApply. This was true prior to my refactor as best as I can tell.

I'm confident this is the "issue" I saw, but I'm not confident it's actual issue vs me accidentally creating an impossible situation in my test setup (but we should add a guard against nil input so we don't segfault during ERS either way), or how to best handle this. I think what we actually want is to just skip the MASTERs, because they aren't applying relay logs since they are not replicating, so something like:

func (erp *EmergencyReparnter) waitForAllRelayLogsToApply(/* blah blah blah */) (/* blah blah*/) {
	errCh := make(chan error)
	defer close(errCh)

	groupCtx, groupCancel := context.WithTimeout(ctx, opts.WaitReplicasTimeout)
	defer groupCancel()

	replicaCount := 0

	for candidate := range validCandidates {
		tablet, ok := tabletMap[candidate]
		// handle !ok and return error
		if tablet.Type == topodatapb.TabletType_MASTER {
			// log
			continue
		}

		go func(alias string) {
			var err error
			defer func() { errCh <- err }()
			err = WaitForRelayLogsToApply(groupCtx, erp.tmc, tabletMap[alias], statusMap[alias])
		}(candidate)

		replicaCount++
	}

	errgroup := concurrency.ErrorGroup{
		NumGoroutines:        replicaCount,
		NumRequiredSuccesses: replicaCount,
		NumAllowedErrors:     0,
	}
	rec := errgroup.Wait(groupCancel, errCh)

	// rest of the function unchanged
}

cc @deepthi @PrismaPhonic to double-check my read here

PrismaPhonic · 2021-02-20T04:52:22Z

The design is VERY intentional upon not assuming who masters are. This allows us to handle a host of edge cases where a tablet thinks it's master when it's really not.

PrismaPhonic

I'm blocking on this until we can discuss it further, because it pushes against the design by trying to know who the master is. That was a very important part of our re-design (to intentionally not try to assume who the master was)

ajm188 · 2021-02-20T12:58:59Z

I am well aware of that design consideration. In my most recent comment, I point out that the change currently in this PR is not the change we want:

it seems this block I've added actually didn't exist anymore, and was removed in 6449 (I was looking at tinyspeck/master and not vitessio just prior to my first refactor).

The actual issue, as I attempted to describe in my most recent comment, is that a MASTER tablet will return mysql.ErrNotReplica from tmc.StopReplicationAndGetStatus, which will cause us to add an entry for that tablet in the primaryStatusMap and not the statusMap, in StopReplicationAndBuildStatusMap. Then, when we attempt to wait for all valid candidates to apply their relay logs (which can include MASTER tablets), we end up passing a nil Status to WaitForRelayLogsToApply, causing the segfault.

I agree that not making assumptions about who the primary is remains an important aspect of the design, but this needs to be fixed, as crashing during an ERS is also quite dangerous to health of the cluster the ERS is trying to salvage.

…eration" This reverts commit c5bbcc7. Signed-off-by: Andrew Mason <amason@slack-corp.com>

Signed-off-by: Andrew Mason <amason@slack-corp.com>

…ng StopReplication phase Signed-off-by: Andrew Mason <amason@slack-corp.com>

ajm188 · 2021-02-20T21:04:05Z

I've pushed, in order:

A revert of the original change
A minimal test case that caused the panic, to demonstrate the problem we are trying to solve here.
A fix for that test case, with commentary.

Assuming we're all on board with this, I'll update the PR description and dig in to any other tests that broke as a result; and, again, assuming everything's good, when merging the final version of this, rebase to remove the original commit + revert of that commit, to clean up the git branch.

deepthi · 2021-02-22T20:27:59Z

LGTM
@ajm188 can you update the title and description of the PR to better reflect the changes? Thanks!

PrismaPhonic

This looks like a much better approach. Thank you!

OP vitessio#7523 This basically protects from trying to catch up on replication on hosts that are likely not replicating. Signed-off-by: Richard Bailey <rbailey@slack-corp.com>

ERS should remove currently-serving shard primary from consideration

c5bbcc7

This was in the old implementation, and was overlooked in the port to an encapsulated struct. I've added tests as penance. Signed-off-by: Andrew Mason <amason@slack-corp.com>

ajm188 requested review from deepthi, doeg, setassociative and rohit-nayak-ps February 19, 2021 21:47

PrismaPhonic requested a review from enisoc February 20, 2021 04:52

PrismaPhonic suggested changes Feb 20, 2021

View reviewed changes

ajm188 added 3 commits February 20, 2021 11:41

Revert "ERS should remove currently-serving shard primary from consid…

949fe73

…eration" This reverts commit c5bbcc7. Signed-off-by: Andrew Mason <amason@slack-corp.com>

Add minimal test case to cause panic

3df1501

Signed-off-by: Andrew Mason <amason@slack-corp.com>

Fix panic by skipping any candidates that returned ErrNotReplica duri…

2400c5b

…ng StopReplication phase Signed-off-by: Andrew Mason <amason@slack-corp.com>

ajm188 mentioned this pull request Feb 22, 2021

[vtctld] Add remaining reparent commands to VtctldServer #7536

Merged

8 tasks

ajm188 changed the title ~~[reparentutil] ERS should remove currently-serving shard primary from consideration~~ [reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication Feb 22, 2021

PrismaPhonic approved these changes Feb 24, 2021

View reviewed changes

deepthi merged commit 6403489 into vitessio:master Feb 24, 2021

askdba added the Component: Cluster management label Feb 26, 2021

askdba added this to the v10.0 milestone Feb 26, 2021

artemvovk mentioned this pull request Feb 26, 2021

EmergencyReparentShard still trying to get replication status from broken primary VTTablet #7559

Closed

ajm188 deleted the am_ers_delete_current_primary branch February 27, 2021 00:55

setassociative mentioned this pull request Mar 9, 2021

Backport vitessio#7523 into 8.0 tinyspeck/vitess#198

Merged

setassociative mentioned this pull request Mar 9, 2021

Vitess v8.0 Release branch tinyspeck/vitess#194

Merged

ajm188 mentioned this pull request Mar 16, 2021

[reparentutil | ERS] Bind status variable to each goroutine in WaitForRelayLogsToApply #7692

Merged

8 tasks

ajm188 added the Type: Bug label May 23, 2021

ajm188 added this to In progress in Vtctld Service via automation May 23, 2021

ajm188 moved this from In progress to Done in Vtctld Service May 23, 2021

ajm188 mentioned this pull request Jul 14, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523

[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523

ajm188 commented Feb 19, 2021 •

edited

setassociative commented Feb 19, 2021 •

edited

ajm188 commented Feb 19, 2021

ajm188 commented Feb 20, 2021 •

edited

PrismaPhonic commented Feb 20, 2021

PrismaPhonic left a comment •

edited

ajm188 commented Feb 20, 2021

ajm188 commented Feb 20, 2021 •

edited

deepthi commented Feb 22, 2021 •

edited

PrismaPhonic left a comment

[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523

[reparentutil] ERS should not attempt to WaitForRelayLogsToApply on primary tablets that were not running replication #7523

Conversation

ajm188 commented Feb 19, 2021 • edited

Description

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

setassociative commented Feb 19, 2021 • edited

ajm188 commented Feb 19, 2021

ajm188 commented Feb 20, 2021 • edited

PrismaPhonic commented Feb 20, 2021

PrismaPhonic left a comment • edited

Choose a reason for hiding this comment

ajm188 commented Feb 20, 2021

ajm188 commented Feb 20, 2021 • edited

deepthi commented Feb 22, 2021 • edited

PrismaPhonic left a comment

Choose a reason for hiding this comment

ajm188 commented Feb 19, 2021 •

edited

setassociative commented Feb 19, 2021 •

edited

ajm188 commented Feb 20, 2021 •

edited

PrismaPhonic left a comment •

edited

ajm188 commented Feb 20, 2021 •

edited

deepthi commented Feb 22, 2021 •

edited