Fix initialization code to also stop replication to prevent crash #12534

GuptaManan100 · 2023-03-01T19:01:20Z

Description

This PR fixes the bug described in #12533 by fixing the initialization code to also stop replication before it tries to reset slave all. The old code assumed that replication is already stopped when a new vttablet was coming up, but that assumption is incorrect. We should stop replication irrespective before we try to change the primary hostname, and port.

This PR is a follow-up to: #10881 which was originally created to resolve #10880.

Related Issue(s)

Fixes Bug Report: Vttablet restart when MySQL replication is running crashes #12533

Checklist

"Backport to:" labels have been added if this change should be back-ported
Tests were added or are not required
Did the new or modified tests pass consistently locally and on the CI
Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <manan@planetscale.com>

vitess-bot · 2023-03-01T19:01:24Z

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

Ensure that the Pull Request has a descriptive title.
If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

Is it really necessary to add this flag?
Flag names should be clear and intuitive (as far as possible)
Help text should be descriptive.
Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

Each item in Jobs should be named in order to mark it as required.
If the workflow should be required, the maintainer team should be notified.

Bug fixes

There should be at least one unit or end-to-end test.
The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

There should be some code comments as to why things are implemented the way they are.

New/Existing features

Should be documented, either by modifying the existing documentation or creating new documentation.
New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

Protobuf changes should be wire-compatible.
Changes to _vt tables and RPCs need to be backward compatible.
vtctl command output order should be stable and awk-able.
RPC changes should be compatible with vitess-operator
If a flag is removed, then it should also be removed from VTop, if used there.

mattlord

Approving as it's definitely an improvement. I'm not certain, however, that this eliminates the general race condition as a human, tablet repair, vtorc, etc could still start replication in between right? I feel like perhaps we should continue retrying to achieve desired config and error after a number of attempts (always checking to see if the desired config is not already in place).

mattlord · 2023-03-01T19:40:25Z

go/test/endtoend/reparent/newfeaturetest/reparent_test.go

+// TestTabletRestart tests that a running tablet can be  restarted and everything is still fine
+func TestTabletRestart(t *testing.T) {
+	defer cluster.PanicHandler(t)
+	clusterInstance := utils.SetupReparentCluster(t, "semi_sync")
+	defer utils.TeardownCluster(clusterInstance)
+	tablets := clusterInstance.Keyspaces[0].Shards[0].Vttablets
+
+	utils.StopTablet(t, tablets[1], false)
+	tablets[1].VttabletProcess.ServingStatus = "SERVING"
+	err := tablets[1].VttabletProcess.Setup()
+	require.NoError(t, err)
+}


This fails on main?

Yes, it does.

go/vt/vttablet/tabletmanager/tm_init.go

mattlord · 2023-03-01T20:01:59Z

From the issue, does the process end here?

F0301 22:52:17.853669   63118 vttablet.go:128] failed to parse --tablet-path or initialize DB credentials: ExecuteFetch(RESET SLAVE ALL) failed: This operation cannot be performed with running replication threads; run STOP SLAVE FOR CHANNEL '' first (errno 3081) (sqlstate HY000) during query: RESET SLAVE ALL
MysqlDaemon.SetReplicationSource failed

If so, then I would guess we have a nil pointer dereference or something that flows from this. That would seem like a separate issue to address, as ideally we should retry to reach the desired state during init and then shutdown gracefully if we fail to. Otherwise I think we may still be still prone to the process simply “disappearing” w/o a trace or clue because of race conditions around this.

GuptaManan100 · 2023-03-02T07:23:49Z

@mattlord It's not really a race. Its just that this part of code didn't stop replication before it started changing the primary information. It assumed that replication would be stopped already, which is an incorrect assumption.
As far as the question for a race with someone fixing the replication (vtorc, or manual) is concerned, I don't think that is an issue because of how SetReplicationSource is written. It accumulates all the commands it has to run and then runs them all one after the other.

        cmds := []string{}
	if replicationStopBefore {
		cmds = append(cmds, conn.StopReplicationCommand())
	}
	// Reset replication parameters commands makes the instance forget the source host port
	// This is required because sometimes MySQL gets stuck due to improper initialization of
	// master info structure or related failures and throws errors like
	// ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
	// These errors can only be resolved by resetting the replication parameters, otherwise START SLAVE fails.
	// Therefore, we have elected to always reset the replication parameters whenever we try to set the source host port
	// Since there is no real overhead, but it makes this function robust enough to also handle failures like these.
	cmds = append(cmds, conn.ResetReplicationParametersCommands()...)
	smc := conn.SetReplicationSourceCommand(params, host, port, int(replicationConnectRetry.Seconds()))
	cmds = append(cmds, smc)
	if replicationStartAfter {
		cmds = append(cmds, conn.StartReplicationCommand())
	}
	return mysqld.executeSuperQueryListConn(ctx, conn, cmds)

I think that the idea of having a retry is still a good one. We could also just ignore the error, instead of exiting on it and let VTOrc repair the replication later.

This is what we have in Start -

_, err = tm.initializeReplication(ctx, tm.Tablet().Type)
	tm.tmState.Open()
	return err

Maybe we should ignore that error, because that error gets propagated up the stack eventually doing log.Exitf. An error in initializing replication shouldn't really shutdown the tablet. I am still in two-minds about that though. WDYT?

Signed-off-by: Manan Gupta <manan@planetscale.com>

GuptaManan100 · 2023-03-03T11:53:41Z

@mattlord Me and @deepthi had a chat about this PR today morning. She feels that we shouldn't be ignoring an error during start-up. Also, we both think that a retry for initializing replication isn't required especially since VTOrc has already become compulsory.

Other than that, I have pushed fixes to the tests so this PR should be good to go. If something does turn up, please feel free to resolve it as you both see fit since I'll be off for about 12 days.

vitess-bot · 2023-03-04T02:11:51Z

I was unable to backport this Pull Request to the following branches: release-14.0, release-15.0, release-16.0.

…tessio#12534) * feat: fix initialization code to also stop replication Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix tests expectations Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix wrangler tests Signed-off-by: Manan Gupta <manan@planetscale.com> --------- Signed-off-by: Manan Gupta <manan@planetscale.com>

…2534) (#12692) * feat: fix initialization code to also stop replication * feat: fix tests expectations * feat: fix wrangler tests --------- Signed-off-by: Manan Gupta <manan@planetscale.com>

…2534) (#12691) * feat: fix initialization code to also stop replication * feat: fix tests expectations * feat: fix wrangler tests --------- Signed-off-by: Manan Gupta <manan@planetscale.com>

feat: fix initialization code to also stop replication

5cd22e7

Signed-off-by: Manan Gupta <manan@planetscale.com>

GuptaManan100 added Type: Bug Component: Cluster management Backport to: release-14.0 labels Mar 1, 2023

GuptaManan100 requested review from deepthi, rohit-nayak-ps, rsajwani and shlomi-noach as code owners March 1, 2023 19:01

vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 1, 2023

GuptaManan100 removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 1, 2023

mattlord approved these changes Mar 1, 2023

View reviewed changes

feat: fix tests expectations

e4b5978

Signed-off-by: Manan Gupta <manan@planetscale.com>

GuptaManan100 requested review from ajm188 and notfelineit as code owners March 3, 2023 02:51

feat: fix wrangler tests

86f8f7c

Signed-off-by: Manan Gupta <manan@planetscale.com>

deepthi approved these changes Mar 4, 2023

View reviewed changes

deepthi merged commit ba115b3 into vitessio:main Mar 4, 2023

deepthi deleted the vttablet-startup-fix branch March 4, 2023 02:09

GuptaManan100 mentioned this pull request Mar 22, 2023

[release-16.0] Fix initialization code to also stop replication to prevent crash #12534 #12691

Merged

GuptaManan100 mentioned this pull request Mar 22, 2023

[release-15.0] Fix initialization code to also stop replication to prevent crash #12534 #12692

Merged

austenLacy mentioned this pull request Apr 26, 2023

Fixes the SwitchTraffic bug that wasn't respecting --dry_run for readonly and replica tablets during a resharding event Shopify/vitess#91

Merged

austenLacy mentioned this pull request Aug 1, 2023

Custom jwt http ACL policy Shopify/vitess#114

Closed

hmaurer mentioned this pull request Mar 21, 2024

oops #15542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix initialization code to also stop replication to prevent crash #12534

Fix initialization code to also stop replication to prevent crash #12534

GuptaManan100 commented Mar 1, 2023 •

edited

vitess-bot bot commented Mar 1, 2023

mattlord left a comment •

edited

mattlord Mar 1, 2023

GuptaManan100 Mar 2, 2023

mattlord commented Mar 1, 2023

GuptaManan100 commented Mar 2, 2023

GuptaManan100 commented Mar 3, 2023 •

edited

vitess-bot bot commented Mar 4, 2023

Fix initialization code to also stop replication to prevent crash #12534

Fix initialization code to also stop replication to prevent crash #12534

Conversation

GuptaManan100 commented Mar 1, 2023 • edited

Description

Related Issue(s)

Checklist

Deployment Notes

vitess-bot bot commented Mar 1, 2023

Review Checklist

General

If a new flag is being introduced:

If a workflow is added or modified:

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

mattlord left a comment • edited

Choose a reason for hiding this comment

mattlord Mar 1, 2023

Choose a reason for hiding this comment

GuptaManan100 Mar 2, 2023

Choose a reason for hiding this comment

mattlord commented Mar 1, 2023

GuptaManan100 commented Mar 2, 2023

GuptaManan100 commented Mar 3, 2023 • edited

vitess-bot bot commented Mar 4, 2023

GuptaManan100 commented Mar 1, 2023 •

edited

mattlord left a comment •

edited

GuptaManan100 commented Mar 3, 2023 •

edited