Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix initialization code to also stop replication to prevent crash #12534

Merged
merged 3 commits into from
Mar 4, 2023

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Mar 1, 2023

Description

This PR fixes the bug described in #12533 by fixing the initialization code to also stop replication before it tries to reset slave all. The old code assumed that replication is already stopped when a new vttablet was coming up, but that assumption is incorrect. We should stop replication irrespective before we try to change the primary hostname, and port.

This PR is a follow-up to: #10881 which was originally created to resolve #10880.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <manan@planetscale.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 1, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@GuptaManan100 GuptaManan100 removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Mar 1, 2023
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as it's definitely an improvement. I'm not certain, however, that this eliminates the general race condition as a human, tablet repair, vtorc, etc could still start replication in between right? I feel like perhaps we should continue retrying to achieve desired config and error after a number of attempts (always checking to see if the desired config is not already in place).

Comment on lines +100 to +111
// TestTabletRestart tests that a running tablet can be restarted and everything is still fine
func TestTabletRestart(t *testing.T) {
defer cluster.PanicHandler(t)
clusterInstance := utils.SetupReparentCluster(t, "semi_sync")
defer utils.TeardownCluster(clusterInstance)
tablets := clusterInstance.Keyspaces[0].Shards[0].Vttablets

utils.StopTablet(t, tablets[1], false)
tablets[1].VttabletProcess.ServingStatus = "SERVING"
err := tablets[1].VttabletProcess.Setup()
require.NoError(t, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails on main?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does.

go/vt/vttablet/tabletmanager/tm_init.go Show resolved Hide resolved
@mattlord
Copy link
Contributor

mattlord commented Mar 1, 2023

From the issue, does the process end here?

F0301 22:52:17.853669   63118 vttablet.go:128] failed to parse --tablet-path or initialize DB credentials: ExecuteFetch(RESET SLAVE ALL) failed: This operation cannot be performed with running replication threads; run STOP SLAVE FOR CHANNEL '' first (errno 3081) (sqlstate HY000) during query: RESET SLAVE ALL
MysqlDaemon.SetReplicationSource failed

If so, then I would guess we have a nil pointer dereference or something that flows from this. That would seem like a separate issue to address, as ideally we should retry to reach the desired state during init and then shutdown gracefully if we fail to. Otherwise I think we may still be still prone to the process simply “disappearing” w/o a trace or clue because of race conditions around this.

@GuptaManan100
Copy link
Member Author

@mattlord It's not really a race. Its just that this part of code didn't stop replication before it started changing the primary information. It assumed that replication would be stopped already, which is an incorrect assumption.
As far as the question for a race with someone fixing the replication (vtorc, or manual) is concerned, I don't think that is an issue because of how SetReplicationSource is written. It accumulates all the commands it has to run and then runs them all one after the other.

        cmds := []string{}
	if replicationStopBefore {
		cmds = append(cmds, conn.StopReplicationCommand())
	}
	// Reset replication parameters commands makes the instance forget the source host port
	// This is required because sometimes MySQL gets stuck due to improper initialization of
	// master info structure or related failures and throws errors like
	// ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
	// These errors can only be resolved by resetting the replication parameters, otherwise START SLAVE fails.
	// Therefore, we have elected to always reset the replication parameters whenever we try to set the source host port
	// Since there is no real overhead, but it makes this function robust enough to also handle failures like these.
	cmds = append(cmds, conn.ResetReplicationParametersCommands()...)
	smc := conn.SetReplicationSourceCommand(params, host, port, int(replicationConnectRetry.Seconds()))
	cmds = append(cmds, smc)
	if replicationStartAfter {
		cmds = append(cmds, conn.StartReplicationCommand())
	}
	return mysqld.executeSuperQueryListConn(ctx, conn, cmds)

I think that the idea of having a retry is still a good one. We could also just ignore the error, instead of exiting on it and let VTOrc repair the replication later.

This is what we have in Start -

_, err = tm.initializeReplication(ctx, tm.Tablet().Type)
	tm.tmState.Open()
	return err

Maybe we should ignore that error, because that error gets propagated up the stack eventually doing log.Exitf. An error in initializing replication shouldn't really shutdown the tablet. I am still in two-minds about that though. WDYT?

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Mar 3, 2023

@mattlord Me and @deepthi had a chat about this PR today morning. She feels that we shouldn't be ignoring an error during start-up. Also, we both think that a retry for initializing replication isn't required especially since VTOrc has already become compulsory.

Other than that, I have pushed fixes to the tests so this PR should be good to go. If something does turn up, please feel free to resolve it as you both see fit since I'll be off for about 12 days.

@deepthi deepthi merged commit ba115b3 into vitessio:main Mar 4, 2023
@deepthi deepthi deleted the vttablet-startup-fix branch March 4, 2023 02:09
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Mar 4, 2023

I was unable to backport this Pull Request to the following branches: release-14.0, release-15.0, release-16.0.

GuptaManan100 added a commit to planetscale/vitess that referenced this pull request Mar 22, 2023
…tessio#12534)

* feat: fix initialization code to also stop replication

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix tests expectations

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix wrangler tests

Signed-off-by: Manan Gupta <manan@planetscale.com>

---------

Signed-off-by: Manan Gupta <manan@planetscale.com>
GuptaManan100 added a commit to planetscale/vitess that referenced this pull request Mar 22, 2023
…tessio#12534)

* feat: fix initialization code to also stop replication

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix tests expectations

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: fix wrangler tests

Signed-off-by: Manan Gupta <manan@planetscale.com>

---------

Signed-off-by: Manan Gupta <manan@planetscale.com>
frouioui pushed a commit that referenced this pull request Mar 23, 2023
…2534) (#12692)

* feat: fix initialization code to also stop replication



* feat: fix tests expectations



* feat: fix wrangler tests



---------

Signed-off-by: Manan Gupta <manan@planetscale.com>
frouioui pushed a commit that referenced this pull request Mar 23, 2023
…2534) (#12691)

* feat: fix initialization code to also stop replication



* feat: fix tests expectations



* feat: fix wrangler tests



---------

Signed-off-by: Manan Gupta <manan@planetscale.com>
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants