-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix initialization code to also stop replication to prevent crash #12534
Conversation
Signed-off-by: Manan Gupta <manan@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as it's definitely an improvement. I'm not certain, however, that this eliminates the general race condition as a human, tablet repair, vtorc, etc could still start replication in between right? I feel like perhaps we should continue retrying to achieve desired config and error after a number of attempts (always checking to see if the desired config is not already in place).
// TestTabletRestart tests that a running tablet can be restarted and everything is still fine | ||
func TestTabletRestart(t *testing.T) { | ||
defer cluster.PanicHandler(t) | ||
clusterInstance := utils.SetupReparentCluster(t, "semi_sync") | ||
defer utils.TeardownCluster(clusterInstance) | ||
tablets := clusterInstance.Keyspaces[0].Shards[0].Vttablets | ||
|
||
utils.StopTablet(t, tablets[1], false) | ||
tablets[1].VttabletProcess.ServingStatus = "SERVING" | ||
err := tablets[1].VttabletProcess.Setup() | ||
require.NoError(t, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails on main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does.
From the issue, does the process end here?
If so, then I would guess we have a nil pointer dereference or something that flows from this. That would seem like a separate issue to address, as ideally we should retry to reach the desired state during init and then shutdown gracefully if we fail to. Otherwise I think we may still be still prone to the process simply “disappearing” w/o a trace or clue because of race conditions around this. |
@mattlord It's not really a race. Its just that this part of code didn't stop replication before it started changing the primary information. It assumed that replication would be stopped already, which is an incorrect assumption. cmds := []string{}
if replicationStopBefore {
cmds = append(cmds, conn.StopReplicationCommand())
}
// Reset replication parameters commands makes the instance forget the source host port
// This is required because sometimes MySQL gets stuck due to improper initialization of
// master info structure or related failures and throws errors like
// ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log
// These errors can only be resolved by resetting the replication parameters, otherwise START SLAVE fails.
// Therefore, we have elected to always reset the replication parameters whenever we try to set the source host port
// Since there is no real overhead, but it makes this function robust enough to also handle failures like these.
cmds = append(cmds, conn.ResetReplicationParametersCommands()...)
smc := conn.SetReplicationSourceCommand(params, host, port, int(replicationConnectRetry.Seconds()))
cmds = append(cmds, smc)
if replicationStartAfter {
cmds = append(cmds, conn.StartReplicationCommand())
}
return mysqld.executeSuperQueryListConn(ctx, conn, cmds) I think that the idea of having a retry is still a good one. We could also just ignore the error, instead of exiting on it and let VTOrc repair the replication later. This is what we have in
Maybe we should ignore that error, because that error gets propagated up the stack eventually doing |
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
@mattlord Me and @deepthi had a chat about this PR today morning. She feels that we shouldn't be ignoring an error during start-up. Also, we both think that a retry for initializing replication isn't required especially since VTOrc has already become compulsory. Other than that, I have pushed fixes to the tests so this PR should be good to go. If something does turn up, please feel free to resolve it as you both see fit since I'll be off for about 12 days. |
I was unable to backport this Pull Request to the following branches: |
…tessio#12534) * feat: fix initialization code to also stop replication Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix tests expectations Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix wrangler tests Signed-off-by: Manan Gupta <manan@planetscale.com> --------- Signed-off-by: Manan Gupta <manan@planetscale.com>
…tessio#12534) * feat: fix initialization code to also stop replication Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix tests expectations Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: fix wrangler tests Signed-off-by: Manan Gupta <manan@planetscale.com> --------- Signed-off-by: Manan Gupta <manan@planetscale.com>
Description
This PR fixes the bug described in #12533 by fixing the initialization code to also stop replication before it tries to
reset slave all
. The old code assumed that replication is already stopped when a new vttablet was coming up, but that assumption is incorrect. We should stop replication irrespective before we try to change the primary hostname, and port.This PR is a follow-up to: #10881 which was originally created to resolve #10880.
Related Issue(s)
Checklist
Deployment Notes