Fix ERS to work when the primary candidate's replication is stopped #9512

GuptaManan100 · 2022-01-14T12:45:47Z

Description

While investigation flakiness in one of the runs of cluster 14, it was noticed that ERS failed when the replication on the most advanced tablet was stopped. The failure occurred when we waited for it to apply the relay log and timed out. This PR fixes this issue and adds an end to end test for it.
The solution that is proposed is that we start the SQL_Thread of the MySQL server when we wait for it to apply the relay logs.

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Signed-off-by: Manan Gupta <manan@planetscale.com>

…rimary-elect has replication stopped Signed-off-by: Manan Gupta <manan@planetscale.com>

…ng for the replica Signed-off-by: Manan Gupta <manan@planetscale.com>

… tablets Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>

…ents and add logging for vtctl commands Signed-off-by: Manan Gupta <manan@planetscale.com>

…tion-candidate Signed-off-by: Manan Gupta <manan@planetscale.com>

After #9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired. This changes the behavior so that: 1. We only attempt to start the SQL_Thread(s) if it's not already running 2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before #9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up. Signed-off-by: Matt Lord <mattalord@gmail.com>

…sio#10104) After vitessio#9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired. This changes the behavior so that: 1. We only attempt to start the SQL_Thread(s) if it's not already running 2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before vitessio#9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up. Signed-off-by: Matt Lord <mattalord@gmail.com>

…ded (#10123) * Only start SQL thread temporarily to WaitForPosition if needed (#10104) After #9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired. This changes the behavior so that: 1. We only attempt to start the SQL_Thread(s) if it's not already running 2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before #9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up. Signed-off-by: Matt Lord <mattalord@gmail.com> * Use older replication status interface As release-13.0 does not have this: #9853 Signed-off-by: Matt Lord <mattalord@gmail.com>

…sio#561) * Only start SQL thread temporarily to WaitForPosition if needed (vitessio#10104) After vitessio#9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired. This changes the behavior so that: 1. We only attempt to start the SQL_Thread(s) if it's not already running 2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before vitessio#9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up. Signed-off-by: Matt Lord <mattalord@gmail.com> * Use older replication status interface As vitess-private does not have this: vitessio#9853 Signed-off-by: Matt Lord <mattalord@gmail.com>

GuptaManan100 added 4 commits January 14, 2022 17:48

test: added a function to check for replication status in a tablet

49ded01

Signed-off-by: Manan Gupta <manan@planetscale.com>

test: Augment an e2e test to also verify that ERS succeeds when the p…

2ca375f

…rimary-elect has replication stopped Signed-off-by: Manan Gupta <manan@planetscale.com>

feat: fix waitUntilPosition to also start the sql thread before waiti…

898ceb2

…ng for the replica Signed-off-by: Manan Gupta <manan@planetscale.com>

test: add comment explaining why active reparents are disabled on the…

ff4d956

… tablets Signed-off-by: Manan Gupta <manan@planetscale.com>

GuptaManan100 added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management release notes labels Jan 14, 2022

GuptaManan100 requested review from deepthi, harshit-gangal and systay as code owners January 14, 2022 12:45

GuptaManan100 requested review from frouioui and sougou January 14, 2022 12:46

GuptaManan100 added 3 commits January 14, 2022 19:53

test: fix a test after disabling active reparents on tablets

95ce159

Signed-off-by: Manan Gupta <manan@planetscale.com>

test: restrict passing disable_active_reparents to only latest compon…

0dd616b

…ents and add logging for vtctl commands Signed-off-by: Manan Gupta <manan@planetscale.com>

Merge remote-tracking branch 'upstream/main' into ers-stopped-replica…

4d4fbac

…tion-candidate Signed-off-by: Manan Gupta <manan@planetscale.com>

deepthi approved these changes Jan 19, 2022

View reviewed changes

GuptaManan100 mentioned this pull request Jan 19, 2022

Bug in ERS with vtctl stopped replication on a tablet #9529

Closed

GuptaManan100 merged commit c11b97a into vitessio:main Jan 19, 2022

GuptaManan100 deleted the ers-stopped-replication-candidate branch January 19, 2022 09:24

This was referenced Apr 15, 2022

Only start SQL thread temporarily to WaitForPosition if needed #10103

Closed

Only start SQL thread temporarily to WaitForPosition if needed #10104

Merged

mattlord mentioned this pull request Apr 21, 2022

Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

Merged

3 tasks

mattlord mentioned this pull request Apr 27, 2022

Do not mutate replication state in WaitSourcePos and ignore tablets with SQL_Thread stopped in ERS #10148

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ERS to work when the primary candidate's replication is stopped #9512

Fix ERS to work when the primary candidate's replication is stopped #9512

GuptaManan100 commented Jan 14, 2022

Fix ERS to work when the primary candidate's replication is stopped #9512

Fix ERS to work when the primary candidate's replication is stopped #9512

Conversation

GuptaManan100 commented Jan 14, 2022

Description

Related Issue(s)

Checklist

Deployment Notes