-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't wait for all remote replicas when starting a replica set instance #2958
Comments
I chatted with Mons, there are several possible ways to fix, we'll meet Monday 7pm to discuss. |
OK, thanks for the fast response! I invited to this meeting too. |
Discussed with @kostja => wait 50%+1 of remote appliers. |
Do not use the generic applier_on_state infrastructure, which is used to advance a single applier to a particular state as it doesn't really fit the case when we need to wait for multiple appliers and it utterly fails when it comes to waiting for an arbitrary subset of all appliers. Needed for #2958
The logic of applier_connect_all() does not really belong to applier.cc, because it operates on a set of appliers while applier.cc is all about an individual applier implementation. Let's move it to replication.cc and rename it to replicaset_connect(). Needed for #2958
- Fold call to replicaset_update() in replicaset_connect(). These two functions are always called one after another. - Factor out replicaset_follow() to resume all appliers registered with the replica set. Needed for #2958
If one cluster node is down permanently for some reason, no other node can restart - they will stall in box.cfg{} until all other nodes are up and running. This complicates a tarantool cluster deployment in real world scenarios. To address this issue, let's complete the configuration as soon as connections have been established to more than a half of all configured replicas, assuming that the rest will connect asynchronously. Closes #2958
I assume that there is no point in waiting until all replicas are inserted in the case of the initial cluster load, because any of the replicas should not contain any data (for correct configurations, of course). So, if a 'bootsrap is not finished' status was received during initial bootstrap/join then box.cfg may be finished without all other masters. |
Do not use the generic applier_on_state infrastructure, which is used to advance a single applier to a particular state as it doesn't really fit the case when we need to wait for multiple appliers and it utterly fails when it comes to waiting for an arbitrary subset of all appliers. Needed for #2958
The logic of applier_connect_all() does not really belong to applier.cc, because it operates on a set of appliers while applier.cc is all about an individual applier implementation. Let's move it to replication.cc and rename it to replicaset_connect(). Needed for #2958
- Fold call to replicaset_update() in replicaset_connect(). These two functions are always called one after another. - Factor out replicaset_follow() to resume all appliers registered with the replica set. Needed for #2958
If one cluster node is down permanently for some reason, no other node can restart - they will stall in box.cfg{} until all other nodes are up and running. This complicates a tarantool cluster deployment in real world scenarios. To address this issue, let's complete the configuration as soon as connections have been established connections to the number of hosts specified by the new configuration option, box.cfg.replication_quorum, assuming the rest will connect asynchronously. If the option is unset, it defaults to the number of entries in box.cfg.replication so this patch shouldn't affect the behavior of existing setups. Closes #2958
Do not use the generic applier_on_state infrastructure, which is used to advance a single applier to a particular state as it doesn't really fit the case when we need to wait for multiple appliers and it utterly fails when it comes to waiting for an arbitrary subset of all appliers. Needed for #2958
The logic of applier_connect_all() does not really belong to applier.cc, because it operates on a set of appliers while applier.cc is all about an individual applier implementation. Let's move it to replication.cc and rename it to replicaset_connect(). Needed for #2958
- Fold call to replicaset_update() in replicaset_connect(). These two functions are always called one after another. - Factor out replicaset_follow() to resume all appliers registered with the replica set. Needed for #2958
Added replication_quorum option to box.cfg():
Proposed logic with replication_lag + read_only on box.cfg() doesn't solve the problem that replica can fall behind master AFTER box.cfg(). The comprehensive solution is needed to forbid read/write queries on stale replicas. Currently this logic can be implemented using Lua without patching server. box.cfg { replication_quorum - X } - 1.7.6-140-gf1167c429 (1.7-next) |
Currently, we have three variables related to replication timeouts: applier_timeout, relay_timeout, and replication_cfg_timeout. They are all set to the value of box.cfg.replication_timeout. We use these variables in four different cases: - Sending heartbeat messages periodically from master to slave and back (relay_timeout, applier_timeout). - Reconnecting applier after a disconnect (applier_timeout). - Disconnecting a replica if no hearbeat message has been receivied within the specified timeout (TIMEOUT_PERIODS * relay_timeout). - Waiting for box.cfg() to succeed (replication_cfg_timeout). This is confusing. Let's keep just one variable, replication_timeout, that would determine the heartbeat interval and introduce the following helpers for the three other cases: - replication_reconnect_timeout() - replication_disconnect_timeout() - replication_cfg_timeout() Also, let's make replication_cfg_timeout() return 3 times the configured timeout in the scope of this patch, because, as pointed out by @kostja, > We need another replication_timeout variable, using the same variable > for everything doesn't work. Please try setting a broken > box.cfg.replication second time, and you'll see that it doesn't try to > reconnect, because reconnect timeout = replication timeout. This is > broken, reconnect_timeout should be < replication_timeout, to allow for > at least a few reconnects. Suggested by @kostja Follow-up #2958
First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to drop it. Now, to the new 'orphan' mode. Tarantool is switched to 'orphan' mode if it failed to form a quorum within the specified timeout during initial configuration of an instance that has already been bootstrapped. Basically, 'orphan' status forces the instance to switch to read-only mode. The instance will switch back to 'running' (and enable write access) in either of the following three cases: - replication is successfully reconfigured - box.cfg.replication_connect_quorum is changed - enough replicas are connected to form a quorum Note, any subsequent attempt to change box.cfg.replication won't change the server status - as before, it will fail downright if it can't form a quorum within the configured timeout period. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958
Currently, we have three variables related to replication timeouts: applier_timeout, relay_timeout, and replication_cfg_timeout. They are all set to the value of box.cfg.replication_timeout. We use these variables in four different cases: - Sending heartbeat messages periodically from master to slave and back (relay_timeout, applier_timeout). - Reconnecting applier after a disconnect (applier_timeout). - Disconnecting a replica if no hearbeat message has been receivied within the specified timeout (TIMEOUT_PERIODS * replication_timeout). - Waiting for box.cfg() to succeed (replication_connect_quorum_timeout). This is confusing. Let's keep just one variable, replication_timeout, that would determine the heartbeat interval and introduce the following helpers for the three other cases: - replication_reconnect_timeout() - replication_disconnect_timeout() - replication_connect_quroum_timeout() Also, let's make replication_connect_quorum_timeout() return 4 times the configured timeout in the scope of this patch, because, as pointed out by @kostja, > We need another replication_timeout variable, using the same variable > for everything doesn't work. Please try setting a broken > box.cfg.replication second time, and you'll see that it doesn't try to > reconnect, because reconnect timeout = replication timeout. This is > broken, reconnect_timeout should be < replication_timeout, to allow for > at least a few reconnects. Suggested by @kostja Follow-up #2958
replicaset_connect() leaves appliers that failed to connect within the specified time period running. To prevent them from entering 'subscribe' stage prematurely (i.e. before replicaset_follow() is called), we set replica->pause_on_connect flag, which will force them to freeze upon successful connection. We clear this flag in replicaset_follow(). This juggling with flags looks ugly. Instead, let's stop failed appliers in replicaset_connect() and restart them in replicaset_follow(). Follow-up #2958
This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to 0, then a replica skips the "sync" state and switches to "follow" immediately. For the sake of backward compatibility, the new option is unset by default. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort. 3. Then it synchronizes with all configured replicas. It doesn't leave box.cfg() until all connected replicas are synchronized. If it managed to synchronize to enough replicas to form a quorum, it switches to 'running' state, otherwise it leaves the server in 'orphan' state to be switched to 'running' asynchronously, as soon as a quorum is formed. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999
The replication/catch test's intention is to check that a replica doesn't enter the write mode until it syncs up with the master. Actually, this feature was introduced only recently, in the scope of #2958, so the test is obviously bogus. Let's just drop it as the functionality it tries and fails to test is already covered by the replication/quorum test.
replicaset_connect() leaves appliers that failed to connect within the specified time period running. To prevent them from entering 'subscribe' stage prematurely (i.e. before replicaset_follow() is called), we set replica->pause_on_connect flag, which will force them to freeze upon successful connection. We clear this flag in replicaset_follow(). This juggling with flags looks ugly. Instead, let's stop failed appliers in replicaset_connect() and restart them in replicaset_follow(). Follow-up #2958
This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. For the sake of backward compatibility, the new option is unset by default. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999
This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. For the sake of backward compatibility, the new option is unset by default. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999
replicaset_connect() leaves appliers that failed to connect within the specified time period running. To prevent them from entering 'subscribe' stage prematurely (i.e. before replicaset_follow() is called), we set replica->pause_on_connect flag, which will force them to freeze upon successful connection. We clear this flag in replicaset_follow(). This juggling with flags looks ugly. Instead, let's stop failed appliers in replicaset_connect() and restart them in replicaset_follow(). Follow-up #2958
This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. The default value of replication_sync_lag is 10 seconds, but it is ignored (assumed to be inf) in case the master is running tarantool older than 1.7.7, which does not send heartbeat messages. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999
This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. The default value of replication_sync_lag is 10 seconds, but it is ignored (assumed to be inf) in case the master is running tarantool older than 1.7.7, which does not send heartbeat messages. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999
Preamble
Tarantool 1.7 waits until all remote upstreams from box.cfg.replication are connected before proceeding with boot, whereas 1.6 didn't. This breaks high availability in a few important cases:
The change
The current behavior implements the motto "better to be safe than sorry". In all cases when an instance failed to boot in its previously defined configuration it is clearly visible to the DBA which then can intervene and change the configuration according to the circumstances. For example, Tarantool 1.6 could start with wrong slave URIs in the configuration file and never finish initializing replication - and this used to be missed by ops.
Therefore, the spirit of the new changes is to preserve and improve the level of safety introduced in 1.7, while giving the power users flexibility to automate restart as they wish.
First boot
On first boot, when the replica set is not yet initialized, the procedure will not change. We will wait for all replicas specified in the configuration to be up and running before finishing the bootstrap.
Subsequent starts
On a subsequent start, the replica will continue to try to connect to all peers specified in
box.cfg.replication
configuration variable, making at least 3 attempts to connect to each peer during the period specified inreplication_timeout
. If these attempts succeed until the timeout is reached, the replica will get up to date with all the changes accumulated at peers, as specified in a newreplication_lag
variable (in seconds). When the replica is behind its peers for fewer seconds than specified inreplication_lag
, it ends booting by leaving box.cfg{} in read-write or read-only mode, as defined inbox.cfg.read_only
.In case replica was not able to reach some of its peers in
replication_timeout
seconds, it syncs up with the peers it was able to connect to and then leaves box.cfg{} in a degraded mode:box.cfg.read_only
is true, the mode is "read only".box.cfg.read_only
is false, the replica checks a new variable,box.cfg.replication_connect_quorum
, which defaults to the number of entries inbox.cfg.replication
, and specifies the number of peers the replica should be in sync with to enter read-write mode. If the quorum is reached, the mode is read-write. Otherwise the mode is read-only. The new, forced read-only mode can be changed only by changing box.cfg.replication or box.cfg.replication_connect_quorum.The text was updated successfully, but these errors were encountered: