Don't wait for all remote replicas when starting a replica set instance #2958

rtsisyk · 2017-11-25T15:53:04Z

Preamble

Tarantool 1.7 waits until all remote upstreams from box.cfg.replication are connected before proceeding with boot, whereas 1.6 didn't. This breaks high availability in a few important cases:

there is a master-master set up operating in eventual consistency mode. One of the masters is down. When the second master restarts for whatever reason, it doesn't boot without a DBA intervention
there is a master-slave set up, and the master is down. The slave restarts, again, for whatever reason. Instead of continuing in read only mode, which is entirely safe, it sits in box.cfg{} waiting for master to respond to the slave's replication handshake request.

The change

The current behavior implements the motto "better to be safe than sorry". In all cases when an instance failed to boot in its previously defined configuration it is clearly visible to the DBA which then can intervene and change the configuration according to the circumstances. For example, Tarantool 1.6 could start with wrong slave URIs in the configuration file and never finish initializing replication - and this used to be missed by ops.

Therefore, the spirit of the new changes is to preserve and improve the level of safety introduced in 1.7, while giving the power users flexibility to automate restart as they wish.

First boot

On first boot, when the replica set is not yet initialized, the procedure will not change. We will wait for all replicas specified in the configuration to be up and running before finishing the bootstrap.

Subsequent starts

On a subsequent start, the replica will continue to try to connect to all peers specified in box.cfg.replication configuration variable, making at least 3 attempts to connect to each peer during the period specified in replication_timeout. If these attempts succeed until the timeout is reached, the replica will get up to date with all the changes accumulated at peers, as specified in a new replication_lag variable (in seconds). When the replica is behind its peers for fewer seconds than specified in replication_lag, it ends booting by leaving box.cfg{} in read-write or read-only mode, as defined in box.cfg.read_only.

In case replica was not able to reach some of its peers in replication_timeout seconds, it syncs up with the peers it was able to connect to and then leaves box.cfg{} in a degraded mode:

if box.cfg.read_only is true, the mode is "read only".
if box.cfg.read_only is false, the replica checks a new variable, box.cfg.replication_connect_quorum, which defaults to the number of entries in box.cfg.replication, and specifies the number of peers the replica should be in sync with to enter read-write mode. If the quorum is reached, the mode is read-write. Otherwise the mode is read-only. The new, forced read-only mode can be changed only by changing box.cfg.replication or box.cfg.replication_connect_quorum.

The text was updated successfully, but these errors were encountered:

kostja · 2017-11-27T01:10:19Z

I chatted with Mons, there are several possible ways to fix, we'll meet Monday 7pm to discuss.

rtsisyk · 2017-11-27T06:53:41Z

OK, thanks for the fast response! I invited to this meeting too.

rtsisyk · 2017-11-28T14:19:10Z

Discussed with @kostja => wait 50%+1 of remote appliers.

Do not use the generic applier_on_state infrastructure, which is used to advance a single applier to a particular state as it doesn't really fit the case when we need to wait for multiple appliers and it utterly fails when it comes to waiting for an arbitrary subset of all appliers. Needed for #2958

The logic of applier_connect_all() does not really belong to applier.cc, because it operates on a set of appliers while applier.cc is all about an individual applier implementation. Let's move it to replication.cc and rename it to replicaset_connect(). Needed for #2958

- Fold call to replicaset_update() in replicaset_connect(). These two functions are always called one after another. - Factor out replicaset_follow() to resume all appliers registered with the replica set. Needed for #2958

If one cluster node is down permanently for some reason, no other node can restart - they will stall in box.cfg{} until all other nodes are up and running. This complicates a tarantool cluster deployment in real world scenarios. To address this issue, let's complete the configuration as soon as connections have been established to more than a half of all configured replicas, assuming that the rest will connect asynchronously. Closes #2958

GeorgyKirichenko · 2017-12-13T18:30:14Z

I assume that there is no point in waiting until all replicas are inserted in the case of the initial cluster load, because any of the replicas should not contain any data (for correct configurations, of course). So, if a 'bootsrap is not finished' status was received during initial bootstrap/join then box.cfg may be finished without all other masters.

Do not use the generic applier_on_state infrastructure, which is used to advance a single applier to a particular state as it doesn't really fit the case when we need to wait for multiple appliers and it utterly fails when it comes to waiting for an arbitrary subset of all appliers. Needed for #2958

The logic of applier_connect_all() does not really belong to applier.cc, because it operates on a set of appliers while applier.cc is all about an individual applier implementation. Let's move it to replication.cc and rename it to replicaset_connect(). Needed for #2958

- Fold call to replicaset_update() in replicaset_connect(). These two functions are always called one after another. - Factor out replicaset_follow() to resume all appliers registered with the replica set. Needed for #2958

If one cluster node is down permanently for some reason, no other node can restart - they will stall in box.cfg{} until all other nodes are up and running. This complicates a tarantool cluster deployment in real world scenarios. To address this issue, let's complete the configuration as soon as connections have been established connections to the number of hosts specified by the new configuration option, box.cfg.replication_quorum, assuming the rest will connect asynchronously. If the option is unset, it defaults to the number of entries in box.cfg.replication so this patch shouldn't affect the behavior of existing setups. Closes #2958

Do not use the generic applier_on_state infrastructure, which is used to advance a single applier to a particular state as it doesn't really fit the case when we need to wait for multiple appliers and it utterly fails when it comes to waiting for an arbitrary subset of all appliers. Needed for #2958

The logic of applier_connect_all() does not really belong to applier.cc, because it operates on a set of appliers while applier.cc is all about an individual applier implementation. Let's move it to replication.cc and rename it to replicaset_connect(). Needed for #2958

- Fold call to replicaset_update() in replicaset_connect(). These two functions are always called one after another. - Factor out replicaset_follow() to resume all appliers registered with the replica set. Needed for #2958

rtsisyk · 2018-01-10T10:03:32Z

Added replication_quorum option to box.cfg():

box.cfg { replication = {xx, yy, zz}, replication_quorum = 0 } - don't wait replicas (old 1.6 behaviour)
box.cfg { replication = {xx, yy, zz, replication_quorum = 100500} - wait all replicas (new 1.7 behaviour)
box.cfg { replication = {xx, yy, zz, replication_quorum = N} - wait N replicas (custom logic)

Proposed logic with replication_lag + read_only on box.cfg() doesn't solve the problem that replica can fall behind master AFTER box.cfg(). The comprehensive solution is needed to forbid read/write queries on stale replicas. Currently this logic can be implemented using Lua without patching server.

box.cfg { replication_quorum - X } - 1.7.6-140-gf1167c429 (1.7-next)

@kostja

'replication_connect_quorum' seems to be a better name for this option is ignored once connections to remote masters have been established. Suggested by @kostja Follow-up #2958

@kostja

Currently, we have three variables related to replication timeouts: applier_timeout, relay_timeout, and replication_cfg_timeout. They are all set to the value of box.cfg.replication_timeout. We use these variables in four different cases: - Sending heartbeat messages periodically from master to slave and back (relay_timeout, applier_timeout). - Reconnecting applier after a disconnect (applier_timeout). - Disconnecting a replica if no hearbeat message has been receivied within the specified timeout (TIMEOUT_PERIODS * relay_timeout). - Waiting for box.cfg() to succeed (replication_cfg_timeout). This is confusing. Let's keep just one variable, replication_timeout, that would determine the heartbeat interval and introduce the following helpers for the three other cases: - replication_reconnect_timeout() - replication_disconnect_timeout() - replication_cfg_timeout() Also, let's make replication_cfg_timeout() return 3 times the configured timeout in the scope of this patch, because, as pointed out by @kostja, > We need another replication_timeout variable, using the same variable > for everything doesn't work. Please try setting a broken > box.cfg.replication second time, and you'll see that it doesn't try to > reconnect, because reconnect timeout = replication timeout. This is > broken, reconnect_timeout should be < replication_timeout, to allow for > at least a few reconnects. Suggested by @kostja Follow-up #2958

@kostja

First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to drop it. Now, to the new 'orphan' mode. Tarantool is switched to 'orphan' mode if it failed to form a quorum within the specified timeout during initial configuration of an instance that has already been bootstrapped. Basically, 'orphan' status forces the instance to switch to read-only mode. The instance will switch back to 'running' (and enable write access) in either of the following three cases: - replication is successfully reconfigured - box.cfg.replication_connect_quorum is changed - enough replicas are connected to form a quorum Note, any subsequent attempt to change box.cfg.replication won't change the server status - as before, it will fail downright if it can't form a quorum within the configured timeout period. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958

@kostja

'replication_connect_quorum' seems to be a better name for this option is ignored once connections to remote masters have been established. Suggested by @kostja Follow-up #2958

@kostja

Currently, we have three variables related to replication timeouts: applier_timeout, relay_timeout, and replication_cfg_timeout. They are all set to the value of box.cfg.replication_timeout. We use these variables in four different cases: - Sending heartbeat messages periodically from master to slave and back (relay_timeout, applier_timeout). - Reconnecting applier after a disconnect (applier_timeout). - Disconnecting a replica if no hearbeat message has been receivied within the specified timeout (TIMEOUT_PERIODS * replication_timeout). - Waiting for box.cfg() to succeed (replication_connect_quorum_timeout). This is confusing. Let's keep just one variable, replication_timeout, that would determine the heartbeat interval and introduce the following helpers for the three other cases: - replication_reconnect_timeout() - replication_disconnect_timeout() - replication_connect_quroum_timeout() Also, let's make replication_connect_quorum_timeout() return 4 times the configured timeout in the scope of this patch, because, as pointed out by @kostja, > We need another replication_timeout variable, using the same variable > for everything doesn't work. Please try setting a broken > box.cfg.replication second time, and you'll see that it doesn't try to > reconnect, because reconnect timeout = replication timeout. This is > broken, reconnect_timeout should be < replication_timeout, to allow for > at least a few reconnects. Suggested by @kostja Follow-up #2958

replicaset_connect() leaves appliers that failed to connect within the specified time period running. To prevent them from entering 'subscribe' stage prematurely (i.e. before replicaset_follow() is called), we set replica->pause_on_connect flag, which will force them to freeze upon successful connection. We clear this flag in replicaset_follow(). This juggling with flags looks ugly. Instead, let's stop failed appliers in replicaset_connect() and restart them in replicaset_follow(). Follow-up #2958

@kostja

This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to 0, then a replica skips the "sync" state and switches to "follow" immediately. For the sake of backward compatibility, the new option is unset by default. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort. 3. Then it synchronizes with all configured replicas. It doesn't leave box.cfg() until all connected replicas are synchronized. If it managed to synchronize to enough replicas to form a quorum, it switches to 'running' state, otherwise it leaves the server in 'orphan' state to be switched to 'running' asynchronously, as soon as a quorum is formed. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999

The replication/catch test's intention is to check that a replica doesn't enter the write mode until it syncs up with the master. Actually, this feature was introduced only recently, in the scope of #2958, so the test is obviously bogus. Let's just drop it as the functionality it tries and fails to test is already covered by the replication/quorum test.

replicaset_connect() leaves appliers that failed to connect within the specified time period running. To prevent them from entering 'subscribe' stage prematurely (i.e. before replicaset_follow() is called), we set replica->pause_on_connect flag, which will force them to freeze upon successful connection. We clear this flag in replicaset_follow(). This juggling with flags looks ugly. Instead, let's stop failed appliers in replicaset_connect() and restart them in replicaset_follow(). Follow-up #2958

@kostja

This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. For the sake of backward compatibility, the new option is unset by default. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999

@kostja

This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. For the sake of backward compatibility, the new option is unset by default. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999

replicaset_connect() leaves appliers that failed to connect within the specified time period running. To prevent them from entering 'subscribe' stage prematurely (i.e. before replicaset_follow() is called), we set replica->pause_on_connect flag, which will force them to freeze upon successful connection. We clear this flag in replicaset_follow(). This juggling with flags looks ugly. Instead, let's stop failed appliers in replicaset_connect() and restart them in replicaset_follow(). Follow-up #2958

@kostja

This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. The default value of replication_sync_lag is 10 seconds, but it is ignored (assumed to be inf) in case the master is running tarantool older than 1.7.7, which does not send heartbeat messages. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999

@kostja

This patch modifies the replication configuration procedure so as to fully conform to the specification presented in #2958. In a nutshell, now box.cfg() tries to synchronize all connected replicas before returning. If it fails to connect enough replicas to form a quorum, it leaves the server in a degraded 'orphan' mode, which is basically read-only. More details below. First of all, it's worth mentioning that we already have 'orphan' status in Tarantool (between 'loading' and 'hot_standby'), but it has nothing to do with replication. Actually, it's unclear why it was introduced in the first place so we agreed to silently drop it. We assume that a replica is synchronized if its lag is not greater than the value of new configuration option box.cfg.replication_sync_lag. Otherwise a replica is considered to be syncing and has "sync" status. If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then a replica skips the "sync" state and switches to "follow" immediately. The default value of replication_sync_lag is 10 seconds, but it is ignored (assumed to be inf) in case the master is running tarantool older than 1.7.7, which does not send heartbeat messages. If box.cfg() is called for the very first time (bootstrap) for a given instance, then 1. It tries to connect to all configured replicas for as long as it takes (replication_timeout isn't taken into account). If it fails to connect to at least one replica, bootstrap is aborted. 2. If this is a cluster bootstrap and the current instance turns out to be the new cluster leader, then it performs local bootstrap and switches to 'running' state and leaves box.cfg() immediately. 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it bootstraps from a remote master and then stays in 'orphan' state until it synchronizes with all replicas before switching to 'running' state and leaving box.cfg(). If box.cfg() is called after bootstrap, in order to recover from the local storage, then 1. It recovers the last snapshot and xlogs stored in the local directory. 2. Then it switches to 'orphan' mode and tries to connect to at least as many replicas as specified by box.cfg.replication_connect_quorum for a time period which is a multiple of box.cfg.replication_timeout (4x). If it fails, it doesn't abort, but leaves box.cfg() in 'orphan' mode. The state will switch to 'running' asynchronously as soon as the instance has synced with 'replication_connect_quorum' replicas. 3. If it managed to connect to enough replicas to form a quorum at step 2, it synchronizes with them: box.cfg() doesn't return until at least 'replication_connect_quorum' replicas have been synchronized. If box.cfg() is called after recovery to reconfigure replication, then it tries to connect to all specified replicas within a time period which is a multiple of box.cfg.replication_timeout (4x). The value of box.cfg.replication_connect_quorum isn't taken into account, neither is the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as all configured replicas have been connected. Just like any other status, the new one is reflected by box.info.status. Suggested by @kostja Follow-up #2958 Closes #999

rtsisyk added customer prio1 labels Nov 25, 2017

rtsisyk added this to the 1.7.7 milestone Nov 25, 2017

kostja removed this from the 1.7.7 milestone Nov 25, 2017

rtsisyk assigned locker Nov 28, 2017

rtsisyk mentioned this issue Nov 28, 2017

make box.cfg{repilcation_source=} non-blocking when called second time #1956

Closed

rtsisyk added this to the 1.7.7 milestone Nov 28, 2017

rtsisyk mentioned this issue Dec 11, 2017

RIP: MM-replication is dead #2975

Closed

rtsisyk added the blocked Not ready to be implemented label Dec 11, 2017

rtsisyk assigned kostja and unassigned locker Dec 11, 2017

rtsisyk added in design Requires design document and removed blocked Not ready to be implemented labels Dec 11, 2017

kostja removed in design Requires design document prio1 labels Dec 12, 2017

kostja changed the title ~~Don't wait remote replica on box.cfg() when cluster is boostrapped~~ Don't wait remote replica on box.cfg() when at replica set instance start Dec 12, 2017

kostja changed the title ~~Don't wait remote replica on box.cfg() when at replica set instance start~~ Don't wait remote replica on box.cfg() when starting a replica set instance Dec 12, 2017

kostja changed the title ~~Don't wait remote replica on box.cfg() when starting a replica set instance~~ Don't wait for all remote replicas when starting a replica set instance Dec 12, 2017

tarantool deleted a comment from rtsisyk Dec 12, 2017

tarantool deleted a comment from unera Jan 10, 2018

kostja assigned locker Jan 10, 2018

locker mentioned this issue Jan 10, 2018

New replication configuration API #3034

Closed

kostja mentioned this issue Jan 12, 2018

Document box.cfg { replication_connect_quorum = x } tarantool/doc#372

Closed

kostja added the feature A new functionality label Jan 19, 2018

kostja closed this as completed in f1167c4 Jan 25, 2018

kostja mentioned this issue Jan 31, 2018

Document orphan mode tarantool/doc#384

Closed

This was referenced Feb 18, 2018

https://tarantool.org/en/doc/1.7/dev_guide/internals_index.html#server-startup-with-replication #3170

Closed

Sever startup with replication chapter in internals tarantool/doc#409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't wait for all remote replicas when starting a replica set instance #2958

Don't wait for all remote replicas when starting a replica set instance #2958

rtsisyk commented Nov 25, 2017 •

edited by kostja

Loading

kostja commented Nov 27, 2017 •

edited

Loading

rtsisyk commented Nov 27, 2017

rtsisyk commented Nov 28, 2017

GeorgyKirichenko commented Dec 13, 2017

rtsisyk commented Jan 10, 2018

Don't wait for all remote replicas when starting a replica set instance #2958

Don't wait for all remote replicas when starting a replica set instance #2958

Comments

rtsisyk commented Nov 25, 2017 • edited by kostja Loading

Preamble

The change

First boot

Subsequent starts

kostja commented Nov 27, 2017 • edited Loading

rtsisyk commented Nov 27, 2017

rtsisyk commented Nov 28, 2017

GeorgyKirichenko commented Dec 13, 2017

rtsisyk commented Jan 10, 2018

rtsisyk commented Nov 25, 2017 •

edited by kostja

Loading

kostja commented Nov 27, 2017 •

edited

Loading