Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't wait for all remote replicas when starting a replica set instance #2958

Closed
rtsisyk opened this issue Nov 25, 2017 · 7 comments
Closed
Assignees
Labels
customer feature A new functionality
Milestone

Comments

@rtsisyk
Copy link
Contributor

rtsisyk commented Nov 25, 2017

Preamble

Tarantool 1.7 waits until all remote upstreams from box.cfg.replication are connected before proceeding with boot, whereas 1.6 didn't. This breaks high availability in a few important cases:

  • there is a master-master set up operating in eventual consistency mode. One of the masters is down. When the second master restarts for whatever reason, it doesn't boot without a DBA intervention
  • there is a master-slave set up, and the master is down. The slave restarts, again, for whatever reason. Instead of continuing in read only mode, which is entirely safe, it sits in box.cfg{} waiting for master to respond to the slave's replication handshake request.

The change

The current behavior implements the motto "better to be safe than sorry". In all cases when an instance failed to boot in its previously defined configuration it is clearly visible to the DBA which then can intervene and change the configuration according to the circumstances. For example, Tarantool 1.6 could start with wrong slave URIs in the configuration file and never finish initializing replication - and this used to be missed by ops.

Therefore, the spirit of the new changes is to preserve and improve the level of safety introduced in 1.7, while giving the power users flexibility to automate restart as they wish.

First boot

On first boot, when the replica set is not yet initialized, the procedure will not change. We will wait for all replicas specified in the configuration to be up and running before finishing the bootstrap.

Subsequent starts

On a subsequent start, the replica will continue to try to connect to all peers specified in box.cfg.replication configuration variable, making at least 3 attempts to connect to each peer during the period specified in replication_timeout. If these attempts succeed until the timeout is reached, the replica will get up to date with all the changes accumulated at peers, as specified in a new replication_lag variable (in seconds). When the replica is behind its peers for fewer seconds than specified in replication_lag, it ends booting by leaving box.cfg{} in read-write or read-only mode, as defined in box.cfg.read_only.

In case replica was not able to reach some of its peers in replication_timeout seconds, it syncs up with the peers it was able to connect to and then leaves box.cfg{} in a degraded mode:

  • if box.cfg.read_only is true, the mode is "read only".
  • if box.cfg.read_only is false, the replica checks a new variable, box.cfg.replication_connect_quorum, which defaults to the number of entries in box.cfg.replication, and specifies the number of peers the replica should be in sync with to enter read-write mode. If the quorum is reached, the mode is read-write. Otherwise the mode is read-only. The new, forced read-only mode can be changed only by changing box.cfg.replication or box.cfg.replication_connect_quorum.
@rtsisyk rtsisyk added this to the 1.7.7 milestone Nov 25, 2017
@kostja kostja removed this from the 1.7.7 milestone Nov 25, 2017
@kostja
Copy link
Contributor

kostja commented Nov 27, 2017

I chatted with Mons, there are several possible ways to fix, we'll meet Monday 7pm to discuss.

@rtsisyk
Copy link
Contributor Author

rtsisyk commented Nov 27, 2017

OK, thanks for the fast response! I invited to this meeting too.

@rtsisyk
Copy link
Contributor Author

rtsisyk commented Nov 28, 2017

Discussed with @kostja => wait 50%+1 of remote appliers.

@rtsisyk rtsisyk added this to the 1.7.7 milestone Nov 28, 2017
locker added a commit that referenced this issue Dec 5, 2017
Do not use the generic applier_on_state infrastructure, which is used to
advance a single applier to a particular state as it doesn't really fit
the case when we need to wait for multiple appliers and it utterly fails
when it comes to waiting for an arbitrary subset of all appliers.

Needed for #2958
locker added a commit that referenced this issue Dec 5, 2017
The logic of applier_connect_all() does not really belong to applier.cc,
because it operates on a set of appliers while applier.cc is all about
an individual applier implementation. Let's move it to replication.cc
and rename it to replicaset_connect().

Needed for #2958
locker added a commit that referenced this issue Dec 5, 2017
 - Fold call to replicaset_update() in replicaset_connect().
   These two functions are always called one after another.

 - Factor out replicaset_follow() to resume all appliers
   registered with the replica set.

Needed for #2958
locker added a commit that referenced this issue Dec 5, 2017
If one cluster node is down permanently for some reason, no other node
can restart - they will stall in box.cfg{} until all other nodes are up
and running. This complicates a tarantool cluster deployment in real
world scenarios.

To address this issue, let's complete the configuration as soon as
connections have been established to more than a half of all configured
replicas, assuming that the rest will connect asynchronously.

Closes #2958
@rtsisyk rtsisyk added the blocked Not ready to be implemented label Dec 11, 2017
@rtsisyk rtsisyk assigned kostja and unassigned locker Dec 11, 2017
@rtsisyk rtsisyk added in design Requires design document and removed blocked Not ready to be implemented labels Dec 11, 2017
@kostja kostja removed in design Requires design document prio1 labels Dec 12, 2017
@kostja kostja changed the title Don't wait remote replica on box.cfg() when cluster is boostrapped Don't wait remote replica on box.cfg() when at replica set instance start Dec 12, 2017
@kostja kostja changed the title Don't wait remote replica on box.cfg() when at replica set instance start Don't wait remote replica on box.cfg() when starting a replica set instance Dec 12, 2017
@kostja kostja changed the title Don't wait remote replica on box.cfg() when starting a replica set instance Don't wait for all remote replicas when starting a replica set instance Dec 12, 2017
@tarantool tarantool deleted a comment from rtsisyk Dec 12, 2017
@tarantool tarantool deleted a comment from rtsisyk Dec 12, 2017
@GeorgyKirichenko
Copy link
Contributor

I assume that there is no point in waiting until all replicas are inserted in the case of the initial cluster load, because any of the replicas should not contain any data (for correct configurations, of course). So, if a 'bootsrap is not finished' status was received during initial bootstrap/join then box.cfg may be finished without all other masters.

locker added a commit that referenced this issue Dec 29, 2017
Do not use the generic applier_on_state infrastructure, which is used to
advance a single applier to a particular state as it doesn't really fit
the case when we need to wait for multiple appliers and it utterly fails
when it comes to waiting for an arbitrary subset of all appliers.

Needed for #2958
locker added a commit that referenced this issue Dec 29, 2017
The logic of applier_connect_all() does not really belong to applier.cc,
because it operates on a set of appliers while applier.cc is all about
an individual applier implementation. Let's move it to replication.cc
and rename it to replicaset_connect().

Needed for #2958
locker added a commit that referenced this issue Jan 9, 2018
 - Fold call to replicaset_update() in replicaset_connect().
   These two functions are always called one after another.

 - Factor out replicaset_follow() to resume all appliers
   registered with the replica set.

Needed for #2958
locker added a commit that referenced this issue Jan 9, 2018
If one cluster node is down permanently for some reason, no other node
can restart - they will stall in box.cfg{} until all other nodes are up
and running. This complicates a tarantool cluster deployment in real
world scenarios.

To address this issue, let's complete the configuration as soon as
connections have been established connections to the number of hosts
specified by the new configuration option, box.cfg.replication_quorum,
assuming the rest will connect asynchronously. If the option is unset,
it defaults to the number of entries in box.cfg.replication so this
patch shouldn't affect the behavior of existing setups.

Closes #2958
rtsisyk pushed a commit that referenced this issue Jan 10, 2018
Do not use the generic applier_on_state infrastructure, which is used to
advance a single applier to a particular state as it doesn't really fit
the case when we need to wait for multiple appliers and it utterly fails
when it comes to waiting for an arbitrary subset of all appliers.

Needed for #2958
rtsisyk pushed a commit that referenced this issue Jan 10, 2018
The logic of applier_connect_all() does not really belong to applier.cc,
because it operates on a set of appliers while applier.cc is all about
an individual applier implementation. Let's move it to replication.cc
and rename it to replicaset_connect().

Needed for #2958
rtsisyk pushed a commit that referenced this issue Jan 10, 2018
 - Fold call to replicaset_update() in replicaset_connect().
   These two functions are always called one after another.

 - Factor out replicaset_follow() to resume all appliers
   registered with the replica set.

Needed for #2958
@rtsisyk
Copy link
Contributor Author

rtsisyk commented Jan 10, 2018

Added replication_quorum option to box.cfg():

  • box.cfg { replication = {xx, yy, zz}, replication_quorum = 0 } - don't wait replicas (old 1.6 behaviour)
  • box.cfg { replication = {xx, yy, zz, replication_quorum = 100500} - wait all replicas (new 1.7 behaviour)
  • box.cfg { replication = {xx, yy, zz, replication_quorum = N} - wait N replicas (custom logic)

Proposed logic with replication_lag + read_only on box.cfg() doesn't solve the problem that replica can fall behind master AFTER box.cfg(). The comprehensive solution is needed to forbid read/write queries on stale replicas. Currently this logic can be implemented using Lua without patching server.

box.cfg { replication_quorum - X } - 1.7.6-140-gf1167c429 (1.7-next)

@tarantool tarantool deleted a comment from unera Jan 10, 2018
locker added a commit that referenced this issue Jan 10, 2018
'replication_connect_quorum' seems to be a better name for this option
is ignored once connections to remote masters have been established.

Suggested by @kostja

Follow-up #2958
locker added a commit that referenced this issue Jan 10, 2018
Currently, we have three variables related to replication timeouts:
applier_timeout, relay_timeout, and replication_cfg_timeout. They are
all set to the value of box.cfg.replication_timeout. We use these
variables in four different cases:

 - Sending heartbeat messages periodically from master to slave and back
   (relay_timeout, applier_timeout).

 - Reconnecting applier after a disconnect (applier_timeout).

 - Disconnecting a replica if no hearbeat message has been receivied
   within the specified timeout (TIMEOUT_PERIODS * relay_timeout).

 - Waiting for box.cfg() to succeed (replication_cfg_timeout).

This is confusing. Let's keep just one variable, replication_timeout,
that would determine the heartbeat interval and introduce the following
helpers for the three other cases:

 - replication_reconnect_timeout()
 - replication_disconnect_timeout()
 - replication_cfg_timeout()

Also, let's make replication_cfg_timeout() return 3 times the configured
timeout in the scope of this patch, because, as pointed out by @kostja,

> We need another replication_timeout variable, using the same variable
> for everything doesn't work.  Please try setting a broken
> box.cfg.replication second time, and you'll see that it doesn't try to
> reconnect, because reconnect timeout = replication timeout. This is
> broken, reconnect_timeout should be < replication_timeout, to allow for
> at least a few reconnects.

Suggested by @kostja

Follow-up #2958
locker added a commit that referenced this issue Jan 11, 2018
First of all, it's worth mentioning that we already have 'orphan' status
in Tarantool (between 'loading' and 'hot_standby'), but it has nothing
to do with replication. Actually, it's unclear why it was introduced in
the first place so we agreed to drop it. Now, to the new 'orphan' mode.

Tarantool is switched to 'orphan' mode if it failed to form a quorum
within the specified timeout during initial configuration of an instance
that has already been bootstrapped. Basically, 'orphan' status forces
the instance to switch to read-only mode. The instance will switch back
to 'running' (and enable write access) in either of the following three
cases:

 - replication is successfully reconfigured
 - box.cfg.replication_connect_quorum is changed
 - enough replicas are connected to form a quorum

Note, any subsequent attempt to change box.cfg.replication won't change
the server status - as before, it will fail downright if it can't form a
quorum within the configured timeout period.

Just like any other status, the new one is reflected by box.info.status.

Suggested by @kostja

Follow-up #2958
kostja pushed a commit that referenced this issue Jan 11, 2018
'replication_connect_quorum' seems to be a better name for this option
is ignored once connections to remote masters have been established.

Suggested by @kostja

Follow-up #2958
kostja pushed a commit that referenced this issue Jan 12, 2018
Currently, we have three variables related to replication timeouts:
applier_timeout, relay_timeout, and replication_cfg_timeout. They are
all set to the value of box.cfg.replication_timeout. We use these
variables in four different cases:

 - Sending heartbeat messages periodically from master to slave and back
   (relay_timeout, applier_timeout).

 - Reconnecting applier after a disconnect (applier_timeout).

 - Disconnecting a replica if no hearbeat message has been receivied
   within the specified timeout (TIMEOUT_PERIODS * replication_timeout).

 - Waiting for box.cfg() to succeed (replication_connect_quorum_timeout).

This is confusing. Let's keep just one variable, replication_timeout,
that would determine the heartbeat interval and introduce the following
helpers for the three other cases:

 - replication_reconnect_timeout()
 - replication_disconnect_timeout()
 - replication_connect_quroum_timeout()

Also, let's make replication_connect_quorum_timeout() return 4 times the
configured timeout in the scope of this patch, because, as pointed out by
@kostja,

> We need another replication_timeout variable, using the same variable
> for everything doesn't work.  Please try setting a broken
> box.cfg.replication second time, and you'll see that it doesn't try to
> reconnect, because reconnect timeout = replication timeout. This is
> broken, reconnect_timeout should be < replication_timeout, to allow for
> at least a few reconnects.

Suggested by @kostja

Follow-up #2958
locker added a commit that referenced this issue Jan 14, 2018
replicaset_connect() leaves appliers that failed to connect within the
specified time period running. To prevent them from entering 'subscribe'
stage prematurely (i.e. before replicaset_follow() is called), we set
replica->pause_on_connect flag, which will force them to freeze upon
successful connection. We clear this flag in replicaset_follow(). This
juggling with flags looks ugly. Instead, let's stop failed appliers in
replicaset_connect() and restart them in replicaset_follow().

Follow-up #2958
locker added a commit that referenced this issue Jan 14, 2018
This patch modifies the replication configuration procedure so as to
fully conform to the specification presented in #2958. In a nutshell,
now box.cfg() tries to synchronize all connected replicas before
returning. If it fails to connect enough replicas to form a quorum, it
leaves the server in a degraded 'orphan' mode, which is basically
read-only. More details below.

First of all, it's worth mentioning that we already have 'orphan' status
in Tarantool (between 'loading' and 'hot_standby'), but it has nothing
to do with replication. Actually, it's unclear why it was introduced in
the first place so we agreed to silently drop it.

We assume that a replica is synchronized if its lag is not greater than
the value of new configuration option box.cfg.replication_sync_lag.
Otherwise a replica is considered to be syncing and has "sync" status.
If replication_sync_lag is unset (nil) or set to 0, then a replica skips
the "sync" state and switches to "follow" immediately. For the sake of
backward compatibility, the new option is unset by default.

If box.cfg() is called for the very first time (bootstrap) for a given
instance, then

 1. It tries to connect to all configured replicas for as long as it
    takes (replication_timeout isn't taken into account). If it fails to
    connect to at least one replica, bootstrap is aborted.

 2. If this is a cluster bootstrap and the current instance turns out to
    be the new cluster leader, then it performs local bootstrap and
    switches to 'running' state and leaves box.cfg() immediately.

 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it
    bootstraps from a remote master and then stays in 'orphan' state
    until it synchronizes with all replicas before switching to
    'running' state and leaving box.cfg().

If box.cfg() is called after bootstrap, in order to recover from the
local storage, then

 1. It recovers the last snapshot and xlogs stored in the local
    directory.

 2. Then it switches to 'orphan' mode and tries to connect to at least
    as many replicas as specified by box.cfg.replication_connect_quorum
    for a time period which is a multiple of box.cfg.replication_timeout
    (4x). If it fails, it doesn't abort.

 3. Then it synchronizes with all configured replicas. It doesn't leave
    box.cfg() until all connected replicas are synchronized. If it
    managed to synchronize to enough replicas to form a quorum, it
    switches to 'running' state, otherwise it leaves the server in
    'orphan' state to be switched to 'running' asynchronously, as soon
    as a quorum is formed.

If box.cfg() is called after recovery to reconfigure replication, then
it tries to connect to all specified replicas within a time period which
is a multiple of box.cfg.replication_timeout (4x). The value of
box.cfg.replication_connect_quorum isn't taken into account, neither is
the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as
all configured replicas have been connected.

Just like any other status, the new one is reflected by box.info.status.

Suggested by @kostja

Follow-up #2958
Closes #999
locker added a commit that referenced this issue Jan 14, 2018
The replication/catch test's intention is to check that a replica
doesn't enter the write mode until it syncs up with the master.
Actually, this feature was introduced only recently, in the scope
of #2958, so the test is obviously bogus. Let's just drop it as the
functionality it tries and fails to test is already covered by the
replication/quorum test.
@kostja kostja added the feature A new functionality label Jan 19, 2018
locker added a commit that referenced this issue Jan 24, 2018
replicaset_connect() leaves appliers that failed to connect within the
specified time period running. To prevent them from entering 'subscribe'
stage prematurely (i.e. before replicaset_follow() is called), we set
replica->pause_on_connect flag, which will force them to freeze upon
successful connection. We clear this flag in replicaset_follow(). This
juggling with flags looks ugly. Instead, let's stop failed appliers in
replicaset_connect() and restart them in replicaset_follow().

Follow-up #2958
locker added a commit that referenced this issue Jan 24, 2018
This patch modifies the replication configuration procedure so as to
fully conform to the specification presented in #2958. In a nutshell,
now box.cfg() tries to synchronize all connected replicas before
returning. If it fails to connect enough replicas to form a quorum, it
leaves the server in a degraded 'orphan' mode, which is basically
read-only. More details below.

First of all, it's worth mentioning that we already have 'orphan' status
in Tarantool (between 'loading' and 'hot_standby'), but it has nothing
to do with replication. Actually, it's unclear why it was introduced in
the first place so we agreed to silently drop it.

We assume that a replica is synchronized if its lag is not greater than
the value of new configuration option box.cfg.replication_sync_lag.
Otherwise a replica is considered to be syncing and has "sync" status.
If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then
a replica skips the "sync" state and switches to "follow" immediately.
For the sake of backward compatibility, the new option is unset by
default.

If box.cfg() is called for the very first time (bootstrap) for a given
instance, then

 1. It tries to connect to all configured replicas for as long as it
    takes (replication_timeout isn't taken into account). If it fails to
    connect to at least one replica, bootstrap is aborted.

 2. If this is a cluster bootstrap and the current instance turns out to
    be the new cluster leader, then it performs local bootstrap and
    switches to 'running' state and leaves box.cfg() immediately.

 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it
    bootstraps from a remote master and then stays in 'orphan' state
    until it synchronizes with all replicas before switching to
    'running' state and leaving box.cfg().

If box.cfg() is called after bootstrap, in order to recover from the
local storage, then

 1. It recovers the last snapshot and xlogs stored in the local
    directory.

 2. Then it switches to 'orphan' mode and tries to connect to at least
    as many replicas as specified by box.cfg.replication_connect_quorum
    for a time period which is a multiple of box.cfg.replication_timeout
    (4x). If it fails, it doesn't abort, but leaves box.cfg() in
    'orphan' mode. The state will switch to 'running' asynchronously as
    soon as the instance has synced with 'replication_connect_quorum'
    replicas.

 3. If it managed to connect to enough replicas to form a quorum at step
    2, it synchronizes with them: box.cfg() doesn't return until at
    least 'replication_connect_quorum' replicas have been synchronized.

If box.cfg() is called after recovery to reconfigure replication, then
it tries to connect to all specified replicas within a time period which
is a multiple of box.cfg.replication_timeout (4x). The value of
box.cfg.replication_connect_quorum isn't taken into account, neither is
the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as
all configured replicas have been connected.

Just like any other status, the new one is reflected by box.info.status.

Suggested by @kostja

Follow-up #2958
Closes #999
@kostja kostja closed this as completed in f1167c4 Jan 25, 2018
locker added a commit that referenced this issue Jan 29, 2018
This patch modifies the replication configuration procedure so as to
fully conform to the specification presented in #2958. In a nutshell,
now box.cfg() tries to synchronize all connected replicas before
returning. If it fails to connect enough replicas to form a quorum, it
leaves the server in a degraded 'orphan' mode, which is basically
read-only. More details below.

First of all, it's worth mentioning that we already have 'orphan' status
in Tarantool (between 'loading' and 'hot_standby'), but it has nothing
to do with replication. Actually, it's unclear why it was introduced in
the first place so we agreed to silently drop it.

We assume that a replica is synchronized if its lag is not greater than
the value of new configuration option box.cfg.replication_sync_lag.
Otherwise a replica is considered to be syncing and has "sync" status.
If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then
a replica skips the "sync" state and switches to "follow" immediately.
For the sake of backward compatibility, the new option is unset by
default.

If box.cfg() is called for the very first time (bootstrap) for a given
instance, then

 1. It tries to connect to all configured replicas for as long as it
    takes (replication_timeout isn't taken into account). If it fails to
    connect to at least one replica, bootstrap is aborted.

 2. If this is a cluster bootstrap and the current instance turns out to
    be the new cluster leader, then it performs local bootstrap and
    switches to 'running' state and leaves box.cfg() immediately.

 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it
    bootstraps from a remote master and then stays in 'orphan' state
    until it synchronizes with all replicas before switching to
    'running' state and leaving box.cfg().

If box.cfg() is called after bootstrap, in order to recover from the
local storage, then

 1. It recovers the last snapshot and xlogs stored in the local
    directory.

 2. Then it switches to 'orphan' mode and tries to connect to at least
    as many replicas as specified by box.cfg.replication_connect_quorum
    for a time period which is a multiple of box.cfg.replication_timeout
    (4x). If it fails, it doesn't abort, but leaves box.cfg() in
    'orphan' mode. The state will switch to 'running' asynchronously as
    soon as the instance has synced with 'replication_connect_quorum'
    replicas.

 3. If it managed to connect to enough replicas to form a quorum at step
    2, it synchronizes with them: box.cfg() doesn't return until at
    least 'replication_connect_quorum' replicas have been synchronized.

If box.cfg() is called after recovery to reconfigure replication, then
it tries to connect to all specified replicas within a time period which
is a multiple of box.cfg.replication_timeout (4x). The value of
box.cfg.replication_connect_quorum isn't taken into account, neither is
the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as
all configured replicas have been connected.

Just like any other status, the new one is reflected by box.info.status.

Suggested by @kostja

Follow-up #2958
Closes #999
kostja pushed a commit that referenced this issue Jan 30, 2018
replicaset_connect() leaves appliers that failed to connect within the
specified time period running. To prevent them from entering 'subscribe'
stage prematurely (i.e. before replicaset_follow() is called), we set
replica->pause_on_connect flag, which will force them to freeze upon
successful connection. We clear this flag in replicaset_follow(). This
juggling with flags looks ugly. Instead, let's stop failed appliers in
replicaset_connect() and restart them in replicaset_follow().

Follow-up #2958
locker added a commit that referenced this issue Jan 30, 2018
This patch modifies the replication configuration procedure so as to
fully conform to the specification presented in #2958. In a nutshell,
now box.cfg() tries to synchronize all connected replicas before
returning. If it fails to connect enough replicas to form a quorum, it
leaves the server in a degraded 'orphan' mode, which is basically
read-only. More details below.

First of all, it's worth mentioning that we already have 'orphan' status
in Tarantool (between 'loading' and 'hot_standby'), but it has nothing
to do with replication. Actually, it's unclear why it was introduced in
the first place so we agreed to silently drop it.

We assume that a replica is synchronized if its lag is not greater than
the value of new configuration option box.cfg.replication_sync_lag.
Otherwise a replica is considered to be syncing and has "sync" status.
If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then
a replica skips the "sync" state and switches to "follow" immediately.
The default value of replication_sync_lag is 10 seconds, but it is
ignored (assumed to be inf) in case the master is running tarantool
older than 1.7.7, which does not send heartbeat messages.

If box.cfg() is called for the very first time (bootstrap) for a given
instance, then

 1. It tries to connect to all configured replicas for as long as it
    takes (replication_timeout isn't taken into account). If it fails to
    connect to at least one replica, bootstrap is aborted.

 2. If this is a cluster bootstrap and the current instance turns out to
    be the new cluster leader, then it performs local bootstrap and
    switches to 'running' state and leaves box.cfg() immediately.

 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it
    bootstraps from a remote master and then stays in 'orphan' state
    until it synchronizes with all replicas before switching to
    'running' state and leaving box.cfg().

If box.cfg() is called after bootstrap, in order to recover from the
local storage, then

 1. It recovers the last snapshot and xlogs stored in the local
    directory.

 2. Then it switches to 'orphan' mode and tries to connect to at least
    as many replicas as specified by box.cfg.replication_connect_quorum
    for a time period which is a multiple of box.cfg.replication_timeout
    (4x). If it fails, it doesn't abort, but leaves box.cfg() in
    'orphan' mode. The state will switch to 'running' asynchronously as
    soon as the instance has synced with 'replication_connect_quorum'
    replicas.

 3. If it managed to connect to enough replicas to form a quorum at step
    2, it synchronizes with them: box.cfg() doesn't return until at
    least 'replication_connect_quorum' replicas have been synchronized.

If box.cfg() is called after recovery to reconfigure replication, then
it tries to connect to all specified replicas within a time period which
is a multiple of box.cfg.replication_timeout (4x). The value of
box.cfg.replication_connect_quorum isn't taken into account, neither is
the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as
all configured replicas have been connected.

Just like any other status, the new one is reflected by box.info.status.

Suggested by @kostja

Follow-up #2958
Closes #999
kostja pushed a commit that referenced this issue Jan 31, 2018
This patch modifies the replication configuration procedure so as to
fully conform to the specification presented in #2958. In a nutshell,
now box.cfg() tries to synchronize all connected replicas before
returning. If it fails to connect enough replicas to form a quorum, it
leaves the server in a degraded 'orphan' mode, which is basically
read-only. More details below.

First of all, it's worth mentioning that we already have 'orphan' status
in Tarantool (between 'loading' and 'hot_standby'), but it has nothing
to do with replication. Actually, it's unclear why it was introduced in
the first place so we agreed to silently drop it.

We assume that a replica is synchronized if its lag is not greater than
the value of new configuration option box.cfg.replication_sync_lag.
Otherwise a replica is considered to be syncing and has "sync" status.
If replication_sync_lag is unset (nil) or set to TIMEOUT_INFINITY, then
a replica skips the "sync" state and switches to "follow" immediately.
The default value of replication_sync_lag is 10 seconds, but it is
ignored (assumed to be inf) in case the master is running tarantool
older than 1.7.7, which does not send heartbeat messages.

If box.cfg() is called for the very first time (bootstrap) for a given
instance, then

 1. It tries to connect to all configured replicas for as long as it
    takes (replication_timeout isn't taken into account). If it fails to
    connect to at least one replica, bootstrap is aborted.

 2. If this is a cluster bootstrap and the current instance turns out to
    be the new cluster leader, then it performs local bootstrap and
    switches to 'running' state and leaves box.cfg() immediately.

 3. Otherwise (i.e. if this is bootstrap of a slave replica), then it
    bootstraps from a remote master and then stays in 'orphan' state
    until it synchronizes with all replicas before switching to
    'running' state and leaving box.cfg().

If box.cfg() is called after bootstrap, in order to recover from the
local storage, then

 1. It recovers the last snapshot and xlogs stored in the local
    directory.

 2. Then it switches to 'orphan' mode and tries to connect to at least
    as many replicas as specified by box.cfg.replication_connect_quorum
    for a time period which is a multiple of box.cfg.replication_timeout
    (4x). If it fails, it doesn't abort, but leaves box.cfg() in
    'orphan' mode. The state will switch to 'running' asynchronously as
    soon as the instance has synced with 'replication_connect_quorum'
    replicas.

 3. If it managed to connect to enough replicas to form a quorum at step
    2, it synchronizes with them: box.cfg() doesn't return until at
    least 'replication_connect_quorum' replicas have been synchronized.

If box.cfg() is called after recovery to reconfigure replication, then
it tries to connect to all specified replicas within a time period which
is a multiple of box.cfg.replication_timeout (4x). The value of
box.cfg.replication_connect_quorum isn't taken into account, neither is
the value of box.cfg.replication_sync_lag - box.cfg() returns as soon as
all configured replicas have been connected.

Just like any other status, the new one is reflected by box.info.status.

Suggested by @kostja

Follow-up #2958
Closes #999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer feature A new functionality
Projects
None yet
Development

No branches or pull requests

4 participants