box.ctl.promote() #3055

laserjump · 2018-01-18T13:41:22Z

Implement a built-in call which promotes a replica to a master in a replica set.

What it should do:

Check the current master-slave configuration. If the replica is already in read-write mode, do nothing. If there is another master in the configuration, return a warning that there is more than one master.
If the current instance is in read-only mode, so indeed can be promoted to master, find the current master. If the master is not found, continue to step 5.
Set the current master to read-only mode.
Wait until all replicas in the replica set synchronize their vclock with the current master.
Set itself to read-write mode.

Open issues:

what should be the name of the call? Options: box.ctl.promote() box.ctl.set_master(), box.ctl.change_master_to().
Perhaps we should simly improve changes of read_only in box.cfg so that it is race-condition free? Then we need no special API.
prevention of race conditions and persisting the role change. The call should work correctly in case of race condition (multiple change master on multiple instances in a replica set) and persist over server restarts (the problem that some instances may restart, and some not, for example, the old master can be restarted). Ideally we should persist the state of this procedure in a system space in all nodes, so that we can resume it any time. Perhaps we should store the current mode/role in the WAL, needs to be investigated.

Now that we have before_replace triggers, we essentially have logical replication available as a vehicle for message passing between master and slave. Perhaps we should make Iproto_nop possible in read-only mode, so that message passing can work in both directions - from read-write master to read-only one, and vice-versa.

How we can achieve correctness of the algorithm without persisting its state:

each promotion should begin with a start message and end with a commit message
each message in a promotion is identified with a unique promotion round id (guid or timestamp)
the start message should have a timeout associated with hte promotion
if the promotion doesn't happen until timeout expires, all nodes forget about this promotion round and revert to the original state
the node which initiates the promotion grows the timeout exponentially from round to round, to deal with network lags, but if it restarts, it reverts to the original low default for promotion round timeout

More issues to consider:

what should we do if replication link is not bi-directional?
what should we do in a degraded state, when we can't really establish the state of peers? Perhaps we should have an option force=true which proceeds even in degraded state. Perhaps some other interface is necessary.

rtsisyk · 2018-01-22T10:20:13Z

box.cfg { read_only = true } ?

laserjump · 2018-01-23T08:54:39Z

и что общего это имеет с минимизацией отстаивания в r/o?
я скорее имел ввиду вариант наподобие https://gist.github.com/Mons/9dc4e2e3097c231551d4f0d130986149, но реализованное средствами луа в тарантуле 1.7 и поддерживаемое разработчиками

Mons · 2018-01-24T16:49:12Z

у меня есть и для 1.6

alyapunov · 2018-01-24T17:00:39Z

* Mons <notifications@github.com> [18/01/24 19:53]:

у меня есть и для 1.6

Добавь в тикет.

…

-- Konstantin Osipov, Moscow, Russia, +7 903 626 22 32 http://tarantool.org - www.twitter.com/kostja_osipov

Gerold103 · 2018-02-19T14:22:41Z

Basic algorithm

box.ctl.promote() - a function to make the current replica to be master of a
fullmesh replicaset. Master is a replica in read-write mode. Slave is a replica
in read-only mode.

Check mode of the replica: if it is already read-write, then follow through
on the steps of the algorithm to ensure all other replicas are read-only.
Else a current master must be found by map-reduce via replication connections
(details of replication connections usage are described below).
If a found master already works with another box.ctl.promote, then a new
box.ctl.promote is aborted.
If a master is found, set him to a read-only mode for timeout T seconds. The
found master remembers that his state is neither master or replica - this state
is used in (3) to abort new box.ctl.promote calls.
Wait full sync of data on the just disabled master. If time is out of T
seconds, then abort: the old master automatically enters back in read-write
mode, and the current replica finishes sync. If full sync is ok, then make the
old master be replica and enter in read-write mode.

Various problems

Obviously, there are no problems, if:

a replica is down before sending any requests to a master - algorithm did not
manage to start;
a replica is down after a master respond with ok, and box.ctl.promote
returned OK - algorithm is already finished.

Assume, an old master is down when he was syncing in read-only mode. In such a
case when it is restarted, it is back in read-write mode, because configuration
is not persisted. A replica, called box.ctl.promote, returns error after T
seconds, used as timeout.

Assume, a replica, called box.ctl.promote, is down, when an old master already
finished sync and entered read-only mode. Then the cluster becomes read-only. This case is not distinguishable from one when a new master fails right after box.ctl.promote.

Implementation

A host, on which box.ctl.promote is called, must communicate with replicaset
members. On each call create netbox connections to all replicas is too expensive
and long. The idea of usage of existing replication connections seems to be very
good alternative way.

Consider, how communication can be done via replication connections. At first,
new IPROTO commands family must be introduced: IPROTO_CTL (the name can be
discussed). IPROTO_CTL is an array of maps. Each map is one command like
"enter into read-write mode", or "enter into read-only mode for T seconds".
IPROTO_CTL is a sequence of such commands, which is applied in one
"transaction". It means, that either all commands are applied, or nothing.
In the code appears new structure for operations:

struct iproto_ctl_op {
	enum iproto_ctl_type type;
	int (*do)(struct iproto_ctl_op *op);
	void (*rollback)(struct iproto_ctl_op *op);
	void (*destroy)(struct iproto_ctl_op *op);
}

Sequence of these operations are applied one by one via do() calls. If any
do() returned not 0, then rollback is called for all already done operations
(like AlterSpaceOp).
When box.ctl.promote is called on a replica, it sends IPROTO_CTL commands from
relay thread. Applier processes requests and send response via the same
replication connection.

IPROTO_CTL constants:
IPROTO_CTL_MODE_RW,
IPROTO_CTL_MODE_RO.

IPROTO_CTL commands needed for box.ctl.promote:

IPROTO_CTL_MODE: <nil> - get a replica mode: IPROTO_CTL_MODE_RW/RO;
IPROTO_CTL_MODE: IPROTO_CTL_MODE_RW/RO - set a replica mode.
IPROTO_CTL_SYNC: <seconds> - an instance, received the message, waits full
sync during . If no full sync, then error.

Final protocol of promotion:

      Replica                                           Master
               IPROTO_CTL: [
                 {IPROTO_CTL_MODE: IPROTO_CTL_MODE_RO};
                 {IPROTO_CTL_SYNC: <timeout>;}
               ] ------------------------------------------->
      
               <-----------------IPROTO_OK-------------------
 set read-write mode;

Final box.ctl.promote API:
box.ctl.promote({timeout = <seconds>}).

Gerold103 · 2018-02-22T11:42:22Z

The ticket is mine.

Updated: box/net.box_reconnect_after_gh-3164.test.lua gh-5081 replication/errinj.test.lua gh-3870 replication/qsync_basic.test.lua gh-5355 replication/anon.test.lua gh-5381 replication/status.test.lua gh-5409 replication/election_qsync.test.lua gh-5430 Added new: box-py/iproto.test.py gh-qa-132 replication/gh-5435-qsync-clear-synchro-queue-co> gh-qa-129 replication/gh-5445-leader-inconsistency.test.lua gh-qa-129 replication/gh-3055-election-promote.test.lua gh-qa-127 replication/election_basic.test.lua gh-qa-133

Updated: box/net.box_reconnect_after_gh-3164.test.lua gh-5081 replication/errinj.test.lua gh-3870 replication/qsync_basic.test.lua gh-5355 replication/anon.test.lua gh-5381 replication/status.test.lua gh-5409 replication/election_qsync.test.lua gh-5430 Added new: box-py/iproto.test.py gh-qa-132 replication/gh-5435-qsync-clear-synchro-queue-co> gh-qa-129 replication/gh-5445-leader-inconsistency.test.lua gh-qa-129 replication/gh-3055-election-promote.test.lua gh-qa-127 replication/election_basic.test.lua gh-qa-133 (cherry picked from commit 75193e5)

Follow-up #3055

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055 (cherry picked from commit 338e867)

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

Found the following error in our CI: [001] Test failed! Result content mismatch: [001] --- replication/gh-3055-election-promote.result Mon Aug 2 17:52:55 2021 [001] +++ var/rejects/replication/gh-3055-election-promote.reject Mon Aug 9 10:29:34 2021 [001] @@ -88,7 +88,7 @@ [001] | ... [001] assert(not box.info.ro) [001] | --- [001] - | - true [001] + | - error: assertion failed! [001] | ... [001] assert(box.info.election.term > term) [001] | --- [001] The problem was the same as in recently fixed election_qsync.test (commit 096a0a7): PROMOTE is written to WAL asynchronously, and box.ctl.promote() returns earlier than this happens. Fix the issue by waiting for the instance to become writeable. Follow-up #6034

Found the following error in our CI: [001] Test failed! Result content mismatch: [001] --- replication/gh-3055-election-promote.result Mon Aug 2 17:52:55 2021 [001] +++ var/rejects/replication/gh-3055-election-promote.reject Mon Aug 9 10:29:34 2021 [001] @@ -88,7 +88,7 @@ [001] | ... [001] assert(not box.info.ro) [001] | --- [001] - | - true [001] + | - error: assertion failed! [001] | ... [001] assert(box.info.election.term > term) [001] | --- [001] The problem was the same as in recently fixed election_qsync.test (commit 096a0a7): PROMOTE is written to WAL asynchronously, and box.ctl.promote() returns earlier than this happens. Fix the issue by waiting for the instance to become writeable. Follow-up #6034 (cherry picked from commit 1df9960)

Found the following error in our CI: [001] Test failed! Result content mismatch: [001] --- replication/gh-3055-election-promote.result Mon Aug 2 17:52:55 2021 [001] +++ var/rejects/replication/gh-3055-election-promote.reject Mon Aug 9 10:29:34 2021 [001] @@ -88,7 +88,7 @@ [001] | ... [001] assert(not box.info.ro) [001] | --- [001] - | - true [001] + | - error: assertion failed! [001] | ... [001] assert(box.info.election.term > term) [001] | --- [001] The problem was the same as in recently fixed election_qsync.test (commit 096a0a7): PROMOTE is written to WAL asynchronously, and box.ctl.promote() returns earlier than this happens. Fix the issue by waiting for the instance to become writeable. Follow-up #6034

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

kostja added the feature A new functionality label Jan 19, 2018

kostja added this to the 1.7.7 milestone Jan 19, 2018

rtsisyk assigned kostja Jan 22, 2018

kostja assigned locker and unassigned kostja Feb 11, 2018

kostja modified the milestones: 1.8.0, 1.9.0 Feb 11, 2018

kostja added prio1 replication in design Requires design document labels Feb 11, 2018

kostja changed the title ~~Скрипт для переключения нагрузки между репликами в схеме мастер(read_only=true) <-> мастер (read_only=false)~~ box.ctl.promote() Feb 15, 2018

kostja assigned Gerold103 and unassigned locker Feb 15, 2018

kostja removed the in design Requires design document label Feb 15, 2018

Gerold103 added the in design Requires design document label Feb 16, 2018

Gerold103 added design review and removed in design Requires design document labels Feb 20, 2018

kostja assigned kbelyavs Feb 22, 2018

Gerold103 unassigned kbelyavs Feb 22, 2018

This was referenced Feb 26, 2018

tuple access by field name for net.box #2978

Closed

Implement a proxy module with manual failover #2625

Open

Gerold103 added the in design Requires design document label Mar 6, 2018

kyukhin removed this from the wishlist milestone Jul 13, 2021

sergepetrenko added a commit that referenced this issue Jul 14, 2021

box: fix an assertion failure after a spurious wakeup in promote

eb6b1f4

Follow-up #3055

sergepetrenko added a commit that referenced this issue Jul 15, 2021

box: fix an assertion failure after a spurious wakeup in promote

9031314

Follow-up #3055

cyrillos pushed a commit that referenced this issue Jul 16, 2021

box: fix an assertion failure after a spurious wakeup in promote

3a2cf7f

Follow-up #3055

cyrillos pushed a commit that referenced this issue Jul 19, 2021

box: fix an assertion failure after a spurious wakeup in promote

82248cf

Follow-up #3055

cyrillos pushed a commit that referenced this issue Jul 22, 2021

box: fix an assertion failure after a spurious wakeup in promote

2542886

Follow-up #3055

sergepetrenko added a commit that referenced this issue Jul 23, 2021

box: test the assertion failure after a spurious wakeup in promote

536bfe8

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Jul 29, 2021

box: test the assertion failure after a spurious wakeup in promote

b121805

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

cyrillos pushed a commit that referenced this issue Jul 30, 2021

box: test the assertion failure after a spurious wakeup in promote

381cfa7

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 2, 2021

box: test the assertion failure after a spurious wakeup in promote

f6f27c3

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 3, 2021

box: test the assertion failure after a spurious wakeup in promote

884a700

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 4, 2021

box: test the assertion failure after a spurious wakeup in promote

d710068

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

cyrillos pushed a commit that referenced this issue Aug 4, 2021

box: test the assertion failure after a spurious wakeup in promote

73b500f

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 5, 2021

box: test the assertion failure after a spurious wakeup in promote

b6ecdd3

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 5, 2021

box: test the assertion failure after a spurious wakeup in promote

d7340a3

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

kyukhin pushed a commit that referenced this issue Aug 6, 2021

box: test the assertion failure after a spurious wakeup in promote

338e867

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 9, 2021

box: test the assertion failure after a spurious wakeup in promote

ed0c53a

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

sergepetrenko added a commit that referenced this issue Aug 9, 2021

box: test the assertion failure after a spurious wakeup in promote

1840e79

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

kyukhin pushed a commit that referenced this issue Aug 9, 2021

box: test the assertion failure after a spurious wakeup in promote

56a9440

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

yanshtunder pushed a commit that referenced this issue Oct 4, 2021

box: test the assertion failure after a spurious wakeup in promote

b7aa62b

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

yanshtunder pushed a commit that referenced this issue Oct 4, 2021

box: test the assertion failure after a spurious wakeup in promote

5494cfd

The failure itself was fixed in 68de875 (raft: replace raft_start_candidate with _promote), let's add a regression test now. Follow-up #3055

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

box.ctl.promote() #3055

box.ctl.promote() #3055

laserjump commented Jan 18, 2018 •

edited by kostja

rtsisyk commented Jan 22, 2018

laserjump commented Jan 23, 2018

Mons commented Jan 24, 2018

alyapunov commented Jan 24, 2018 via email

Gerold103 commented Feb 19, 2018 •

edited

Gerold103 commented Feb 22, 2018

box.ctl.promote() #3055

box.ctl.promote() #3055

Comments

laserjump commented Jan 18, 2018 • edited by kostja

rtsisyk commented Jan 22, 2018

laserjump commented Jan 23, 2018

Mons commented Jan 24, 2018

alyapunov commented Jan 24, 2018 via email

Gerold103 commented Feb 19, 2018 • edited

Basic algorithm

Various problems

Implementation

Gerold103 commented Feb 22, 2018

laserjump commented Jan 18, 2018 •

edited by kostja

Gerold103 commented Feb 19, 2018 •

edited