Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

box.ctl.promote() #3055

Closed
laserjump opened this issue Jan 18, 2018 · 7 comments
Closed

box.ctl.promote() #3055

laserjump opened this issue Jan 18, 2018 · 7 comments
Assignees
Labels

Comments

@laserjump
Copy link

laserjump commented Jan 18, 2018

Implement a built-in call which promotes a replica to a master in a replica set.

What it should do:

  1. Check the current master-slave configuration. If the replica is already in read-write mode, do nothing. If there is another master in the configuration, return a warning that there is more than one master.
  2. If the current instance is in read-only mode, so indeed can be promoted to master, find the current master. If the master is not found, continue to step 5.
  3. Set the current master to read-only mode.
  4. Wait until all replicas in the replica set synchronize their vclock with the current master.
  5. Set itself to read-write mode.

Open issues:

  • what should be the name of the call? Options: box.ctl.promote() box.ctl.set_master(), box.ctl.change_master_to().
  • Perhaps we should simly improve changes of read_only in box.cfg so that it is race-condition free? Then we need no special API.
  • prevention of race conditions and persisting the role change. The call should work correctly in case of race condition (multiple change master on multiple instances in a replica set) and persist over server restarts (the problem that some instances may restart, and some not, for example, the old master can be restarted). Ideally we should persist the state of this procedure in a system space in all nodes, so that we can resume it any time. Perhaps we should store the current mode/role in the WAL, needs to be investigated.

Now that we have before_replace triggers, we essentially have logical replication available as a vehicle for message passing between master and slave. Perhaps we should make Iproto_nop possible in read-only mode, so that message passing can work in both directions - from read-write master to read-only one, and vice-versa.

How we can achieve correctness of the algorithm without persisting its state:

  • each promotion should begin with a start message and end with a commit message
  • each message in a promotion is identified with a unique promotion round id (guid or timestamp)
  • the start message should have a timeout associated with hte promotion
  • if the promotion doesn't happen until timeout expires, all nodes forget about this promotion round and revert to the original state
  • the node which initiates the promotion grows the timeout exponentially from round to round, to deal with network lags, but if it restarts, it reverts to the original low default for promotion round timeout

More issues to consider:

  • what should we do if replication link is not bi-directional?
  • what should we do in a degraded state, when we can't really establish the state of peers? Perhaps we should have an option force=true which proceeds even in degraded state. Perhaps some other interface is necessary.
@kostja kostja added the feature A new functionality label Jan 19, 2018
@kostja kostja added this to the 1.7.7 milestone Jan 19, 2018
@laserjump laserjump changed the title Нужен набор сопутствующих скриптов для эксплуатации тарантулов в схеме мастер(read_only=true) <-> мастер (read_only=false) скрипт для переключения нагрузки между репликами в схеме мастер(read_only=true) <-> мастер (read_only=false) Jan 19, 2018
@laserjump laserjump changed the title скрипт для переключения нагрузки между репликами в схеме мастер(read_only=true) <-> мастер (read_only=false) Скрипт для переключения нагрузки между репликами в схеме мастер(read_only=true) <-> мастер (read_only=false) Jan 19, 2018
@rtsisyk
Copy link
Contributor

rtsisyk commented Jan 22, 2018

box.cfg { read_only = true } ?

@laserjump
Copy link
Author

и что общего это имеет с минимизацией отстаивания в r/o?
я скорее имел ввиду вариант наподобие https://gist.github.com/Mons/9dc4e2e3097c231551d4f0d130986149, но реализованное средствами луа в тарантуле 1.7 и поддерживаемое разработчиками

@Mons
Copy link
Contributor

Mons commented Jan 24, 2018

у меня есть и для 1.6

@alyapunov
Copy link
Contributor

alyapunov commented Jan 24, 2018 via email

@kostja kostja assigned locker and unassigned kostja Feb 11, 2018
@kostja kostja modified the milestones: 1.8.0, 1.9.0 Feb 11, 2018
@kostja kostja added prio1 replication in design Requires design document labels Feb 11, 2018
@kostja kostja changed the title Скрипт для переключения нагрузки между репликами в схеме мастер(read_only=true) <-> мастер (read_only=false) box.ctl.promote() Feb 15, 2018
@kostja kostja assigned Gerold103 and unassigned locker Feb 15, 2018
@kostja kostja removed the in design Requires design document label Feb 15, 2018
@Gerold103 Gerold103 added the in design Requires design document label Feb 16, 2018
@Gerold103
Copy link
Collaborator

Gerold103 commented Feb 19, 2018

Basic algorithm

box.ctl.promote() - a function to make the current replica to be master of a
fullmesh replicaset. Master is a replica in read-write mode. Slave is a replica
in read-only mode.

  1. Check mode of the replica: if it is already read-write, then follow through
    on the steps of the algorithm to ensure all other replicas are read-only.
  2. Else a current master must be found by map-reduce via replication connections
    (details of replication connections usage are described below).
  3. If a found master already works with another box.ctl.promote, then a new
    box.ctl.promote is aborted.
  4. If a master is found, set him to a read-only mode for timeout T seconds. The
    found master remembers that his state is neither master or replica - this state
    is used in (3) to abort new box.ctl.promote calls.
  5. Wait full sync of data on the just disabled master. If time is out of T
    seconds, then abort: the old master automatically enters back in read-write
    mode, and the current replica finishes sync. If full sync is ok, then make the
    old master be replica and enter in read-write mode.

Various problems

Obviously, there are no problems, if:

  • a replica is down before sending any requests to a master - algorithm did not
    manage to start;
  • a replica is down after a master respond with ok, and box.ctl.promote
    returned OK - algorithm is already finished.

Assume, an old master is down when he was syncing in read-only mode. In such a
case when it is restarted, it is back in read-write mode, because configuration
is not persisted. A replica, called box.ctl.promote, returns error after T
seconds, used as timeout.

Assume, a replica, called box.ctl.promote, is down, when an old master already
finished sync and entered read-only mode. Then the cluster becomes read-only. This case is not distinguishable from one when a new master fails right after box.ctl.promote.

Implementation

A host, on which box.ctl.promote is called, must communicate with replicaset
members. On each call create netbox connections to all replicas is too expensive
and long. The idea of usage of existing replication connections seems to be very
good alternative way.

Consider, how communication can be done via replication connections. At first,
new IPROTO commands family must be introduced: IPROTO_CTL (the name can be
discussed). IPROTO_CTL is an array of maps. Each map is one command like
"enter into read-write mode", or "enter into read-only mode for T seconds".
IPROTO_CTL is a sequence of such commands, which is applied in one
"transaction". It means, that either all commands are applied, or nothing.
In the code appears new structure for operations:

struct iproto_ctl_op {
	enum iproto_ctl_type type;
	int (*do)(struct iproto_ctl_op *op);
	void (*rollback)(struct iproto_ctl_op *op);
	void (*destroy)(struct iproto_ctl_op *op);
}

Sequence of these operations are applied one by one via do() calls. If any
do() returned not 0, then rollback is called for all already done operations
(like AlterSpaceOp).
When box.ctl.promote is called on a replica, it sends IPROTO_CTL commands from
relay thread. Applier processes requests and send response via the same
replication connection.

IPROTO_CTL constants:
IPROTO_CTL_MODE_RW,
IPROTO_CTL_MODE_RO.

IPROTO_CTL commands needed for box.ctl.promote:

  • IPROTO_CTL_MODE: <nil> - get a replica mode: IPROTO_CTL_MODE_RW/RO;
  • IPROTO_CTL_MODE: IPROTO_CTL_MODE_RW/RO - set a replica mode.
  • IPROTO_CTL_SYNC: <seconds> - an instance, received the message, waits full
    sync during . If no full sync, then error.

Final protocol of promotion:

      Replica                                           Master
               IPROTO_CTL: [
                 {IPROTO_CTL_MODE: IPROTO_CTL_MODE_RO};
                 {IPROTO_CTL_SYNC: <timeout>;}
               ] ------------------------------------------->
      
               <-----------------IPROTO_OK-------------------
 set read-write mode;

Final box.ctl.promote API:
box.ctl.promote({timeout = <seconds>}).

@Gerold103 Gerold103 added design review and removed in design Requires design document labels Feb 20, 2018
@Gerold103
Copy link
Collaborator

The ticket is mine.

avtikhon added a commit that referenced this issue Jun 24, 2021
Updated:

  box/net.box_reconnect_after_gh-3164.test.lua gh-5081
  replication/errinj.test.lua                  gh-3870
  replication/qsync_basic.test.lua             gh-5355
  replication/anon.test.lua                    gh-5381
  replication/status.test.lua                  gh-5409
  replication/election_qsync.test.lua          gh-5430

Added new:

  box-py/iproto.test.py                             gh-qa-132
  replication/gh-5435-qsync-clear-synchro-queue-co> gh-qa-129
  replication/gh-5445-leader-inconsistency.test.lua gh-qa-129
  replication/gh-3055-election-promote.test.lua     gh-qa-127
  replication/election_basic.test.lua               gh-qa-133
kyukhin pushed a commit that referenced this issue Jul 5, 2021
Updated:

  box/net.box_reconnect_after_gh-3164.test.lua gh-5081
  replication/errinj.test.lua                  gh-3870
  replication/qsync_basic.test.lua             gh-5355
  replication/anon.test.lua                    gh-5381
  replication/status.test.lua                  gh-5409
  replication/election_qsync.test.lua          gh-5430

Added new:

  box-py/iproto.test.py                             gh-qa-132
  replication/gh-5435-qsync-clear-synchro-queue-co> gh-qa-129
  replication/gh-5445-leader-inconsistency.test.lua gh-qa-129
  replication/gh-3055-election-promote.test.lua     gh-qa-127
  replication/election_basic.test.lua               gh-qa-133
kyukhin pushed a commit that referenced this issue Jul 5, 2021
Updated:

  box/net.box_reconnect_after_gh-3164.test.lua gh-5081
  replication/errinj.test.lua                  gh-3870
  replication/qsync_basic.test.lua             gh-5355
  replication/anon.test.lua                    gh-5381
  replication/status.test.lua                  gh-5409
  replication/election_qsync.test.lua          gh-5430

Added new:

  box-py/iproto.test.py                             gh-qa-132
  replication/gh-5435-qsync-clear-synchro-queue-co> gh-qa-129
  replication/gh-5445-leader-inconsistency.test.lua gh-qa-129
  replication/gh-3055-election-promote.test.lua     gh-qa-127
  replication/election_basic.test.lua               gh-qa-133

(cherry picked from commit 75193e5)
kyukhin pushed a commit that referenced this issue Jul 5, 2021
Updated:

  box/net.box_reconnect_after_gh-3164.test.lua gh-5081
  replication/errinj.test.lua                  gh-3870
  replication/qsync_basic.test.lua             gh-5355
  replication/anon.test.lua                    gh-5381
  replication/status.test.lua                  gh-5409
  replication/election_qsync.test.lua          gh-5430

Added new:

  box-py/iproto.test.py                             gh-qa-132
  replication/gh-5435-qsync-clear-synchro-queue-co> gh-qa-129
  replication/gh-5445-leader-inconsistency.test.lua gh-qa-129
  replication/gh-3055-election-promote.test.lua     gh-qa-127
  replication/election_basic.test.lua               gh-qa-133

(cherry picked from commit 75193e5)
@kyukhin kyukhin removed this from the wishlist milestone Jul 13, 2021
sergepetrenko added a commit that referenced this issue Jul 23, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Jul 29, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
cyrillos pushed a commit that referenced this issue Jul 30, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 2, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 3, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 4, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
cyrillos pushed a commit that referenced this issue Aug 4, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 5, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 5, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
kyukhin pushed a commit that referenced this issue Aug 6, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 9, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 9, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
kyukhin pushed a commit that referenced this issue Aug 9, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055

(cherry picked from commit 338e867)
kyukhin pushed a commit that referenced this issue Aug 9, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
sergepetrenko added a commit that referenced this issue Aug 9, 2021
Found the following error in our CI:

[001] Test failed! Result content mismatch:
[001] --- replication/gh-3055-election-promote.result	Mon Aug  2 17:52:55 2021
[001] +++ var/rejects/replication/gh-3055-election-promote.reject	Mon Aug  9 10:29:34 2021
[001] @@ -88,7 +88,7 @@
[001]   | ...
[001]  assert(not box.info.ro)
[001]   | ---
[001] - | - true
[001] + | - error: assertion failed!
[001]   | ...
[001]  assert(box.info.election.term > term)
[001]   | ---
[001]

The problem was the same as in recently fixed election_qsync.test
(commit 096a0a7): PROMOTE is written to
WAL asynchronously, and box.ctl.promote() returns earlier than this
happens.

Fix the issue by waiting for the instance to become writeable.

Follow-up #6034
kyukhin pushed a commit that referenced this issue Aug 12, 2021
Found the following error in our CI:

[001] Test failed! Result content mismatch:
[001] --- replication/gh-3055-election-promote.result	Mon Aug  2 17:52:55 2021
[001] +++ var/rejects/replication/gh-3055-election-promote.reject	Mon Aug  9 10:29:34 2021
[001] @@ -88,7 +88,7 @@
[001]   | ...
[001]  assert(not box.info.ro)
[001]   | ---
[001] - | - true
[001] + | - error: assertion failed!
[001]   | ...
[001]  assert(box.info.election.term > term)
[001]   | ---
[001]

The problem was the same as in recently fixed election_qsync.test
(commit 096a0a7): PROMOTE is written to
WAL asynchronously, and box.ctl.promote() returns earlier than this
happens.

Fix the issue by waiting for the instance to become writeable.

Follow-up #6034

(cherry picked from commit 1df9960)
kyukhin pushed a commit that referenced this issue Aug 12, 2021
Found the following error in our CI:

[001] Test failed! Result content mismatch:
[001] --- replication/gh-3055-election-promote.result	Mon Aug  2 17:52:55 2021
[001] +++ var/rejects/replication/gh-3055-election-promote.reject	Mon Aug  9 10:29:34 2021
[001] @@ -88,7 +88,7 @@
[001]   | ...
[001]  assert(not box.info.ro)
[001]   | ---
[001] - | - true
[001] + | - error: assertion failed!
[001]   | ...
[001]  assert(box.info.election.term > term)
[001]   | ---
[001]

The problem was the same as in recently fixed election_qsync.test
(commit 096a0a7): PROMOTE is written to
WAL asynchronously, and box.ctl.promote() returns earlier than this
happens.

Fix the issue by waiting for the instance to become writeable.

Follow-up #6034

(cherry picked from commit 1df9960)
kyukhin pushed a commit that referenced this issue Aug 12, 2021
Found the following error in our CI:

[001] Test failed! Result content mismatch:
[001] --- replication/gh-3055-election-promote.result	Mon Aug  2 17:52:55 2021
[001] +++ var/rejects/replication/gh-3055-election-promote.reject	Mon Aug  9 10:29:34 2021
[001] @@ -88,7 +88,7 @@
[001]   | ...
[001]  assert(not box.info.ro)
[001]   | ---
[001] - | - true
[001] + | - error: assertion failed!
[001]   | ...
[001]  assert(box.info.election.term > term)
[001]   | ---
[001]

The problem was the same as in recently fixed election_qsync.test
(commit 096a0a7): PROMOTE is written to
WAL asynchronously, and box.ctl.promote() returns earlier than this
happens.

Fix the issue by waiting for the instance to become writeable.

Follow-up #6034
yanshtunder pushed a commit that referenced this issue Oct 4, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
yanshtunder pushed a commit that referenced this issue Oct 4, 2021
The failure itself was fixed in 68de875
(raft: replace raft_start_candidate with _promote), let's add a
regression test now.

Follow-up #3055
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants