Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica bootstraps own cluster with established replication cfg when all nodes are in ro state #5613

Closed
ochaton opened this issue Dec 8, 2020 · 1 comment
Assignees
Labels
bug Something isn't working replication
Milestone

Comments

@ochaton
Copy link
Member

ochaton commented Dec 8, 2020

Tarantool version: 1.10.8-25-g3f8ed054e, 2.6.1-11-g37077a48e

Bug description:

Steps to reproduce:

  1. Bootstrap 2 nodes (bound to 4001, and 4002).
  2. Switch them to ro using box.cfg{ read_only = true }
  3. Bootstrap 3rd node with box.cfg{ replication = {"127.0.0.1:4001", "127.0.0.1:4002", "127.0.0.1:4003"}, listen = 4003 }

Expectation:
3rd node must fail bootstrap, instead it initiates own cluster.

2020-12-08 20:16:33.707 [95090] main/101/init.lua C> Tarantool 1.10.8-25-g3f8ed054e
2020-12-08 20:16:33.707 [95090] main/101/init.lua C> log level 5
2020-12-08 20:16:33.707 [95090] main/101/init.lua I> mapping 268435456 bytes for memtx tuple arena...
2020-12-08 20:16:33.708 [95090] main/101/init.lua I> mapping 0 bytes for vinyl tuple arena...
2020-12-08 20:16:33.714 [95090] main/101/init.lua I> instance uuid 70cfd369-43d7-42c7-9b0a-6ca30e2f55b5
2020-12-08 20:16:33.714 [95090] iproto/101/main I> binary: bound to [::]:4003
2020-12-08 20:16:33.715 [95090] main/101/init.lua I> connecting to 3 replicas
2020-12-08 20:16:33.715 [95090] main/105/applier/127.0.0.1:4001 I> remote master f7d4b9d0-e0f1-4d25-a5ab-b2eb1b77de6b at 127.0.0.1:4001 running Tarantool 1.10.8
2020-12-08 20:16:33.716 [95090] main/106/applier/127.0.0.1:4002 I> remote master 12c02516-138b-474a-9a72-9817905870f9 at 127.0.0.1:4002 running Tarantool 1.10.8
2020-12-08 20:16:33.719 [95090] main/107/applier/127.0.0.1:4003 I> remote master 70cfd369-43d7-42c7-9b0a-6ca30e2f55b5 at 127.0.0.1:4003 running Tarantool 1.10.8
2020-12-08 20:16:33.719 [95090] main/101/init.lua I> connected to 3 replicas
2020-12-08 20:16:33.719 [95090] main/101/init.lua I> initializing an empty data directory
2020-12-08 20:16:33.752 [95090] main/101/init.lua I> assigned id 1 to replica 70cfd369-43d7-42c7-9b0a-6ca30e2f55b5
2020-12-08 20:16:33.752 [95090] main/101/init.lua I> assigned id 2 to replica 12c02516-138b-474a-9a72-9817905870f9
2020-12-08 20:16:33.752 [95090] main/101/init.lua I> assigned id 3 to replica f7d4b9d0-e0f1-4d25-a5ab-b2eb1b77de6b
2020-12-08 20:16:33.752 [95090] main/101/init.lua I> cluster uuid 1efd7cb3-18fe-4874-88b9-ea44215ecece
2020-12-08 20:16:33.762 [95090] snapshot/101/main I> saving snapshot `tnt3/00000000000000000000.snap.inprogress'
2020-12-08 20:16:33.765 [95090] snapshot/101/main I> done
2020-12-08 20:16:33.766 [95090] main/101/init.lua I> ready to accept requests
2020-12-08 20:16:33.766 [95090] main/110/checkpoint_daemon I> started
2020-12-08 20:16:33.767 [95090] main/110/checkpoint_daemon I> scheduled the next snapshot at Tue Dec  8 22:04:28 2020
2020-12-08 20:16:33.768 [95090] main/105/applier/127.0.0.1:4001 I> can't join/subscribe
2020-12-08 20:16:33.768 [95090] main/105/applier/127.0.0.1:4001 xrow.c:962 E> ER_REPLICASET_UUID_MISMATCH: Replica set UUID mismatch: expected 09141bb3-4d4c-4fd1-9f89-887f7df55588, got 1efd7cb3-18fe-4874-88b9-ea44215ecece
2020-12-08 20:16:33.768 [95090] main/106/applier/127.0.0.1:4002 I> can't join/subscribe
2020-12-08 20:16:33.768 [95090] main/106/applier/127.0.0.1:4002 xrow.c:962 E> ER_REPLICASET_UUID_MISMATCH: Replica set UUID mismatch: expected 09141bb3-4d4c-4fd1-9f89-887f7df55588, got 1efd7cb3-18fe-4874-88b9-ea44215ecece
@ochaton ochaton changed the title Replica bootstraps own cluster with established replication cfg when all nodes in ro state Replica bootstraps own cluster with established replication cfg when all nodes are in ro state Dec 8, 2020
@Gerold103 Gerold103 added bug Something isn't working replication labels Dec 8, 2020
@sergepetrenko sergepetrenko self-assigned this Dec 9, 2020
@kyukhin kyukhin added the teamR label Jan 13, 2021
@kyukhin kyukhin added this to the 2.6.3 milestone Jan 13, 2021
@kyukhin kyukhin added the tmp label Mar 23, 2021
@kyukhin kyukhin modified the milestones: 2.6.3, 2.7.3 Apr 12, 2021
@kyukhin kyukhin removed the tmp label Apr 30, 2021
@Gerold103
Copy link
Collaborator

Gerold103 commented May 28, 2021

The error is different on the master branch due to this #6094.

Gerold103 added a commit that referenced this issue May 28, 2021
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaser UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613
Gerold103 added a commit that referenced this issue Jun 1, 2021
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaset UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613
Gerold103 added a commit that referenced this issue Jun 2, 2021
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaset UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613
Gerold103 added a commit that referenced this issue Jun 2, 2021
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaset UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613

(cherry picked from commit ea0b126)
Gerold103 added a commit that referenced this issue Jun 2, 2021
Remote node doing the subscribe might be from a different
replicaset.

Before this patch the subscribe would be retried infinitely
because the node couldn't be found in _cluster, and the master
assumed it must have joined to another node, and its ID should
arrive shortly (ER_TOO_EARLY_SUBSCRIBE).

The ID would never arrive, because the node belongs to another
replicaset.

The patch makes so the master checks if the peer lives in the same
replicaset. Since it is doing a subscribe, it must have joined
already and should have a valid replicaset UUID, regardless of
whether it is anonymous or not.

Correct behaviour is to hard cut this peer off immediately,
without retries.

Closes #6094
Part of #5613

(cherry picked from commit ea0b126)
Gerold103 added a commit that referenced this issue Jun 3, 2021
When an instance was being bootstrapped, it didn't check if its
replication sources had the same replicaset UUID.

As a result, if they didn't match, it used to boot from any of
them almost randomly (using selection among non-read-only nodes,
and min uuid among these) and raise an error about the mismatching
ones later.

Obviously, there is something wrong, such replication config is
not valid and the node should not boot at all.

The patch tries to prevent such instance's bootstrap if it sees at
least 2 replicas with different replicaset UUID.

It does not solve the problem for all the cases because still one
of the replicas simply could be not available. Then the new node
would give up and boot from the other node successfully, and
notice replicaset UUID mismatch only when the connection to the
first node is restored.

But it should help for most of the boot attempts.

Closes #5613

@TarantoolBot document
Title: New field in IPROTO_BALLOT

`IPROTO_BALLOT(0x29)` is a response for `IPROTO_VOTE(0x44)`. It
used to contain 5 fields (is_ro, vclock, gc_vclock, etc). Now it
also contains `IPROTO_BALLOT_REPLICASET_UUID (0x06)`. It is a
UUID string showing replicaset UUID of the sender. It can be nil
UUID (when all digits are 0) when not known. It is optional. When
omitted, it is assumed to be not known.
Gerold103 added a commit that referenced this issue Jun 4, 2021
Firstly, rename it to replicaset_find_join_master(). Now, when
there is Raft with a concept of an actual leader, the function
name becomes confusing.

Secondly, do not access ballot member in struct applier in such a
long way - save the ballot pointer on the stack. This is going to
become useful when in one of the next patches the ballot will be
used more.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 4, 2021
Rename the member to show its actual meaning. It is not the
real RO state of the instance. Only how it is configured.

It can happen that the instance is read_only = false, but still is
in RO state due to other reasons.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 4, 2021
Is_loading in the ballot used to mean the following: "the instance
did not finish its box.cfg() or has read_only = true". Which is
quite a strange property.

For instance, it was 'true' even if the instance is not really
loading anymore but has read_only = true.

The patch renames it to 'is_ro' (which existed here before, but
also with a wrong meaning).

Its behaviour is slightly changed to report the RO state of the
instance. Not its read_only. This way it incorporates all the
possible RO conditions. Such as not finished bootstrap, having
read_only = true, being a Raft follower, and so on.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 4, 2021
The new field reports whether the instance has finished its
bootstrap/recovery, or IOW has finished box.cfg().

The new field will help in fixing #5613 so as not to try to join
to a replicaset via non-bootstrapped instances if there are
others.

The problem is that otherwise, if all nodes are booted but
are read-only, new instances bootstrap their own independent
replicaset. It would be better to just fail and terminate the
process than do such a bizarre action.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 4, 2021
The patch refactors the algorithm of finding a join-master (in
replicaset_find_join_master()) to use scores instead of multiple
iterations with different criteria.

The original code was relatively fine as long as it had only
one parameter to change - whether should it skip
`box.cfg{read_only = true}` nodes.

Although it was clear that it was "on the edge" of acceptable
complexity due to a second non-configurable parameter whether a
replica is in read-only state regardless of its config.

It is going to get more complicated when the algorithm will take
into account the third parameter whether an instance is
bootstrapped.

Then it should make decisions like "among bootstrapped nodes try
to prefer instances not having read_only=true, and not being in
read-only state". The easiest way to do so is to use
scores/weights incremented according to the instance's parameters
matching certain "good points".

Part of #5613
Gerold103 added a commit that referenced this issue Jun 4, 2021
The algorithm of looking for an instance to join the replicaset
from didn't take into account that some of the instances might be
not bootstrapped but still perfectly available.

As a result, a ridiculous situation could happen - an instance
could connect to a cluster with just read-only instances, but it
could have itself with box.cfg{read_only = false}. Then instead of
failing or waiting it just booted a brand new cluster. And after
that the node just started complaining about the others having a
different replicaset UUID.

The patch makes so a new instance always prefers a bootstrapped
join-source to a non-boostrapped one, including self. In the
situation above the new instance now terminates with an error.

In future hopefully it should start a retry-loop instead.

Closes #5613

@TarantoolBot document
Title: IPROTO_BALLOT rework and a new field

A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
matching with their names. They are changed.

* `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
  `box.cfg{read_only = true}`". It was renamed in the source code
  to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
  the value is the same. Only the name has changed, and in the doc
  should be too.

* `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
  finished `box.cfg()` and it has `read_only = true`". The name
  was wrong therefore, because even if the instance finished
  loading, the flag still was false for `read_only = true` nodes.
  Also such a value is not very suitable for any sane usage.
  The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
  the same, and the value now is "the instance is not writable".
  The reason for being not writable can be any: the node is an
  orphan; or it has `read_only = true`; or it is a Raft follower;
  or anything else.

And there is a new field.

`IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
bootstrap or recovery.
Gerold103 added a commit that referenced this issue Jun 6, 2021
The algorithm of looking for an instance to join the replicaset
from didn't take into account that some of the instances might be
not bootstrapped but still perfectly available.

As a result, a ridiculous situation could happen - an instance
could connect to a cluster with just read-only instances, but it
could have itself with box.cfg{read_only = false}. Then instead of
failing or waiting it just booted a brand new cluster. And after
that the node just started complaining about the others having a
different replicaset UUID.

The patch makes so a new instance always prefers a bootstrapped
join-source to a non-boostrapped one, including self. In the
situation above the new instance now terminates with an error.

In future hopefully it should start a retry-loop instead.

Closes #5613

@TarantoolBot document
Title: IPROTO_BALLOT rework and a new field

A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
matching with their names. They are changed.

* `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
  `box.cfg{read_only = true}`". It was renamed in the source code
  to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
  the value is the same. Only the name has changed, and in the doc
  should be too.

* `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
  finished `box.cfg()` and it has `read_only = true`". The name
  was wrong therefore, because even if the instance finished
  loading, the flag still was false for `read_only = true` nodes.
  Also such a value is not very suitable for any sane usage.
  The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
  the same, and the value now is "the instance is not writable".
  The reason for being not writable can be any: the node is an
  orphan; or it has `read_only = true`; or it is a Raft follower;
  or anything else.

And there is a new field.

`IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
bootstrap or recovery.
Gerold103 added a commit that referenced this issue Jun 6, 2021
There was a bug that a new replica at join to a Raft cluster
sometimes tried to register on a non-leader node which couldn't
write to _cluster, so the join failed with ER_READONLY error.

Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.

Closes #6127
Gerold103 added a commit that referenced this issue Jun 6, 2021
There was a bug that a new replica at join to a Raft cluster
sometimes tried to register on a non-leader node which couldn't
write to _cluster, so the join failed with ER_READONLY error.

Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.

Closes #6127
@Gerold103 Gerold103 assigned Gerold103 and unassigned sergepetrenko Jun 7, 2021
Gerold103 added a commit that referenced this issue Jun 10, 2021
Firstly, rename it to replicaset_find_join_master(). Now, when
there is Raft with a concept of an actual leader, the function
name becomes confusing.

Secondly, do not access ballot member in struct applier in such a
long way - save the ballot pointer on the stack. This is going to
become useful when in one of the next patches the ballot will be
used more.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 10, 2021
Rename the member to show its actual meaning. It is not the
real RO state of the instance. Only how it is configured.

It can happen that the instance is read_only = false, but still is
in RO state due to other reasons.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 10, 2021
Is_loading in the ballot used to mean the following: "the instance
did not finish its box.cfg() or has read_only = true". Which is
quite a strange property.

For instance, it was 'true' even if the instance is not really
loading anymore but has read_only = true.

The patch renames it to 'is_ro' (which existed here before, but
also with a wrong meaning).

Its behaviour is slightly changed to report the RO state of the
instance. Not its read_only. This way it incorporates all the
possible RO conditions. Such as not finished bootstrap, having
read_only = true, being a Raft follower, and so on.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 10, 2021
The new field reports whether the instance has finished its
bootstrap/recovery, or IOW has finished box.cfg().

The new field will help in fixing #5613 so as not to try to join
to a replicaset via non-bootstrapped instances if there are
others.

The problem is that otherwise, if all nodes are booted but
are read-only, new instances bootstrap their own independent
replicaset. It would be better to just fail and terminate the
process than do such a bizarre action.

Part of #5613
Gerold103 added a commit that referenced this issue Jun 10, 2021
The patch refactors the algorithm of finding a join-master (in
replicaset_find_join_master()) to use scores instead of multiple
iterations with different criteria.

The original code was relatively fine as long as it had only
one parameter to change - whether should it skip
`box.cfg{read_only = true}` nodes.

Although it was clear that it was "on the edge" of acceptable
complexity due to a second non-configurable parameter whether a
replica is in read-only state regardless of its config.

It is going to get more complicated when the algorithm will take
into account the third parameter whether an instance is
bootstrapped.

Then it should make decisions like "among bootstrapped nodes try
to prefer instances not having read_only=true, and not being in
read-only state". The easiest way to do so is to use
scores/weights incremented according to the instance's parameters
matching certain "good points".

Part of #5613
Gerold103 added a commit that referenced this issue Jun 10, 2021
There was a bug that a new replica at join to a Raft cluster
sometimes tried to register on a non-leader node which couldn't
write to _cluster, so the join failed with ER_READONLY error.

Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.

Closes #6127
Gerold103 added a commit that referenced this issue Jun 11, 2021
There was a bug that a new replica at join to a election-enabled
cluster sometimes tried to register on a non-leader node which
couldn't write to _cluster, so the join failed with ER_READONLY
error.

Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.

Closes #6127
Gerold103 added a commit that referenced this issue Jun 11, 2021
Firstly, rename it to replicaset_find_join_master(). Now, when
there is Raft with a concept of an actual leader, the function
name becomes confusing.

Secondly, do not access ballot member in struct applier in such a
long way - save the ballot pointer on the stack. This is going to
become useful when in one of the next patches the ballot will be
used more.

Part of #5613

(cherry picked from commit 2daec52)
Gerold103 added a commit that referenced this issue Jun 11, 2021
Rename the member to show its actual meaning. It is not the
real RO state of the instance. Only how it is configured.

It can happen that the instance is read_only = false, but still is
in RO state due to other reasons.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613

(cherry picked from commit e4488f4)
Gerold103 added a commit that referenced this issue Jun 11, 2021
Is_loading in the ballot used to mean the following: "the instance
did not finish its box.cfg() or has read_only = true". Which is
quite a strange property.

For instance, it was 'true' even if the instance is not really
loading anymore but has read_only = true.

The patch renames it to 'is_ro' (which existed here before, but
also with a wrong meaning).

Its behaviour is slightly changed to report the RO state of the
instance. Not its read_only. This way it incorporates all the
possible RO conditions. Such as not finished bootstrap, having
read_only = true, being a Raft follower, and so on.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613

(cherry picked from commit 71d2a56)
Gerold103 added a commit that referenced this issue Jun 11, 2021
The new field reports whether the instance has finished its
bootstrap/recovery, or IOW has finished box.cfg().

The new field will help in fixing #5613 so as not to try to join
to a replicaset via non-bootstrapped instances if there are
others.

The problem is that otherwise, if all nodes are booted but
are read-only, new instances bootstrap their own independent
replicaset. It would be better to just fail and terminate the
process than do such a bizarre action.

Part of #5613

(cherry picked from commit f8a150c)
Gerold103 added a commit that referenced this issue Jun 11, 2021
The patch refactors the algorithm of finding a join-master (in
replicaset_find_join_master()) to use scores instead of multiple
iterations with different criteria.

The original code was relatively fine as long as it had only
one parameter to change - whether should it skip
`box.cfg{read_only = true}` nodes.

Although it was clear that it was "on the edge" of acceptable
complexity due to a second non-configurable parameter whether a
replica is in read-only state regardless of its config.

It is going to get more complicated when the algorithm will take
into account the third parameter whether an instance is
bootstrapped.

Then it should make decisions like "among bootstrapped nodes try
to prefer instances not having read_only=true, and not being in
read-only state". The easiest way to do so is to use
scores/weights incremented according to the instance's parameters
matching certain "good points".

Part of #5613

(cherry picked from commit 89f1f63)
Gerold103 added a commit that referenced this issue Jun 11, 2021
The algorithm of looking for an instance to join the replicaset
from didn't take into account that some of the instances might be
not bootstrapped but still perfectly available.

As a result, a ridiculous situation could happen - an instance
could connect to a cluster with just read-only instances, but it
could have itself with box.cfg{read_only = false}. Then instead of
failing or waiting it just booted a brand new cluster. And after
that the node just started complaining about the others having a
different replicaset UUID.

The patch makes so a new instance always prefers a bootstrapped
join-source to a non-boostrapped one, including self. In the
situation above the new instance now terminates with an error.

In future hopefully it should start a retry-loop instead.

Closes #5613

@TarantoolBot document
Title: IPROTO_BALLOT rework and a new field

A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
matching with their names. They are changed.

* `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
  `box.cfg{read_only = true}`". It was renamed in the source code
  to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
  the value is the same. Only the name has changed, and in the doc
  should be too.

* `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
  finished `box.cfg()` and it has `read_only = true`". The name
  was wrong therefore, because even if the instance finished
  loading, the flag still was false for `read_only = true` nodes.
  Also such a value is not very suitable for any sane usage.
  The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
  the same, and the value now is "the instance is not writable".
  The reason for being not writable can be any: the node is an
  orphan; or it has `read_only = true`; or it is a Raft follower;
  or anything else.

And there is a new field.

`IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
bootstrap or recovery.

(cherry picked from commit 6a14c8d)
Gerold103 added a commit that referenced this issue Jun 11, 2021
There was a bug that a new replica at join to a election-enabled
cluster sometimes tried to register on a non-leader node which
couldn't write to _cluster, so the join failed with ER_READONLY
error.

Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.

Closes #6127

(cherry picked from commit 5fe44e2)
Gerold103 added a commit that referenced this issue Jun 11, 2021
Firstly, rename it to replicaset_find_join_master(). Now, when
there is Raft with a concept of an actual leader, the function
name becomes confusing.

Secondly, do not access ballot member in struct applier in such a
long way - save the ballot pointer on the stack. This is going to
become useful when in one of the next patches the ballot will be
used more.

Part of #5613

(cherry picked from commit 2daec52)
Gerold103 added a commit that referenced this issue Jun 11, 2021
Rename the member to show its actual meaning. It is not the
real RO state of the instance. Only how it is configured.

It can happen that the instance is read_only = false, but still is
in RO state due to other reasons.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613

(cherry picked from commit e4488f4)
Gerold103 added a commit that referenced this issue Jun 11, 2021
Is_loading in the ballot used to mean the following: "the instance
did not finish its box.cfg() or has read_only = true". Which is
quite a strange property.

For instance, it was 'true' even if the instance is not really
loading anymore but has read_only = true.

The patch renames it to 'is_ro' (which existed here before, but
also with a wrong meaning).

Its behaviour is slightly changed to report the RO state of the
instance. Not its read_only. This way it incorporates all the
possible RO conditions. Such as not finished bootstrap, having
read_only = true, being a Raft follower, and so on.

The patch is done in scope of #5613 where the ballot is going to
be extended and used a bit differently in the join-master search
algorithm.

Part of #5613

(cherry picked from commit 71d2a56)
Gerold103 added a commit that referenced this issue Jun 11, 2021
The new field reports whether the instance has finished its
bootstrap/recovery, or IOW has finished box.cfg().

The new field will help in fixing #5613 so as not to try to join
to a replicaset via non-bootstrapped instances if there are
others.

The problem is that otherwise, if all nodes are booted but
are read-only, new instances bootstrap their own independent
replicaset. It would be better to just fail and terminate the
process than do such a bizarre action.

Part of #5613

(cherry picked from commit f8a150c)
Gerold103 added a commit that referenced this issue Jun 11, 2021
The patch refactors the algorithm of finding a join-master (in
replicaset_find_join_master()) to use scores instead of multiple
iterations with different criteria.

The original code was relatively fine as long as it had only
one parameter to change - whether should it skip
`box.cfg{read_only = true}` nodes.

Although it was clear that it was "on the edge" of acceptable
complexity due to a second non-configurable parameter whether a
replica is in read-only state regardless of its config.

It is going to get more complicated when the algorithm will take
into account the third parameter whether an instance is
bootstrapped.

Then it should make decisions like "among bootstrapped nodes try
to prefer instances not having read_only=true, and not being in
read-only state". The easiest way to do so is to use
scores/weights incremented according to the instance's parameters
matching certain "good points".

Part of #5613

(cherry picked from commit 89f1f63)
Gerold103 added a commit that referenced this issue Jun 11, 2021
The algorithm of looking for an instance to join the replicaset
from didn't take into account that some of the instances might be
not bootstrapped but still perfectly available.

As a result, a ridiculous situation could happen - an instance
could connect to a cluster with just read-only instances, but it
could have itself with box.cfg{read_only = false}. Then instead of
failing or waiting it just booted a brand new cluster. And after
that the node just started complaining about the others having a
different replicaset UUID.

The patch makes so a new instance always prefers a bootstrapped
join-source to a non-boostrapped one, including self. In the
situation above the new instance now terminates with an error.

In future hopefully it should start a retry-loop instead.

Closes #5613

@TarantoolBot document
Title: IPROTO_BALLOT rework and a new field

A couple of fields in `IPROTO_BALLOT 0x29` used to have values not
matching with their names. They are changed.

* `IPROTO_BALLOT_IS_RO 0x01` used to mean "the instance has
  `box.cfg{read_only = true}`". It was renamed in the source code
  to `IPROTO_BALLOT_IS_RO_CFG`. It has the same code `0x01`, and
  the value is the same. Only the name has changed, and in the doc
  should be too.

* `IPROTO_BALLOT_IS_LOADING 0x04` used to mean "the instance has
  finished `box.cfg()` and it has `read_only = true`". The name
  was wrong therefore, because even if the instance finished
  loading, the flag still was false for `read_only = true` nodes.
  Also such a value is not very suitable for any sane usage.
  The name was changed to `IPROTO_BALLOT_IS_RO`, the code stayed
  the same, and the value now is "the instance is not writable".
  The reason for being not writable can be any: the node is an
  orphan; or it has `read_only = true`; or it is a Raft follower;
  or anything else.

And there is a new field.

`IPROTO_BALLOT_IS_BOOTED 0x06` means the instance has finished its
bootstrap or recovery.

(cherry picked from commit 6a14c8d)
Gerold103 added a commit that referenced this issue Jun 11, 2021
There was a bug that a new replica at join to a election-enabled
cluster sometimes tried to register on a non-leader node which
couldn't write to _cluster, so the join failed with ER_READONLY
error.

Now in scope of #5613 the algorithm of join-master selection is
changed. A new node looks for writable members of the cluster to
use a join-master. It will not choose a follower if there is a
leader.

Closes #6127

(cherry picked from commit 5fe44e2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working replication
Projects
None yet
Development

No branches or pull requests

4 participants