Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different reaction if failed to connect to replicas #4424

Closed
rtokarev opened this issue Aug 12, 2019 · 2 comments
Closed

Different reaction if failed to connect to replicas #4424

rtokarev opened this issue Aug 12, 2019 · 2 comments
Assignees
Labels
bug Something isn't working replication
Milestone

Comments

@rtokarev
Copy link
Contributor

rtokarev commented Aug 12, 2019

Tarantool version: 1.10.3-106-g4faa103

OS version: CentOS 6

Bug description:

I've noticed different reactions when bootstrapped tarantool is failed to connect to replicas. I believe that reaction in all the following cases must be the same.

№ 1 initial boot with any value of 'replication_connect_quorum' and defined 'replication' - boots normally

Tarantool 1.10.3-106-g4faa103
type 'help' for interactive help
tarantool> box.cfg{ read_only = true, replication = { '0:12345' }, replication_connect_timeout = 0.1, replication_timeout = 0.1 }
2019-08-12 19:18:19.487 [24309] main/101/interactive C> Tarantool 1.10.3-106-g4faa103
2019-08-12 19:18:19.487 [24309] main/101/interactive C> log level 5
2019-08-12 19:18:19.488 [24309] main/101/interactive I> mapping 268435456 bytes for memtx tuple arena...
2019-08-12 19:18:19.488 [24309] main/101/interactive I> mapping 134217728 bytes for vinyl tuple arena...
2019-08-12 19:18:19.498 [24309] main/101/interactive I> instance uuid c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:18:19.499 [24309] main/101/interactive I> instance vclock {1: 3}
2019-08-12 19:18:19.499 [24309] main/101/interactive I> connecting to 1 replicas
2019-08-12 19:18:19.500 [24309] main/105/applier/0:12345 I> can't connect to master
2019-08-12 19:18:19.500 [24309] main/105/applier/0:12345 coio.cc:106 !> SystemError connect, called on fd 11, aka 127.0.0.1:46977: Connection refused
2019-08-12 19:18:19.500 [24309] main/105/applier/0:12345 I> will retry every 0.10 second
2019-08-12 19:18:19.600 [24309] main/101/interactive C> failed to connect to 1 out of 1 replicas
2019-08-12 19:18:19.600 [24309] main/101/interactive I> recovery start
2019-08-12 19:18:19.600 [24309] main/101/interactive I> recovering from `./00000000000000000003.snap'
2019-08-12 19:18:19.635 [24309] main/101/interactive I> cluster uuid f90f9748-7d87-443a-b5d9-3202082bfe6a
2019-08-12 19:18:19.640 [24309] main/101/interactive I> assigned id 1 to replica 4068eb51-9af6-448b-ba5f-4c72b9214d2e
2019-08-12 19:18:19.640 [24309] main/101/interactive I> assigned id 2 to replica 2a38db2a-dec1-4c03-bfcb-11b943f14b52
2019-08-12 19:18:19.640 [24309] main/101/interactive I> assigned id 3 to replica c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:18:19.641 [24309] main/101/interactive I> recover from `./00000000000000000003.xlog'
2019-08-12 19:18:19.641 [24309] main/101/interactive I> done `./00000000000000000003.xlog'
2019-08-12 19:18:19.642 [24309] main/101/interactive I> ready to accept requests
2019-08-12 19:18:19.642 [24309] main/101/interactive I> synchronizing with 1 replicas
2019-08-12 19:18:19.642 [24309] main/101/interactive C> failed to synchronize with 1 out of 1 replicas
2019-08-12 19:18:19.642 [24309] main/101/interactive C> entering orphan mode
2019-08-12 19:18:19.646 [24309] main/109/checkpoint_daemon I> started
2019-08-12 19:18:19.646 [24309] main/109/checkpoint_daemon I> scheduled the next snapshot at Mon Aug 12 20:55:51 2019
2019-08-12 19:18:19.647 [24309] main/101/interactive I> set 'read_only' configuration option to true
---
...

tarantool>

№ 2 - set 'replication' after initial boot, 'replication_connect_quorum' is zero - operates normally

Tarantool 1.10.3-106-g4faa103
type 'help' for interactive help
tarantool> box.cfg{ read_only = true }
2019-08-12 19:20:36.792 [24801] main/101/interactive C> Tarantool 1.10.3-106-g4faa103
2019-08-12 19:20:36.792 [24801] main/101/interactive C> log level 5
2019-08-12 19:20:36.793 [24801] main/101/interactive I> mapping 268435456 bytes for memtx tuple arena...
2019-08-12 19:20:36.793 [24801] main/101/interactive I> mapping 134217728 bytes for vinyl tuple arena...
2019-08-12 19:20:36.804 [24801] main/101/interactive I> instance uuid c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:20:36.804 [24801] main/101/interactive I> instance vclock {1: 3}
2019-08-12 19:20:36.805 [24801] main/101/interactive I> recovery start
2019-08-12 19:20:36.805 [24801] main/101/interactive I> recovering from `./00000000000000000003.snap'
2019-08-12 19:20:36.840 [24801] main/101/interactive I> cluster uuid f90f9748-7d87-443a-b5d9-3202082bfe6a
2019-08-12 19:20:36.845 [24801] main/101/interactive I> assigned id 1 to replica 4068eb51-9af6-448b-ba5f-4c72b9214d2e
2019-08-12 19:20:36.845 [24801] main/101/interactive I> assigned id 2 to replica 2a38db2a-dec1-4c03-bfcb-11b943f14b52
2019-08-12 19:20:36.845 [24801] main/101/interactive I> assigned id 3 to replica c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:20:36.846 [24801] main/101/interactive I> recover from `./00000000000000000003.xlog'
2019-08-12 19:20:36.846 [24801] main/101/interactive I> done `./00000000000000000003.xlog'
2019-08-12 19:20:36.846 [24801] main/101/interactive I> ready to accept requests
2019-08-12 19:20:36.847 [24801] main/107/checkpoint_daemon I> started
2019-08-12 19:20:36.847 [24801] main/107/checkpoint_daemon I> scheduled the next snapshot at Mon Aug 12 20:44:13 2019
2019-08-12 19:20:36.847 [24801] main/101/interactive I> set 'read_only' configuration option to true
---
...

tarantool> box.cfg{ replication = { '0:12345' }, replication_connect_timeout = 0.1, replication_timeout = 0.1, replication_connect_quorum = 0 }
2019-08-12 19:20:45.250 [24801] main/101/interactive I> set 'replication_connect_quorum' configuration option to 0
2019-08-12 19:20:45.250 [24801] main/101/interactive I> set 'replication_connect_timeout' configuration option to 0.1
2019-08-12 19:20:45.250 [24801] main/101/interactive I> connecting to 1 replicas
2019-08-12 19:20:45.251 [24801] main/112/applier/0:12345 I> can't connect to master
2019-08-12 19:20:45.251 [24801] main/112/applier/0:12345 coio.cc:106 !> SystemError connect, called on fd 12, aka 127.0.0.1:48249: Connection refused
2019-08-12 19:20:45.251 [24801] main/112/applier/0:12345 I> will retry every 1.00 second
2019-08-12 19:20:45.348 [24801] main/101/interactive C> failed to connect to 1 out of 1 replicas
2019-08-12 19:20:45.348 [24801] main/101/interactive I> set 'replication' configuration option to ["0:12345"]
2019-08-12 19:20:45.348 [24801] main/101/interactive I> set 'replication_timeout' configuration option to 0.1
---
...

tarantool>

№ 3 set 'replication' after initial boot, 'replicaiton_connect_quorum' is default of not zero - raise the error

Tarantool 1.10.3-106-g4faa103
type 'help' for interactive help
tarantool> box.cfg{ read_only = true }
2019-08-12 19:22:03.203 [25075] main/101/interactive C> Tarantool 1.10.3-106-g4faa103
2019-08-12 19:22:03.204 [25075] main/101/interactive C> log level 5
2019-08-12 19:22:03.204 [25075] main/101/interactive I> mapping 268435456 bytes for memtx tuple arena...
2019-08-12 19:22:03.205 [25075] main/101/interactive I> mapping 134217728 bytes for vinyl tuple arena...
2019-08-12 19:22:03.215 [25075] main/101/interactive I> instance uuid c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:22:03.216 [25075] main/101/interactive I> instance vclock {1: 3}
2019-08-12 19:22:03.216 [25075] main/101/interactive I> recovery start
2019-08-12 19:22:03.216 [25075] main/101/interactive I> recovering from `./00000000000000000003.snap'
2019-08-12 19:22:03.251 [25075] main/101/interactive I> cluster uuid f90f9748-7d87-443a-b5d9-3202082bfe6a
2019-08-12 19:22:03.256 [25075] main/101/interactive I> assigned id 1 to replica 4068eb51-9af6-448b-ba5f-4c72b9214d2e
2019-08-12 19:22:03.256 [25075] main/101/interactive I> assigned id 2 to replica 2a38db2a-dec1-4c03-bfcb-11b943f14b52
2019-08-12 19:22:03.256 [25075] main/101/interactive I> assigned id 3 to replica c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:22:03.257 [25075] main/101/interactive I> recover from `./00000000000000000003.xlog'
2019-08-12 19:22:03.257 [25075] main/101/interactive I> done `./00000000000000000003.xlog'
2019-08-12 19:22:03.257 [25075] main/101/interactive I> ready to accept requests
2019-08-12 19:22:03.258 [25075] main/107/checkpoint_daemon I> started
2019-08-12 19:22:03.258 [25075] main/107/checkpoint_daemon I> scheduled the next snapshot at Mon Aug 12 20:37:39 2019
2019-08-12 19:22:03.258 [25075] main/101/interactive I> set 'read_only' configuration option to true
---
...

tarantool> box.cfg{ replication = { '0:12345' }, replication_connect_timeout = 0.1, replication_timeout = 0.1 }
2019-08-12 19:22:06.193 [25075] main/101/interactive I> set 'replication_connect_timeout' configuration option to 0.1
2019-08-12 19:22:06.193 [25075] main/101/interactive I> connecting to 1 replicas
2019-08-12 19:22:06.194 [25075] main/112/applier/0:12345 I> can't connect to master
2019-08-12 19:22:06.194 [25075] main/112/applier/0:12345 coio.cc:106 !> SystemError connect, called on fd 12, aka 127.0.0.1:49255: Connection refused
2019-08-12 19:22:06.194 [25075] main/112/applier/0:12345 I> will retry every 1.00 second
2019-08-12 19:22:06.291 [25075] main/101/interactive C> failed to connect to 1 out of 1 replicas
---
- error: 'Incorrect value for option ''replication'': failed to connect to one or
    more replicas'
...

tarantool>
@locker
Copy link
Member

locker commented Aug 14, 2019

Well, failing the initial configuration is not an option, because the instance simply wouldn't start in this case and there would be no way to fix the issue manually apart from patching the configuration. So we put it in so-called "orphan" mode, which basically means that the instance is read-only, and then proceed. Once the instance has managed to connect to the configured replicas, it will leave the "orphan" mode (become read-write).

OTOH an error during a subsequent call to box.cfg can be handled and processed properly so we don't really need to enter the "orphan" mode in that case.

@locker
Copy link
Member

locker commented Aug 14, 2019

Talked to @kostja. Agreed that it's okay to enter "orphan" mode even if box.cfg is called after initial configuration.

@kyukhin kyukhin added bug Something isn't working replication labels Aug 15, 2019
@kyukhin kyukhin added this to the 1.10.4 milestone Aug 15, 2019
sergepetrenko added a commit that referenced this issue Aug 19, 2019
We only entered orphan mode on bootrap and local recovery, but threw an
error when replicaton config was changed on the fly.
For consistency, in this case we should also enter orphan mode when
an instance fails to connect to quorum remote instances.

Closes #4424
sergepetrenko added a commit that referenced this issue Aug 19, 2019
We only entered orphan mode on bootrap and local recovery, but threw an
error when replicaton config was changed on the fly.
For consistency, in this case we should also enter orphan mode when
an instance fails to connect to quorum remote instances.

Closes #4424
sergepetrenko added a commit that referenced this issue Aug 23, 2019
We only entered orphan mode on bootstrap and local recovery, but threw an
error when replicaton config was changed on the fly.
For consistency, in this case we should also enter orphan mode when
an instance fails to connect to quorum remote instances.

Closes #4424
sergepetrenko added a commit that referenced this issue Aug 23, 2019
We only entered orphan mode on bootstrap and local recovery, but threw an
error when replicaton config was changed on the fly.
For consistency, in this case we should also enter orphan mode when
an instance fails to connect to quorum remote instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now calling `box.cfg{replication={uri1, uri2, ...}}` will never throw an
error in case replication quorum cannot be reached. It will just switch
the instance to orphan state.
(Previously the instance switched to orphan mode in case of an error in
initial configuration, and an error was thrown if quorum couldn't be
reached on subsequent box.cfg call. Now instance always switches to
orphan if quorum cannot be reached)
sergepetrenko added a commit that referenced this issue Aug 28, 2019
We only entered orphan mode on bootstrap and local recovery, but threw an
error when replicaton config was changed on the fly.
For consistency, in this case we should also enter orphan mode when
an instance fails to connect to quorum remote instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now calling `box.cfg{replication={uri1, uri2, ...}}` will never throw an
error in case replication quorum cannot be reached. It will just switch
the instance to orphan state.
(Previously the instance switched to orphan mode in case of an error in
initial configuration, and an error was thrown if quorum couldn't be
reached on subsequent box.cfg call. Now instance always switches to
orphan if quorum cannot be reached)
sergepetrenko added a commit that referenced this issue Aug 28, 2019
Currently we only enter orphan mode when instance fails to connect to
replication_connect_quorum remote instances during local recovery.
On bootstrap and manual replication configuration change an error is
thrown. We better enter orphan mode on manual config change, and leave
it only in case we managed to sync with replication_connect_quorum
instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to
sync with replication_connect_quorum remote instances, the server will
throw an error, if it is bootstrapping, and just set its state to orphan
in all other cases (recovering from existing xlog/snap files or manually
changing box.cfg.replication on the fly). To leave orphan mode, you may
wait until the server manages to sync with replication_connect_quorum
instances.
In order to leave orphan mode you need to make the server sync with
enough instances. To do so, you may either:
1) set replication_connect_quorum to a lower value
2) reset box.cfg.replication to exclude instances that cannot
   be reached or synced with
3) just set box.cfg.replication to "" (empty string)
sergepetrenko added a commit that referenced this issue Aug 28, 2019
Currently we only enter orphan mode when instance fails to connect to
replication_connect_quorum remote instances during local recovery.
On bootstrap and manual replication configuration change an error is
thrown. We better enter orphan mode on manual config change, and leave
it only in case we managed to sync with replication_connect_quorum
instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to
sync with replication_connect_quorum remote instances, the server will
throw an error, if it is bootstrapping, and just set its state to orphan
in all other cases (recovering from existing xlog/snap files or manually
changing box.cfg.replication on the fly). To leave orphan mode, you may
wait until the server manages to sync with replication_connect_quorum
instances.
In order to leave orphan mode you need to make the server sync with
enough instances. To do so, you may either:
1) set replication_connect_quorum to a lower value
2) reset box.cfg.replication to exclude instances that cannot
   be reached or synced with
3) just set box.cfg.replication to "" (empty string)
sergepetrenko added a commit that referenced this issue Aug 28, 2019
Currently we only enter orphan mode when instance fails to connect to
replication_connect_quorum remote instances during local recovery.
On bootstrap and manual replication configuration change an error is
thrown. We better enter orphan mode on manual config change, and leave
it only in case we managed to sync with replication_connect_quorum
instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to
sync with replication_connect_quorum remote instances, the server will
throw an error, if it is bootstrapping, and just set its state to orphan
in all other cases (recovering from existing xlog/snap files or manually
changing box.cfg.replication on the fly). To leave orphan mode, you may
wait until the server manages to sync with replication_connect_quorum
instances.
In order to leave orphan mode you need to make the server sync with
enough instances. To do so, you may either:
1) set replication_connect_quorum to a lower value
2) reset box.cfg.replication to exclude instances that cannot
   be reached or synced with
3) just set box.cfg.replication to "" (empty string)
@locker locker closed this as completed in 5a0cfe0 Aug 28, 2019
locker pushed a commit that referenced this issue Aug 28, 2019
Currently we only enter orphan mode when instance fails to connect to
replication_connect_quorum remote instances during local recovery.
On bootstrap and manual replication configuration change an error is
thrown. We better enter orphan mode on manual config change, and leave
it only in case we managed to sync with replication_connect_quorum
instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to
sync with replication_connect_quorum remote instances, the server will
throw an error, if it is bootstrapping, and just set its state to orphan
in all other cases (recovering from existing xlog/snap files or manually
changing box.cfg.replication on the fly). To leave orphan mode, you may
wait until the server manages to sync with replication_connect_quorum
instances.
In order to leave orphan mode you need to make the server sync with
enough instances. To do so, you may either:
1) set replication_connect_quorum to a lower value
2) reset box.cfg.replication to exclude instances that cannot
   be reached or synced with
3) just set box.cfg.replication to "" (empty string)

(cherry picked from commit 5a0cfe0)
locker pushed a commit that referenced this issue Aug 28, 2019
Currently we only enter orphan mode when instance fails to connect to
replication_connect_quorum remote instances during local recovery.
On bootstrap and manual replication configuration change an error is
thrown. We better enter orphan mode on manual config change, and leave
it only in case we managed to sync with replication_connect_quorum
instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to
sync with replication_connect_quorum remote instances, the server will
throw an error, if it is bootstrapping, and just set its state to orphan
in all other cases (recovering from existing xlog/snap files or manually
changing box.cfg.replication on the fly). To leave orphan mode, you may
wait until the server manages to sync with replication_connect_quorum
instances.
In order to leave orphan mode you need to make the server sync with
enough instances. To do so, you may either:
1) set replication_connect_quorum to a lower value
2) reset box.cfg.replication to exclude instances that cannot
   be reached or synced with
3) just set box.cfg.replication to "" (empty string)

(cherry picked from commit 5a0cfe0)
locker pushed a commit that referenced this issue Aug 28, 2019
Currently we only enter orphan mode when instance fails to connect to
replication_connect_quorum remote instances during local recovery.
On bootstrap and manual replication configuration change an error is
thrown. We better enter orphan mode on manual config change, and leave
it only in case we managed to sync with replication_connect_quorum
instances.

Closes #4424

@TarantoolBot document
Title: document reaction on error in replication configuration change.

Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to
sync with replication_connect_quorum remote instances, the server will
throw an error, if it is bootstrapping, and just set its state to orphan
in all other cases (recovering from existing xlog/snap files or manually
changing box.cfg.replication on the fly). To leave orphan mode, you may
wait until the server manages to sync with replication_connect_quorum
instances.
In order to leave orphan mode you need to make the server sync with
enough instances. To do so, you may either:
1) set replication_connect_quorum to a lower value
2) reset box.cfg.replication to exclude instances that cannot
   be reached or synced with
3) just set box.cfg.replication to "" (empty string)

(cherry picked from commit 5a0cfe0)
avtikhon added a commit that referenced this issue Sep 4, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
avtikhon added a commit that referenced this issue Sep 6, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-assert-on-server-die.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-value-not-replicated-on-iproto-request.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-no-panic-on-connected.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
avtikhon added a commit that referenced this issue Sep 6, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
avtikhon added a commit that referenced this issue Sep 7, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-assert-on-server-die.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-value-not-replicated-on-iproto-request.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-no-panic-on-connected.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
avtikhon added a commit that referenced this issue Sep 7, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
avtikhon added a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
avtikhon added a commit that referenced this issue Sep 8, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
avtikhon added a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit c7e6627)
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit 867e6b3)
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit 867e6b3)
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940
kyukhin pushed a commit that referenced this issue Sep 8, 2020
To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

(cherry picked from commit 867e6b3)
avtikhon added a commit that referenced this issue Sep 8, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
avtikhon added a commit that referenced this issue Sep 9, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
kyukhin pushed a commit that referenced this issue Sep 9, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271

(cherry picked from commit 5a9b79f)
kyukhin pushed a commit that referenced this issue Sep 9, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271

(cherry picked from commit 5a9b79f)
kyukhin pushed a commit that referenced this issue Sep 9, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271

(cherry picked from commit 5a9b79f)
kyukhin pushed a commit that referenced this issue Sep 9, 2020
Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working replication
Projects
None yet
Development

No branches or pull requests

4 participants