Different reaction if failed to connect to replicas #4424

rtokarev · 2019-08-12T16:24:11Z

Tarantool version: 1.10.3-106-g4faa103

OS version: CentOS 6

Bug description:

I've noticed different reactions when bootstrapped tarantool is failed to connect to replicas. I believe that reaction in all the following cases must be the same.

№ 1 initial boot with any value of 'replication_connect_quorum' and defined 'replication' - boots normally

Tarantool 1.10.3-106-g4faa103
type 'help' for interactive help
tarantool> box.cfg{ read_only = true, replication = { '0:12345' }, replication_connect_timeout = 0.1, replication_timeout = 0.1 }
2019-08-12 19:18:19.487 [24309] main/101/interactive C> Tarantool 1.10.3-106-g4faa103
2019-08-12 19:18:19.487 [24309] main/101/interactive C> log level 5
2019-08-12 19:18:19.488 [24309] main/101/interactive I> mapping 268435456 bytes for memtx tuple arena...
2019-08-12 19:18:19.488 [24309] main/101/interactive I> mapping 134217728 bytes for vinyl tuple arena...
2019-08-12 19:18:19.498 [24309] main/101/interactive I> instance uuid c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:18:19.499 [24309] main/101/interactive I> instance vclock {1: 3}
2019-08-12 19:18:19.499 [24309] main/101/interactive I> connecting to 1 replicas
2019-08-12 19:18:19.500 [24309] main/105/applier/0:12345 I> can't connect to master
2019-08-12 19:18:19.500 [24309] main/105/applier/0:12345 coio.cc:106 !> SystemError connect, called on fd 11, aka 127.0.0.1:46977: Connection refused
2019-08-12 19:18:19.500 [24309] main/105/applier/0:12345 I> will retry every 0.10 second
2019-08-12 19:18:19.600 [24309] main/101/interactive C> failed to connect to 1 out of 1 replicas
2019-08-12 19:18:19.600 [24309] main/101/interactive I> recovery start
2019-08-12 19:18:19.600 [24309] main/101/interactive I> recovering from `./00000000000000000003.snap'
2019-08-12 19:18:19.635 [24309] main/101/interactive I> cluster uuid f90f9748-7d87-443a-b5d9-3202082bfe6a
2019-08-12 19:18:19.640 [24309] main/101/interactive I> assigned id 1 to replica 4068eb51-9af6-448b-ba5f-4c72b9214d2e
2019-08-12 19:18:19.640 [24309] main/101/interactive I> assigned id 2 to replica 2a38db2a-dec1-4c03-bfcb-11b943f14b52
2019-08-12 19:18:19.640 [24309] main/101/interactive I> assigned id 3 to replica c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:18:19.641 [24309] main/101/interactive I> recover from `./00000000000000000003.xlog'
2019-08-12 19:18:19.641 [24309] main/101/interactive I> done `./00000000000000000003.xlog'
2019-08-12 19:18:19.642 [24309] main/101/interactive I> ready to accept requests
2019-08-12 19:18:19.642 [24309] main/101/interactive I> synchronizing with 1 replicas
2019-08-12 19:18:19.642 [24309] main/101/interactive C> failed to synchronize with 1 out of 1 replicas
2019-08-12 19:18:19.642 [24309] main/101/interactive C> entering orphan mode
2019-08-12 19:18:19.646 [24309] main/109/checkpoint_daemon I> started
2019-08-12 19:18:19.646 [24309] main/109/checkpoint_daemon I> scheduled the next snapshot at Mon Aug 12 20:55:51 2019
2019-08-12 19:18:19.647 [24309] main/101/interactive I> set 'read_only' configuration option to true
---
...

tarantool>

№ 2 - set 'replication' after initial boot, 'replication_connect_quorum' is zero - operates normally

Tarantool 1.10.3-106-g4faa103
type 'help' for interactive help
tarantool> box.cfg{ read_only = true }
2019-08-12 19:20:36.792 [24801] main/101/interactive C> Tarantool 1.10.3-106-g4faa103
2019-08-12 19:20:36.792 [24801] main/101/interactive C> log level 5
2019-08-12 19:20:36.793 [24801] main/101/interactive I> mapping 268435456 bytes for memtx tuple arena...
2019-08-12 19:20:36.793 [24801] main/101/interactive I> mapping 134217728 bytes for vinyl tuple arena...
2019-08-12 19:20:36.804 [24801] main/101/interactive I> instance uuid c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:20:36.804 [24801] main/101/interactive I> instance vclock {1: 3}
2019-08-12 19:20:36.805 [24801] main/101/interactive I> recovery start
2019-08-12 19:20:36.805 [24801] main/101/interactive I> recovering from `./00000000000000000003.snap'
2019-08-12 19:20:36.840 [24801] main/101/interactive I> cluster uuid f90f9748-7d87-443a-b5d9-3202082bfe6a
2019-08-12 19:20:36.845 [24801] main/101/interactive I> assigned id 1 to replica 4068eb51-9af6-448b-ba5f-4c72b9214d2e
2019-08-12 19:20:36.845 [24801] main/101/interactive I> assigned id 2 to replica 2a38db2a-dec1-4c03-bfcb-11b943f14b52
2019-08-12 19:20:36.845 [24801] main/101/interactive I> assigned id 3 to replica c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:20:36.846 [24801] main/101/interactive I> recover from `./00000000000000000003.xlog'
2019-08-12 19:20:36.846 [24801] main/101/interactive I> done `./00000000000000000003.xlog'
2019-08-12 19:20:36.846 [24801] main/101/interactive I> ready to accept requests
2019-08-12 19:20:36.847 [24801] main/107/checkpoint_daemon I> started
2019-08-12 19:20:36.847 [24801] main/107/checkpoint_daemon I> scheduled the next snapshot at Mon Aug 12 20:44:13 2019
2019-08-12 19:20:36.847 [24801] main/101/interactive I> set 'read_only' configuration option to true
---
...

tarantool> box.cfg{ replication = { '0:12345' }, replication_connect_timeout = 0.1, replication_timeout = 0.1, replication_connect_quorum = 0 }
2019-08-12 19:20:45.250 [24801] main/101/interactive I> set 'replication_connect_quorum' configuration option to 0
2019-08-12 19:20:45.250 [24801] main/101/interactive I> set 'replication_connect_timeout' configuration option to 0.1
2019-08-12 19:20:45.250 [24801] main/101/interactive I> connecting to 1 replicas
2019-08-12 19:20:45.251 [24801] main/112/applier/0:12345 I> can't connect to master
2019-08-12 19:20:45.251 [24801] main/112/applier/0:12345 coio.cc:106 !> SystemError connect, called on fd 12, aka 127.0.0.1:48249: Connection refused
2019-08-12 19:20:45.251 [24801] main/112/applier/0:12345 I> will retry every 1.00 second
2019-08-12 19:20:45.348 [24801] main/101/interactive C> failed to connect to 1 out of 1 replicas
2019-08-12 19:20:45.348 [24801] main/101/interactive I> set 'replication' configuration option to ["0:12345"]
2019-08-12 19:20:45.348 [24801] main/101/interactive I> set 'replication_timeout' configuration option to 0.1
---
...

tarantool>

№ 3 set 'replication' after initial boot, 'replicaiton_connect_quorum' is default of not zero - raise the error

Tarantool 1.10.3-106-g4faa103
type 'help' for interactive help
tarantool> box.cfg{ read_only = true }
2019-08-12 19:22:03.203 [25075] main/101/interactive C> Tarantool 1.10.3-106-g4faa103
2019-08-12 19:22:03.204 [25075] main/101/interactive C> log level 5
2019-08-12 19:22:03.204 [25075] main/101/interactive I> mapping 268435456 bytes for memtx tuple arena...
2019-08-12 19:22:03.205 [25075] main/101/interactive I> mapping 134217728 bytes for vinyl tuple arena...
2019-08-12 19:22:03.215 [25075] main/101/interactive I> instance uuid c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:22:03.216 [25075] main/101/interactive I> instance vclock {1: 3}
2019-08-12 19:22:03.216 [25075] main/101/interactive I> recovery start
2019-08-12 19:22:03.216 [25075] main/101/interactive I> recovering from `./00000000000000000003.snap'
2019-08-12 19:22:03.251 [25075] main/101/interactive I> cluster uuid f90f9748-7d87-443a-b5d9-3202082bfe6a
2019-08-12 19:22:03.256 [25075] main/101/interactive I> assigned id 1 to replica 4068eb51-9af6-448b-ba5f-4c72b9214d2e
2019-08-12 19:22:03.256 [25075] main/101/interactive I> assigned id 2 to replica 2a38db2a-dec1-4c03-bfcb-11b943f14b52
2019-08-12 19:22:03.256 [25075] main/101/interactive I> assigned id 3 to replica c58d85a7-ee40-470f-b64d-1af4f1b46b44
2019-08-12 19:22:03.257 [25075] main/101/interactive I> recover from `./00000000000000000003.xlog'
2019-08-12 19:22:03.257 [25075] main/101/interactive I> done `./00000000000000000003.xlog'
2019-08-12 19:22:03.257 [25075] main/101/interactive I> ready to accept requests
2019-08-12 19:22:03.258 [25075] main/107/checkpoint_daemon I> started
2019-08-12 19:22:03.258 [25075] main/107/checkpoint_daemon I> scheduled the next snapshot at Mon Aug 12 20:37:39 2019
2019-08-12 19:22:03.258 [25075] main/101/interactive I> set 'read_only' configuration option to true
---
...

tarantool> box.cfg{ replication = { '0:12345' }, replication_connect_timeout = 0.1, replication_timeout = 0.1 }
2019-08-12 19:22:06.193 [25075] main/101/interactive I> set 'replication_connect_timeout' configuration option to 0.1
2019-08-12 19:22:06.193 [25075] main/101/interactive I> connecting to 1 replicas
2019-08-12 19:22:06.194 [25075] main/112/applier/0:12345 I> can't connect to master
2019-08-12 19:22:06.194 [25075] main/112/applier/0:12345 coio.cc:106 !> SystemError connect, called on fd 12, aka 127.0.0.1:49255: Connection refused
2019-08-12 19:22:06.194 [25075] main/112/applier/0:12345 I> will retry every 1.00 second
2019-08-12 19:22:06.291 [25075] main/101/interactive C> failed to connect to 1 out of 1 replicas
---
- error: 'Incorrect value for option ''replication'': failed to connect to one or
    more replicas'
...

tarantool>

The text was updated successfully, but these errors were encountered:

locker · 2019-08-14T09:00:07Z

Well, failing the initial configuration is not an option, because the instance simply wouldn't start in this case and there would be no way to fix the issue manually apart from patching the configuration. So we put it in so-called "orphan" mode, which basically means that the instance is read-only, and then proceed. Once the instance has managed to connect to the configured replicas, it will leave the "orphan" mode (become read-write).

OTOH an error during a subsequent call to box.cfg can be handled and processed properly so we don't really need to enter the "orphan" mode in that case.

locker · 2019-08-14T12:05:57Z

Talked to @kostja. Agreed that it's okay to enter "orphan" mode even if box.cfg is called after initial configuration.

We only entered orphan mode on bootrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424

We only entered orphan mode on bootstrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424

@TarantoolBot

We only entered orphan mode on bootstrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now calling `box.cfg{replication={uri1, uri2, ...}}` will never throw an error in case replication quorum cannot be reached. It will just switch the instance to orphan state. (Previously the instance switched to orphan mode in case of an error in initial configuration, and an error was thrown if quorum couldn't be reached on subsequent box.cfg call. Now instance always switches to orphan if quorum cannot be reached)

@TarantoolBot

We only entered orphan mode on bootstrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now calling `box.cfg{replication={uri1, uri2, ...}}` will never throw an error in case replication quorum cannot be reached. It will just switch the instance to orphan state. (Previously the instance switched to orphan mode in case of an error in initial configuration, and an error was thrown if quorum couldn't be reached on subsequent box.cfg call. Now instance always switches to orphan if quorum cannot be reached)

@TarantoolBot

Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string)

@TarantoolBot

Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string)

@TarantoolBot

Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string)

@TarantoolBot

Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string) (cherry picked from commit 5a0cfe0)

@TarantoolBot

Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string) (cherry picked from commit 5a0cfe0)

@TarantoolBot

Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string) (cherry picked from commit 5a0cfe0)

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-assert-on-server-die.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-value-not-replicated-on-iproto-request.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-no-panic-on-connected.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-assert-on-server-die.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-value-not-replicated-on-iproto-request.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-no-panic-on-connected.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit c7e6627)

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit 867e6b3)

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit 867e6b3)

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940

To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit 867e6b3)

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271 (cherry picked from commit 5a9b79f)

Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271

kyukhin added bug Something isn't working replication labels Aug 15, 2019

kyukhin added this to the 1.10.4 milestone Aug 15, 2019

kyukhin assigned sergepetrenko Aug 15, 2019

sergepetrenko added ready for review and removed ready for review labels Aug 19, 2019

locker closed this as completed in 5a0cfe0 Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different reaction if failed to connect to replicas #4424

Different reaction if failed to connect to replicas #4424

rtokarev commented Aug 12, 2019 •

edited

locker commented Aug 14, 2019

locker commented Aug 14, 2019

Different reaction if failed to connect to replicas #4424

Different reaction if failed to connect to replicas #4424

Comments

rtokarev commented Aug 12, 2019 • edited

locker commented Aug 14, 2019

locker commented Aug 14, 2019

rtokarev commented Aug 12, 2019 •

edited