New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different reaction if failed to connect to replicas #4424
Comments
Well, failing the initial configuration is not an option, because the instance simply wouldn't start in this case and there would be no way to fix the issue manually apart from patching the configuration. So we put it in so-called "orphan" mode, which basically means that the instance is read-only, and then proceed. Once the instance has managed to connect to the configured replicas, it will leave the "orphan" mode (become read-write). OTOH an error during a subsequent call to box.cfg can be handled and processed properly so we don't really need to enter the "orphan" mode in that case. |
Talked to @kostja. Agreed that it's okay to enter "orphan" mode even if box.cfg is called after initial configuration. |
We only entered orphan mode on bootrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424
We only entered orphan mode on bootrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424
We only entered orphan mode on bootstrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424
We only entered orphan mode on bootstrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now calling `box.cfg{replication={uri1, uri2, ...}}` will never throw an error in case replication quorum cannot be reached. It will just switch the instance to orphan state. (Previously the instance switched to orphan mode in case of an error in initial configuration, and an error was thrown if quorum couldn't be reached on subsequent box.cfg call. Now instance always switches to orphan if quorum cannot be reached)
We only entered orphan mode on bootstrap and local recovery, but threw an error when replicaton config was changed on the fly. For consistency, in this case we should also enter orphan mode when an instance fails to connect to quorum remote instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now calling `box.cfg{replication={uri1, uri2, ...}}` will never throw an error in case replication quorum cannot be reached. It will just switch the instance to orphan state. (Previously the instance switched to orphan mode in case of an error in initial configuration, and an error was thrown if quorum couldn't be reached on subsequent box.cfg call. Now instance always switches to orphan if quorum cannot be reached)
Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string)
Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string)
Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string)
Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string) (cherry picked from commit 5a0cfe0)
Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string) (cherry picked from commit 5a0cfe0)
Currently we only enter orphan mode when instance fails to connect to replication_connect_quorum remote instances during local recovery. On bootstrap and manual replication configuration change an error is thrown. We better enter orphan mode on manual config change, and leave it only in case we managed to sync with replication_connect_quorum instances. Closes #4424 @TarantoolBot document Title: document reaction on error in replication configuration change. Now when issuing `box.cfg{replication={uri1, uri2, ...}}` and failing to sync with replication_connect_quorum remote instances, the server will throw an error, if it is bootstrapping, and just set its state to orphan in all other cases (recovering from existing xlog/snap files or manually changing box.cfg.replication on the fly). To leave orphan mode, you may wait until the server manages to sync with replication_connect_quorum instances. In order to leave orphan mode you need to make the server sync with enough instances. To do so, you may either: 1) set replication_connect_quorum to a lower value 2) reset box.cfg.replication to exclude instances that cannot be reached or synced with 3) just set box.cfg.replication to "" (empty string) (cherry picked from commit 5a0cfe0)
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-assert-on-server-die.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-value-not-replicated-on-iproto-request.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-no-panic-on-connected.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-assert-on-server-die.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-value-not-replicated-on-iproto-request.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-no-panic-on-connected.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit c7e6627)
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit 867e6b3)
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit 867e6b3)
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940
To fix flaky issues of replication/misc.test.lua the test had to be divided into smaller tests to be able to localize the flaky results: gh-2991-misc-asserts-on-update.test.lua gh-3111-misc-rebootstrap-from-ro-master.test.lua gh-3160-misc-heartbeats-on-master-changes.test.lua gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-3510-misc-assert-replica-on-applier-disconnect.test.lua gh-3606-misc-crash-on-box-concurrent-update.test.lua gh-3610-misc-assert-connecting-master-twice.test.lua gh-3637-misc-error-on-replica-auth-fail.test.lua gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua gh-3704-misc-replica-checks-cluster-id.test.lua gh-3711-misc-no-restart-on-same-configuration.test.lua gh-3760-misc-return-on-quorum-0.test.lua gh-4399-misc-no-failure-on-error-reading-wal.test.lua gh-4424-misc-orphan-on-reconfiguration-error.test.lua Needed for #4940 (cherry picked from commit 867e6b3)
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271 (cherry picked from commit 5a9b79f)
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271 (cherry picked from commit 5a9b79f)
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271 (cherry picked from commit 5a9b79f)
Fixed flaky status check: [016] @@ -73,11 +73,11 @@ [016] ... [016] box.info.status [016] --- [016] -- running [016] +- orphan [016] ... [016] box.info.ro [016] --- [016] -- false [016] +- true [016] ... [016] box.cfg{ \ [016] replication = {}, \ [016] Test changed to use wait condition for the status check, which should be changed from 'orphan' to 'running'. On heavy loaded hosts it may spend some additional time, wait condition routine helped to fix it. Closes #5271
Tarantool version: 1.10.3-106-g4faa103
OS version: CentOS 6
Bug description:
I've noticed different reactions when bootstrapped tarantool is failed to connect to replicas. I believe that reaction in all the following cases must be the same.
№ 1 initial boot with any value of 'replication_connect_quorum' and defined 'replication' - boots normally
№ 2 - set 'replication' after initial boot, 'replication_connect_quorum' is zero - operates normally
№ 3 set 'replication' after initial boot, 'replicaiton_connect_quorum' is default of not zero - raise the error
The text was updated successfully, but these errors were encountered: