qsync testing plan

RFC

state of RFC used for preparing testcases (15.06.2020) - https://github.com/tarantool/tarantool/commit/a0236e5891f97426a62634557560c4adf32fc967
current state of RFC https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md#asynchronous-replication

Bugs

1st iteration

~~[RFC, summary] switch async replicas into sync ones and vice versa, expected success and data consistency on a leader and replicas~~
~~[RFC, summary] switch from leader to replica and vice versa, expected success and data consistency on a leader and replicas~~
~~[RFC, quorum commit] happy path: write/read data to a leader in sync cluster, expected data consistency on a leader and replicas~~
happy path: read/write data to a sync cluster with max allowed replicas number, expected success and data consistency on a leader and replicas
~~[RFC, quorum commit] no quorum achieved, expected transaction rollback and data consistency on a leader and replicas~~
[RFC, quorum commit] check behaviour with no answer from a replica during write, expected to set failure answer
~~[RFC, quorum commit] check behaviour with failure answer from a replica during write, expected disconnect from the replication~~
~~[RFC, quorum commit] attempt to write multiple transactions, expected the same order as on client in case of achieved quorum~~
[RFC, quorum commit] attempt to write multiple transactions, expected that latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it
[RFC, quorum commit] failure on a leader transaction confirm, expected rollback and data consistency on a leader and replicas
(no sense to test it in regression) leader got a quorum but one replica participated in a quorum leave cluster right after answering to a leader, expected nothing
[RFC, quorum commit] проверить ситуацию, когда в WAL записали и ответили SUCCESS, но потом потеряли WAL
~~почитать код для rollback ("guarantee of rollback on leader and sync replicas")~~
~~consistency on replicas on enabling and disabling sync replication~~
(checked manually) ~~[RFC, connection liveness] replication_connect_timeout works as expected with sync cluster (see documentation)~~
(checked manually) ~~[RFC, connection liveness] replication_sync_lag works as expected with sync cluster (see documentation)~~
(checked manually) ~~[RFC, connection liveness] replication_sync_timeout works as expected with sync cluster (see documentation)~~
(checked manually) ~~[RFC, connection liveness] replication_timeout works as expected with sync cluster (see documentation)~~
~~[RFC, connection liveness] replication_synchro_timeout~~
~~[RFC, connection liveness] replication_synchro_quorum~~
~~[RFC, connection liveness] when Leader has no response for another heartbeat interval, it should consider the replica is lost~~
??? [RFC, connection liveness] when leader appears in a situation it has not enough replicas to achieve quorum, it should stop accepting write requests
??? [RFC, connection liveness] leader stopped to accept write requests can be switched back to write mode when configuration of a cluster updated.
[RFC, connection liveness] some of replicas become unavailable during the quorum collection, expected - a leader should wait at most for replication_synchro_quorum_timeout after which it issues a rollback pointing to the oldest TXN in the waiting list
~~test with a leader and a single replica in a cluster, expected ??? (TBD)~~
??? [RFC, Recovery and failover] "In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode."
[RFC, Recovery and failover] Tarantool instance during reading WAL should postpone the undo log deletion until the 'confirm' is read. In case the WAL eof is achieved, the instance should keep undo log for all transactions that are waiting for a confirm entry until the role of the instance is set.
~~[RFC, Snapshot generation] all txns confirmed, then snapshot, expected success (check both master and replica)~~
~~[RFC, Snapshot generation] snapshot started, then confirm arrived, expected success (check both master and replica)~~
~~[RFC, Snapshot generation] snapshot started, then rollback arrived, expected snapshot abort (check both master and replica)~~
[RFC, Snapshot generation] successful snapshot contains all txns created before LSN that was latest when snapshot creation started (check both master and replica)
~~[RFC, Asynchronous replication] successful transaction applied on async replica~~
~~[RFC, Asynchronous replication] failed transaction rolled back on async replica~~
[RFC, Synchronous replication enabling] "As soon as last operation of synchronous transaction appeared in leader's WAL, it will cause all following transactions - no matter if they are synchronous or not - wait for the quorum. In case quorum is not achieved the 'rollback' operation will cause rollback of all transactions after the synchronous one."

2nd iteration

??? [RFC, Synchronous replication enabling] "Cluster description should contain explicit attribute for each replica to denote it participates in synchronous activities. Also the description should contain criterion on how many replicas responses are needed to achieve the quorum."
[RFC, Leader role assignment] promote a leader manually using special command
[RFC, Leader role assignment] demote a leader manually using special command
fault injections on a different steps to fail "WAL Ok" from replica: network, disk, etc (TBD)
test with time difference on leader and replicas, expected success
proper quorum number calculation (quorum is a more than half of the number of nodes in cluster. (N/2 + 1) where N is total number of nodes in cluster):
- In a 5-node cluster, quorum is 3
- In a 4-node cluster, quorum is 3
- In a 3-node cluster, quorum is 2
- In a 2-node cluster, quorum is 1
[RFC, Leader role assignment] TODO: automated leader promotion with Raft
test Raft implementation itself with random state generation and invariants described in a Raft paper
test new cluster cli

Raft Tests

Notes

Testing should be done with both engines: memtx and vinyl
How many nodes should be in a cluster?
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm:

Almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes. 84% will manifest on no more than 2 nodes…. It is not necessary to have a large cluster to test for and reproduce failures.