Harness raft 'learner' feature #6281

Gerold103 · 2021-08-01T16:29:48Z

With the current implementation of raft it is possible to disable all synchronous transactions by adding a new instance to the cluster until it is fully joined and up-to-date.

The simplest example: there is 1 node with the canonical (replication_synchro_quorum = 'N/2 + 1'). It works fine, but has lots of data. If now a second node is added, the first node won't be able to commit any synchronous data because the quorum became 2, but the second node is busy with long initial and then final join stages.

The same will happen for a cluster having multiple nodes and getting multiple new instances. For example, cluster of 3 nodes, and it gets 3 new nodes - the quorum becomes 4 and the synchronous replication is dead until at least one of the new nodes finished its join stages.

In etcd the problem is solved by having a special role for such instances: https://etcd.io/docs/v3.3/learning/learner/ - learner. They do not affect the quorum, can't vote, can't be leaders. But they can join and get up to date with the leader before entering the quorum.

Need to check if this issue actually exists in Tarantool (make a test, perhaps with an error injection). Then try to add the new role.

In the etcd approach, if I understood it correctly, I do not like that a user needs to set the role manually. In Tarantool it might be done automatically, I hope. Make all new nodes learners until they finish at least both join stages, and then turn of the actual election_mode specified in the config.

The text was updated successfully, but these errors were encountered:

Serpentian · 2023-02-22T10:13:40Z

The issue indeed exists in Tarantool. Here's the reproducer: https://github.com/Serpentian/tarantool/tree/gh-6281-learner-role.
It can be run with: ./test-run.py --builddir ../build replication-luatest/gh_6281_raft_learner_feature_test.lua

This commit introduces test and initial implementation of the fix. POC. Closes tarantool#6281 NO_DOC=bugfix NO_CHANGELOG=later

Gerold103 added feature A new functionality replication synchro raft RAFT protocol labels Aug 1, 2021

kyukhin added incoming and removed incoming labels Aug 5, 2021

kyukhin added this to the wishlist milestone Aug 19, 2021

sergepetrenko mentioned this issue Sep 27, 2021

Raft improvements #6472

Open

6 tasks

kyukhin removed the synchro label Mar 11, 2022

sergepetrenko mentioned this issue Nov 3, 2022

Disallow lowering quorum below N / 2 + 1 without turning some nodes into learners #7896

Open

Serpentian self-assigned this Jan 9, 2023

Serpentian added a commit to Serpentian/tarantool that referenced this issue Feb 27, 2023

replication: fix sync tx unavailability during join

319cf73

This commit introduces test and initial implementation of the fix. POC. Closes tarantool#6281 NO_DOC=bugfix NO_CHANGELOG=later

coveralls mentioned this issue Feb 27, 2023

replication: fix sync tx unavailability during join #8375

Closed

kyukhin removed this from the wishlist milestone May 23, 2023

TarantoolBot removed the teamS label Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harness raft 'learner' feature #6281

Harness raft 'learner' feature #6281

Gerold103 commented Aug 1, 2021 •

edited by kyukhin

Serpentian commented Feb 22, 2023

Harness raft 'learner' feature #6281

Harness raft 'learner' feature #6281

Comments

Gerold103 commented Aug 1, 2021 • edited by kyukhin

Serpentian commented Feb 22, 2023

Gerold103 commented Aug 1, 2021 •

edited by kyukhin