Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harness raft 'learner' feature #6281

Open
Tracked by #6472
Gerold103 opened this issue Aug 1, 2021 · 1 comment
Open
Tracked by #6472

Harness raft 'learner' feature #6281

Gerold103 opened this issue Aug 1, 2021 · 1 comment
Assignees
Labels
feature A new functionality raft RAFT protocol replication

Comments

@Gerold103
Copy link
Collaborator

Gerold103 commented Aug 1, 2021

With the current implementation of raft it is possible to disable all synchronous transactions by adding a new instance to the cluster until it is fully joined and up-to-date.

The simplest example: there is 1 node with the canonical (replication_synchro_quorum = 'N/2 + 1'). It works fine, but has lots of data. If now a second node is added, the first node won't be able to commit any synchronous data because the quorum became 2, but the second node is busy with long initial and then final join stages.

The same will happen for a cluster having multiple nodes and getting multiple new instances. For example, cluster of 3 nodes, and it gets 3 new nodes - the quorum becomes 4 and the synchronous replication is dead until at least one of the new nodes finished its join stages.

In etcd the problem is solved by having a special role for such instances: https://etcd.io/docs/v3.3/learning/learner/ - learner. They do not affect the quorum, can't vote, can't be leaders. But they can join and get up to date with the leader before entering the quorum.

Need to check if this issue actually exists in Tarantool (make a test, perhaps with an error injection). Then try to add the new role.

In the etcd approach, if I understood it correctly, I do not like that a user needs to set the role manually. In Tarantool it might be done automatically, I hope. Make all new nodes learners until they finish at least both join stages, and then turn of the actual election_mode specified in the config.

@Serpentian
Copy link
Contributor

The issue indeed exists in Tarantool. Here's the reproducer: https://github.com/Serpentian/tarantool/tree/gh-6281-learner-role.
It can be run with: ./test-run.py --builddir ../build replication-luatest/gh_6281_raft_learner_feature_test.lua

Serpentian added a commit to Serpentian/tarantool that referenced this issue Feb 27, 2023
This commit introduces test and initial implementation
of the fix. POC.

Closes tarantool#6281

NO_DOC=bugfix
NO_CHANGELOG=later
@kyukhin kyukhin removed this from the wishlist milestone May 23, 2023
@TarantoolBot TarantoolBot removed the teamS label Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality raft RAFT protocol replication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants