CONFIRM without quorum leads to split-brain error #9138

lowitea · 2023-09-13T17:16:06Z

My case:

There is a cartridge cluster with a storage replicaset with three nodes.

All spaces are in sync mode.

Stateful failover with etcd is enabled.

Election mode is manual. Function box.ctl.promote() is called inside apply_config if the node is the master.

During high load, the cpu ran out and a master in the replicaset stuck. The failover decided to switch the master. The new master also got stuck. And there were several such switches. At some point, the error "Split-Brain discovered: got a request from a foreign synchro queue owner" appeared in the replicaset.

Possible cause:

A node while it was a master wrote a transaction in WAL and began obtaining quorum. At that moment master switching occurred, but the node managed to obtain quorum and write CONFIRM in its WAL. After writing, the node sent the WAL entry to other nodes in replicaset in which the limbo owner had already been changed.

screenshots

Errors in Cartridge UI

storage-9-1

storage-9-2

storage-9-3

Last WAL entries

---
HEADER:
  lsn: 85936
  replica_id: 3
  type: CONFIRM
  timestamp: 1694197746.2396
BODY:
- null
- 3
- 85930
---
HEADER:
  lsn: 85937
  replica_id: 3
  type: CONFIRM
  timestamp: 1694197747.5981
BODY:
- null
- 3
- 85935
...

OS: RockyLinux
OS Version: 8.5
Architecture: amd64

Tarantool 2.11.0

The text was updated successfully, but these errors were encountered:

R-omk · 2023-09-14T14:42:30Z

In my experience, only when election mode set to 'raft' with 'strict' fencing can provide at least some kind of guarantee.

An external failover by definition cannot provide any guarantees, since it only elects a new leader, while the old one can continue writing.

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes tarantool#9138 NO_DOC=bugfix

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of tarantool#9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Additionally, persist the confirmed lsns for all previous synchronous transaction queue owner for the sake of correct filtering after restarts. Closes tarantool#9138 NO_DOC=bugfix

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of tarantool#9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Additionally, persist the confirmed lsns for all previous synchronous transaction queue owner for the sake of correct filtering after restarts. Closes tarantool#9138 NO_DOC=bugfix

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of tarantool#9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible

@TarantoolBot

Previously the replicas only persisted the confirmed lsn of the current synchronous transaction queue owner. As soon as the onwer changed, the info about which lsn was confirmed by the previous owner was lost. Actually, this info is needed to correctly filter synchro requests coming from the old term, so start tracking confirmed vclock instead of the confirmed lsn on replicas. In-scope of tarantool#9138 NO_TEST=covered by the next commit NO_CHANGELOG=internal change @TarantoolBot document Title: Document new IPROTO_RAFT_PROMOTE field IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key value pair: IPROTO_VCLOCK : MP_MAP The vclock holds a confirmed vclock of the node sending the request.

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes tarantool#9138 NO_DOC=bugfix

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of tarantool#9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible

Synchronous requests will receive a new field encoding a full vclock soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for a map header + 32 components each taking up 1 byte for replica id and up to 9 bytes for lsn). So it makes no sense to increase SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same as plain BODY_LEN_MAX. Simply reuse the latter everywhere. In-scope-of tarantool#9138 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

@TarantoolBot

Previously the replicas only persisted the confirmed lsn of the current synchronous transaction queue owner. As soon as the onwer changed, the info about which lsn was confirmed by the previous owner was lost. Actually, this info is needed to correctly filter synchro requests coming from the old term, so start tracking confirmed vclock instead of the confirmed lsn on replicas. In-scope of tarantool#9138 NO_TEST=covered by the next commit NO_CHANGELOG=internal change @TarantoolBot document Title: Document new IPROTO_RAFT_PROMOTE request field IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key value pair: IPROTO_VCLOCK : MP_MAP The vclock holds a confirmed vclock of the node sending the request.

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes tarantool#9138 NO_DOC=bugfix

@TarantoolBot

Previously the replicas only persisted the confirmed lsn of the current synchronous transaction queue owner. As soon as the onwer changed, the info about which lsn was confirmed by the previous owner was lost. Actually, this info is needed to correctly filter synchro requests coming from the old term, so start tracking confirmed vclock instead of the confirmed lsn on replicas. In-scope of tarantool#9138 NO_TEST=covered by the next commit NO_CHANGELOG=internal change @TarantoolBot document Title: Document new IPROTO_RAFT_PROMOTE request field IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key value pair: IPROTO_VCLOCK : MP_MAP The vclock holds a confirmed vclock of the node sending the request.

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes tarantool#9138 NO_DOC=bugfix

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of #9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible

Synchronous requests will receive a new field encoding a full vclock soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for a map header + 32 components each taking up 1 byte for replica id and up to 9 bytes for lsn). So it makes no sense to increase SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same as plain BODY_LEN_MAX. Simply reuse the latter everywhere. In-scope-of #9138 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

@TarantoolBot

Previously the replicas only persisted the confirmed lsn of the current synchronous transaction queue owner. As soon as the onwer changed, the info about which lsn was confirmed by the previous owner was lost. Actually, this info is needed to correctly filter synchro requests coming from the old term, so start tracking confirmed vclock instead of the confirmed lsn on replicas. In-scope of #9138 NO_TEST=covered by the next commit NO_CHANGELOG=internal change @TarantoolBot document Title: Document new IPROTO_RAFT_PROMOTE request field IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key value pair: IPROTO_VCLOCK : MP_MAP The vclock holds a confirmed vclock of the node sending the request.

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes #9138 NO_DOC=bugfix

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of #9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible (cherry picked from commit c18410f)

Synchronous requests will receive a new field encoding a full vclock soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for a map header + 32 components each taking up 1 byte for replica id and up to 9 bytes for lsn). So it makes no sense to increase SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same as plain BODY_LEN_MAX. Simply reuse the latter everywhere. In-scope-of #9138 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring (cherry picked from commit 5360577)

@TarantoolBot

Previously the replicas only persisted the confirmed lsn of the current synchronous transaction queue owner. As soon as the onwer changed, the info about which lsn was confirmed by the previous owner was lost. Actually, this info is needed to correctly filter synchro requests coming from the old term, so start tracking confirmed vclock instead of the confirmed lsn on replicas. In-scope of #9138 NO_TEST=covered by the next commit NO_CHANGELOG=internal change @TarantoolBot document Title: Document new IPROTO_RAFT_PROMOTE request field IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key value pair: IPROTO_VCLOCK : MP_MAP The vclock holds a confirmed vclock of the node sending the request. (cherry picked from commit c4415d4)

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes #9138 NO_DOC=bugfix (cherry picked from commit ffa6ac1)

There was an error in xrow_decode_synchro: it compared the expected type of the value to the type of the key (MP_UINT) instead of the type of the actual value. This went unnoticed because all values in synchro requests were integers. This is going to change soon, when PROMOTE requests will start holding a vclock, so fix the wrong type check. In-scope-of #9138 NO_DOC=bugfix NO_CHANGELOG=not user-visible (cherry picked from commit c18410f)

Synchronous requests will receive a new field encoding a full vclock soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for a map header + 32 components each taking up 1 byte for replica id and up to 9 bytes for lsn). So it makes no sense to increase SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same as plain BODY_LEN_MAX. Simply reuse the latter everywhere. In-scope-of #9138 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring (cherry picked from commit 5360577)

@TarantoolBot

Previously the replicas only persisted the confirmed lsn of the current synchronous transaction queue owner. As soon as the onwer changed, the info about which lsn was confirmed by the previous owner was lost. Actually, this info is needed to correctly filter synchro requests coming from the old term, so start tracking confirmed vclock instead of the confirmed lsn on replicas. In-scope of #9138 NO_TEST=covered by the next commit NO_CHANGELOG=internal change @TarantoolBot document Title: Document new IPROTO_RAFT_PROMOTE request field IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key value pair: IPROTO_VCLOCK : MP_MAP The vclock holds a confirmed vclock of the node sending the request. (cherry picked from commit c4415d4)

Current split-brain detector implementation raises an error each time a CONFIRM or ROLLBACK entry is received from the previous synchronous transaction queue owner. It is assumed that the new queue owner must have witnessed all the previous CONFIRMS. Besides, according to Raft, ROLLBACK should never happen. Actually there is a case when a CONFIRM from an old term is legal: it's possible that during leader transition old leader writes a CONFIRM for the same transaction that is confirmed by the new leader's PROMOTE. If PROMOTE and CONFIRM lsns match there is nothing bad about such situation. Symmetrically, when an old leader issues a ROLLBACK with the lsn right after the new leader's PROMOTE lsn, it is not a split-brain. Allow such cases by tracking the last confirmed lsn for each synchronous transaction queue owner and silently nopifying CONFIRMs with an lsn less than the one recorded and ROLLBACKs with lsn greater than that. Closes #9138 NO_DOC=bugfix (cherry picked from commit ffa6ac1)

lowitea added the bug Something isn't working label Sep 13, 2023

sergepetrenko added the qsync replication label Sep 14, 2023

Thorwinz added the customer label Nov 7, 2023

sergepetrenko self-assigned this Nov 8, 2023

sergepetrenko mentioned this issue Nov 10, 2023

replication: fix extraneous split-brain alerting #9359

Merged

sergepetrenko added the 2.10 Target is 2.10 and all newer release/master branches label Nov 10, 2023

ylobankov closed this as completed in #9359 Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONFIRM without quorum leads to split-brain error #9138

CONFIRM without quorum leads to split-brain error #9138

lowitea commented Sep 13, 2023 •

edited

R-omk commented Sep 14, 2023

CONFIRM without quorum leads to split-brain error #9138

CONFIRM without quorum leads to split-brain error #9138

Comments

lowitea commented Sep 13, 2023 • edited

My case:

Possible cause:

R-omk commented Sep 14, 2023

lowitea commented Sep 13, 2023 •

edited