Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CONFIRM without quorum leads to split-brain error #9138

Closed
lowitea opened this issue Sep 13, 2023 · 1 comment · Fixed by #9359
Closed

CONFIRM without quorum leads to split-brain error #9138

lowitea opened this issue Sep 13, 2023 · 1 comment · Fixed by #9359
Assignees
Labels
2.10 Target is 2.10 and all newer release/master branches bug Something isn't working customer qsync replication

Comments

@lowitea
Copy link

lowitea commented Sep 13, 2023

My case:

There is a cartridge cluster with a storage replicaset with three nodes.

All spaces are in sync mode.

Stateful failover with etcd is enabled.

Election mode is manual. Function box.ctl.promote() is called inside apply_config if the node is the master.

During high load, the cpu ran out and a master in the replicaset stuck. The failover decided to switch the master. The new master also got stuck. And there were several such switches. At some point, the error "Split-Brain discovered: got a request from a foreign synchro queue owner" appeared in the replicaset.

Possible cause:

A node while it was a master wrote a transaction in WAL and began obtaining quorum. At that moment master switching occurred, but the node managed to obtain quorum and write CONFIRM in its WAL. After writing, the node sent the WAL entry to other nodes in replicaset in which the limbo owner had already been changed.

screenshots Errors in Cartridge UI

storage-9-1

Pasted image 20230913132200

storage-9-2

Pasted image 20230913132209

storage-9-3

Pasted image 20230913132221

Last WAL entries

---
HEADER:
  lsn: 85936
  replica_id: 3
  type: CONFIRM
  timestamp: 1694197746.2396
BODY:
- null
- 3
- 85930
---
HEADER:
  lsn: 85937
  replica_id: 3
  type: CONFIRM
  timestamp: 1694197747.5981
BODY:
- null
- 3
- 85935
...
  • OS: RockyLinux
  • OS Version: 8.5
  • Architecture: amd64

Tarantool 2.11.0

@lowitea lowitea added the bug Something isn't working label Sep 13, 2023
@R-omk
Copy link

R-omk commented Sep 14, 2023

In my experience, only when election mode set to 'raft' with 'strict' fencing can provide at least some kind of guarantee.

An external failover by definition cannot provide any guarantees, since it only elects a new leader, while the old one can continue writing.

@sergepetrenko sergepetrenko self-assigned this Nov 8, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 10, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes tarantool#9138

NO_DOC=bugfix
@sergepetrenko sergepetrenko added the 2.10 Target is 2.10 and all newer release/master branches label Nov 10, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 14, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of tarantool#9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 14, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Additionally, persist the confirmed lsns for all previous synchronous
transaction queue owner for the sake of correct filtering after
restarts.

Closes tarantool#9138

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 15, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of tarantool#9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 15, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Additionally, persist the confirmed lsns for all previous synchronous
transaction queue owner for the sake of correct filtering after
restarts.

Closes tarantool#9138

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 15, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Additionally, persist the confirmed lsns for all previous synchronous
transaction queue owner for the sake of correct filtering after
restarts.

Closes tarantool#9138

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 21, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of tarantool#9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 21, 2023
Previously the replicas only persisted the confirmed lsn of the current
synchronous transaction queue owner. As soon as the onwer changed, the
info about which lsn was confirmed by the previous owner was lost.

Actually, this info is needed to correctly filter synchro requests
coming from the old term, so start tracking confirmed vclock instead of
the confirmed lsn on replicas.

In-scope of tarantool#9138

NO_TEST=covered by the next commit
NO_CHANGELOG=internal change

@TarantoolBot document
Title: Document new IPROTO_RAFT_PROMOTE field

IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key
value pair:

IPROTO_VCLOCK : MP_MAP

The vclock holds a confirmed vclock of the node sending the request.
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 21, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes tarantool#9138

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 22, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of tarantool#9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 22, 2023
Synchronous requests will receive a new field encoding a full vclock
soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for
a map header + 32 components each taking up 1 byte for replica id and up
to 9 bytes for lsn). So it makes no sense to increase
SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same
as plain BODY_LEN_MAX. Simply reuse the latter everywhere.

In-scope-of tarantool#9138

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 22, 2023
Previously the replicas only persisted the confirmed lsn of the current
synchronous transaction queue owner. As soon as the onwer changed, the
info about which lsn was confirmed by the previous owner was lost.

Actually, this info is needed to correctly filter synchro requests
coming from the old term, so start tracking confirmed vclock instead of
the confirmed lsn on replicas.

In-scope of tarantool#9138

NO_TEST=covered by the next commit
NO_CHANGELOG=internal change

@TarantoolBot document
Title: Document new IPROTO_RAFT_PROMOTE request field

IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key
value pair:

IPROTO_VCLOCK : MP_MAP

The vclock holds a confirmed vclock of the node sending the request.
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 22, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes tarantool#9138

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 22, 2023
Previously the replicas only persisted the confirmed lsn of the current
synchronous transaction queue owner. As soon as the onwer changed, the
info about which lsn was confirmed by the previous owner was lost.

Actually, this info is needed to correctly filter synchro requests
coming from the old term, so start tracking confirmed vclock instead of
the confirmed lsn on replicas.

In-scope of tarantool#9138

NO_TEST=covered by the next commit
NO_CHANGELOG=internal change

@TarantoolBot document
Title: Document new IPROTO_RAFT_PROMOTE request field

IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key
value pair:

IPROTO_VCLOCK : MP_MAP

The vclock holds a confirmed vclock of the node sending the request.
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue Nov 22, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes tarantool#9138

NO_DOC=bugfix
ylobankov pushed a commit that referenced this issue Nov 30, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of #9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible
ylobankov pushed a commit that referenced this issue Nov 30, 2023
Synchronous requests will receive a new field encoding a full vclock
soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for
a map header + 32 components each taking up 1 byte for replica id and up
to 9 bytes for lsn). So it makes no sense to increase
SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same
as plain BODY_LEN_MAX. Simply reuse the latter everywhere.

In-scope-of #9138

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
ylobankov pushed a commit that referenced this issue Nov 30, 2023
Previously the replicas only persisted the confirmed lsn of the current
synchronous transaction queue owner. As soon as the onwer changed, the
info about which lsn was confirmed by the previous owner was lost.

Actually, this info is needed to correctly filter synchro requests
coming from the old term, so start tracking confirmed vclock instead of
the confirmed lsn on replicas.

In-scope of #9138

NO_TEST=covered by the next commit
NO_CHANGELOG=internal change

@TarantoolBot document
Title: Document new IPROTO_RAFT_PROMOTE request field

IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key
value pair:

IPROTO_VCLOCK : MP_MAP

The vclock holds a confirmed vclock of the node sending the request.
ylobankov pushed a commit that referenced this issue Nov 30, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes #9138

NO_DOC=bugfix
sergepetrenko added a commit that referenced this issue Dec 2, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of #9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible

(cherry picked from commit c18410f)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
Synchronous requests will receive a new field encoding a full vclock
soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for
a map header + 32 components each taking up 1 byte for replica id and up
to 9 bytes for lsn). So it makes no sense to increase
SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same
as plain BODY_LEN_MAX. Simply reuse the latter everywhere.

In-scope-of #9138

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 5360577)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
Previously the replicas only persisted the confirmed lsn of the current
synchronous transaction queue owner. As soon as the onwer changed, the
info about which lsn was confirmed by the previous owner was lost.

Actually, this info is needed to correctly filter synchro requests
coming from the old term, so start tracking confirmed vclock instead of
the confirmed lsn on replicas.

In-scope of #9138

NO_TEST=covered by the next commit
NO_CHANGELOG=internal change

@TarantoolBot document
Title: Document new IPROTO_RAFT_PROMOTE request field

IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key
value pair:

IPROTO_VCLOCK : MP_MAP

The vclock holds a confirmed vclock of the node sending the request.

(cherry picked from commit c4415d4)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes #9138

NO_DOC=bugfix

(cherry picked from commit ffa6ac1)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
There was an error in xrow_decode_synchro: it compared the expected type
of the value to the type of the key (MP_UINT) instead of the type of the
actual value. This went unnoticed because all values in synchro requests
were integers.

This is going to change soon, when PROMOTE requests will start holding a
vclock, so fix the wrong type check.

In-scope-of #9138

NO_DOC=bugfix
NO_CHANGELOG=not user-visible

(cherry picked from commit c18410f)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
Synchronous requests will receive a new field encoding a full vclock
soon. Theoretically a vclock may take up to ~ 300-400 bytes (3 bytes for
a map header + 32 components each taking up 1 byte for replica id and up
to 9 bytes for lsn). So it makes no sense to increase
SYNCHRO_BODY_LEN_MAX from 32 to 400-500. It would become almost the same
as plain BODY_LEN_MAX. Simply reuse the latter everywhere.

In-scope-of #9138

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 5360577)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
Previously the replicas only persisted the confirmed lsn of the current
synchronous transaction queue owner. As soon as the onwer changed, the
info about which lsn was confirmed by the previous owner was lost.

Actually, this info is needed to correctly filter synchro requests
coming from the old term, so start tracking confirmed vclock instead of
the confirmed lsn on replicas.

In-scope of #9138

NO_TEST=covered by the next commit
NO_CHANGELOG=internal change

@TarantoolBot document
Title: Document new IPROTO_RAFT_PROMOTE request field

IPROTO_RAFT_PROMOTE and IPROTO_RAFT_DEMOTE requests receive a new key
value pair:

IPROTO_VCLOCK : MP_MAP

The vclock holds a confirmed vclock of the node sending the request.

(cherry picked from commit c4415d4)
sergepetrenko added a commit that referenced this issue Dec 2, 2023
Current split-brain detector implementation raises an error each time a
CONFIRM or ROLLBACK entry is received from the previous synchronous
transaction queue owner. It is assumed that the new queue owner must
have witnessed all the previous CONFIRMS. Besides, according to Raft,
ROLLBACK should never happen.

Actually there is a case when a CONFIRM from an old term is legal: it's
possible that during leader transition old leader writes a CONFIRM for
the same transaction that is confirmed by the new leader's PROMOTE. If
PROMOTE and CONFIRM lsns match there is nothing bad about such
situation.

Symmetrically, when an old leader issues a ROLLBACK with the lsn right
after the new leader's PROMOTE lsn, it is not a split-brain.

Allow such cases by tracking the last confirmed lsn for each synchronous
transaction queue owner and silently nopifying CONFIRMs with an lsn less
than the one recorded and ROLLBACKs with lsn greater than that.

Closes #9138

NO_DOC=bugfix

(cherry picked from commit ffa6ac1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.10 Target is 2.10 and all newer release/master branches bug Something isn't working customer qsync replication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants