Replies: 3 comments 1 reply
-
Remediation for what issue? I don't follow you.
Split brain is not possible, due to the nature of Raft.
If the other side of the partition has a quorum, it will maintain a leader, and that leader will become the leader of the smaller side of the partition when the partition is healed. If the cluster has split into enough pieces that no portion has quorum, well that will need to be fixed -- either by manual intervention, or the system comes back itself (for example, the network heals itself). More information: |
Beta Was this translation helpful? Give feedback.
-
As for question 1, see wildarch/jepsen.rqlite#9. I need more information from the testers to understand what they actually mean. As for question 2, there is an API to detect if a node can't contact the leader (which is caused either by a partition, or because a quorum of nodes is down -- it's impossible for a distributed system like Raft to be 100% sure which is the case, because the effects are basically the same). Simply call When everything is OK, it'll work like this (for example): $ curl localhost:4001/readyz
[+]node ok
[+]leader ok When there is a partition, and that node is on the side without the leader, it will look like this: $ curl localhost:4001/readyz
[+]node ok
[+]leader does not exist More information here: https://github.com/rqlite/rqlite/blob/master/DOC/DIAGNOSTICS.md#readiness-checks You can also check |
Beta Was this translation helpful? Give feedback.
-
You had it there in the appropriately named markdown docs all along . Thanks for awesome docs . I guess I missed it |
Beta Was this translation helpful? Give feedback.
-
I am glad to see https://github.com/wildarch/jepsen.rqlite/blob/main/doc/blog.md#network-partitioning discus this.
I think the series had a meme name of "Call me Maybe" which was appropriate :)
I presume that this fault case is not handled ? Based on https://github.com/wildarch/jepsen.rqlite/blob/main/doc/blog.md#results it seems it is not. But its very good in general and i am really happy to see linearizability being tested.
I wonder if remediation can be employed. I know these are hairy race conditions, and i am wondering if doing HIL ( human in the loop) helps and then Ops can fix it via the CLI.
Is there a way to detect "split brain" has occurred ?
When it occurs is there a way to tell the small quorum of nodes to get a new leader once the partition recovers.
Beta Was this translation helpful? Give feedback.
All reactions