Skip to content

Leader Elections

Simon Massey edited this page Sep 4, 2017 · 10 revisions


TRex implements the algorithm presented in the paper Paxos Made Simple with each node playing the part of Proposer, Accepter and Learner. This means that a cluster contains nodes running identical processes. TRex demonstrates that Paxos itself automatically chooses a Paxos stable leader when the leader timeouts are long enough to prevent two potential leaders fighting and when the network allows sufficient messages to get through.

Here is a sketch of the parts of the message exchange a cold startup of a three node cluster where the node which happens to timeout first has been labeled leader (but any of the three nodes could have timed out first as all nodes are equal upon startup):

leader election code start

Note that the image in the sketch doesn't cover low-ball polling nor leader heartbeat commit messages that are discussed below.

Interruptions And Low-Ball Polling

A naive version of events would be that when a node times out on a leader it runs Phase 1 and issues a Prepare(id: Identifier) message. This message uses an id containing a ballot number higher than both the last promise number and last highest commit number. It can instantaneously promise to its own new ballot number ("self-promise"). When it receives a majority of PrepareAck (positive acknowledgments) it works out the highest numbered value (if any) then transmits that value in Phase 2 Accept message (else a no-op value). Upon a majority of positive Phase 2 responses the leader sends a Commit the message. This is the learning message so that the other nodes learn that a value has been fixed and can be processed. The Commit message also contains a leader heartbeat number. The node that has assumed the leadership will periodically retransmit the Commit message with a higher heartbeat value to stop other nodes from timing out. The periodic retransmission also help corrects for network partitions where the original commit message was lost.

The problem with this naive version of events is that if we have a three node cluster A, B and C and the link between the leader and one node goes down we will disrupt the stable leader. Why? Let's say that A is the leader and the link A-B goes down. Node B will issue a prepare higher than its promise to A. Node C will make a promise and reply to B. Then node C will reject any further Phase 2 messages from the leader A. Node A has lost its leadership and we may enter an infinite loop of leadership changes.

To deal with this a timed out node will first issue a low-ball Prepare message using a zero ballot number. This won't interrupt a stable leader as nodes will reply with PrepareNack (negative acknowledgment). The PrepareNack messages may contain either strong or weak evidence of a stable leader. If it deduces from this evidence that there is a leader it will reset its timeout and won't issue a high prepare. There is also an additional check that it has enough responses to form a majority before issuing a high prepare. Effectively a timed out node will poll the rest of the cluster with a low-ball message to check the status of the network. Only if a majoirty of nodes are contactable and they present no compelling evidence of a working leader will a timed out node attempt to take over the leadership.

Strong leader evidence is in the form of the highest committed log index see by each node. If a timed out node sees that other nodes are learning of higher committed log indexes it knows a leader must have run successful Phase 2 rounds. It only needs one response message to show that the committed slot number has increased to be certain of the existence of a leadership beyond the last it saw. It will reset its timeout and request retransmission of the committed value(s) from the other node. If our cluster isn't busy we might not see the commit number increasing in the timeout window. You can choose to script a synthetic client to send no-op values to keep the cluster busy which trades IO and modest storage overhead to avoid leader failovers.

Weak leader evidence is in the form of comparing the last leader heartbeat values taken from responses. If it sees quorum - 1 higher heartbeat values when compared to its last direct seen heartbeat it will reset its timeout. The heartbeat value is currently implemented as the local leader clock value which is sent in heartbeat Commit message. Clocks are not synchronized and clock skew means that the stable leader can change and the heartbeat will go back in time. We will get a false negative weak leader evidence check as the new leader clock is behind the old leader. Such a false negative scenario can only happen in clusters with more than three nodes. You can choose to script a synthetic client to send no-op values to cause strong evidence to prevent unnecessary interruptions due to false negative weak evidence. In the case of a three node cluster reliance on weak evidence should be sufficient.

Issue 28 tracks that the low-ball state is not yet explicitly modeled in the PaxosRoles sealed trait of named states.


Note that if the timeouts are set too low a stable leader won't emerge quickly. The whole cycle takes about 3.5x message round trip time (RTT). Let's say we have three nodes X,Y,Z with leader Z dying. Then say the two nodes X and Y timeout at random in the range 0 to 10 RTT. The probability of them timing out 3.5 RTT apart is 0.42. Which suggests you might get a few failed attempts before one of the two nodes gets enough time to lead. The recommendation is that the minimum timeout is set to greater than 3.5 RTT plus disk flush. The maximum timeout for a three node cluster should be set such that mix - min > 10 RTT.

Note that you are using a JVM so you should count potential GC in your estimated RTT in addition to disk flush time of large values. This means that ping time isn't a good estimate of real RTT under load. You should both load test and soak test your system with realistic payloads and measure the actual RTT (e.g. dropwizard gauge histogram 95th percentile). You should also look at your GC statistics in live and consider swapping the GC strategy and/or tuning the GC parameters to reduce the occurrences of log GC stalls.

You can’t perform that action at this time.