-
Notifications
You must be signed in to change notification settings - Fork 1
Papers We Love (Raft)
Alpha Chen edited this page Apr 13, 2017
·
1 revision
- In Search of an Understandable Consensus Algorithm
- Analysis of Raft Consensus
- The Secret Lives of Data
- Consistency
- Availability
- Partition Tolerance
- Google (Chubby in BigTable)
- Amazon AWS
- Apache Mesos
- Microsoft (Bing, Azure)
- VMWare
- Oracle
- Neo4j
- single-decree Paxos: can reach agreement on a single decision (i.e., one log entry)
- multi-Paxos: multiple instances of single-decree Paxos to achieve a series of decisions
correctness has been proven, and it is efficient in the normal case
exceptionally difficult to understand
we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year
- no widely agreed-upon algorithm for multi-Paxos
- symmetric peer-to-peer approach
real implementations are so different from Paxos that the proofs have little value
"There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the final system will be based on an unproven protocol." -Chubby implementers
- etcd
- RethinkDB
- Consul
- problem decomposition
- simplify the state space by reducing the number of states to consider
- "randomized approaches introduce nondeterminism, but they tend to reduce the state space... (choose any; it doesn't matter)"
- Strong leader (logs only go from leader -> followers)
- Leader election (use randomized timers to elect leaders)
- Membership changes (parallel changes pattern)
- note: clients are external here
- Election Safety: only one leader per term
- Leader Append-Only: leader never changes existing log entries
- Log Matching: if two log entries are the same, logs are identical up through that entry
- Leader Completeness: if a log entry is committed in a term, then that entry will be present in logs of future leaders
- State Machine Safety: if a server has applied an entry at a given index, no other server will apply a different entry for the same index
- linearizable semantics (each operation appears to execute instantaneously, exactly once, at some point between its invocation and response)
- leader
- follower
- candidate
- numbered with consecutive integers
- each term starts with an election
- act as a logical clock in Raft
- each server stores current term number
- terms exchanged each time servers communicate, updating if necessary
- if a candidate/leader gets a newer term, it immediately reverts to a follower
- requests with stale terms are rejected
- increment current term
- transitions to candidate state
- votes for itself and issues RequestVote to other servers
- a candidate wins if it gets votes from a majority of servers in the full cluster
- on winning, candidate becomes leader and sends heartbeats to other servers to establish authority
- if the term is >= candidate's term, candidate recognizes leader as legitimate and returns to being a follower
- else, rejects the request and continues in candidate state
- could happen with split vote, etc.
- each candidate times out and starts a new election (incrementing term ane sending RequestVote)
- election timeouts are chosen randomly from a fixed interval to avoid this
- randomized retry approach more obvious and understandable
- sends AppendEntries indefinitely
- log entries store commands with the term number (from leader) and an integer index
- once committed, leader returns the result to the client
- followers reject request if not consistent
- on becoming leader, /nextIndex/ is re-initialized to the next index in the leader's log
- after rejection due to inconsistency, /nextIndex/ is decremented for the follower and leader retries AppendEntries
- optimization can be done here to reduce the number of rejected AppendEntries; in practice, this should be infrequent
thus, logs automatically converge (satisfying Log Matching) and the leader never overwrites or deletes entries in its own log (satisfying Leader Append-Only)
- ensures Leader Completeness
- prevents candidates from winning unles its log has all committed entries
- "up-to-date" is determined by index and term of last entries in the logs
- leader cannot immediately conclude that an entry from a previous term is committed once it is stored on a majority of servers!
- therefore, only commits log entries from current term
- more conservative approach for simplicity
- easier to reason about entries since they maintain the same term number across logs
- proof by contradiction that given Leader Completeness, we can prove State Machine Safety
- handled by retrying indefinitely since Raft requests are idempotent
- safety must not depend on timing
- /broadcastTime/ = average time for leader to send parallel requests to all servers and receive responses
- usually between 0.5ms to 20ms depending on the storage technology
- /electionTimeout/ (described above)
- /MTBF/ = mean time between failures (for a single server)
- /broadcastTime/ << /electionTimeout/ so leaders can reliably send heartbeats to prevent followers from starting elections
- /electionTimeout/ << /MTBF/ so system makes steady progress
- /electionTimeout/ should be as small as possible since the system is unavailable for roughly this period when a leader crashes
- usually between 10ms and 500ms
- use the parallel changes pattern (expand/contract) to handle gracefully
- /joint consensus/ for old and new configuration
- cluster configuration stored using special entries in the log
- join as non-voting members so they get AppendEntries but not considered for majorities
- leader steps down (becomes a follower) once it commits the
C_new
configuration entry
- they won't get heartbeats, so will time out and start elections
- servers ignore RequestVote requests when they believe a current leader exists (if they're within the /electionTimeout/ of hearing from a current leader)
- snapshotting
- followers can also snapshot on their own
- InstallSnapshot from leader to followers that are too far behind
- clients pick a random server to connect to
- if the server is not the leader, it rejects the request and tells the client the current leader
- can be handled without writing to the log, but linearizable reads must not return stale data
- each leader commits a
no-op
entry at the start of its term to guarantee that it has all committed entries - each leader exchanges heartbeats with a majority of the cluster before responding to a read-only request to verify it has not been deposed
- user study of students between learning Paxos vs. Raft
- stuff I didn't understand
- without randomness, leader elections consistently took >10s due to split votes
- 5ms of randomness -> median downtime of 287ms
- 50ms of randomness -> worst case of 513ms (over 1000 trials)
- /electionTimeout/ of 12-24ms resulted in 35ms average (152ms max) to elect leader, but 150-300ms recommended