Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello! I found some catastrophic consequences mainly caused by snapshot. Maybe they are bugs or misuses of the library. If some of them are bugs, it can cause data loss, log inconsistency and progress problems.
1-6 are related to snapshot. And 7-9 are other issues. The modifications of this PR may be too much. I can divide them into small fixes in different branches.
raft_recv_appendentries#L432 when ae->prev_log_idx==0, entries will be appended unconditionally. However, log may have been compacted and corresponding compacted AE entries will be appended to log again (maybe caused by a network packet duplication or a leader heartbeat). See testcase TestCluster_follower_log_no_more_than_leader_after_ae_success. This can further lead to log inconsistency, such as committed log not replicated to majority servers (see testcase TestCluster_trace_0_81007_inv_2_committed_log_replicated_majority)
raft_send_appendentries#L901 AE with empty entries can be sent while it has log entries to send, causing the Follower to update the commit idx incorrectly. See testcase TestCluster_ae_should_contain_entries_if_not_synchronized. This can further lead to log inconsistency, such as committed log being rolled back and triggering code assertion. See testcase TestCluster_trace_4_63114_inv_3_committed_log_is_durable and TestCluster_trace_3_99878_assertion_set_commit_idx_le_current_idx.
raft_begin_load_snapshot#L1377 log may mismatch. It is necessary to load the snapshot in this case. It causes the server lagged behind until receiving next snapshot. See testcase TestCluster_handle_snapshot_make_progress.
raft_begin_load_snapshot#L1383 changing current_term directly is dangerous. It causes current term is not monotonic. See testcase TestCluster_current_term_is_monotonic. Can lead to two leaders in one term.
raft_recv_appendentries#L452 log at prev_log_idx may have been compacted. In this case, compacted logs should be treated as committed logs and remaining matching logs should be appended to log. This causes performance degradation. See testcase TestCluster_ae_retry_once
raft_begin_load_snapshot#L1407 memory leak found by valgrind. The pointer reference of me->nodes[0] is definitely lost if me->nodes[0] is not the me node. I noticed that @yossigo fixed this bug (PR Fix memory leak on snapshot loading. #98 ).
raft_recv_appendentries_response#L317 next_idx can be decreased to equal match idx, making matched idx retransmitted. It does not seem harmful. However, it is better not to retransmit already matched logs. See testcase TestCluster_next_idx_greater_than_match_idx. I also noticed that it is introduced by a potential bug fix AE error response not properly handled. #97 . The case "a node that is not monotonic" may be caused by network failure and another issue (described in 6.)
raft_send_appendentries_all#L952 it returns immediately if raft_send_appendentries returns a non-zero value. However, the return value maybe RAFT_ERR_NEEDS_SNAPSHOT. It seems that sending AE to different servers should be irrelevant. Plus 3. mentioned above, it can cause the entire cluster to fail to progress. See testcase TestCluster_handle_snapshot_make_progress. I noticed that issue raft_send_appendentries_all logic #79 also mentioned it.
raft_get_last_log_term#L216 current_idx maybe a snapshot, in this condition, returning 0 causing other servers do not grant vote. If all server in the cluster's log are compacted, then no Leader can be elected. See testcase TestCluster_will_exist_leader
After fixing above issues, some tests are violated.
I find a testcase's name is TestRaft_leader_sends_appendentries_when_node_next_index_was_compacted. It seems that sending AE when next index is compacted is a required behavior? If so, it may be a misuse of the library.
I find the testcase TestRaft_leader_sends_appendentries_with_correct_prev_log_idx_when_snapshotted gets node by array index directly (Seems that nodes should not be freed). Maybe I misuses the raft_begin_load_snapshot API.
I look forward to your feedback! Thank you.