Fix bugs mainly caused by snapshot #118

tangruize · 2021-08-27T17:57:03Z

Hello! I found some catastrophic consequences mainly caused by snapshot. Maybe they are bugs or misuses of the library. If some of them are bugs, it can cause data loss, log inconsistency and progress problems.

1-6 are related to snapshot. And 7-9 are other issues. The modifications of this PR may be too much. I can divide them into small fixes in different branches.

raft_recv_appendentries#L432 when ae->prev_log_idx==0, entries will be appended unconditionally. However, log may have been compacted and corresponding compacted AE entries will be appended to log again (maybe caused by a network packet duplication or a leader heartbeat). See testcase TestCluster_follower_log_no_more_than_leader_after_ae_success. This can further lead to log inconsistency, such as committed log not replicated to majority servers (see testcase TestCluster_trace_0_81007_inv_2_committed_log_replicated_majority)
raft_send_appendentries#L901 AE with empty entries can be sent while it has log entries to send, causing the Follower to update the commit idx incorrectly. See testcase TestCluster_ae_should_contain_entries_if_not_synchronized. This can further lead to log inconsistency, such as committed log being rolled back and triggering code assertion. See testcase TestCluster_trace_4_63114_inv_3_committed_log_is_durable and TestCluster_trace_3_99878_assertion_set_commit_idx_le_current_idx.
raft_begin_load_snapshot#L1377 log may mismatch. It is necessary to load the snapshot in this case. It causes the server lagged behind until receiving next snapshot. See testcase TestCluster_handle_snapshot_make_progress.
raft_begin_load_snapshot#L1383 changing current_term directly is dangerous. It causes current term is not monotonic. See testcase TestCluster_current_term_is_monotonic. Can lead to two leaders in one term.
raft_recv_appendentries#L452 log at prev_log_idx may have been compacted. In this case, compacted logs should be treated as committed logs and remaining matching logs should be appended to log. This causes performance degradation. See testcase TestCluster_ae_retry_once
raft_begin_load_snapshot#L1407 memory leak found by valgrind. The pointer reference of me->nodes[0] is definitely lost if me->nodes[0] is not the me node. I noticed that @yossigo fixed this bug (PR Fix memory leak on snapshot loading. #98 ).
raft_recv_appendentries_response#L317 next_idx can be decreased to equal match idx, making matched idx retransmitted. It does not seem harmful. However, it is better not to retransmit already matched logs. See testcase TestCluster_next_idx_greater_than_match_idx. I also noticed that it is introduced by a potential bug fix AE error response not properly handled. #97 . The case "a node that is not monotonic" may be caused by network failure and another issue (described in 6.)
raft_send_appendentries_all#L952 it returns immediately if raft_send_appendentries returns a non-zero value. However, the return value maybe RAFT_ERR_NEEDS_SNAPSHOT. It seems that sending AE to different servers should be irrelevant. Plus 3. mentioned above, it can cause the entire cluster to fail to progress. See testcase TestCluster_handle_snapshot_make_progress. I noticed that issue raft_send_appendentries_all logic #79 also mentioned it.
raft_get_last_log_term#L216 current_idx maybe a snapshot, in this condition, returning 0 causing other servers do not grant vote. If all server in the cluster's log are compacted, then no Leader can be elected. See testcase TestCluster_will_exist_leader

After fixing above issues, some tests are violated.

I find a testcase's name is TestRaft_leader_sends_appendentries_when_node_next_index_was_compacted. It seems that sending AE when next index is compacted is a required behavior? If so, it may be a misuse of the library.

I find the testcase TestRaft_leader_sends_appendentries_with_correct_prev_log_idx_when_snapshotted gets node by array index directly (Seems that nodes should not be freed). Maybe I misuses the raft_begin_load_snapshot API.

I look forward to your feedback! Thank you.

.gitignore: *.a *.so *.gcda .hypothesis *.gcov CLinkedListQueue/ tests/main_test.c tests_main python make clean: GCOV_OUTPUT tests.c CLinkedListQueue .hypothesis Makefile CFLAGS: add override keyword to override cmd assignment virtraft2.py: remove namedtuple verbose argument (removed since python3.7)

test_cluster.h: virtual network, raft controller, fault injection, and invariant checking functions test_cluster.c: testcases that violate invariants (constructed from TLA+ traces) test_cluster_more.c: generated testcases

1. (critical) raft_recv_appendentries: in the case ae->prev_log_idx==0 and log has been compacted, causing logs that have been compacted appended to log again (further leading to log inconsistency, such as committed log not replicated to majority servers). 2. raft_recv_appendentries_response: next_idx can be decreased to equal match idx, making matched idx retransmitted 3. (critical) raft_send_appendentries: empty AE can be sent while it has log entries to send, causing the Follower to update the commit idx incorrectly (further leading to log inconsistency, such as committed log being rolled back) 4. (critical) raft_begin_load_snapshot: log may mismatch. It is necessary to load the snapshot in this case. (causing the server lagged behind until receiving next snapshot) 5. (critical) raft_begin_load_snapshot: changes current_term directly, which is dangerous. And can lead to two leaders in one term. 6. (critical) raft_recv_appendentries: log at prev_log_idx may have been compacted. In this case, compacted log should be treated as committed logs and remaining matching logs should be appended to log. 7. (critical) raft_begin_load_snapshot memory leak. The pointer reference of me->nodes[0] is definitely lost if me->nodes[0] is not the me node 8. raft_send_appendentries_all returns immediately if raft_send_appendentries returns a non-zero value. However, the return value maybe RAFT_ERR_NEEDS_SNAPSHOT. It seems that sending AE to different servers should be irrelevant. (plus bug 4, it can cause the entire cluster to fail to progress) 9. (critical) raft_get_last_log_term current_idx maybe a snapshot, in this condition, returning 0 causing other servers not grant vote. If all server in the cluster's log are compacted, then no Leader can be elected.

test_server.c: ae.prev_log_term should be 0 if ae.prev_log_idx is 0. TestRaft_follower_load_from_snapshot committed entry confict TestRaft_follower_load_from_snapshot_does_not_break_cluster_safety log mismatch TestRaft_leader_sends_snapshot_when_node_next_index_was_compacted should send snapshot if next idx was compacted TestRaft_leader_not_send_appendentries_when_snapshotted should send snapshot TestRaft_leader_sends_snapshot_if_log_was_compacted it is ok to send snapshot when request timeout

tangruize added 4 commits August 27, 2021 22:25

add testcases

fed1fa6

test_cluster.h: virtual network, raft controller, fault injection, and invariant checking functions test_cluster.c: testcases that violate invariants (constructed from TLA+ traces) test_cluster_more.c: generated testcases

tangruize mentioned this pull request Aug 28, 2021

Found bugs mainly caused by snapshot RedisLabs/raft#49

Closed

tangruize changed the title ~~Issues found by TLA+~~ Fix bugs mainly caused by snapshot Sep 2, 2021

tangruize mentioned this pull request Oct 11, 2021

Bad format of Json trace printing tlaplus/CommunityModules#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs mainly caused by snapshot #118

Fix bugs mainly caused by snapshot #118

tangruize commented Aug 27, 2021 •

edited

Loading

Fix bugs mainly caused by snapshot #118

Are you sure you want to change the base?

Fix bugs mainly caused by snapshot #118

Conversation

tangruize commented Aug 27, 2021 • edited Loading

tangruize commented Aug 27, 2021 •

edited

Loading