AE error response not properly handled. #97

yossigo · 2018-09-27T20:07:00Z

Errors were ignored if node's match_idx==next_idx-1, I assume (??) as a
precaution against silently feeding a node that is not monotonic.

However this does not consider the case where send_appendentries()
relies on snapshot_last_idx which may be greater than the node's index.

@willemt can you think of unwanted side effects for this?

Errors were ignored if node's match_idx==next_idx-1, I assume (??) as a precaution against silently feeding a node that is not monotonic. However this does not consider the case where send_appendentries() relies on snapshot_last_idx which may be greater than the node's index.

To clarify, the desired behavior is: 1) Avoid false assertions. 2) Silently ignore AE responses which are stale or not monotonic (i.e. current index < match index).

willemt · 2018-10-06T02:23:01Z

This is a good fix. It matches the semantics of the raft paper better.
Confirmed in virtraft that nothing breaks.

tangruize · 2021-07-15T14:21:07Z

Hello! I found this fix may cause raft_recv_appendentries_response() to send an empty retry appendentries. The main reason is that stale msgs are not all filtered out.

Consider this trace in a 3-server cluster:
S1 becomes Leader (term 1) -> S1 sends heartbeat msg m1 to S2 -> S1 restarts and become Follower -> S1 times out and starts a new election (term 2) -> S2 votes for S1 and updates term to 2 -> S1 becomes Leader again (term 2) -> S2 receives msg m1 and replies msg m2: [term:2, success: false, current_index: 0] because of smaller term -> S1 receives msg m2, but m2 is not filtered out as a stale msg, and S1 replies an empty retry appendentries.

However, this is not even a bug. I just think it is inconsistent with the semantics that the program wants to express.

yossigo added 2 commits September 27, 2018 23:04

Better fix to previous AE response issue.

f8236e0

To clarify, the desired behavior is: 1) Avoid false assertions. 2) Silently ignore AE responses which are stale or not monotonic (i.e. current index < match index).

willemt merged commit 4ddc273 into willemt:master Oct 6, 2018

tangruize mentioned this pull request Aug 27, 2021

Fix bugs mainly caused by snapshot #118

Open

tangruize mentioned this pull request Dec 6, 2021

Found bugs mainly caused by snapshot RedisLabs/raft#49

Closed

tezc deleted the fix/ae-failure-handling branch October 28, 2022 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AE error response not properly handled. #97

AE error response not properly handled. #97

yossigo commented Sep 27, 2018

willemt commented Oct 6, 2018

tangruize commented Jul 15, 2021

AE error response not properly handled. #97

AE error response not properly handled. #97

Conversation

yossigo commented Sep 27, 2018

willemt commented Oct 6, 2018

tangruize commented Jul 15, 2021