Resend our prevote for the locked round when we receive a propose #43

jazg · 2020-03-19T00:32:47Z

This PR fixes an issue with consensus that may arise from messages being dropped at a certain height or round. Specifically, the network can end up in a partitioned state if a subset of the validators (less than 2f+1) receive sufficient non-nil prevotes for a given propose, while the remaining nodes do not meaning they precommit nil. The validators then progress to the next round and do so indefinitely as the two groups will never agree on a proposal from an opposing group. To fix this, upon receiving a proposal with a valid round we simply broadcast our prevote for that given round to others.

Additionally, this PR fixes the logic for scheduling timeouts on boot. Previously validators would only schedule the precommit timeout if they were at the precommit step, whereas based on the specification this timer should be started regardless of what step they are in, so long as the number of precommits exceeds 2f.

Finally, the logic for calling startRound inside the message handlers was incorrect as it used the number of unique messages for the given message type to calculate n. This should instead be the total number of unique messages at a given height and round irrespective of type.

loongy · 2020-03-19T02:31:05Z

To provide some extra clarity on the problem of partitioning. It is possible that, due to practical networking conditions, there end up being two groups of nodes at height H and round R+K for K>0: (A) those that are locked on a block at round R, and (B) those that are not locked on any block.

It is unlikely, but possible, that this partition happens in such a way that neither (A) nor (B) has more than 2F nodes. We have seen this once across all of our deployed networks over months of testing. In this situation, the nodes from (A) will never accept blocks proposed by nodes from (B), because the nodes in (A) are locked on a block (so will only propose/accept that block) and the nodes in (B) are not (so will propose different blocks).

According to the Tendermint algorithm, the only way to resolve this problem is for the nodes in (B) to accept and prevote the locked block. However, this can only happen if 2F+1 prevotes are seen for the locked block from the round that cause nodes to become locked. This is where we were having issues. Network failures could cause these prevotes to be lost.

The solution: whenever you see a propose with a valid round R that is non-negative (ie the proposal is from a node in (A) that is locked on some block from some valid round R), you re-broadcast your prevote from the current height and the proposal’s valid round. This ensures everyone can catch up on any potentially missing prevotes, and under normal operational circumstances it does not introduce any new messages into the system (so performance is not sacrificed).

loongy

LGTM. This fix was deployed to a live-locked network, and it resolved the live-lock without further intervention.

jazg added 3 commits March 19, 2020 10:49

process: precommit timeout should be restarted regardless of step

09c23e6

process: fix n calculation

7b50fe7

process: resend prevote from valid round in case of missed messages

273f186

jazg added bug Something isn't working enhancement New feature or request labels Mar 19, 2020

jazg requested a review from loongy March 19, 2020 00:32

jazg self-assigned this Mar 19, 2020

jazg added 3 commits March 19, 2020 11:48

process_test: remove deprecated test

12cbe76

process: only resend our prevote if it is non-nil

fb14926

process_test: test resending prevotes for old rounds

7e85aef

loongy approved these changes Mar 19, 2020

View reviewed changes

jazg changed the base branch from master to release/1.1.1 March 19, 2020 03:33

jazg merged commit cb1a4a0 into release/1.1.1 Mar 19, 2020

jazg deleted the fix/missed-prevotes branch March 19, 2020 03:34

jazg mentioned this pull request Mar 19, 2020

Release v1.1.1 #44

Merged

jazg changed the title ~~Resend our prevote for the locked round inside a propose~~ Resend our prevote for the locked round when we receive a propose Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resend our prevote for the locked round when we receive a propose #43

Resend our prevote for the locked round when we receive a propose #43

jazg commented Mar 19, 2020

loongy commented Mar 19, 2020 •

edited

Loading

loongy left a comment

Resend our prevote for the locked round when we receive a propose #43

Resend our prevote for the locked round when we receive a propose #43

Conversation

jazg commented Mar 19, 2020

loongy commented Mar 19, 2020 • edited Loading

loongy left a comment

Choose a reason for hiding this comment

loongy commented Mar 19, 2020 •

edited

Loading