Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resend our prevote for the locked round when we receive a propose #43

Merged
merged 6 commits into from
Mar 19, 2020

Conversation

jazg
Copy link
Member

@jazg jazg commented Mar 19, 2020

This PR fixes an issue with consensus that may arise from messages being dropped at a certain height or round. Specifically, the network can end up in a partitioned state if a subset of the validators (less than 2f+1) receive sufficient non-nil prevotes for a given propose, while the remaining nodes do not meaning they precommit nil. The validators then progress to the next round and do so indefinitely as the two groups will never agree on a proposal from an opposing group. To fix this, upon receiving a proposal with a valid round we simply broadcast our prevote for that given round to others.

Additionally, this PR fixes the logic for scheduling timeouts on boot. Previously validators would only schedule the precommit timeout if they were at the precommit step, whereas based on the specification this timer should be started regardless of what step they are in, so long as the number of precommits exceeds 2f.

Finally, the logic for calling startRound inside the message handlers was incorrect as it used the number of unique messages for the given message type to calculate n. This should instead be the total number of unique messages at a given height and round irrespective of type.

@jazg jazg added bug Something isn't working enhancement New feature or request labels Mar 19, 2020
@jazg jazg requested a review from loongy March 19, 2020 00:32
@jazg jazg self-assigned this Mar 19, 2020
@loongy
Copy link
Contributor

loongy commented Mar 19, 2020

To provide some extra clarity on the problem of partitioning. It is possible that, due to practical networking conditions, there end up being two groups of nodes at height H and round R+K for K>0: (A) those that are locked on a block at round R, and (B) those that are not locked on any block.

It is unlikely, but possible, that this partition happens in such a way that neither (A) nor (B) has more than 2F nodes. We have seen this once across all of our deployed networks over months of testing. In this situation, the nodes from (A) will never accept blocks proposed by nodes from (B), because the nodes in (A) are locked on a block (so will only propose/accept that block) and the nodes in (B) are not (so will propose different blocks).

According to the Tendermint algorithm, the only way to resolve this problem is for the nodes in (B) to accept and prevote the locked block. However, this can only happen if 2F+1 prevotes are seen for the locked block from the round that cause nodes to become locked. This is where we were having issues. Network failures could cause these prevotes to be lost.

The solution: whenever you see a propose with a valid round R that is non-negative (ie the proposal is from a node in (A) that is locked on some block from some valid round R), you re-broadcast your prevote from the current height and the proposal’s valid round. This ensures everyone can catch up on any potentially missing prevotes, and under normal operational circumstances it does not introduce any new messages into the system (so performance is not sacrificed).

Copy link
Contributor

@loongy loongy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This fix was deployed to a live-locked network, and it resolved the live-lock without further intervention.

@jazg jazg changed the base branch from master to release/1.1.1 March 19, 2020 03:33
@jazg jazg merged commit cb1a4a0 into release/1.1.1 Mar 19, 2020
@jazg jazg deleted the fix/missed-prevotes branch March 19, 2020 03:34
@jazg jazg mentioned this pull request Mar 19, 2020
@jazg jazg changed the title Resend our prevote for the locked round inside a propose Resend our prevote for the locked round when we receive a propose Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants