-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resend our prevote for the locked round when we receive a propose #43
Conversation
To provide some extra clarity on the problem of partitioning. It is possible that, due to practical networking conditions, there end up being two groups of nodes at height H and round R+K for K>0: (A) those that are locked on a block at round R, and (B) those that are not locked on any block. It is unlikely, but possible, that this partition happens in such a way that neither (A) nor (B) has more than 2F nodes. We have seen this once across all of our deployed networks over months of testing. In this situation, the nodes from (A) will never accept blocks proposed by nodes from (B), because the nodes in (A) are locked on a block (so will only propose/accept that block) and the nodes in (B) are not (so will propose different blocks). According to the Tendermint algorithm, the only way to resolve this problem is for the nodes in (B) to accept and prevote the locked block. However, this can only happen if 2F+1 prevotes are seen for the locked block from the round that cause nodes to become locked. This is where we were having issues. Network failures could cause these prevotes to be lost. The solution: whenever you see a propose with a valid round R that is non-negative (ie the proposal is from a node in (A) that is locked on some block from some valid round R), you re-broadcast your prevote from the current height and the proposal’s valid round. This ensures everyone can catch up on any potentially missing prevotes, and under normal operational circumstances it does not introduce any new messages into the system (so performance is not sacrificed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This fix was deployed to a live-locked network, and it resolved the live-lock without further intervention.
This PR fixes an issue with consensus that may arise from messages being dropped at a certain height or round. Specifically, the network can end up in a partitioned state if a subset of the validators (less than
2f+1
) receive sufficient non-nil prevotes for a given propose, while the remaining nodes do not meaning they precommit nil. The validators then progress to the next round and do so indefinitely as the two groups will never agree on a proposal from an opposing group. To fix this, upon receiving a proposal with a valid round we simply broadcast our prevote for that given round to others.Additionally, this PR fixes the logic for scheduling timeouts on boot. Previously validators would only schedule the precommit timeout if they were at the precommit step, whereas based on the specification this timer should be started regardless of what step they are in, so long as the number of precommits exceeds
2f
.Finally, the logic for calling
startRound
inside the message handlers was incorrect as it used the number of unique messages for the given message type to calculaten
. This should instead be the total number of unique messages at a given height and round irrespective of type.