Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Commit Timeout by splitting into a Quorum and Remainder Commit #7945

Closed
4 tasks
cmwaters opened this issue Nov 10, 2021 · 16 comments
Closed
4 tasks
Labels
S:proposal Status: Proposal stale for use by stalebot

Comments

@cmwaters
Copy link
Contributor

Protocol Change Proposal

Summary

TimeoutCommit is a local consensus parameter which indicates the duration a node waits to receive additional precommit votes after having received the necessary 2/3 required to commit the block. The reason for this timeout is that having more precommits votes decreases the likelihood that a fork was produced in this block by the fact that more voting power would have been required to generate the fork (i.e. seeing 2/3+ votes means that 1/3+ could have double voted to generate a fork whilst 90% votes means that it would take at least 57% (67% - (100%-90%)) of voting power to double sign). I propose an alternative method which removes TimeoutCommit, speeding up consensus, whilst still allowing for greater confidence in network integrity. The cost of this proposal is greater technical complexity.

Proposal

Proposers of Tendermint currently append the precommit votes they saw in the previous height to the block they propose at the current height. This cannonicalizes the commit once the proposed block is agreed upon in consensus. Rather than having a single LastCommit, I propose separating these into two values: a QuorumCommit and a RemainderCommit.

The QuorumCommit constitutes the necessary 2/3+ precommits for height h - 1, whilst the RemainderCommit constitutes the remaining precommits for height h - 2. Thus validators have the duration of an entire height with which to still collect votes for the RemainderCommit. Note, that it's also feasible to extend this to 3, 10 or even 100 heights in the past but this adds further complexity which I don't think is necessary.

Persistence of Commits

Once a block gets committed, the node should add the RemainderCommit to the QuorumCommit and store this at the height of the block it is associated with to make it easier to retrieve the SignedHeader for light client verification (which is also used for block sync and state sync).

Verification

Nodes using sequential verification (i.e. for block sync) will receive the SignedHeader and only need to check the signatures of the quorum commit to verify the Header (and thus the rest of the Block). Nodes using skipping verification (i.e. light clients) will figure out the overlay and may require signatures from both commits to verify the SignedHeader.

Signature Aggregation

It is likely we will adopt some form of signature aggregation in the near future. In this case the signatures in the quorum commit will be aggregated and the signatures in the remainder commit will be aggregated. Perhaps there might need some more thought into what is the best structure that supports both consensus and light client verification.

Just a final clarification: this is just an idea, none of this has been formally verified in any capacity and their might be holes in my thinking. I welcome anyone to challenge or build on top of this :)


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@cmwaters cmwaters added the S:proposal Status: Proposal label Nov 10, 2021
@josef-widder
Copy link
Contributor

I disagree with the summary. As far as I understand TimeoutCommit is there for liveness. If we have received 2/3 of commits, up to 1/3 might be from Byzantine nodes, for a bad blocks. This means that we need not have 2/3 precommits for the same block yet. The timeout is there to wait for the remaining correct validators to help us to get above the 2/3 threshold. Without this timeout, faulty validators could always be faster and prevent a decision.

Therefore, removing the timeout will remove liveness guarantees.

@cmwaters
Copy link
Contributor Author

I don't quite understand this explanation about affecting liveness. TimeoutCommit is currently a local parameter that can be set to 0. We even have a bool flag that skips TimeoutCommit. There have also been cases where validators on the CosmosHub have actually skipped TimeoutCommit which although prevents votes from making it in the commit, gives me the impression that TimeoutCommit isn't an integral component to the protocol's liveness.

This means that we need not have 2/3 precommits for the same block yet

I was under the impression that TimeoutCommit only started when a node received 2/3+ precommits for the same block.

@cmwaters
Copy link
Contributor Author

Just looking at the code now, I can't even see when/where TimeoutCommit gets used

@josef-widder
Copy link
Contributor

I don't quite understand this explanation about affecting liveness. TimeoutCommit is currently a local parameter that can be set to 0. We even have a bool flag that skips TimeoutCommit. There have also been cases where validators on the CosmosHub have actually skipped TimeoutCommit which although prevents votes from making it in the commit, gives me the impression that TimeoutCommit isn't an integral component to the protocol's liveness.

If in practice the first 2/3 precommits agree on the block it does not prevent liveness. If they disagree it should. My point about liveness was about faulty executions. I guess they never happen on the hub.

This means that we need not have 2/3 precommits for the same block yet

I was under the impression that TimeoutCommit only started when a node received 2/3+ precommits for the same block.

In the arXiv paper the algorithm counts all precommits.

@cason could you let us know what do you think?

@cason
Copy link
Contributor

cason commented Nov 11, 2021

I think that TimeoutCommit is there in order to collect as many Precommits message as possible to include them in the next block. One reason for that is rewarding all the validators that contributed with the validation and commit of a block. This situation would be covered by @cmwaters proposal, as additional Precommits will still be gathered during the next height, and somehow also considered in the future for rewarding and block validation.

The timeout that is required for ensuring liveness is TimeoutPrecommit. It is triggered when a validator receives 2/3+ disagreeing Precommits messages. In this case, no block will be committed in that round, and waiting before entering a new round ensures that locked and valid values will be updated by correct validators. Locked and valid values are updated upon receiving Prevote messages, not Precommit messages.

In summary, it is common to confuse this two timeouts with similar names, but I think that only the TimeoutPrecommit is a requirement for liveness. TimeoutCommit does not even appears in the paper, and for me it was always something related to proof-of-stake and not to consensus itself. But checking this with @milosevic will be nice.

@cason
Copy link
Contributor

cason commented Nov 11, 2021

If we have received 2/3 of commits, up to 1/3 might be from Byzantine nodes, for a bad blocks. This means that we need not have 2/3 precommits for the same block yet. The timeout is there to wait for the remaining correct validators to help us to get above the 2/3 threshold. Without this timeout, faulty validators could always be faster and prevent a decision.

I didn't fully get the point here either. If a validator receives 2/3+ voting-power-weighted commits for a block, the block is decided. While it is true that 1/3 of the votes can come from Byzantine validators, we still have 1/3+ votes from correct validators. This becomes clearer when we ignore different voting powers: 2f+1 votes for a block makes it decided; f of them may come from faulty validators, but f+1 of them are from correct validators. Since a majority of the votes in the quorum are from correct validators, no distinct value can be decided in future rounds.

In addition, the timeoutCommit is scheduled after committing a block. It is a time validators wait before proceeding to the next height. There is no possibility to revert a commit during this interval, the only thing that a validator can do is to receive additional endorsements supporting the commit of the block, making it more reliable.

@cason
Copy link
Contributor

cason commented Nov 11, 2021

Just looking at the code now, I can't even see when/where TimeoutCommit gets used

This timeout is scheduled in the NewHeight round step, by the updateToState method. This method is called after the commit of the block in the previous height, which occur after all transactions included in the committed block being executed.

@cason
Copy link
Contributor

cason commented Nov 11, 2021

TimeoutCommit is currently a local parameter that can be set to 0. We even have a bool flag that skips TimeoutCommit.

Well, this flag is meant to skip the timeout when all the Precommits for the previous blocks are received:
if cs.config.SkipTimeoutCommit && precommits.HasAll()

In other words, since the point of this timeout is to retrieve as many Precommits as possible, when I got all of them I should not need to wait any longer.

@cason
Copy link
Contributor

cason commented Nov 11, 2021

Regarding the proposal, I am not sure if and how you can rewrite a block, in order to include additional Precommits received after the first 2/3+.

This would allow, for instance, two validators having different sets of Precommits for a given block stored? Would this make the two blocks actually not equal?

@cmwaters
Copy link
Contributor Author

Regarding the proposal, I am not sure if and how you can rewrite a block, in order to include additional Precommits received after the first 2/3+.

You wouldn't be rewriting a block. The additional precommits would be added in the next block (h + 2) by the next proposer. It just means that nodes need to keep votes around for two heights instead of one.

@Alex-duzhichao
Copy link

I think that TimeoutCommit is there in order to collect as many Precommits message as possible to include them in the next block. One reason for that is rewarding all the validators that contributed with the validation and commit of a block. This situation would be covered by @cmwaters proposal, as additional Precommits will still be gathered during the next height, and somehow also considered in the future for rewarding and block validation.

The timeout that is required for ensuring liveness is TimeoutPrecommit. It is triggered when a validator receives 2/3+ disagreeing Precommits messages. In this case, no block will be committed in that round, and waiting before entering a new round ensures that locked and valid values will be updated by correct validators. Locked and valid values are updated upon receiving Prevote messages, not Precommit messages.

In summary, it is common to confuse this two timeouts with similar names, but I think that only the TimeoutPrecommit is a requirement for liveness. TimeoutCommit does not even appears in the paper, and for me it was always something related to proof-of-stake and not to consensus itself. But checking this with @milosevic will be nice.

@cason Do you have read the Hotstuff paper?
In the section 4.4 of Hotstuff paper, the author says "The synchronous delay introduced by Tendermint abandones (Optimistic) Responsiveness". I'm wondering whether the synchronous delay means TimeoutCommit ?
What's the meaning of Responsiveness? Is Responsiveness and Liveness the same thing?
And why Tendermint abandones (Optimistic) Responsiveness?

@cason
Copy link
Contributor

cason commented Dec 3, 2021

Hi Alex, I've read HotStuff paper and I'm quite positive they refer to the TimeoutPrecommit.

The main idea of responsiveness is that once the systems starts to behave synchronous, you decide within a fixed amount of rounds. You cannot ensure that a decision happens (liveness) while the system is asynchronous, because messages may not arrive or may be arbitrarily delayed. But when the system becomes synchronous, you should be able to decide, but in case of Byzantine protocols this is not that simple. First, because the proposer of the first round in the synchronous period can be Byzantine, which can prevent progress in any protocol until it is replaced by a correct proposer. Second, and this is specific of Tendermint, two correct processes may start a round in the synchronous period having different views of the consensus state. More specifically, a correct process might not know that a value was locked by other correct processes. In this case, even in a synchronous period and it being a correct process, it can propose a value that other correct processes will reject because they are looked in another value.

This problem is solved in different ways by Tendermint and HotStuff. In one of the several versions of HotStuff, an additional communication step is added to ensure that if a correct process locked a value, all correct processes will learn about that before the end of the round. This additional round has this only purpose, as it does not is required for safety, nor for liveness, but only for responsiveness. But it is an additional communication step that is executed in all rounds, so it adds some latency also for the regular (failure-free) operation of HotStuff.

Tendermint deals with the same problem by adopting the TimeoutPrecommit approach. Differently from HotStuff, it is not a regular communication step, so you cannot guarantee that all correct processes will learn about locked values, but once the system becomes synchronous it is enough for that. Also, it is a delay that is only introduced when a round fails, so it does not delay the regular (fail-free) execution of the protocol.

@williambanfield
Copy link
Contributor

@cason Thanks for posting this explanation. To help me understand, do you mean that the waiting that Tendermint does in the precommit step is in place of the extra communication step in the hotstuff algorithm? Hotstuff does the extra consensus step to make sure that if a value is locked in a round, everyone will see it with this extra step. Tendermint does not ensure that the locked value will be seen by everyone in a round, but it does ensure that if a process missed the locked value, it will still make progress to the next round upon seeing the quorum of votes from other processes. Are they analogous because they both ensure that processes will make progress to the next round and not lock an incorrect value?

@cason
Copy link
Contributor

cason commented Dec 6, 2021

Tendermint achieves this property of everyone being able to learn about a value locked by some correct processes due to gossip, as the votes that lead to a lock are eventually received by all processes. This does not happens with HotStuff, as Byzantine processes may send votes that enable a lock to only some processes, instead of all of them.

The votes of Byzantine processes usually don't have impact on BFT protocols, as they voting power is limited to f and 2f+1 votes are required to accept, lock, and decide a value. This means that we can lock a value regardless Byzantine votes, and that two values cannot be locked in the same round, even with Byzantine processes voting for both.

Votes from Byzantine processes become an issue in a scenario when correct processes are partitioned among two distinct values. In this scenario, Byzantine processes may use their votes to lead some, few (<f) correct processes to lock a value, while the remaining correct processes consider that no progress was achieved in the round. In the next rounds, the correct processes that locked a value will refuse other values, so they will not cooperate to reach a decision. The Byzantine processes may them play with their votes to have some other set of correct processes to lock the proposed value, while hiding this lock from most of the correct processes, in particular the ones already locked in another value. From this point, with two small sets of correct processes locked into distinct values in different rounds, we may have a liveness issue.

The above scenario is known as hidden lock attack, and might, in the worst case, put the system in a bivalent state that prevents progress for good. That is, Byzantine processes, by playing very well with their votes and relying on some substantial amount of asynchrony, can make to system to jump between locked values, so that to never decide a value.

HotStuff prevents the hidden lock attack by adding a communication step when correct processes inform other about values they locked in a round. Tendermint deals with the hidden lock attack by adding a new variable, validValue that stores values that might have been locked by processes in a round. The extra delay in Tendermint, via the timeout Precommit, is intended to give enough time to all processes update their validValue accordingly.

The difference between the approaches is that in Tendermint the validValue is only guaranteed to be set to a value that was locked in the round when the system behaves synchronously, so after GST. Before that, Tendermint may proceed to a new round of consensus in the presence of hidden locks, which as above mentioned, may prevent progress. While in HotStuff, the corresponding to their validValue is guaranteed to be set to a value that was locked in a round due to the added communication step, that is, even when the system is yet in asynchronous operation.

@Alex-duzhichao
Copy link

The main idea of responsiveness is that once the systems starts to behave synchronous, you decide within a fixed amount of rounds.

Hi @cason, Thanks for your explaination.
After some dive into the responsiveness, in my opinion, it means that after GST, a non-faulty leader can reach consensus in time depending only on the auctual network delays, independent of the predefined timeout.
Does what you said "... decide within a fixed amount of rounds" mean the same thing of my understanding?

@cason
Copy link
Contributor

cason commented Dec 9, 2021

Yes, you cannot ensure decision in less than f rounds, because they can be coordinated by Byzantine processes. But once a round started after GST is coordinated by a correct processes, decision is guaranteed.

In Tendermint, the possibility of having hidden locks may prevent such from happening, but due to the PrecommitTimeout hidden locks should not remain "hidden" for more than one round after GST.

@cmwaters cmwaters transferred this issue from tendermint/spec Feb 22, 2022
@github-actions github-actions bot added the stale for use by stalebot label Jun 3, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:proposal Status: Proposal stale for use by stalebot
Projects
None yet
Development

No branches or pull requests

5 participants