-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spec: blocksync #8586
spec: blocksync #8586
Conversation
spec/blocksync/verification.md
Outdated
|
||
### Trusted state | ||
|
||
The light client additionally relies on the notion of a **trusting period**. A trusting period is the time during which we assume we can trust validators because, if we do detect misbehaviour, we can slash them - they are still bonded. Beyond this period, the validators do not have to have any bonded assets and cannot be held accountable for their misbheaviour. Blocksync-ing blocks will most often be outside this trusting period for a particular block. Therefore, the trusting period assumptions as they are in the light client, cannot be applied here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unclear to me. In my view, the trusting period is nothing specific to the light client, but a consequence of the unbonding period. If we cannot apply the trusting period, this means that the whole protocol is in general unsafe, or more precisely, the problem we want to solve is unsolvable.
For instance, if we want to blocksync from genesis, and genesis is 2 months old, all the initial validators might have unbonded. They can produce an alternative history, and make our new node always sync to a bad state. If we don't make additional assumptions the problem is unsolvable. (Or the protocol is just best-effort. But even then we should explain in what cases it works).
This means to say something meaningful about blocksync, we might need to add (or make explicit already existing implicit) assumptions, e.g., every new node can talk to at least one correct peer who has the correct history of blocks (headers) stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point I agree with you on is the weak guarantees this initial trust in the validator set provides to us in blocksync. And I think we need to find a way to improve the guarantees on this by either starting a light client service to verify headers against or some other, safer solution.
I think there is a sentence about the node always having a correct peer but correct is not specified fully as you stated it here. Indeed, for this to be correct we do need a peer who will send us a correct block (header), but I feel this assumption is weak. I will add this assumption explicitly for now but I do think we should come up with a safer solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that may inform the assumptions is the use case. Perhaps we can differentiate:
- If you have a trusted state within the trusting period (w.r.t. to now): strong guarantees
- Otherwise: weak guarantees. We still provide automation, but you should double check the outcome.
For me the idea with witnesses is in the second option. "We cannot really give guarantees, but we want to make the attack more complicated."
We should make the different use cases/assumptions and the following guarantees more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we launched the light client as a service (as state sync does), do you think we could always provide strong guarantees? if so would this be a better option compared to witness verification? If we are satisfied with the witness model then I can certainly specify these things better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is the input data rather than the protocol we are using. If you start from genesis, and the genesis is older than the trusting period, then the light client would just terminate without doing anything.
If we expect blocksync to still do something, we need to understand that whatever protocol it uses (witnesses, or the light client or any other new protocol we might come up with), it will have weaker guarantees.
Perhaps it is better to jump on a quick call to discuss this synchronously ;-)
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long list of comments. Most are minor and shouldn't take a lot of effort
|
||
If all these checks pass, the reactor verifies that other peers have the same block at the particular height. | ||
|
||
In order to reduce traffic, we do not ask the peers to provide the whole block to us, rather only the header. If crosschecking the headers fails then the node requests the remainder of the block to decide on whether this peer is faulty or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To further reduce traffic, we could just request the block's hash (smaller than the header)
spec/blocksync/verification.md
Outdated
- If we connect to a subset of peers, they could feed the node faulty data. Eventually, when the node switches to consensus, it would realize there is something wrong, but then the node itself might be blacklisted. | ||
- Alternatively, a node that is fed faulty data, could, upon switching to conensus, become part of a light client attack by serving as a faulty witness. | ||
- There is no check whether the maximum height reported by peers is true or not. A slow node could report a very distant height to the node - for example 2000, when the blockchain is at height 1000 in fact. This would lead to one part of the condition to switch to consensus never being true. To prevent the node switching due to not advancing, the malicious node sends a new block very slowly. Thus the node progresses but can never participate in consensus. This issue could potentially be mitigated if, instead of taking the maximum height reported by peers, we report the lowest of their maximums. The idea is that peers should be close enought to the top of the chain in any case. | ||
- A blocksyncing node can flood peers with requests - constantly reporting that it has not synced up. At the moment the maximum amount of requests received is limited and protects peers to some extend against this attack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong on this, but another problem is that the blocks sent/received via block synch reactor are not chunked like in consensus. So a busy chain containing big blocks might cause performance problems. Also, a malicious node could try to feed big blocks to try to "choke" the receiver.
I am aware this affects performance rather than correctness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I suggested that maybe we send around headers and blocks only once verification passes. However, this just delays the problem. .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this specification presents the implementation of the blocksync reactor very well.
But I focused my comments, in this first review, in the introduction of the spec, the readme file. I would suggestion extending it with:
- General definitions, maybe even a form o glossary of terms used several times
- A high-level overview of what blocksync actually does, independently how it is implemented
I think that once this high-level concepts and goals are properly defined, we can present the existing implementation as a way of obtaining the desired goals, pointing out its limitations and potential improvements.
spec/blocksync/readme.md
Outdated
--- | ||
|
||
|
||
In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph comes from here: https://github.com/tendermint/tendermint/blob/c398bc55e6e5f3fd01f5524aaa7a847ae7f35938/docs/tendermint-core/block-sync/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But more generally, I don't agree with its content. In both classes of blockchains, syncing blocks is the same as deciding the block in the usual (consensus) way. In PoW blockchains, we extend a block that we "trust" with downloaded blocks, verify the downloaded block, and append to a local candidate blockchain. If we have multiple local candidate blockchains, we opt for the longest one, that represents the greatest accumulate work. In PoS blockchains, we do the same: we extend a block that we "trust" with downloaded blocks, verify the downloaded block, and append to a local candidate blockchain. The differences are two: (i) we verify a block by verifying the signatures of messages voting for the block, instead of verifying brute-force-generated hashes, and (ii) we should not have multiple candidate blockchains.
That said, one thing that should be clearer in the specification of this module/protocol/spec is that blocksync represent a very condensed execution of the consensus protocol. The consensus protocol involves (potentially) multiples rounds, each round is composed by three communication steps. A node, however, can decide a block by only receiving the messages from the first (propose) and last (precommit) step of any round, provided they match. In other words, it is enough for a node to receive a proposed block and 2/3+ Precommit
messages from the same round for that block to complete and instance of consensus.
In my view, and please correct me if I am wrong, this exactly the information used in blocksync. There are, of course, differences: (i) the proposed block is not split into multiple messages, as in consensus, and (ii) a set of 2/3+
identical Precommit
messages for the decided block are condensed into a Commit
. If this parallel between consensus and blocksync is indeed valid, I think that this is the way of introducing blocksync.
spec/blocksync/readme.md
Outdated
|
||
In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol. | ||
|
||
The Blocksync Reactor's high level responsibility is to enable peers who are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we defining a reactor here, which is an implementation name, or a service. For the sake of specification, I would prefer to define a service, which is more abstract, to then describe an implementation of the service (e.g., as a Reactor
).
spec/blocksync/readme.md
Outdated
many blocks (that have already been decided) in parallel, verifying their commits, and executing them against the | ||
ABCI application. | ||
|
||
Tendermint full nodes run the Blocksync Reactor as a service to provide blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tendermint full nodes run the Blocksync Reactor as a service to provide blocks | |
Tendermint nodes run the Blocksync Reactor as a service to provide blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here you are describing the two "modes" of operation of blocksync: server, which ships decided blocks to whoever wants them, and client, which asks peers for blocks while (or when) they are needed.
I don't know whether this server/client nomenclature is good (I've used it in this paper, Section V.D), but I clearly see two different roles here.
spec/blocksync/readme.md
Outdated
|
||
The reactor is activated after state sync, where the pool and request processing routines are launched. | ||
|
||
However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the service starts with the node, meaning that is able to send and receive status messages. But its actual operation only starts when the block pool is ready, is that correct?
spec/blocksync/readme.md
Outdated
|
||
However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing. | ||
|
||
**Note**. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync *service* before we call the function to switch from statesync to blocksync. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe too much implementation specific?
spec/blocksync/readme.md
Outdated
**Note**. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync *service* before we call the function to switch from statesync to blocksync. | ||
|
||
### Switching from blocksync to consensus | ||
Ideally, the switch to consensus is done once the node considers itself caugh up or we have not advanced our height for more than 60s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we move to consensus when the node caught up with its peers. This is the expected after a "run" of blocksync in "client mode".
In practice, we need to define what "caught up" means here. This includes the mentioned 60s (which could another duration, don't you agree?) and the condition mentioned in the next paragraph (1 height away etc.).
spec/blocksync/readme.md
Outdated
### Switching from blocksync to consensus | ||
Ideally, the switch to consensus is done once the node considers itself caugh up or we have not advanced our height for more than 60s. | ||
|
||
The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we **do not catch up until the maximum height** (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec would benefit from having this fact explained earlier: we need the block at height H+1
to validate the block at height H
. This happens because the block at height H+1
includes a Commit
of height H
, i.e., 2/3+ Precommit
signatures for the block decided at height H
.
spec/blocksync/readme.md
Outdated
|
||
The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we **do not catch up until the maximum height** (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`. | ||
|
||
If the node is not starting from genesis, blocksync **does not** switch to consensus until we have synced at least one block. We need to have vote extensions in order to participate in consensus and they are not provided to the blocksync reactor after state sync. We therefore need to receive them from one of our peers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a subsection with the additional requirement imposed by vote extensions would be a good idea. Briefly defining what it means, with a summary and/or links, is also good. The point here is that a Commit
does not include vote extensions, whose are needed for a proper operation of the consensus protocol.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Resolves #8219
First draft of blocksync spec. The verification is based on the expected changes adding the light client verification.