spec: blocksync #8586

jmalicevic · 2022-05-20T13:27:51Z

Resolves #8219

First draft of blocksync spec. The verification is based on the expected changes adding the light client verification.

josef-widder · 2022-05-23T09:49:32Z

spec/blocksync/verification.md

+
+### Trusted state
+
+The light client additionally relies on the notion of a **trusting period**. A trusting period is the time during which we assume we can trust validators because, if we do detect misbehaviour, we can slash them - they are still bonded. Beyond this period, the validators do not have to have any bonded assets and cannot be held accountable for their misbheaviour. Blocksync-ing blocks will most often be outside this trusting period for a particular block. Therefore, the trusting period assumptions as they are in the light client, cannot be applied here. 


This is a bit unclear to me. In my view, the trusting period is nothing specific to the light client, but a consequence of the unbonding period. If we cannot apply the trusting period, this means that the whole protocol is in general unsafe, or more precisely, the problem we want to solve is unsolvable.

For instance, if we want to blocksync from genesis, and genesis is 2 months old, all the initial validators might have unbonded. They can produce an alternative history, and make our new node always sync to a bad state. If we don't make additional assumptions the problem is unsolvable. (Or the protocol is just best-effort. But even then we should explain in what cases it works).

This means to say something meaningful about blocksync, we might need to add (or make explicit already existing implicit) assumptions, e.g., every new node can talk to at least one correct peer who has the correct history of blocks (headers) stored.

The main point I agree with you on is the weak guarantees this initial trust in the validator set provides to us in blocksync. And I think we need to find a way to improve the guarantees on this by either starting a light client service to verify headers against or some other, safer solution.

I think there is a sentence about the node always having a correct peer but correct is not specified fully as you stated it here. Indeed, for this to be correct we do need a peer who will send us a correct block (header), but I feel this assumption is weak. I will add this assumption explicitly for now but I do think we should come up with a safer solution.

One thing that may inform the assumptions is the use case. Perhaps we can differentiate:

If you have a trusted state within the trusting period (w.r.t. to now): strong guarantees

Otherwise: weak guarantees. We still provide automation, but you should double check the outcome.

For me the idea with witnesses is in the second option. "We cannot really give guarantees, but we want to make the attack more complicated."

We should make the different use cases/assumptions and the following guarantees more clear.

If we launched the light client as a service (as state sync does), do you think we could always provide strong guarantees? if so would this be a better option compared to witness verification? If we are satisfied with the witness model then I can certainly specify these things better.

The issue is the input data rather than the protocol we are using. If you start from genesis, and the genesis is older than the trusting period, then the light client would just terminate without doing anything.

If we expect blocksync to still do something, we need to understand that whatever protocol it uses (witnesses, or the light client or any other new protocol we might come up with), it will have weaker guarantees.

Perhaps it is better to jump on a quick call to discuss this synchronously ;-)

github-actions · 2022-06-13T00:17:54Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sergio-mena

Sorry for the long list of comments. Most are minor and shouldn't take a lot of effort

spec/blocksync/readme.md

spec/blocksync/verification.md

sergio-mena · 2022-06-20T11:22:19Z

spec/blocksync/verification.md

+
+If all these checks pass, the reactor verifies that other peers have the same block at the particular height. 
+
+In order to reduce traffic, we do not ask the peers to provide the whole block to us, rather only the header. If crosschecking the headers fails then the node requests the remainder of the block to decide on whether this peer is faulty or not.


To further reduce traffic, we could just request the block's hash (smaller than the header)

spec/blocksync/verification.md

sergio-mena · 2022-06-20T11:32:12Z

spec/blocksync/verification.md

+- If we connect to a subset of peers, they could feed the node faulty data. Eventually, when the node switches to consensus, it would realize there is something wrong, but then the node itself might be blacklisted.
+- Alternatively, a node that is fed faulty data, could, upon switching to conensus, become part of a light client attack by serving as a faulty witness. 
+- There is no check whether the maximum height reported by peers is true or not. A slow node could report a very distant height to the node - for example 2000, when the blockchain is at height 1000 in fact. This would lead to one part of the condition to switch to consensus never being true. To prevent the node switching due to not advancing, the malicious node sends a new block very slowly. Thus the node progresses but can never participate in consensus. This issue could potentially be mitigated if, instead of taking the maximum height reported by peers, we report the lowest of their maximums. The idea is that peers should be close enought to the top of the chain in any case. 
+- A blocksyncing node can flood peers with requests - constantly reporting that it has not synced up. At the moment the maximum amount of requests received is limited and protects peers to some extend against this attack. 


I could be wrong on this, but another problem is that the blocks sent/received via block synch reactor are not chunked like in consensus. So a busy chain containing big blocks might cause performance problems. Also, a malicious node could try to feed big blocks to try to "choke" the receiver.

I am aware this affects performance rather than correctness

That is why I suggested that maybe we send around headers and blocks only once verification passes. However, this just delays the problem. .

abci/tests/test_cli/test.sh

spec/blocksync/verification.md

cason

I think this specification presents the implementation of the blocksync reactor very well.

But I focused my comments, in this first review, in the introduction of the spec, the readme file. I would suggestion extending it with:

General definitions, maybe even a form o glossary of terms used several times
A high-level overview of what blocksync actually does, independently how it is implemented

I think that once this high-level concepts and goals are properly defined, we can present the existing implementation as a way of obtaining the desired goals, pointing out its limitations and potential improvements.

spec/blocksync/readme.md

cason · 2022-06-27T05:54:36Z

spec/blocksync/readme.md

+---
+
+
+In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol.


This paragraph comes from here: https://github.com/tendermint/tendermint/blob/c398bc55e6e5f3fd01f5524aaa7a847ae7f35938/docs/tendermint-core/block-sync/README.md

But more generally, I don't agree with its content. In both classes of blockchains, syncing blocks is the same as deciding the block in the usual (consensus) way. In PoW blockchains, we extend a block that we "trust" with downloaded blocks, verify the downloaded block, and append to a local candidate blockchain. If we have multiple local candidate blockchains, we opt for the longest one, that represents the greatest accumulate work. In PoS blockchains, we do the same: we extend a block that we "trust" with downloaded blocks, verify the downloaded block, and append to a local candidate blockchain. The differences are two: (i) we verify a block by verifying the signatures of messages voting for the block, instead of verifying brute-force-generated hashes, and (ii) we should not have multiple candidate blockchains.

That said, one thing that should be clearer in the specification of this module/protocol/spec is that blocksync represent a very condensed execution of the consensus protocol. The consensus protocol involves (potentially) multiples rounds, each round is composed by three communication steps. A node, however, can decide a block by only receiving the messages from the first (propose) and last (precommit) step of any round, provided they match. In other words, it is enough for a node to receive a proposed block and 2/3+ Precommit messages from the same round for that block to complete and instance of consensus.

In my view, and please correct me if I am wrong, this exactly the information used in blocksync. There are, of course, differences: (i) the proposed block is not split into multiple messages, as in consensus, and (ii) a set of 2/3+ identical Precommit messages for the decided block are condensed into a Commit. If this parallel between consensus and blocksync is indeed valid, I think that this is the way of introducing blocksync.

cason · 2022-06-27T06:11:04Z

spec/blocksync/readme.md

+
+In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol.
+
+The Blocksync Reactor's high level responsibility is to enable peers who are


Are we defining a reactor here, which is an implementation name, or a service. For the sake of specification, I would prefer to define a service, which is more abstract, to then describe an implementation of the service (e.g., as a Reactor).

spec/blocksync/readme.md

cason · 2022-06-27T06:13:35Z

spec/blocksync/readme.md

+many blocks (that have already been decided) in parallel, verifying their commits, and executing them against the
+ABCI application.
+
+Tendermint full nodes run the Blocksync Reactor as a service to provide blocks


Suggested change

Tendermint full nodes run the Blocksync Reactor as a service to provide blocks

Tendermint nodes run the Blocksync Reactor as a service to provide blocks

I think here you are describing the two "modes" of operation of blocksync: server, which ships decided blocks to whoever wants them, and client, which asks peers for blocks while (or when) they are needed.

I don't know whether this server/client nomenclature is good (I've used it in this paper, Section V.D), but I clearly see two different roles here.

cason · 2022-06-27T06:25:53Z

spec/blocksync/readme.md

+
+The reactor is activated after state sync, where the pool and request processing routines are launched. 
+
+However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing.  


So, the service starts with the node, meaning that is able to send and receive status messages. But its actual operation only starts when the block pool is ready, is that correct?

cason · 2022-06-27T06:26:10Z

spec/blocksync/readme.md

+
+However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing.  
+
+**Note**. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync *service* before we call the function to switch from statesync to blocksync.  


Maybe too much implementation specific?

cason · 2022-06-27T06:30:24Z

spec/blocksync/readme.md

+**Note**. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync *service* before we call the function to switch from statesync to blocksync.  
+
+### Switching from blocksync to consensus
+Ideally, the switch to consensus is done once the node considers itself caugh up or we have not advanced our height for more than 60s. 


Ideally, we move to consensus when the node caught up with its peers. This is the expected after a "run" of blocksync in "client mode".

In practice, we need to define what "caught up" means here. This includes the mentioned 60s (which could another duration, don't you agree?) and the condition mentioned in the next paragraph (1 height away etc.).

cason · 2022-06-27T06:32:27Z

spec/blocksync/readme.md

+### Switching from blocksync to consensus
+Ideally, the switch to consensus is done once the node considers itself caugh up or we have not advanced our height for more than 60s. 
+
+The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we **do not catch up until the maximum height** (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`. 


The spec would benefit from having this fact explained earlier: we need the block at height H+1 to validate the block at height H. This happens because the block at height H+1 includes a Commit of height H, i.e., 2/3+ Precommit signatures for the block decided at height H.

cason · 2022-06-27T06:34:48Z

spec/blocksync/readme.md

+
+The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we **do not catch up until the maximum height** (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`. 
+
+If the node is not starting from genesis, blocksync **does not** switch to consensus until we have synced at least one block. We need to have vote extensions in order to participate in consensus and they are not provided to the blocksync reactor after state sync. We therefore need to receive them from one of our peers. 


I think a subsection with the additional requirement imposed by vote extensions would be a good idea. Briefly defining what it means, with a summary and/or links, is also good. The point here is that a Commit does not include vote extensions, whose are needed for a proper operation of the consensus protocol.

github-actions · 2022-07-15T00:19:39Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2022-07-26T00:17:55Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jmalicevic added 11 commits May 16, 2022 09:07

Initial commit (no verification and communication)

2ac53d6

Added communication section

7527462

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

f275fa6

Initial doc on verification

db44385

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

76e43df

fixed links

a56ba48

fixed links

431d78d

fixed links

81476bb

fixed links

8199104

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

92d925c

Layout and spelling

4eab83b

jmalicevic requested review from sergio-mena and cmwaters May 20, 2022 13:27

jmalicevic changed the title ~~Jasmina/8219 blocksync spec~~ (WIP) spec/blocksync May 20, 2022

jmalicevic added 2 commits May 20, 2022 15:31

Paragraph on perf improvement to verification

4aad2de

Fixed links

68ff259

josef-widder reviewed May 23, 2022

View reviewed changes

jmalicevic added 6 commits May 23, 2022 12:00

spec/blocksync: added assumption on peer correctness

d443995

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

36f8df9

Updated verification part, added situation when switching from statesync

63d788c

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

2f52282

Minor fixes

7527fbb

Minor spelling corrections

c1e72ea

jmalicevic changed the title ~~(WIP) spec/blocksync~~ spec: blocksync Jun 2, 2022

jmalicevic marked this pull request as ready for review June 2, 2022 14:53

jmalicevic requested review from milosevic and cason as code owners June 2, 2022 14:53

github-actions bot added the stale for use by stalebot label Jun 13, 2022

creachadair removed the stale for use by stalebot label Jun 13, 2022

Merge branch 'master' into jasmina/8219-blocksync-spec

92235fc

sergio-mena reviewed Jun 20, 2022

View reviewed changes

Applied PR comments

e266b97

jmalicevic requested review from ebuchman, tychoish, williambanfield, creachadair, thanethomson and ancazamfir as code owners June 22, 2022 06:14

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

749c03d

sergio-mena reviewed Jun 24, 2022

View reviewed changes

abci/tests/test_cli/test.sh Outdated Show resolved Hide resolved

sergio-mena reviewed Jun 24, 2022

View reviewed changes

spec/blocksync/verification.md Outdated Show resolved Hide resolved

jmalicevic added 3 commits June 24, 2022 11:41

Reverted change in abci test script

88f45ad

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

0c5720b

Fixed condition for valid block

c398bc5

cason reviewed Jun 27, 2022

View reviewed changes

jmalicevic added 6 commits July 4, 2022 14:51

fixed image, minor text fixes

a4dd1df

Merge remote-tracking branch 'origin' into jasmina/8219-blocksync-spec

3e9dc2a

minor

79476de

applied PR comments

e9fef17

minor

e443336

Updated communication

790d3f8

github-actions bot added the stale for use by stalebot label Jul 15, 2022

cmwaters removed the stale for use by stalebot label Jul 15, 2022

github-actions bot added the stale for use by stalebot label Jul 26, 2022

github-actions bot closed this Jul 31, 2022

thanethomson mentioned this pull request Oct 17, 2022

Tendermint specification #9321

Closed

35 tasks

lasarojc mentioned this pull request Dec 23, 2022

CometBFT specification cometbft/cometbft#23

Open

40 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: blocksync #8586

spec: blocksync #8586

jmalicevic commented May 20, 2022

josef-widder May 23, 2022

jmalicevic May 23, 2022

josef-widder May 23, 2022

jmalicevic May 23, 2022

josef-widder May 23, 2022

github-actions bot commented Jun 13, 2022

sergio-mena left a comment

sergio-mena Jun 20, 2022

sergio-mena Jun 20, 2022

jmalicevic Jun 22, 2022

cason left a comment

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

cason Jun 27, 2022

github-actions bot commented Jul 15, 2022

github-actions bot commented Jul 26, 2022


		### Trusted state

		The light client additionally relies on the notion of a trusting period. A trusting period is the time during which we assume we can trust validators because, if we do detect misbehaviour, we can slash them - they are still bonded. Beyond this period, the validators do not have to have any bonded assets and cannot be held accountable for their misbheaviour. Blocksync-ing blocks will most often be outside this trusting period for a particular block. Therefore, the trusting period assumptions as they are in the light client, cannot be applied here.


		If all these checks pass, the reactor verifies that other peers have the same block at the particular height.

		In order to reduce traffic, we do not ask the peers to provide the whole block to us, rather only the header. If crosschecking the headers fails then the node requests the remainder of the block to decide on whether this peer is faulty or not.

		---


		In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol.


		In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol.

		The Blocksync Reactor's high level responsibility is to enable peers who are

	Tendermint full nodes run the Blocksync Reactor as a service to provide blocks
	Tendermint nodes run the Blocksync Reactor as a service to provide blocks


		The reactor is activated after state sync, where the pool and request processing routines are launched.

		However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing.


		However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing.

		Note. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync service before we call the function to switch from statesync to blocksync.


		The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we do not catch up until the maximum height (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`.

		If the node is not starting from genesis, blocksync does not switch to consensus until we have synced at least one block. We need to have vote extensions in order to participate in consensus and they are not provided to the blocksync reactor after state sync. We therefore need to receive them from one of our peers.

spec: blocksync #8586

spec: blocksync #8586

Conversation

jmalicevic commented May 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 13, 2022

sergio-mena left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 15, 2022

github-actions bot commented Jul 26, 2022