Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: blocksync #8586

Closed
wants to merge 31 commits into from
Closed

spec: blocksync #8586

wants to merge 31 commits into from

Conversation

jmalicevic
Copy link
Contributor

Resolves #8219

First draft of blocksync spec. The verification is based on the expected changes adding the light client verification.

@jmalicevic jmalicevic changed the title Jasmina/8219 blocksync spec (WIP) spec/blocksync May 20, 2022

### Trusted state

The light client additionally relies on the notion of a **trusting period**. A trusting period is the time during which we assume we can trust validators because, if we do detect misbehaviour, we can slash them - they are still bonded. Beyond this period, the validators do not have to have any bonded assets and cannot be held accountable for their misbheaviour. Blocksync-ing blocks will most often be outside this trusting period for a particular block. Therefore, the trusting period assumptions as they are in the light client, cannot be applied here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unclear to me. In my view, the trusting period is nothing specific to the light client, but a consequence of the unbonding period. If we cannot apply the trusting period, this means that the whole protocol is in general unsafe, or more precisely, the problem we want to solve is unsolvable.

For instance, if we want to blocksync from genesis, and genesis is 2 months old, all the initial validators might have unbonded. They can produce an alternative history, and make our new node always sync to a bad state. If we don't make additional assumptions the problem is unsolvable. (Or the protocol is just best-effort. But even then we should explain in what cases it works).

This means to say something meaningful about blocksync, we might need to add (or make explicit already existing implicit) assumptions, e.g., every new node can talk to at least one correct peer who has the correct history of blocks (headers) stored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point I agree with you on is the weak guarantees this initial trust in the validator set provides to us in blocksync. And I think we need to find a way to improve the guarantees on this by either starting a light client service to verify headers against or some other, safer solution.

I think there is a sentence about the node always having a correct peer but correct is not specified fully as you stated it here. Indeed, for this to be correct we do need a peer who will send us a correct block (header), but I feel this assumption is weak. I will add this assumption explicitly for now but I do think we should come up with a safer solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that may inform the assumptions is the use case. Perhaps we can differentiate:

  • If you have a trusted state within the trusting period (w.r.t. to now): strong guarantees
  • Otherwise: weak guarantees. We still provide automation, but you should double check the outcome.

For me the idea with witnesses is in the second option. "We cannot really give guarantees, but we want to make the attack more complicated."

We should make the different use cases/assumptions and the following guarantees more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we launched the light client as a service (as state sync does), do you think we could always provide strong guarantees? if so would this be a better option compared to witness verification? If we are satisfied with the witness model then I can certainly specify these things better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the input data rather than the protocol we are using. If you start from genesis, and the genesis is older than the trusting period, then the light client would just terminate without doing anything.

If we expect blocksync to still do something, we need to understand that whatever protocol it uses (witnesses, or the light client or any other new protocol we might come up with), it will have weaker guarantees.

Perhaps it is better to jump on a quick call to discuss this synchronously ;-)

@jmalicevic jmalicevic changed the title (WIP) spec/blocksync spec: blocksync Jun 2, 2022
@jmalicevic jmalicevic marked this pull request as ready for review June 2, 2022 14:53
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Jun 13, 2022
@creachadair creachadair removed the stale for use by stalebot label Jun 13, 2022
Copy link
Contributor

@sergio-mena sergio-mena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long list of comments. Most are minor and shouldn't take a lot of effort

spec/blocksync/readme.md Outdated Show resolved Hide resolved
spec/blocksync/readme.md Outdated Show resolved Hide resolved
spec/blocksync/readme.md Outdated Show resolved Hide resolved
spec/blocksync/readme.md Outdated Show resolved Hide resolved
spec/blocksync/readme.md Outdated Show resolved Hide resolved
spec/blocksync/verification.md Outdated Show resolved Hide resolved

If all these checks pass, the reactor verifies that other peers have the same block at the particular height.

In order to reduce traffic, we do not ask the peers to provide the whole block to us, rather only the header. If crosschecking the headers fails then the node requests the remainder of the block to decide on whether this peer is faulty or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To further reduce traffic, we could just request the block's hash (smaller than the header)

spec/blocksync/verification.md Outdated Show resolved Hide resolved
spec/blocksync/verification.md Outdated Show resolved Hide resolved
- If we connect to a subset of peers, they could feed the node faulty data. Eventually, when the node switches to consensus, it would realize there is something wrong, but then the node itself might be blacklisted.
- Alternatively, a node that is fed faulty data, could, upon switching to conensus, become part of a light client attack by serving as a faulty witness.
- There is no check whether the maximum height reported by peers is true or not. A slow node could report a very distant height to the node - for example 2000, when the blockchain is at height 1000 in fact. This would lead to one part of the condition to switch to consensus never being true. To prevent the node switching due to not advancing, the malicious node sends a new block very slowly. Thus the node progresses but can never participate in consensus. This issue could potentially be mitigated if, instead of taking the maximum height reported by peers, we report the lowest of their maximums. The idea is that peers should be close enought to the top of the chain in any case.
- A blocksyncing node can flood peers with requests - constantly reporting that it has not synced up. At the moment the maximum amount of requests received is limited and protects peers to some extend against this attack.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong on this, but another problem is that the blocks sent/received via block synch reactor are not chunked like in consensus. So a busy chain containing big blocks might cause performance problems. Also, a malicious node could try to feed big blocks to try to "choke" the receiver.

I am aware this affects performance rather than correctness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I suggested that maybe we send around headers and blocks only once verification passes. However, this just delays the problem. .

Copy link
Contributor

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this specification presents the implementation of the blocksync reactor very well.

But I focused my comments, in this first review, in the introduction of the spec, the readme file. I would suggestion extending it with:

  • General definitions, maybe even a form o glossary of terms used several times
  • A high-level overview of what blocksync actually does, independently how it is implemented

I think that once this high-level concepts and goals are properly defined, we can present the existing implementation as a way of obtaining the desired goals, pointing out its limitations and potential improvements.

spec/blocksync/readme.md Show resolved Hide resolved
---


In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But more generally, I don't agree with its content. In both classes of blockchains, syncing blocks is the same as deciding the block in the usual (consensus) way. In PoW blockchains, we extend a block that we "trust" with downloaded blocks, verify the downloaded block, and append to a local candidate blockchain. If we have multiple local candidate blockchains, we opt for the longest one, that represents the greatest accumulate work. In PoS blockchains, we do the same: we extend a block that we "trust" with downloaded blocks, verify the downloaded block, and append to a local candidate blockchain. The differences are two: (i) we verify a block by verifying the signatures of messages voting for the block, instead of verifying brute-force-generated hashes, and (ii) we should not have multiple candidate blockchains.

That said, one thing that should be clearer in the specification of this module/protocol/spec is that blocksync represent a very condensed execution of the consensus protocol. The consensus protocol involves (potentially) multiples rounds, each round is composed by three communication steps. A node, however, can decide a block by only receiving the messages from the first (propose) and last (precommit) step of any round, provided they match. In other words, it is enough for a node to receive a proposed block and 2/3+ Precommit messages from the same round for that block to complete and instance of consensus.

In my view, and please correct me if I am wrong, this exactly the information used in blocksync. There are, of course, differences: (i) the proposed block is not split into multiple messages, as in consensus, and (ii) a set of 2/3+ identical Precommit messages for the decided block are condensed into a Commit. If this parallel between consensus and blocksync is indeed valid, I think that this is the way of introducing blocksync.


In a proof of work blockchain, syncing with the chain is the same process as staying up-to-date with the consensus: download blocks, and look for the one with the most total work. In proof-of-stake, the consensus process is more complex, as it involves rounds of communication between the nodes to determine what block should be committed next. Using this process to sync up with the blockchain from scratch can take a very long time. It's much faster to just download blocks and check the merkle tree of validators than to run the real-time consensus gossip protocol.

The Blocksync Reactor's high level responsibility is to enable peers who are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we defining a reactor here, which is an implementation name, or a service. For the sake of specification, I would prefer to define a service, which is more abstract, to then describe an implementation of the service (e.g., as a Reactor).

spec/blocksync/readme.md Outdated Show resolved Hide resolved
many blocks (that have already been decided) in parallel, verifying their commits, and executing them against the
ABCI application.

Tendermint full nodes run the Blocksync Reactor as a service to provide blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Tendermint full nodes run the Blocksync Reactor as a service to provide blocks
Tendermint nodes run the Blocksync Reactor as a service to provide blocks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here you are describing the two "modes" of operation of blocksync: server, which ships decided blocks to whoever wants them, and client, which asks peers for blocks while (or when) they are needed.

I don't know whether this server/client nomenclature is good (I've used it in this paper, Section V.D), but I clearly see two different roles here.


The reactor is activated after state sync, where the pool and request processing routines are launched.

However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the service starts with the node, meaning that is able to send and receive status messages. But its actual operation only starts when the block pool is ready, is that correct?


However, receiving messages via the p2p channel and sending status updates to other nodes is enabled regardless of whether the blocksync reactor is started. This makes sense as a node should be able to send updates to other peers regardless of whether it itself is blocksyncing.

**Note**. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync *service* before we call the function to switch from statesync to blocksync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe too much implementation specific?

**Note**. In the current version, if we start from state sync and block sync is not launched before as a service, the internal channels used by the reactor will not be created. We need to be careful to launch the blocksync *service* before we call the function to switch from statesync to blocksync.

### Switching from blocksync to consensus
Ideally, the switch to consensus is done once the node considers itself caugh up or we have not advanced our height for more than 60s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we move to consensus when the node caught up with its peers. This is the expected after a "run" of blocksync in "client mode".

In practice, we need to define what "caught up" means here. This includes the mentioned 60s (which could another duration, don't you agree?) and the condition mentioned in the next paragraph (1 height away etc.).

### Switching from blocksync to consensus
Ideally, the switch to consensus is done once the node considers itself caugh up or we have not advanced our height for more than 60s.

The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we **do not catch up until the maximum height** (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec would benefit from having this fact explained earlier: we need the block at height H+1 to validate the block at height H. This happens because the block at height H+1 includes a Commit of height H, i.e., 2/3+ Precommit signatures for the block decided at height H.


The former is checked by calling `isCaughtUp` inside `poolRoutine` periodically. This period is set with `switchToConsensusTicker`. We consider a node to be caught up if it is 1 height away from the maximum height reported by its peers. The reason we **do not catch up until the maximum height** (`pool.maxPeerHeight`) is that we cannot verify the block at `pool.maxPeerHeight` without the `lastCommit` of the block at `pool.maxPeerHeight + 1`.

If the node is not starting from genesis, blocksync **does not** switch to consensus until we have synced at least one block. We need to have vote extensions in order to participate in consensus and they are not provided to the blocksync reactor after state sync. We therefore need to receive them from one of our peers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a subsection with the additional requirement imposed by vote extensions would be a good idea. Briefly defining what it means, with a summary and/or links, is also good. The point here is that a Commit does not include vote extensions, whose are needed for a proper operation of the consensus protocol.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Jul 15, 2022
@cmwaters cmwaters removed the stale for use by stalebot label Jul 15, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Jul 26, 2022
@github-actions github-actions bot closed this Jul 31, 2022
@thanethomson thanethomson mentioned this pull request Oct 17, 2022
35 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale for use by stalebot
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spec: define syncing protocols
6 participants