Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc 27: report on bandwidth usage within Tendermint #9706

Merged
merged 23 commits into from
Dec 8, 2022
Merged

Conversation

williambanfield
Copy link
Contributor

@williambanfield williambanfield commented Nov 15, 2022

This pull request adds a report on the major bandwidth usage within Tendermint.

Rendered

PR checklist

  • Tests written/updated, or no tests needed
  • CHANGELOG_PENDING.md updated, or no changelog entry needed
  • Updated relevant documentation (docs/) and code comments, or no
    documentation updates needed

docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved

Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.

This new proposed method for gossiping block data would require a slight update to the mempool transaction gossip and consensus vote gossip. Since all of the contents of each block will not be gossiped together, it's possible that some nodes are missing a proposed transaction or the vote of a validator indicated in the new block gossip format. The mempool and consensus reactors would need to be updated to provide a `NeedTxs` and `NeedVotes` message. Each of these messages would allow a node to request a set of data from their peers. When a node receives one of these, it will then transmit the Tx/Votes indicate in the associated message regardless of whether it believes it has transmitted them to the peer before.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GossipSub works a little bit in this way with Want and Have messages. I think you'd want to have more than just Want/Need messages because you don't know who actually has the data you're requesting and so you're just blindly requesting it. What would be better is that upon receiving a block a node communicates all the txs it has. This can be quite compact in the same way VoteSetBits is because you reference by index in the block as opposed to hash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good approach to start thinking about, but I am afraid that the required changes are more complex. For instance, can a node forward a block when it does not have the full content (votes and txs) it references? Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

[I mentioned this in another comment] Why do we need to guarantee that? How would that be different from forwarding a block part of a block for which we haven't received all block parts yet (in the current logic)?

I'm only referring to TXs here. I understand votes are slightly different because consensus will stop propagating them upon decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'd want to have more than just Want/Need messages because you don't know who actually has the data you're requesting and so you're just blindly requesting it.

This is true, however you the proposed mechanism still would result in the node receiving all of the data from its peers since our peers would have to receive them as well in order to actually commit the block at all.

Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

I'm a bit unclear on how this would be any different from what we have today. If a validator has already committed the block, then all of the Tx and votes within the block are available within the block. If the block has been pruned by all validators, then the Txs and votes may also be missing, but the block parts would also be gone so we are, as I see it, no worse than we previously were.


Given that Tendermint informs all peers of _each_ vote message it receives, all nodes should be well informed of which votes their peers have. Given that the vote messages were the third largest consumer of bandwidth in the observation on Osmosis, it's possible that this system is not currently working correctly. Further analysis should examine where votes may be being retransmitted.

### Suggested Improvements to Lower Message Transmission Bandwidth
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasure coding block propagation is another one that (especially Dev) has been keen on. It decreases the chances you receive multiple parts because you only need, for instance, and 5 of 10 parts to reproduce the entire block. See #7932 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erasure code will only help if we are more structured about how to spread information since each node still needs to receive the same amount of data for each proposal.

For example, if the proposal data amounts to 40k or 10 "original" block parts and you add 5 blocks of "parity", then you have to spread 15 blocks from the proposer's point of view and each node needs to receive 10 block parts, original or parity. If some of the blocks are parity, then they will be used to reconstruct the missing original ones, also adding some processing overhead, before being able to reconstruct and ProcessProposal.

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Copy link
Contributor

@sergio-mena sergio-mena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent read!

docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
Copy link
Contributor

@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice document.

I think it involves some discussions that are already present elsewhere in the repository, and should be referred. Also, I think some of the proposed alternatives should be discussed in more detail in future versions or derived RFC/ADRs.

docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
#### BlockPart Transmission

Sending `BlockPart` messages consumes the most bandwidth out of all p2p messages types as observed in the Blockpane Osmosis validator.
In the almost 3 hour observation, the validator sent about 20 gigabytes of `BlockPart` messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably already mentioned this, but it is useful here to know how many bytes were effectively added to the blockchain in this period. This gives an idea of the communication overhead: consumed bandwidth / payload.

docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved

#### Mempool Tx Transmission

The Tendermint mempool stages transactions that are yet to be committed to the blockchain and communicates these transactions to its peers. Each message contains one transaction. Data collected from the Blockpane node running on Osmosis indicates that the validator sent about 12 gigabytes of `Txs` messages during the nearly 3 hour observation period.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it would be interesting to know the cumulative size of all transactions committed in the same time frame, so that we have an idea of the overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at collecting this. I have the height offsets, so we should we able to figure this out

docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved
docs/rfc/rfc-027-p2p-message-bandwidth-report.md Outdated Show resolved Hide resolved

Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.

This new proposed method for gossiping block data would require a slight update to the mempool transaction gossip and consensus vote gossip. Since all of the contents of each block will not be gossiped together, it's possible that some nodes are missing a proposed transaction or the vote of a validator indicated in the new block gossip format. The mempool and consensus reactors would need to be updated to provide a `NeedTxs` and `NeedVotes` message. Each of these messages would allow a node to request a set of data from their peers. When a node receives one of these, it will then transmit the Tx/Votes indicate in the associated message regardless of whether it believes it has transmitted them to the peer before.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good approach to start thinking about, but I am afraid that the required changes are more complex. For instance, can a node forward a block when it does not have the full content (votes and txs) it references? Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

@github-actions
Copy link

github-actions bot commented Dec 3, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale for use by stalebot label Dec 3, 2022
@thanethomson thanethomson added S:wip Work in progress (prevents stalebot from automatically closing) and removed stale for use by stalebot labels Dec 3, 2022

The Tendermint mempool starts a new [broadcastTxRoutine][broadcast-tx-routine] for each peer that it is informed of. The routine sends all transactions that the mempool is aware of to all peers with few exceptions. The only exception is if the mempool received a transaction from a peer, then it marks it as such and won't resend to that peer. Otherwise, it retains no information about which transactions it already to sent to a peer. In some cases it may therefore resend transactions the peer already has. This can occur if the mempool removes a transaction from the `CList` data structure used to store the list of transaction while it is about to be sent and if the transaction was the tail of the `CList` during removal. This will be more likely to occur if a large number of transactions from the end of the list are removed during `RecheckTx`, since multiple transactions will become the tail and then be deleted. It is unclear at the moment how frequently this occurs on production chains.

Beyond ensuring that transactions are rebroadcast to peers less frequently, there is not a simple scheme to communicate fewer transactions to peers. Peers cannot communicate what transactions they need since they do not know which transactions exist on the network.
Copy link
Contributor

@lasarojc lasarojc Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here they could communicate what they have, for example in a bloom filter, and let the other nodes figure what they don't have and must be sent. Bloom filters could be reset at each new hight or using some aging scheme.
Boom filters size could be a consensus parameter, adjusted whenever the filter is deemed too full or too empty. A bad size choice will use more bandwidth to transmit an almost empty filter (too big) or slow the propagation of transactions due to false positives (too small), until a reconfiguration happens.


Block, vote, and mempool gossiping transmit much of same data. The mempool reactor gossips candidate transactions to each peer. The consensus reactor, when gossiping the votes, sends vote metadata and the digital signature of that signs over that metadata. Finally, when a block is proposed, the proposing node amalgamates the received votes, a set of transaction, and adds a header to produce the block. This block is then serialized and gossiped as a list of bytes. However, the data that the block contains, namely the votes and the transactions were most likely _already transmitted to the nodes on the network_ via mempool transaction gossip and consensus vote gossip.

Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.
Copy link
Contributor

@lasarojc lasarojc Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also have block parts that include transactions that were inserted during prepareProposal, as these will not be in the mempool, or have the proposer add such transactions to the mempool. These cannot be enforced, though, and byzantine nodes could use this stall rounds, but it is not different from their current ability to withhold block parts.

williambanfield and others added 3 commits December 8, 2022 10:47
Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Sergio Mena <sergio@informal.systems>
@williambanfield
Copy link
Contributor Author

There is still an open question here regarding the total size of all transactions added to the chain during this experiment. I have not had a chance to retrieve that data. I am planning to still merge this despite the open question so it can remain in the repo for future reference.

@williambanfield williambanfield added the S:automerge Automatically merge PR when requirements pass label Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:automerge Automatically merge PR when requirements pass S:wip Work in progress (prevents stalebot from automatically closing)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants