rfc 27: report on bandwidth usage within Tendermint #9706

williambanfield · 2022-11-15T16:59:36Z

This pull request adds a report on the major bandwidth usage within Tendermint.

Rendered

PR checklist

Tests written/updated, or no tests needed
CHANGELOG_PENDING.md updated, or no changelog entry needed
Updated relevant documentation (docs/) and code comments, or no
documentation updates needed

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

cmwaters · 2022-11-15T17:30:58Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+
+Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.
+
+This new proposed method for gossiping block data would require a slight update to the mempool transaction gossip and consensus vote gossip. Since all of the contents of each block will not be gossiped together, it's possible that some nodes are missing a proposed transaction or the vote of a validator indicated in the new block gossip format. The mempool and consensus reactors would need to be updated to provide a `NeedTxs` and `NeedVotes` message. Each of these messages would allow a node to request a set of data from their peers. When a node receives one of these, it will then transmit the Tx/Votes indicate in the associated message regardless of whether it believes it has transmitted them to the peer before.


GossipSub works a little bit in this way with Want and Have messages. I think you'd want to have more than just Want/Need messages because you don't know who actually has the data you're requesting and so you're just blindly requesting it. What would be better is that upon receiving a block a node communicates all the txs it has. This can be quite compact in the same way VoteSetBits is because you reference by index in the block as opposed to hash

This is a good approach to start thinking about, but I am afraid that the required changes are more complex. For instance, can a node forward a block when it does not have the full content (votes and txs) it references? Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

[I mentioned this in another comment] Why do we need to guarantee that? How would that be different from forwarding a block part of a block for which we haven't received all block parts yet (in the current logic)?

I'm only referring to TXs here. I understand votes are slightly different because consensus will stop propagating them upon decision.

I think you'd want to have more than just Want/Need messages because you don't know who actually has the data you're requesting and so you're just blindly requesting it.

This is true, however you the proposed mechanism still would result in the node receiving all of the data from its peers since our peers would have to receive them as well in order to actually commit the block at all.

Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

I'm a bit unclear on how this would be any different from what we have today. If a validator has already committed the block, then all of the Tx and votes within the block are available within the block. If the block has been pruned by all validators, then the Txs and votes may also be missing, but the block parts would also be gone so we are, as I see it, no worse than we previously were.

cmwaters · 2022-11-15T17:32:25Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+
+Given that Tendermint informs all peers of _each_ vote message it receives, all nodes should be well informed of which votes their peers have. Given that the vote messages were the third largest consumer of bandwidth in the observation on Osmosis, it's possible that this system is not currently working correctly. Further analysis should examine where votes may be being retransmitted.
+
+### Suggested Improvements to Lower Message Transmission Bandwidth


Erasure coding block propagation is another one that (especially Dev) has been keen on. It decreases the chances you receive multiple parts because you only need, for instance, and 5 of 10 parts to reproduce the entire block. See #7932 (comment)

Erasure code will only help if we are more structured about how to spread information since each node still needs to receive the same amount of data for each proposal.

For example, if the proposal data amounts to 40k or 10 "original" block parts and you add 5 blocks of "parity", then you have to spread 15 blocks from the proposer's point of view and each node needs to receive 10 block parts, original or parity. If some of the blocks are parity, then they will be used to reconstruct the missing original ones, also adding some processing overhead, before being able to reconstruct and ProcessProposal.

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

sergio-mena

Excellent read!

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

cason

Nice document.

I think it involves some discussions that are already present elsewhere in the repository, and should be referred. Also, I think some of the proposed alternatives should be discussed in more detail in future versions or derived RFC/ADRs.

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

cason · 2022-11-17T16:20:21Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+#### BlockPart Transmission
+
+Sending `BlockPart` messages consumes the most bandwidth out of all p2p messages types as observed in the Blockpane Osmosis validator.
+In the almost 3 hour observation, the validator sent about 20 gigabytes of `BlockPart` messages.


I probably already mentioned this, but it is useful here to know how many bytes were effectively added to the blockchain in this period. This gives an idea of the communication overhead: consumed bandwidth / payload.

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

cason · 2022-11-17T16:31:44Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+
+#### Mempool Tx Transmission
+
+The Tendermint mempool stages transactions that are yet to be committed to the blockchain and communicates these transactions to its peers. Each message contains one transaction. Data collected from the Blockpane node running on Osmosis indicates that the validator sent about 12 gigabytes of `Txs` messages during the nearly 3 hour observation period.


Again, it would be interesting to know the cumulative size of all transactions committed in the same time frame, so that we have an idea of the overhead.

I'll take a look at collecting this. I have the height offsets, so we should we able to figure this out

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

cason · 2022-11-17T16:43:48Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+
+Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.
+
+This new proposed method for gossiping block data would require a slight update to the mempool transaction gossip and consensus vote gossip. Since all of the contents of each block will not be gossiped together, it's possible that some nodes are missing a proposed transaction or the vote of a validator indicated in the new block gossip format. The mempool and consensus reactors would need to be updated to provide a `NeedTxs` and `NeedVotes` message. Each of these messages would allow a node to request a set of data from their peers. When a node receives one of these, it will then transmit the Tx/Votes indicate in the associated message regardless of whether it believes it has transmitted them to the peer before.


This is a good approach to start thinking about, but I am afraid that the required changes are more complex. For instance, can a node forward a block when it does not have the full content (votes and txs) it references? Putting it in another way, how we guarantee that the references votes and txs are always available in the network?

Co-authored-by: Sergio Mena <sergio@informal.systems>

github-actions · 2022-12-03T00:15:13Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lasarojc · 2022-12-06T10:37:49Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+
+The Tendermint mempool starts a new [broadcastTxRoutine][broadcast-tx-routine] for each peer that it is informed of. The routine sends all transactions that the mempool is aware of to all peers with few exceptions. The only exception is if the mempool received a transaction from a peer, then it marks it as such and won't resend to that peer. Otherwise, it retains no information about which transactions it already to sent to a peer. In some cases it may therefore resend transactions the peer already has. This can occur if the mempool removes a transaction from the `CList` data structure used to store the list of transaction while it is about to be sent and if the transaction was the tail of the `CList` during removal. This will be more likely to occur if a large number of transactions from the end of the list are removed during `RecheckTx`, since multiple transactions will become the tail and then be deleted. It is unclear at the moment how frequently this occurs on production chains.
+
+Beyond ensuring that transactions are rebroadcast to peers less frequently, there is not a simple scheme to communicate fewer transactions to peers. Peers cannot communicate what transactions they need since they do not know which transactions exist on the network.


Here they could communicate what they have, for example in a bloom filter, and let the other nodes figure what they don't have and must be sent. Bloom filters could be reset at each new hight or using some aging scheme.
Boom filters size could be a consensus parameter, adjusted whenever the filter is deemed too full or too empty. A bad size choice will use more bandwidth to transmit an almost empty filter (too big) or slow the propagation of transactions due to false positives (too small), until a reconfiguration happens.

lasarojc · 2022-12-06T12:07:18Z

docs/rfc/rfc-027-p2p-message-bandwidth-report.md

+
+Block, vote, and mempool gossiping transmit much of same data. The mempool reactor gossips candidate transactions to each peer. The consensus reactor, when gossiping the votes, sends vote metadata and the digital signature of that signs over that metadata. Finally, when a block is proposed, the proposing node amalgamates the received votes, a set of transaction, and adds a header to produce the block. This block is then serialized and gossiped as a list of bytes. However, the data that the block contains, namely the votes and the transactions were most likely _already transmitted to the nodes on the network_ via mempool transaction gossip and consensus vote gossip. 
+
+Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.


Could also have block parts that include transactions that were inserted during prepareProposal, as these will not be in the mempool, or have the proposer add such transactions to the mempool. These cannot be enforced, though, and byzantine nodes could use this stall rounds, but it is not different from their current ability to withhold block parts.

Co-authored-by: Sergio Mena <sergio@informal.systems>

williambanfield · 2022-12-08T20:16:43Z

There is still an open question here regarding the total size of all transactions added to the chain during this experiment. I have not had a chance to retrieve that data. I am planning to still merge this despite the open question so it can remain in the repo for future reference.

williambanfield added 6 commits November 7, 2022 17:48

bandwidth usage rfc notes

0dd484c

rfc continued

1cbe1e1

clean up background and add images

2b40eb2

fix blockpane reference

17f29e8

fix image reference

334ec82

add report to ToC

20e73a6

williambanfield requested a review from ebuchman as a code owner November 15, 2022 16:59

williambanfield requested a review from a team November 15, 2022 16:59

williambanfield requested review from adizere and lasarojc as code owners November 15, 2022 16:59

Merge branch 'main' into wb/p2p-rfc

63e49e5

cmwaters reviewed Nov 15, 2022

View reviewed changes

Update docs/rfc/rfc-027-p2p-message-bandwidth-report.md

1beec83

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

sergio-mena approved these changes Nov 17, 2022

View reviewed changes

cason reviewed Nov 17, 2022

View reviewed changes

williambanfield and others added 6 commits November 21, 2022 17:16

reference 627 and add vote size

108878e

compact block and evidence

c35a809

Update docs/rfc/rfc-027-p2p-message-bandwidth-report.md

065f44e

Co-authored-by: Sergio Mena <sergio@informal.systems>

fix 'sounds' typo

717683d

clarify NewRoundStep only clears data

0794c4d

correct hasvote only sent if received vote is novel and current

4841eaf

github-actions bot added the stale for use by stalebot label Dec 3, 2022

thanethomson added S:wip Work in progress (prevents stalebot from automatically closing) and removed stale for use by stalebot labels Dec 3, 2022

lasarojc reviewed Dec 6, 2022

View reviewed changes

williambanfield and others added 3 commits December 8, 2022 10:47

update images to use top 3 percent

dfb6e11

Update docs/rfc/rfc-027-p2p-message-bandwidth-report.md

81a6a5a

Co-authored-by: Sergio Mena <sergio@informal.systems>

Update docs/rfc/rfc-027-p2p-message-bandwidth-report.md

cc63ac0

Co-authored-by: Sergio Mena <sergio@informal.systems>

williambanfield added 4 commits December 8, 2022 14:58

split long lines

5cf352c

remove unused images

a6808aa

add top message consumption

8ee8c95

correct the number of bytes in a vote

9bb658a

validator -> peer

dca5d48

williambanfield added the S:automerge Automatically merge PR when requirements pass label Dec 8, 2022

Merge branch 'main' into wb/p2p-rfc

2f084d3

mergify bot merged commit dfb6589 into main Dec 8, 2022

mergify bot deleted the wb/p2p-rfc branch December 8, 2022 20:20

This was referenced Dec 14, 2022

Report on the bandwidth utilization within Tendermint #9576

Closed

Bandwidth optimization #9880

Closed

This was referenced Dec 20, 2022

Pinpoint inefficiencies in block, vote and transaction propagation #9922

Closed

Pinpoint inefficiencies in block, vote and transaction propagation cometbft/cometbft#26

Closed

Bandwidth optimization cometbft/cometbft#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc 27: report on bandwidth usage within Tendermint #9706

rfc 27: report on bandwidth usage within Tendermint #9706

williambanfield commented Nov 15, 2022 •

edited by thanethomson

cmwaters Nov 15, 2022

cason Nov 17, 2022

sergio-mena Nov 17, 2022 •

edited

williambanfield Nov 21, 2022

cmwaters Nov 15, 2022

lasarojc Dec 6, 2022

sergio-mena left a comment

cason left a comment

cason Nov 17, 2022

cason Nov 17, 2022

williambanfield Nov 21, 2022

cason Nov 17, 2022

github-actions bot commented Dec 3, 2022

lasarojc Dec 6, 2022 •

edited

lasarojc Dec 6, 2022 •

edited

williambanfield commented Dec 8, 2022


		Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.

		This new proposed method for gossiping block data would require a slight update to the mempool transaction gossip and consensus vote gossip. Since all of the contents of each block will not be gossiped together, it's possible that some nodes are missing a proposed transaction or the vote of a validator indicated in the new block gossip format. The mempool and consensus reactors would need to be updated to provide a `NeedTxs` and `NeedVotes` message. Each of these messages would allow a node to request a set of data from their peers. When a node receives one of these, it will then transmit the Tx/Votes indicate in the associated message regardless of whether it believes it has transmitted them to the peer before.


		Given that Tendermint informs all peers of _each_ vote message it receives, all nodes should be well informed of which votes their peers have. Given that the vote messages were the third largest consumer of bandwidth in the observation on Osmosis, it's possible that this system is not currently working correctly. Further analysis should examine where votes may be being retransmitted.

		### Suggested Improvements to Lower Message Transmission Bandwidth


		#### Mempool Tx Transmission

		The Tendermint mempool stages transactions that are yet to be committed to the blockchain and communicates these transactions to its peers. Each message contains one transaction. Data collected from the Blockpane node running on Osmosis indicates that the validator sent about 12 gigabytes of `Txs` messages during the nearly 3 hour observation period.


		The Tendermint mempool starts a new [broadcastTxRoutine][broadcast-tx-routine] for each peer that it is informed of. The routine sends all transactions that the mempool is aware of to all peers with few exceptions. The only exception is if the mempool received a transaction from a peer, then it marks it as such and won't resend to that peer. Otherwise, it retains no information about which transactions it already to sent to a peer. In some cases it may therefore resend transactions the peer already has. This can occur if the mempool removes a transaction from the `CList` data structure used to store the list of transaction while it is about to be sent and if the transaction was the tail of the `CList` during removal. This will be more likely to occur if a large number of transactions from the end of the list are removed during `RecheckTx`, since multiple transactions will become the tail and then be deleted. It is unclear at the moment how frequently this occurs on production chains.

		Beyond ensuring that transactions are rebroadcast to peers less frequently, there is not a simple scheme to communicate fewer transactions to peers. Peers cannot communicate what transactions they need since they do not know which transactions exist on the network.


		Block, vote, and mempool gossiping transmit much of same data. The mempool reactor gossips candidate transactions to each peer. The consensus reactor, when gossiping the votes, sends vote metadata and the digital signature of that signs over that metadata. Finally, when a block is proposed, the proposing node amalgamates the received votes, a set of transaction, and adds a header to produce the block. This block is then serialized and gossiped as a list of bytes. However, the data that the block contains, namely the votes and the transactions were most likely _already transmitted to the nodes on the network_ via mempool transaction gossip and consensus vote gossip.

		Therefore, block gossip can be updated to transmit a representation of the data contained in the block that assumes the peers will already have most of this data. Namely, the block gossip can be updated to only send 1) a list of transaction hashes and 2) a bit array of votes selected for the block along with the header and other required block metadata.

rfc 27: report on bandwidth usage within Tendermint #9706

rfc 27: report on bandwidth usage within Tendermint #9706

Conversation

williambanfield commented Nov 15, 2022 • edited by thanethomson

PR checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergio-mena Nov 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergio-mena left a comment

Choose a reason for hiding this comment

cason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 3, 2022

lasarojc Dec 6, 2022 • edited

Choose a reason for hiding this comment

lasarojc Dec 6, 2022 • edited

Choose a reason for hiding this comment

williambanfield commented Dec 8, 2022

williambanfield commented Nov 15, 2022 •

edited by thanethomson

sergio-mena Nov 17, 2022 •

edited

lasarojc Dec 6, 2022 •

edited

lasarojc Dec 6, 2022 •

edited