add tree-chunking.md #3799

ebuchman · 2019-07-15T12:57:48Z

Wasn't sure where to put this but opening for review/discussion.

I think I convinced myself that if we're going to use some form of chunking, we'll want to sync the chunks in order.

ancazamfir

I agree this is the best way to start. I assume there will be an ADR for it and this PR is just a justification on why it needs to be done.
Also, what is the plan for the binance warp sync?

docs/architecture/tree-chunking.md

cwgoes · 2019-07-15T20:24:10Z

docs/architecture/tree-chunking.md

+front, and could then receive the other chunks in any order. However to
+generalize, we may be required to receive multiple chunks in order first, before
+we get to a point where we can receive them in any order.
+


For a depth-first traversal, consider the following:

Chunkee & chunker negotiate a manifest m mapping chunk indices to contiguous subsets (ranges) of keyspace. They agree on n chunks, and they now both can map the integer range 0 to n - 1 to a subset of keyspace (start key, end key), where the end keys and start keys of contiguous chunks are identical (end is exclusive).

Chunkee requests chunk c_k from chunker. Chunker sends some c_l where k might or might not be equal to l.

Chunkee checks keys in c_l against known mapping from k. If keys are not within the range, chunkee rejects chunk & bans peer. Chunkee now knows that the keys do belong to chunk c_k, although it does not know if the values are correct or if all keys that should be in c_k were included.

Chunkee now chooses random key in the chunk and requests Merkle proof of key, value pair (against known state root from light client) from chunker, or chooses random key range not in the included keys but in the chunk range and requests range proof of non-inclusion from chunker. If Merkle proof is not provided, chunkee rejects chunk and bans peer. If Merkle proof does not validate, chunkee rejects chunk and bans peer.

Repeat random requests some r times with r a security parameter chosen according to expected peer behaviour, total number of chunks, etc.

Chunkee now knows (in expectation) that this chunk is valid - that it has the correct keys , the correct values, and all the keys within this range. Chunkee applies chunk.

Repeat for all chunks in range 0 to n - 1, in any order & in parallel if supported by the underlying tree.

At the end, verify whole tree. If the state root does not match, binary search by requesting Merkle proofs of subtrees from peers and fetch any incorrect chunks. This step could be taken earlier for partial subtrees if we're concerned about malicious peers.

For trees which can handle out-of-order-insertion, that should work in parallel as far as I can tell. It does require a tree which supports range exclusion proofs, and includes randomness to minimize the number of proofs required, which makes the security a bit more complex to reason about but I think will be more efficient in practice. This method should support trees and chunks of any size, with appropriate choices of security parameter, and can translate into sequential reads if the underlying store stores contiguous keyspace sequentially.

Maybe I'm missing something though.

It looks like this still doesn't solve all connected chunker are the malicious (but provide consistent chunks) case.

If at first, the chunkee negotiates with 3 malicious peers, then it has no chance to connect to correct network.
If at first, the 2 peers send chunkee a set of chunk and keyspace mapping while other 2 peers send chunkee another kind of mapping, how the chunkee would behave (decide which 2 peers are malicious)?

If this is not the case merkle proof to resolve, I think negociate hash (which provides more information than indexes) of chunk (begin and end key can still be kept) would be easier.
I cannot see the additional benefit merkle proof provides than hash of each chunk. Hash verification is:

much more efficient (without proof request-responde round trip)

rejecting unexpected chunks earlier.

doesn't need top layer chunks comes in order to achieve a verifiable tree, all chunks can be processed as long as it arrives

Chunkee now chooses random key in the chunk and requests Merkle proof of key, value pair (against known state root from light client) from chunker

why choose random key for existence check? Wouldn't one whole range request be better (avoid c_l miss or add key but still in range)

If at first, the 2 peers send chunkee a set of chunk and keyspace mapping while other 2 peers send chunkee another kind of mapping, how the chunkee would behave (decide which 2 peers are malicious)?

Different keyspace mappings could be valid, there doesn't need to be a single canonical one - if the chunkee and chunker can't agree on the manifest (maybe the chunkee wants smaller or larger chunks), the chunkee can disconnect.

If all connected chunkers are malicious, no strategy works, the chunkee can never obtain chunks if the chunkers don't want to send them since only the chunkers have the data. The best the chunkee can do is ban the peers.

If this is not the case merkle proof to resolve, I think negociate hash (which provides more information than indexes) of chunk (begin and end key can still be kept) would be easier.
I cannot see the additional benefit merkle proof provides than hash of each chunk.

The Merkle proof provides existence / non-existence proofs against a trusted state root - how would the chunkee know that a chunk with a particular hash was in fact the correct chunk for the keyspace?

why choose random key for existence check? Wouldn't one whole range request be better (avoid c_l miss or add key but still in range)

Just so the proof construction, bandwidth, and verifications costs are lower. You could request proofs for the whole chunk, it would be more expensive.

how would the chunkee know that a chunk with a particular hash was in fact the correct chunk for the keyspace?

later when the chunkee compares root hash with one in the block's header?

How hard it is to construct the chunk with a particular hash with invalid data inside?

Different keyspace mappings could be valid, there doesn't need to be a single canonical one - if the chunkee and chunker can't agree on the manifest (maybe the chunkee wants smaller or larger chunks), the chunkee can disconnect.

Then how chunkee knows keyspace at first? Without fixed-size chunk, you cannot support eager state sync right?

The Merkle proof provides existence / non-existence proofs against a trusted state root - how would the chunkee know that a chunk with a particular hash was in fact the correct chunk for the keyspace?

In binance implementation, the manifest is single canoinical thing.

Then how chunkee knows keyspace at first? Without fixed-size chunk, you cannot support eager state sync right?

Chunker and chunkee negotiate on a manifest mapping integer chunk indices to bounded subsets of (the entire) keyspace. What do fixed-size chunks have to do with eager state sync? The chunker and chunkee would also need to pick a block height of the state to sync, if that's what you mean.

Chunker and chunkee negotiate on a manifest mapping integer chunk indices to bounded subsets of (the entire) keyspace.

I mean chunkee is brand new node, it doesn't have any information about how many keys in total. For entire keyspace did you mean whole possible 32 bytes of sha256 (hash of iavl tree node) i.e. 0x10..00 - 0xFF..FF?

What do fixed-size chunks have to do with eager state sync? The chunker and chunkee would also need to pick a block height of the state to sync, if that's what you mean.

I mean think about there are 100 chunkee peers and 1 chunker. All 100 chunkee want negotiate different chunk sizes of chunker. i.e. The first chunkee wants chunk size to be 1M, the second one wants 2M, the third one wants 3M, ..., the 100th chunkee want 100M. To serve 100 kinds of clients, if the chunker wants eagerly prepared the chunks (without traverse ival tree each time a request comes), does it need to prepare 100 kinds of chunks on disk?

I mean chunkee is brand new node, it doesn't have any information about how many keys in total. For entire keyspace did you mean whole possible 32 bytes of sha256 (hash of iavl tree node) i.e. 0x10..00 - 0xFF..FF?

Yes; the manifest maps integer indices to contiguous subsets of the entire keyspace.

I mean think about there are 100 chunkee peers and 1 chunker. All 100 chunkee want negotiate different chunk sizes of chunker. i.e. The first chunkee wants chunk size to be 1M, the second one wants 2M, the third one wants 3M, ..., the 100th chunkee want 100M. To serve 100 kinds of clients, if the chunker wants eagerly prepared the chunks (without traverse ival tree each time a request comes), does it need to prepare 100 kinds of chunks on disk?

That's right, but there could be suggested chunk sizes (which could themselves change over time); it's all in the peer-to-peer protocol and altruism is required anyways.

docs/architecture/tree-chunking.md

ackratos · 2019-07-16T02:45:18Z

docs/architecture/tree-chunking.md

+
+In each case, it appears that liveness is made much more difficult by the need
+to figure out what went wrong with the applied chunks. However, these problems
+can be eliminated by requiring chunks to be applied in-order.


apply chunk in order is not necessary and harms the performance of state sync.

What we do is calculate a mapping from iavl nodes in chunk to its multistore. (with help of record a map[storeKey]numOfNodesInThisStore and an index in the whole store of the first node of this chunk field in each chunk.
https://github.com/binance-chain/BEPs/blob/master/BEP18.md#541-app-state-chunk

Once we received a complete chunk, we write all nodes within it to corresponding multistore directly.

What we do is calculate a mapping from iavl nodes in chunk to its multistore.

Right but isn't your mapping only verifiable according to the manifest, which depends on majority honest peers? This write up is specifically about chunking in the context of full light client security.

docs/architecture/tree-chunking.md

ackratos · 2019-07-16T02:58:23Z

docs/architecture/tree-chunking.md

+  with its index. For instance, if we receive chunk 5, the root hash of the
+  sub-tree contained therein should correspond to the 5th node in layer 10
+
+While this example provides the intuition for how the design might work,


I would like to propose https://github.com/binance-chain/BEPs/blob/master/BEP18.md#541-app-state-chunk again.

It solved the arbitrary node size and have effective chunk space utilization. :P

ackratos · 2019-07-16T03:19:22Z

docs/architecture/tree-chunking.md

+front, and could then receive the other chunks in any order. However to
+generalize, we may be required to receive multiple chunks in order first, before
+we get to a point where we can receive them in any order.
+


It looks like this still doesn't solve all connected chunker are the malicious (but provide consistent chunks) case.

If at first, the chunkee negotiates with 3 malicious peers, then it has no chance to connect to correct network.
If at first, the 2 peers send chunkee a set of chunk and keyspace mapping while other 2 peers send chunkee another kind of mapping, how the chunkee would behave (decide which 2 peers are malicious)?

If this is not the case merkle proof to resolve, I think negociate hash (which provides more information than indexes) of chunk (begin and end key can still be kept) would be easier.
I cannot see the additional benefit merkle proof provides than hash of each chunk. Hash verification is:

much more efficient (without proof request-responde round trip)

rejecting unexpected chunks earlier.

doesn't need top layer chunks comes in order to achieve a verifiable tree, all chunks can be processed as long as it arrives

ackratos · 2019-07-16T03:22:25Z

docs/architecture/tree-chunking.md

+front, and could then receive the other chunks in any order. However to
+generalize, we may be required to receive multiple chunks in order first, before
+we get to a point where we can receive them in any order.
+


Chunkee now chooses random key in the chunk and requests Merkle proof of key, value pair (against known state root from light client) from chunker

why choose random key for existence check? Wouldn't one whole range request be better (avoid c_l miss or add key but still in range)

ackratos · 2019-07-16T03:28:34Z

Also, what is the plan for the binance warp sync?

It's upgraded in production yesterday, you can try https://github.com/binance-chain/node-binary/tree/master/fullnode/prod/0.6.0

just turn on state_sync_reactor to true and state_sync_height to 0 https://github.com/binance-chain/node-binary/blob/master/fullnode/prod/0.6.0/config/config.toml#L19-L27

Co-Authored-By: Christopher Goes <cwgoes@pluranimity.org>

codecov-io · 2019-07-17T14:06:27Z

Codecov Report

Merging #3799 into master will decrease coverage by 0.14%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3799      +/-   ##
==========================================
- Coverage   64.97%   64.83%   -0.15%     
==========================================
  Files         216      216              
  Lines       17565    17565              
==========================================
- Hits        11413    11388      -25     
- Misses       5203     5222      +19     
- Partials      949      955       +6

Impacted Files	Coverage Δ
privval/signer_validator_endpoint.go	`75.55% <0%> (-10%)`	⬇️
privval/signer_service_endpoint.go	`83.63% <0%> (-5.46%)`	⬇️
privval/socket_listeners.go	`86.2% <0%> (-3.45%)`	⬇️
p2p/pex/pex_reactor.go	`83.13% <0%> (-1.17%)`	⬇️
blockchain/reactor.go	`70.56% <0%> (-0.94%)`	⬇️
consensus/reactor.go	`70.8% <0%> (-0.47%)`	⬇️
blockchain/pool.go	`80.26% <0%> (-0.33%)`	⬇️

melekes · 2019-09-03T09:30:48Z

If I am not mistaken, @ebuchman have promised to expand on this.

docs/architecture/tree-chunking.md

ebuchman · 2019-09-07T17:31:24Z

If I am not mistaken, @ebuchman have promised to expand on this.

I think we probably want to consolidate with ADR 042. Maybe ADR-042 should actually have its own folder and we can put multiple files in there as we work this all out.

We still need to actually write up an ADR for the initial design - let's chat about it at next team meeting

add tree-chunking.md

fecc9d6

ancazamfir reviewed Jul 15, 2019

View reviewed changes

cwgoes reviewed Jul 15, 2019

View reviewed changes

cwgoes mentioned this pull request Jul 15, 2019

Evaluate feasibility of deterministic indexical chunk mapping for state sync cosmos/iavl#135

Closed

ackratos reviewed Jul 16, 2019

View reviewed changes

Apply suggestions from code review

df6122f

Co-Authored-By: Christopher Goes <cwgoes@pluranimity.org>

brapse mentioned this pull request Jul 22, 2019

[ADR - 42] State-sync #3769

Merged

5 tasks

melekes assigned ebuchman Sep 3, 2019

melekes added the WIP label Sep 3, 2019

ebuchman commented Sep 7, 2019

View reviewed changes

docs/architecture/tree-chunking.md Outdated Show resolved Hide resolved

ebuchman commented Sep 7, 2019

View reviewed changes

docs/architecture/tree-chunking.md Outdated Show resolved Hide resolved

Apply suggestions from code review

68f7e42

melekes closed this Jan 22, 2020

erikgrinaker mentioned this pull request Jan 29, 2020

ADR-053: state sync prototype #4352

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tree-chunking.md #3799

add tree-chunking.md #3799

ebuchman commented Jul 15, 2019

ancazamfir left a comment

cwgoes Jul 15, 2019 •

edited

ackratos Jul 16, 2019

ackratos Jul 16, 2019

cwgoes Jul 16, 2019 •

edited

melekes Jul 17, 2019

melekes Jul 18, 2019

ackratos Jul 19, 2019 •

edited

cwgoes Jul 19, 2019

ackratos Jul 20, 2019 •

edited

cwgoes Aug 1, 2019

ackratos Jul 16, 2019

ebuchman Sep 7, 2019

ackratos Jul 16, 2019

ackratos Jul 16, 2019

ackratos Jul 16, 2019

ackratos commented Jul 16, 2019

codecov-io commented Jul 17, 2019

melekes commented Sep 3, 2019

ebuchman commented Sep 7, 2019

add tree-chunking.md #3799

add tree-chunking.md #3799

Conversation

ebuchman commented Jul 15, 2019

ancazamfir left a comment

Choose a reason for hiding this comment

cwgoes Jul 15, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwgoes Jul 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ackratos Jul 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ackratos Jul 20, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ackratos commented Jul 16, 2019

codecov-io commented Jul 17, 2019

Codecov Report

melekes commented Sep 3, 2019

ebuchman commented Sep 7, 2019

cwgoes Jul 15, 2019 •

edited

cwgoes Jul 16, 2019 •

edited

ackratos Jul 19, 2019 •

edited

ackratos Jul 20, 2019 •

edited