-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP General/Lazy State-Sync pseudo-spec #3639
Comments
Link with the tracking issue. |
How does this proposal deal with insertion order dependence of the IAVL tree? Also, should we include some description for how it might work for a different kind of tree, like the Patricia Trie? I guess it would work the same way - the fact that in Ethereum's trie you can't iterate over the key space directly shouldn't matter because you can still iterate over the final keys (ie. the hashes of the original keys), right? My other major concern is we just did a lot of work to refactor the blockchain reactor with the intention that the design would largely apply to state sync as well in so far as it's a series of indexed chunks being downloaded in parallel from many peers. This design would be quite different from that. |
How does the new peer verifies the state the new peer gave it? I am assuming the new peer has a root hash, which it has obtained somewhere and can use to verify simple tree that it got from the old peer, correct? |
@ebuchman I think some point proposed here helps the original proposal But most of my concerns in (first section of) my proposal still holds:
I assume jaekwon propose full iavl nodes (rather than leaf key/value pairs) just like I have done in our previous version of state sync: #3243. If not, this is also a concern. |
I'm pretty sure the proposal here is to use the key/value pairs, not the internal iavl nodes. From one perspective, it's more efficient to use the key/value pairs, because it's a smaller amount of data to sync (ie. just the leaves, not the entire tree). But on the other hand, it might turn out to be less efficient in practice since it requires all the lookups. Also, I'm not sure how the insertion-order dependence of the IAVL tree would be handled here. |
IAVLRangeProof expresses internal iavl nodes for a range of key/values, including the key/value leaves. It's depth-first search into a range of values. Reconstruction of this tree would require extra code in the iavl tree, but could be done... it needs to be done anyways for the iavl tree, so I would make it work for rangeproofs since it holds all the info you need. |
I've always thought that the blockchain reactor would differ from the state sync reactor because the blockchain should be verified in sequence, whereas state doesn't need to, it can be verified via merkle proofs in parallel. If you tried to use merkle proofs for verifying the whole range of past blocks, then you end up with somewhat weaker guarantees about safety. So that's my concern about creating a unified system for both. It could be done, but it doesn't seem natural to do it that way.
Yes, it starts from the root, and pulls data while verifying it. Breadth first is achieved via ref links ("more"), requiring support from the store implementation (but the IAVL tree impl wouldn't support this initially, as RangeProof provides depth first ranges.). With subtrees, the requestor gets to choose breadth or depth first.
NewPeer gives recommended segmentation via More links, but OldPeer could also brute-force it by bisecting the entire possible key list.
yes 500 elements. Yeah if a single value is that large then I don't see any other way. I think the solution is to always be chunking any large binaries at the app level, which can be done in a general way. |
The point of this proposal is to make it so that it's easy to implement, and it doesn't require some background long-running preparation step to split the data and persist into chunks, which may take a long time if there is a LOT of data (which will be true for many chains very soon). Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them. It's interesting to consider how robustness works against griefing... if OldPeer Bob doesn't give me good MsgStateResponse .More refs, it's still OK as long as I have other peers that are good, because they can help split the work load recursively. |
So with this proposal you'd still have to sync the internal tree nodes over the network?
Yes, this is more about the
The concern here, based on @ackratos experiments, is that this real-time querying is actually quite slow, and it's much preferable to do some work up front so that the requests can be served much more quickly. |
If we can propose a way depth-first traverse iavl tree fast then chunking it will benefit from it (be fast) as well:) I think I admit its not acceptable if we want take snapshot of recent 100 blocks, but as you solve the bottleneck on traversing iavl tree, I believe chunking solution will acceptable then. What do you think of it? |
Depends on the tree, but for IAVL you have to anyways to replicate the tree structure. You don't have to query for it breadth first, it's included in each range proof.
Hmm, it sounds to me that the pools will be quite dissimilar! Anyways, copy/pasting sounds like a fine way to create it to start, and we can see how it evolves.
Interesting....
There are two components to the "each time" state request in this proposal... (1) the initial query itself, including inner nodes and the leaf node of that query and (2) providing More refs, calculated intelligently. Here, (1) is easy to amortize by having larger chunks (e.g. MaxBytes of 10MB, say, or generally returning many leaves per request) and (2) can be pre-computed. And precomputing (2) is basically just a table of paths, without any of the data, so the size of the precomputed data would be tiny... it doesn't need to have more than say, 10k paths (that would provide enough parallelization, which would only be 1MB of data. It sounds like there's a hybrid solution that does some pre-computation for splitting the work load, while keeping the pre-computation efficient by keeping them short, and leaving the fetching of values to be lazy. |
I did some benchmarks with rocksdb, and I believe my design is viable for random tree chunk accesses. With 10M nodes, each with random 30-byte keys and 200-byte values, I was able to read random ranges of nodes at ~750MB/s (e.g. I could read 100 32,768-node/7.5MB chunks per second) with an i9-9900K processor. At this rate, it's possible to read the entire 2.3GB database in 3 seconds, and this is only with one core. Rocksdb lets us do concurrent reads, so our throughput can be multiplied by the number of cores available. If we use a tree where all inner nodes have a key/value rather than only leaf nodes, then each range inside the database key space represents an entire subtree with no inner nodes to compute so there is only minimal processing needed beyond reading from the db. This can be done lazily with some ongoing CPU cost scaled with the demand for state syncing, but could also be easily cached to disk to free up the CPU (storage is cheap). My code is here: https://gist.github.com/mappum/3323a386203e506aa893a5f3d7622a72 (sorry for the strange style, I was optimizing for usage of rust features I've never used before) |
@mappum thank you. |
Superseded by ADR-053. |
TODO: link with other state-sync specs/work and compare/contrast.
General/Lazy State syncing pseudo-spec
SDK has two layers of trees -- Simple Tree and underneath, IAVL trees.
This could change in the future too -- e.g. IAVL tree of trees, and other trees besides IAVL trees (e.g. perhaps even N-dimensional spatial trees).
Design goals:
Msgs and Types
StatePathRange{}
MsgStateRequest{} // a peer requesting state from another
// NOTE: MaxBytes is the max bytes that the receiver is willing to tolerate, doesn't mean the sender will send that much though,
// the sender can clip at sender's disgretion, and it's up to the receiver to ask for more.
ABCIStateRequest{} // corresponding ABCI message
// NOTE: same as MsgStateRequest... Tendermint passes it through to the app via ABCI.
MsgStateResponse{} // a peer responding with state to requester
ABCIStateResponse{} // corresponding ABCI message
// NOTE: same as MsgStateResponse... Tendermint passes it through to the app via ABCI.
ABCIStatePersistRequest{} // Tendermint on receiving side asking the app to save
// NOTE: same as MsgStateResponse...
ABCIStatePersistResponse{} // App responding to Tendermint
Example
Other Notes:
Requesting the contents of {"1,1","1,2"} (e.g. the geoquad labeld "1,1") may yeild items, or further quads. Hashing happens recursively for each subspace.
The text was updated successfully, but these errors were encountered: