Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP General/Lazy State-Sync pseudo-spec #3639

Closed
jaekwon opened this issue May 8, 2019 · 14 comments
Closed

WIP General/Lazy State-Sync pseudo-spec #3639

jaekwon opened this issue May 8, 2019 · 14 comments
Labels
C:sync Component: Fast Sync, State Sync

Comments

@jaekwon
Copy link
Contributor

jaekwon commented May 8, 2019

TODO: link with other state-sync specs/work and compare/contrast.

General/Lazy State syncing pseudo-spec

SDK has two layers of trees -- Simple Tree and underneath, IAVL trees.
This could change in the future too -- e.g. IAVL tree of trees, and other trees besides IAVL trees (e.g. perhaps even N-dimensional spatial trees).

Design goals:

  • Parallelizeable -- should be able to fetch from multiple peers at a time
  • Works well with layers (of trees)
  • Make it easy for new tree/substore types to implement the protocol
  • Extensible
  • Big-O notiation sufficiently optimal
  • Shouldn't require e.g. fetching a large first item, or long precomputation.
  • Each item fetched should be (almost) immediately verifiable, e.g. can be checked against the root hash

Msgs and Types

  • StatePathRange{}

    • Start string
    • End string // empty string is the last theoretical key
  • MsgStateRequest{} // a peer requesting state from another

    • Height num
    • Range StatePathRange
    • MaxBytes num // 0 means unlimited
      // NOTE: MaxBytes is the max bytes that the receiver is willing to tolerate, doesn't mean the sender will send that much though,
      // the sender can clip at sender's disgretion, and it's up to the receiver to ask for more.
  • ABCIStateRequest{} // corresponding ABCI message
    // NOTE: same as MsgStateRequest... Tendermint passes it through to the app via ABCI.

  • MsgStateResponse{} // a peer responding with state to requester

    • Height num
    • StartPath string // the requested start path
    • More []StatePathRange // sender helps receiver parallelize fetches w/ More refs
    • Type string // e.g. "iavl:range", "simple:tree"
    • Bytes []byte // self proving structure
  • ABCIStateResponse{} // corresponding ABCI message
    // NOTE: same as MsgStateResponse... Tendermint passes it through to the app via ABCI.

  • ABCIStatePersistRequest{} // Tendermint on receiving side asking the app to save
    // NOTE: same as MsgStateResponse...

  • ABCIStatePersistResponse{} // App responding to Tendermint

    • Height num
    • Error (optional)
    • StartPath string
    • More []StatePathRange

Example

Actors: NewPeer, OldPeer
1:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"",""},0}
2:  OldPeer->OldPeer: ABCIStateRequest{1000,Range{"",""},0}
3:  OldPeer->OldPeer: ABCIStateResponse{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>}
4:  OldPeer->NewPeer: MsgStateResponse{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>}
5:  NewPeer->NewPeer: ABCIStatePersistRequest{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>}
6:  NewPeer->NewPeer: ABCIStatePersistResponse{1000,nil,"",[
     {"substore1", "substore2"},
     {"substore5", "substore6"},
     {"substoreABC", "substoreABD"}
   ]}
// NOTE: By this point, the NewPeer has the root hash for height 1000, and knows all the persistent IAVL substore hashes as well, for all substores ("substore1", "substore5", and "substoreABC").  At step 6, the app parses the MULTISTORE bytes and finds out all the substores, and so is able to tell tendermint what more paths to request.
7:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substore1","substore2"},0}
// The below could happen in parallel:
// 7:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substore5","substore6"},0}
// 7:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substoreABC","substoreABD"},0}
...
8:  OldPeer->NewPeer: MsgStateResponse{1000,"substore1","iavl:range",<IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES>}
9:  NewPeer->NewPeer: ABCIStatePersistRequest{1000,"substore1","iavl:range",<IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES>}
10: NewPeer->NewPeer: ABCIStatePersistResponse{1000,nil,"substore1",More:[{"substore1/501",""}]} // next key range is {"substore1/501", ""}
11: NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substore1/501","substore2"},0}
12: OldPeer->NewPeer: MsgStateResponse{1000,"substore1/501","iavl:range",<IAVL_RANGE_501_to_END_IN_SUBSTORE1_BYTES>}
13: NewPeer->NewPeer: ABCIStatePersistRequest{1000,"substore1","iavl:range",<IAVL_RANGE_501_to_END_IN_SUBSTORE1_BYTES>}
14: NewPeer->NewPeer: ABCIStatePersistResponse{1000,nil,"substore1/501",More:[{"substore2",""}]}

ENGLISH:
1: New peer asks old peer for height 1000 state syncing start.
2: Old peer asks ABCI app.
3: Old peer's ABCI app responds with an entire simple tree (simple tree is always complete) of top-level CosmosSDK store.  It is a list of commit references to all substores at that height.
4: Old peer responds to new peer with the same tree.
5: New peer tells its ABCI app.
6: New peer's ABCI app asks to fetch more things, namely the substore contents.
7: New peer asks for "substore1" content.
8: Old peer responds with the first 500 elements.
9: New peer asks ABCI app to save them.
10: New peer's ABCI app suggests for it to fetch everything else after "substore1/501".
11: New peer asks old peer for more items within the "substore1" substore after key "501".
12: Old peer responds with more starting from key "substore1/501".
13: New peer asks ABCI app to save them.
14: New peer's ABCI app tells Tendermint that the next key is "substore2", implying that "substore1" keys have been fully received.

NOTES:
10: The key in the IAVL substore is "501", but "substore1/" is prepended by the multistore impl.
11: Tendermint found intersect of {"substore1/501",""} from previous step, and {"substore1/501","substore2"} from step 6.
11: The other two substores are not queried since it's commented out in step 7.
14: {"substore1", "substore2"} from step 6 intersected with {"substore2",""} is the empty set, which means Tendermint knows to stop making requests and not request for anything in {"substore2",""}.

Other Notes:

  • MsgStateResponse and ABCIStatePersistResponse each have a More field. The sender (e.g. OldPeer) may split the query space into subspaces and return them in MsgStateResponse.More refs to help the receiver parallelize requests. The sender should use this if the size of the response bytes exceeds request.MaxBytes.
  • Tendermint should be persisting and updating "More:..." references it receives from ABCIStatePersistResponse into some priority queue (after intersection with request scope and filtering out references outside the current scope) and making requests to many peers in parallel.
  • 2-d or N-d data can be saved too, as long as the keys can be linearized. How the key is linearized into a string depends on the Merkel-izing scheme (and thus state chunk proving scheme) which determines how the keys are formatted and how the value/domain/space parts are ordered -- and how the key is mapped to subspaces depends on the spatial algorithm. For example, here's a dumb geoquad example:
+---+---+---+---+
|1,1|2,1|3,1|4,1|
+---+---+---+---+
|1,2|2,2|3,2|4,2|
+---+---+---+---+
|1,3|2,3|3,3|4,3|
+---+---+---+---+
|1,4|2,4|3,4|4,4|
+---+---+---+---+

Requesting the contents of {"1,1","1,2"} (e.g. the geoquad labeld "1,1") may yeild items, or further quads. Hashing happens recursively for each subspace.

@jaekwon jaekwon changed the title WIP State-Sync pseudo-spec WIP General State-Sync pseudo-spec May 8, 2019
@zmanian
Copy link
Contributor

zmanian commented May 8, 2019

Link with the tracking issue.
#828

@ebuchman
Copy link
Contributor

ebuchman commented May 8, 2019

How does this proposal deal with insertion order dependence of the IAVL tree?

Also, should we include some description for how it might work for a different kind of tree, like the Patricia Trie? I guess it would work the same way - the fact that in Ethereum's trie you can't iterate over the key space directly shouldn't matter because you can still iterate over the final keys (ie. the hashes of the original keys), right?

My other major concern is we just did a lot of work to refactor the blockchain reactor with the intention that the design would largely apply to state sync as well in so far as it's a series of indexed chunks being downloaded in parallel from many peers. This design would be quite different from that.

@melekes
Copy link
Contributor

melekes commented May 8, 2019

How does the new peer verifies the state the new peer gave it? I am assuming the new peer has a root hash, which it has obtained somewhere and can use to verify simple tree that it got from the old peer, correct?

@ackratos
Copy link
Contributor

ackratos commented May 8, 2019

@ebuchman I think some point proposed here helps the original proposal
i.e. the MsgStateResponse{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>} coped with tendermint is not aware of cosmos/app store structure. (multistore)

But most of my concerns in (first section of) my proposal still holds:

  1. I cannot see an efficient way OldPeer traverse a large tree and to provide arbitrary request.
  2. a single key might be large. Here you specified IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES but I presume you mean "000 - 500" elements (iavl tree node for iavlStore I presume) rather than first 500 bytes. (as in English version you say 8: Old peer responds with the first 500 elements.). App developer might put a large value (*00 MB) for a key and only update it periodically (rather than every block) to save io resource.

How does this proposal deal with insertion order dependence of the IAVL tree?

I assume jaekwon propose full iavl nodes (rather than leaf key/value pairs) just like I have done in our previous version of state sync: #3243. If not, this is also a concern.

@ebuchman
Copy link
Contributor

ebuchman commented May 8, 2019

I assume jaekwon propose full iavl nodes (rather than leaf key/value pairs) just like I have done in our previous version of state sync: #3243. If not, this is also a concern.

I'm pretty sure the proposal here is to use the key/value pairs, not the internal iavl nodes.

From one perspective, it's more efficient to use the key/value pairs, because it's a smaller amount of data to sync (ie. just the leaves, not the entire tree). But on the other hand, it might turn out to be less efficient in practice since it requires all the lookups. Also, I'm not sure how the insertion-order dependence of the IAVL tree would be handled here.

@jaekwon
Copy link
Contributor Author

jaekwon commented May 8, 2019

How does this proposal deal with insertion order dependence of the IAVL tree?

IAVLRangeProof expresses internal iavl nodes for a range of key/values, including the key/value leaves. It's depth-first search into a range of values.

Reconstruction of this tree would require extra code in the iavl tree, but could be done... it needs to be done anyways for the iavl tree, so I would make it work for rangeproofs since it holds all the info you need.

@jaekwon
Copy link
Contributor Author

jaekwon commented May 8, 2019

My other major concern is we just did a lot of work to refactor the blockchain reactor with the intention that the design would largely apply to state sync as well in so far as it's a series of indexed chunks being downloaded in parallel from many peers. This design would be quite different from that.

I've always thought that the blockchain reactor would differ from the state sync reactor because the blockchain should be verified in sequence, whereas state doesn't need to, it can be verified via merkle proofs in parallel. If you tried to use merkle proofs for verifying the whole range of past blocks, then you end up with somewhat weaker guarantees about safety. So that's my concern about creating a unified system for both. It could be done, but it doesn't seem natural to do it that way.

How does the new peer verifies the state the new peer gave it? I am assuming the new peer has a root hash, which it has obtained somewhere and can use to verify simple tree that it got from the old peer, correct?

Yes, it starts from the root, and pulls data while verifying it. Breadth first is achieved via ref links ("more"), requiring support from the store implementation (but the IAVL tree impl wouldn't support this initially, as RangeProof provides depth first ranges.). With subtrees, the requestor gets to choose breadth or depth first.

I cannot see an efficient way OldPeer traverse a large tree and to provide arbitrary request.

NewPeer gives recommended segmentation via More links, but OldPeer could also brute-force it by bisecting the entire possible key list.

a single key might be large. Here you specified IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES but I presume you mean "000 - 500" elements (iavl tree node for iavlStore I presume) rather than first 500 bytes. (as in English version you say 8: Old peer responds with the first 500 elements.). App developer might put a large value (*00 MB) for a key and only update it periodically (rather than every block) to save io resource.

yes 500 elements. Yeah if a single value is that large then I don't see any other way. I think the solution is to always be chunking any large binaries at the app level, which can be done in a general way.

@jaekwon
Copy link
Contributor Author

jaekwon commented May 8, 2019

The point of this proposal is to make it so that it's easy to implement, and it doesn't require some background long-running preparation step to split the data and persist into chunks, which may take a long time if there is a LOT of data (which will be true for many chains very soon). Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them.

It's interesting to consider how robustness works against griefing... if OldPeer Bob doesn't give me good MsgStateResponse .More refs, it's still OK as long as I have other peers that are good, because they can help split the work load recursively.

@jaekwon jaekwon changed the title WIP General State-Sync pseudo-spec WIP General/Lazy State-Sync pseudo-spec May 8, 2019
@ebuchman
Copy link
Contributor

ebuchman commented May 8, 2019

IAVLRangeProof expresses internal iavl nodes for a range of key/values, including the key/value leaves. It's depth-first search into a range of values.

So with this proposal you'd still have to sync the internal tree nodes over the network?

I've always thought that the blockchain reactor would differ from the state sync reactor because the blockchain should be verified in sequence, whereas state doesn't need to, it can be verified via merkle proofs in parallel.

Yes, this is more about the pool component of the reactor and the concurrent requesting of pieces of peers, which should be quite similar for both. Note we're not proposing generalizing the code to be used for both (at least not yet!), just copying the code and modifying. Of course the "boundary" conditions are quite different between the two - blockchain reactor uses ApplyBlock on the ABCI consensus connection to execute blocks in sequence, it uses a heuristic to find out when to switch to consensus, and it has to update the "highest height", etc. while state sync will apply pieces on a different ABCI connection, doesn't have to order them, and should know up front exactly how many pieces are needed. So this is mostly about the pool component, which is where a lot of complexity/concurrency in these reactors actually lies.

Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them.

The concern here, based on @ackratos experiments, is that this real-time querying is actually quite slow, and it's much preferable to do some work up front so that the requests can be served much more quickly.

@ackratos
Copy link
Contributor

ackratos commented May 9, 2019

The point of this proposal is to make it so that it's easy to implement, and it doesn't require some background long-running preparation step to split the data and persist into chunks, which may take a long time if there is a LOT of data (which will be true for many chains very soon). Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them.

If we can propose a way depth-first traverse iavl tree fast then chunking it will benefit from it (be fast) as well:)
In practice, IMHO we have to cache (in memory or disk) for peers range request, rather than iterate (or locate via improved way) tree each time (think about frequently locating a leaf and all non-leaf on the path and load it from leveldb into memory, at meanwhile new key/value are persisted into same db). If we support flexible range, the cache would be larger. Chunking and save it to disk save such look up time.

I think a long time depends how frequently we took snapshot. For our 24hour per snapshot (and ethereum ~13 hour) use case. Its fine to take ~3 min to snapshot a version isn't it? The data should have an estimate-able up boundary. acc store might reach 5m accounts in long future? that's 100m nodes which roughly 2500 4M chunks (before compression), that's 30 min snapshot time still acceptable and need to say its a low bound of time as currently I experiment this by sequentially write disk. stake store might growth but extremely slowly, gov store might get cleaned on proposal expire etc.

I admit its not acceptable if we want take snapshot of recent 100 blocks, but as you solve the bottleneck on traversing iavl tree, I believe chunking solution will acceptable then.

What do you think of it?

@jaekwon
Copy link
Contributor Author

jaekwon commented May 10, 2019

So with this proposal you'd still have to sync the internal tree nodes over the network?

Depends on the tree, but for IAVL you have to anyways to replicate the tree structure. You don't have to query for it breadth first, it's included in each range proof.

Yes, this is more about the pool component of the reactor and the concurrent requesting of pieces of peers, which should be quite similar for both. Note we're not proposing generalizing the code to be used for both (at least not yet!), just copying the code and modifying. Of course the "boundary" conditions are quite different between the two - blockchain reactor uses ApplyBlock on the ABCI consensus connection to execute blocks in sequence, it uses a heuristic to find out when to switch to consensus, and it has to update the "highest height", etc. while state sync will apply pieces on a different ABCI connection, doesn't have to order them, and should know up front exactly how many pieces are needed. So this is mostly about the pool component, which is where a lot of complexity/concurrency in these reactors actually lies.

Hmm, it sounds to me that the pools will be quite dissimilar! Anyways, copy/pasting sounds like a fine way to create it to start, and we can see how it evolves.

The concern here, based on @ackratos experiments, is that this real-time querying is actually quite slow, and it's much preferable to do some work up front so that the requests can be served much more quickly.

Interesting....

@ackratos If we can propose a way depth-first traverse iavl tree fast then chunking it will benefit from it (be fast) as well:) In practice, IMHO we have to cache (in memory or disk) for peers range request, rather than iterate (or locate via improved way) tree each time (think about frequently locating a leaf and all non-leaf on the path and load it from leveldb into memory, at meanwhile new key/value are persisted into same db). If we support flexible range, the cache would be larger. Chunking and save it to disk save such look up time.

There are two components to the "each time" state request in this proposal... (1) the initial query itself, including inner nodes and the leaf node of that query and (2) providing More refs, calculated intelligently. Here, (1) is easy to amortize by having larger chunks (e.g. MaxBytes of 10MB, say, or generally returning many leaves per request) and (2) can be pre-computed. And precomputing (2) is basically just a table of paths, without any of the data, so the size of the precomputed data would be tiny... it doesn't need to have more than say, 10k paths (that would provide enough parallelization, which would only be 1MB of data.


It sounds like there's a hybrid solution that does some pre-computation for splitting the work load, while keeping the pre-computation efficient by keeping them short, and leaving the fetching of values to be lazy.

@mappum
Copy link
Contributor

mappum commented May 11, 2019

I did some benchmarks with rocksdb, and I believe my design is viable for random tree chunk accesses.

With 10M nodes, each with random 30-byte keys and 200-byte values, I was able to read random ranges of nodes at ~750MB/s (e.g. I could read 100 32,768-node/7.5MB chunks per second) with an i9-9900K processor. At this rate, it's possible to read the entire 2.3GB database in 3 seconds, and this is only with one core. Rocksdb lets us do concurrent reads, so our throughput can be multiplied by the number of cores available.

If we use a tree where all inner nodes have a key/value rather than only leaf nodes, then each range inside the database key space represents an entire subtree with no inner nodes to compute so there is only minimal processing needed beyond reading from the db.

This can be done lazily with some ongoing CPU cost scaled with the demand for state syncing, but could also be easily cached to disk to free up the CPU (storage is cheap).

My code is here: https://gist.github.com/mappum/3323a386203e506aa893a5f3d7622a72 (sorry for the strange style, I was optimizing for usage of rust features I've never used before)

@ackratos
Copy link
Contributor

@mappum thank you.
Let me have golang & leveldb experiment

@liamsi liamsi added the C:sync Component: Fast Sync, State Sync label May 13, 2019
@tac0turtle tac0turtle added this to the State-Sync milestone Jul 2, 2019
@ebuchman ebuchman mentioned this issue Jul 4, 2019
5 tasks
@erikgrinaker
Copy link
Contributor

Superseded by ADR-053.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:sync Component: Fast Sync, State Sync
Projects
None yet
Development

No branches or pull requests

9 participants