WIP General/Lazy State-Sync pseudo-spec #3639

jaekwon · 2019-05-08T00:21:01Z

TODO: link with other state-sync specs/work and compare/contrast.

General/Lazy State syncing pseudo-spec

SDK has two layers of trees -- Simple Tree and underneath, IAVL trees.
This could change in the future too -- e.g. IAVL tree of trees, and other trees besides IAVL trees (e.g. perhaps even N-dimensional spatial trees).

Design goals:

Parallelizeable -- should be able to fetch from multiple peers at a time
Works well with layers (of trees)
Make it easy for new tree/substore types to implement the protocol
Extensible
Big-O notiation sufficiently optimal
Shouldn't require e.g. fetching a large first item, or long precomputation.
Each item fetched should be (almost) immediately verifiable, e.g. can be checked against the root hash

Msgs and Types

StatePathRange{}
- Start string
- End string // empty string is the last theoretical key
MsgStateRequest{} // a peer requesting state from another
- Height num
- Range StatePathRange
- MaxBytes num // 0 means unlimited
  // NOTE: MaxBytes is the max bytes that the receiver is willing to tolerate, doesn't mean the sender will send that much though,
  // the sender can clip at sender's disgretion, and it's up to the receiver to ask for more.
ABCIStateRequest{} // corresponding ABCI message
// NOTE: same as MsgStateRequest... Tendermint passes it through to the app via ABCI.
MsgStateResponse{} // a peer responding with state to requester
- Height num
- StartPath string // the requested start path
- More []StatePathRange // sender helps receiver parallelize fetches w/ More refs
- Type string // e.g. "iavl:range", "simple:tree"
- Bytes []byte // self proving structure
ABCIStateResponse{} // corresponding ABCI message
// NOTE: same as MsgStateResponse... Tendermint passes it through to the app via ABCI.
ABCIStatePersistRequest{} // Tendermint on receiving side asking the app to save
// NOTE: same as MsgStateResponse...
ABCIStatePersistResponse{} // App responding to Tendermint
- Height num
- Error (optional)
- StartPath string
- More []StatePathRange

Example

Actors: NewPeer, OldPeer
1:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"",""},0}
2:  OldPeer->OldPeer: ABCIStateRequest{1000,Range{"",""},0}
3:  OldPeer->OldPeer: ABCIStateResponse{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>}
4:  OldPeer->NewPeer: MsgStateResponse{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>}
5:  NewPeer->NewPeer: ABCIStatePersistRequest{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>}
6:  NewPeer->NewPeer: ABCIStatePersistResponse{1000,nil,"",[
     {"substore1", "substore2"},
     {"substore5", "substore6"},
     {"substoreABC", "substoreABD"}
   ]}
// NOTE: By this point, the NewPeer has the root hash for height 1000, and knows all the persistent IAVL substore hashes as well, for all substores ("substore1", "substore5", and "substoreABC").  At step 6, the app parses the MULTISTORE bytes and finds out all the substores, and so is able to tell tendermint what more paths to request.
7:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substore1","substore2"},0}
// The below could happen in parallel:
// 7:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substore5","substore6"},0}
// 7:  NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substoreABC","substoreABD"},0}
...
8:  OldPeer->NewPeer: MsgStateResponse{1000,"substore1","iavl:range",<IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES>}
9:  NewPeer->NewPeer: ABCIStatePersistRequest{1000,"substore1","iavl:range",<IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES>}
10: NewPeer->NewPeer: ABCIStatePersistResponse{1000,nil,"substore1",More:[{"substore1/501",""}]} // next key range is {"substore1/501", ""}
11: NewPeer->OldPeer: MsgStateRequest{1000,Range:{"substore1/501","substore2"},0}
12: OldPeer->NewPeer: MsgStateResponse{1000,"substore1/501","iavl:range",<IAVL_RANGE_501_to_END_IN_SUBSTORE1_BYTES>}
13: NewPeer->NewPeer: ABCIStatePersistRequest{1000,"substore1","iavl:range",<IAVL_RANGE_501_to_END_IN_SUBSTORE1_BYTES>}
14: NewPeer->NewPeer: ABCIStatePersistResponse{1000,nil,"substore1/501",More:[{"substore2",""}]}

ENGLISH:
1: New peer asks old peer for height 1000 state syncing start.
2: Old peer asks ABCI app.
3: Old peer's ABCI app responds with an entire simple tree (simple tree is always complete) of top-level CosmosSDK store.  It is a list of commit references to all substores at that height.
4: Old peer responds to new peer with the same tree.
5: New peer tells its ABCI app.
6: New peer's ABCI app asks to fetch more things, namely the substore contents.
7: New peer asks for "substore1" content.
8: Old peer responds with the first 500 elements.
9: New peer asks ABCI app to save them.
10: New peer's ABCI app suggests for it to fetch everything else after "substore1/501".
11: New peer asks old peer for more items within the "substore1" substore after key "501".
12: Old peer responds with more starting from key "substore1/501".
13: New peer asks ABCI app to save them.
14: New peer's ABCI app tells Tendermint that the next key is "substore2", implying that "substore1" keys have been fully received.

NOTES:
10: The key in the IAVL substore is "501", but "substore1/" is prepended by the multistore impl.
11: Tendermint found intersect of {"substore1/501",""} from previous step, and {"substore1/501","substore2"} from step 6.
11: The other two substores are not queried since it's commented out in step 7.
14: {"substore1", "substore2"} from step 6 intersected with {"substore2",""} is the empty set, which means Tendermint knows to stop making requests and not request for anything in {"substore2",""}.

Other Notes:

MsgStateResponse and ABCIStatePersistResponse each have a More field. The sender (e.g. OldPeer) may split the query space into subspaces and return them in MsgStateResponse.More refs to help the receiver parallelize requests. The sender should use this if the size of the response bytes exceeds request.MaxBytes.
Tendermint should be persisting and updating "More:..." references it receives from ABCIStatePersistResponse into some priority queue (after intersection with request scope and filtering out references outside the current scope) and making requests to many peers in parallel.
2-d or N-d data can be saved too, as long as the keys can be linearized. How the key is linearized into a string depends on the Merkel-izing scheme (and thus state chunk proving scheme) which determines how the keys are formatted and how the value/domain/space parts are ordered -- and how the key is mapped to subspaces depends on the spatial algorithm. For example, here's a dumb geoquad example:

+---+---+---+---+
|1,1|2,1|3,1|4,1|
+---+---+---+---+
|1,2|2,2|3,2|4,2|
+---+---+---+---+
|1,3|2,3|3,3|4,3|
+---+---+---+---+
|1,4|2,4|3,4|4,4|
+---+---+---+---+

Requesting the contents of {"1,1","1,2"} (e.g. the geoquad labeld "1,1") may yeild items, or further quads. Hashing happens recursively for each subspace.

The text was updated successfully, but these errors were encountered:

zmanian · 2019-05-08T00:31:06Z

Link with the tracking issue.
#828

ebuchman · 2019-05-08T02:16:13Z

How does this proposal deal with insertion order dependence of the IAVL tree?

Also, should we include some description for how it might work for a different kind of tree, like the Patricia Trie? I guess it would work the same way - the fact that in Ethereum's trie you can't iterate over the key space directly shouldn't matter because you can still iterate over the final keys (ie. the hashes of the original keys), right?

My other major concern is we just did a lot of work to refactor the blockchain reactor with the intention that the design would largely apply to state sync as well in so far as it's a series of indexed chunks being downloaded in parallel from many peers. This design would be quite different from that.

melekes · 2019-05-08T08:10:46Z

How does the new peer verifies the state the new peer gave it? I am assuming the new peer has a root hash, which it has obtained somewhere and can use to verify simple tree that it got from the old peer, correct?

ackratos · 2019-05-08T11:21:25Z

@ebuchman I think some point proposed here helps the original proposal
i.e. the MsgStateResponse{1000,"","simple:tree",<SIMPLE_TREE_LIST_OF_ALL_IAVL_SUBSTORE_INFOS_BYTES>} coped with tendermint is not aware of cosmos/app store structure. (multistore)

But most of my concerns in (first section of) my proposal still holds:

I cannot see an efficient way OldPeer traverse a large tree and to provide arbitrary request.
a single key might be large. Here you specified IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES but I presume you mean "000 - 500" elements (iavl tree node for iavlStore I presume) rather than first 500 bytes. (as in English version you say 8: Old peer responds with the first 500 elements.). App developer might put a large value (*00 MB) for a key and only update it periodically (rather than every block) to save io resource.

How does this proposal deal with insertion order dependence of the IAVL tree?

I assume jaekwon propose full iavl nodes (rather than leaf key/value pairs) just like I have done in our previous version of state sync: #3243. If not, this is also a concern.

ebuchman · 2019-05-08T14:43:06Z

I assume jaekwon propose full iavl nodes (rather than leaf key/value pairs) just like I have done in our previous version of state sync: #3243. If not, this is also a concern.

I'm pretty sure the proposal here is to use the key/value pairs, not the internal iavl nodes.

From one perspective, it's more efficient to use the key/value pairs, because it's a smaller amount of data to sync (ie. just the leaves, not the entire tree). But on the other hand, it might turn out to be less efficient in practice since it requires all the lookups. Also, I'm not sure how the insertion-order dependence of the IAVL tree would be handled here.

jaekwon · 2019-05-08T18:44:11Z

How does this proposal deal with insertion order dependence of the IAVL tree?

IAVLRangeProof expresses internal iavl nodes for a range of key/values, including the key/value leaves. It's depth-first search into a range of values.

Reconstruction of this tree would require extra code in the iavl tree, but could be done... it needs to be done anyways for the iavl tree, so I would make it work for rangeproofs since it holds all the info you need.

jaekwon · 2019-05-08T18:58:33Z

My other major concern is we just did a lot of work to refactor the blockchain reactor with the intention that the design would largely apply to state sync as well in so far as it's a series of indexed chunks being downloaded in parallel from many peers. This design would be quite different from that.

I've always thought that the blockchain reactor would differ from the state sync reactor because the blockchain should be verified in sequence, whereas state doesn't need to, it can be verified via merkle proofs in parallel. If you tried to use merkle proofs for verifying the whole range of past blocks, then you end up with somewhat weaker guarantees about safety. So that's my concern about creating a unified system for both. It could be done, but it doesn't seem natural to do it that way.

How does the new peer verifies the state the new peer gave it? I am assuming the new peer has a root hash, which it has obtained somewhere and can use to verify simple tree that it got from the old peer, correct?

Yes, it starts from the root, and pulls data while verifying it. Breadth first is achieved via ref links ("more"), requiring support from the store implementation (but the IAVL tree impl wouldn't support this initially, as RangeProof provides depth first ranges.). With subtrees, the requestor gets to choose breadth or depth first.

I cannot see an efficient way OldPeer traverse a large tree and to provide arbitrary request.

NewPeer gives recommended segmentation via More links, but OldPeer could also brute-force it by bisecting the entire possible key list.

a single key might be large. Here you specified IAVL_RANGE_000_to_500_IN_SUBSTORE1_BYTES but I presume you mean "000 - 500" elements (iavl tree node for iavlStore I presume) rather than first 500 bytes. (as in English version you say 8: Old peer responds with the first 500 elements.). App developer might put a large value (*00 MB) for a key and only update it periodically (rather than every block) to save io resource.

yes 500 elements. Yeah if a single value is that large then I don't see any other way. I think the solution is to always be chunking any large binaries at the app level, which can be done in a general way.

jaekwon · 2019-05-08T19:05:31Z

The point of this proposal is to make it so that it's easy to implement, and it doesn't require some background long-running preparation step to split the data and persist into chunks, which may take a long time if there is a LOT of data (which will be true for many chains very soon). Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them.

It's interesting to consider how robustness works against griefing... if OldPeer Bob doesn't give me good MsgStateResponse .More refs, it's still OK as long as I have other peers that are good, because they can help split the work load recursively.

ebuchman · 2019-05-08T19:51:11Z

IAVLRangeProof expresses internal iavl nodes for a range of key/values, including the key/value leaves. It's depth-first search into a range of values.

So with this proposal you'd still have to sync the internal tree nodes over the network?

I've always thought that the blockchain reactor would differ from the state sync reactor because the blockchain should be verified in sequence, whereas state doesn't need to, it can be verified via merkle proofs in parallel.

Yes, this is more about the pool component of the reactor and the concurrent requesting of pieces of peers, which should be quite similar for both. Note we're not proposing generalizing the code to be used for both (at least not yet!), just copying the code and modifying. Of course the "boundary" conditions are quite different between the two - blockchain reactor uses ApplyBlock on the ABCI consensus connection to execute blocks in sequence, it uses a heuristic to find out when to switch to consensus, and it has to update the "highest height", etc. while state sync will apply pieces on a different ABCI connection, doesn't have to order them, and should know up front exactly how many pieces are needed. So this is mostly about the pool component, which is where a lot of complexity/concurrency in these reactors actually lies.

Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them.

The concern here, based on @ackratos experiments, is that this real-time querying is actually quite slow, and it's much preferable to do some work up front so that the requests can be served much more quickly.

ackratos · 2019-05-09T02:49:59Z

The point of this proposal is to make it so that it's easy to implement, and it doesn't require some background long-running preparation step to split the data and persist into chunks, which may take a long time if there is a LOT of data (which will be true for many chains very soon). Instead, in this proposal the chunks must be figured out at request time (which can be cheap, and doesn't have to be perfect) but there's no need to precompute them.

If we can propose a way depth-first traverse iavl tree fast then chunking it will benefit from it (be fast) as well:)
In practice, IMHO we have to cache (in memory or disk) for peers range request, rather than iterate (or locate via improved way) tree each time (think about frequently locating a leaf and all non-leaf on the path and load it from leveldb into memory, at meanwhile new key/value are persisted into same db). If we support flexible range, the cache would be larger. Chunking and save it to disk save such look up time.

I think a long time depends how frequently we took snapshot. For our 24hour per snapshot (and ethereum ~13 hour) use case. Its fine to take ~3 min to snapshot a version isn't it? The data should have an estimate-able up boundary. acc store might reach 5m accounts in long future? that's 100m nodes which roughly 2500 4M chunks (before compression), that's 30 min snapshot time still acceptable and need to say its a low bound of time as currently I experiment this by sequentially write disk. stake store might growth but extremely slowly, gov store might get cleaned on proposal expire etc.

I admit its not acceptable if we want take snapshot of recent 100 blocks, but as you solve the bottleneck on traversing iavl tree, I believe chunking solution will acceptable then.

What do you think of it?

jaekwon · 2019-05-10T21:56:40Z

So with this proposal you'd still have to sync the internal tree nodes over the network?

Depends on the tree, but for IAVL you have to anyways to replicate the tree structure. You don't have to query for it breadth first, it's included in each range proof.

Yes, this is more about the pool component of the reactor and the concurrent requesting of pieces of peers, which should be quite similar for both. Note we're not proposing generalizing the code to be used for both (at least not yet!), just copying the code and modifying. Of course the "boundary" conditions are quite different between the two - blockchain reactor uses ApplyBlock on the ABCI consensus connection to execute blocks in sequence, it uses a heuristic to find out when to switch to consensus, and it has to update the "highest height", etc. while state sync will apply pieces on a different ABCI connection, doesn't have to order them, and should know up front exactly how many pieces are needed. So this is mostly about the pool component, which is where a lot of complexity/concurrency in these reactors actually lies.

Hmm, it sounds to me that the pools will be quite dissimilar! Anyways, copy/pasting sounds like a fine way to create it to start, and we can see how it evolves.

The concern here, based on @ackratos experiments, is that this real-time querying is actually quite slow, and it's much preferable to do some work up front so that the requests can be served much more quickly.

Interesting....

@ackratos If we can propose a way depth-first traverse iavl tree fast then chunking it will benefit from it (be fast) as well:) In practice, IMHO we have to cache (in memory or disk) for peers range request, rather than iterate (or locate via improved way) tree each time (think about frequently locating a leaf and all non-leaf on the path and load it from leveldb into memory, at meanwhile new key/value are persisted into same db). If we support flexible range, the cache would be larger. Chunking and save it to disk save such look up time.

There are two components to the "each time" state request in this proposal... (1) the initial query itself, including inner nodes and the leaf node of that query and (2) providing More refs, calculated intelligently. Here, (1) is easy to amortize by having larger chunks (e.g. MaxBytes of 10MB, say, or generally returning many leaves per request) and (2) can be pre-computed. And precomputing (2) is basically just a table of paths, without any of the data, so the size of the precomputed data would be tiny... it doesn't need to have more than say, 10k paths (that would provide enough parallelization, which would only be 1MB of data.

It sounds like there's a hybrid solution that does some pre-computation for splitting the work load, while keeping the pre-computation efficient by keeping them short, and leaving the fetching of values to be lazy.

mappum · 2019-05-11T17:45:35Z

I did some benchmarks with rocksdb, and I believe my design is viable for random tree chunk accesses.

With 10M nodes, each with random 30-byte keys and 200-byte values, I was able to read random ranges of nodes at ~750MB/s (e.g. I could read 100 32,768-node/7.5MB chunks per second) with an i9-9900K processor. At this rate, it's possible to read the entire 2.3GB database in 3 seconds, and this is only with one core. Rocksdb lets us do concurrent reads, so our throughput can be multiplied by the number of cores available.

If we use a tree where all inner nodes have a key/value rather than only leaf nodes, then each range inside the database key space represents an entire subtree with no inner nodes to compute so there is only minimal processing needed beyond reading from the db.

This can be done lazily with some ongoing CPU cost scaled with the demand for state syncing, but could also be easily cached to disk to free up the CPU (storage is cheap).

My code is here: https://gist.github.com/mappum/3323a386203e506aa893a5f3d7622a72 (sorry for the strange style, I was optimizing for usage of rust features I've never used before)

ackratos · 2019-05-13T02:15:01Z

@mappum thank you.
Let me have golang & leveldb experiment

erikgrinaker · 2020-02-21T12:29:56Z

Superseded by ADR-053.

jaekwon changed the title ~~WIP State-Sync pseudo-spec~~ WIP General State-Sync pseudo-spec May 8, 2019

ebuchman mentioned this issue May 8, 2019

sync: Sync current state without full replay for Applications #828

Closed

jaekwon changed the title ~~WIP General State-Sync pseudo-spec~~ WIP General/Lazy State-Sync pseudo-spec May 8, 2019

liamsi added the C:sync Component: Fast Sync, State Sync label May 13, 2019

tac0turtle added this to the State-Sync milestone Jul 2, 2019

ebuchman mentioned this issue Jul 4, 2019

[ADR - 42] State-sync #3769

Merged

5 tasks

tac0turtle modified the milestones: State-Sync, v0.34 Block Protocol Breaking Release Feb 13, 2020

erikgrinaker closed this as completed Feb 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP General/Lazy State-Sync pseudo-spec #3639

WIP General/Lazy State-Sync pseudo-spec #3639

jaekwon commented May 8, 2019 •

edited

zmanian commented May 8, 2019

ebuchman commented May 8, 2019 •

edited

melekes commented May 8, 2019

ackratos commented May 8, 2019 •

edited

ebuchman commented May 8, 2019 •

edited

jaekwon commented May 8, 2019 •

edited

jaekwon commented May 8, 2019

jaekwon commented May 8, 2019

ebuchman commented May 8, 2019

ackratos commented May 9, 2019 •

edited

jaekwon commented May 10, 2019 •

edited

mappum commented May 11, 2019 •

edited

ackratos commented May 13, 2019

erikgrinaker commented Feb 21, 2020

WIP General/Lazy State-Sync pseudo-spec #3639

WIP General/Lazy State-Sync pseudo-spec #3639

Comments

jaekwon commented May 8, 2019 • edited

General/Lazy State syncing pseudo-spec

Msgs and Types

Example

Other Notes:

zmanian commented May 8, 2019

ebuchman commented May 8, 2019 • edited

melekes commented May 8, 2019

ackratos commented May 8, 2019 • edited

ebuchman commented May 8, 2019 • edited

jaekwon commented May 8, 2019 • edited

jaekwon commented May 8, 2019

jaekwon commented May 8, 2019

ebuchman commented May 8, 2019

ackratos commented May 9, 2019 • edited

jaekwon commented May 10, 2019 • edited

mappum commented May 11, 2019 • edited

ackratos commented May 13, 2019

erikgrinaker commented Feb 21, 2020

jaekwon commented May 8, 2019 •

edited

ebuchman commented May 8, 2019 •

edited

ackratos commented May 8, 2019 •

edited

ebuchman commented May 8, 2019 •

edited

jaekwon commented May 8, 2019 •

edited

ackratos commented May 9, 2019 •

edited

jaekwon commented May 10, 2019 •

edited

mappum commented May 11, 2019 •

edited