Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateless mode (not working yet) #1311

Closed
wants to merge 10 commits into from
Closed

Stateless mode (not working yet) #1311

wants to merge 10 commits into from

Conversation

AdamSpitz
Copy link
Contributor

NOT WORKING YET: New "stateless" mode.

The basic idea is that just before we're about to need some account or slot or block header, we prefetch it asynchronously. (e.g. Accounts and slots are fetched by using eth_getProof, then putting the fetched nodes into the database.)

You can start in this mode by running:

nimbus --sync-mode=stateless --stateless-data-source-url=https://mainnet.infura.io/whatever

(At which point nimbus-eth1 will wait for an eth2 client to feed it blocks to run via newPayload in the Engine API.)

Alternatively, you can run the block with a particular hash by running:

nimbus statelesslyRun --stateless-data-source-url=https://mainnet.infura.io/whatever --stateless-block-hash=1234ABCetc.

(This will simply fetch the block header corresponding to that hash from the data source, then run it, without bothering to wait for an eth2 client.)

I say "not working yet" because the state roots are still coming out wrong.

Also, block processing times are still much too slow to be practical: on the order of five minutes per block. We'll need to do precalculated witnesses or something like that in order to make this viable.

One more note: I've removed the concurrentAssemblers test that I implemented when I did the basic async EVM stuff; it was useful as a temporary way to exercise the async code, but the test was based on a very hacky idea and it was more trouble than it was worth.

AdamSpitz and others added 9 commits October 13, 2022 15:39
This is a whole new Git branch, not the same one as last time
(#1250) - there wasn't
much worth salvaging. Main differences:

I didn't do the "each opcode has to specify an async handler" junk
that I put in last time. Instead, in oph_memory.nim you can see
sloadOp calling asyncChainTo and passing in an async operation.
That async operation is then run by the execCallOrCreate (or
asyncExecCallOrCreate) code in interpreter_dispatch.nim.

In the test code, the (previously existing) macro called "assembler"
now allows you to add a section called "initialStorage", specifying
fake data to be used by the EVM computation run by that test. (In
the long run we'll obviously want to write tests that for-real use
the JSON-RPC API to asynchronously fetch data; for now, this was
just an expedient way to write a basic unit test that exercises the
async-EVM code pathway.)

There's also a new macro called "concurrentAssemblers" that allows
you to write a test that runs multiple assemblers concurrently (and
then waits for them all to finish). There's one example test using
this, in test_op_memory_lazy.nim, though you can't actually see it
doing so unless you uncomment some echo statements in
async_operations.nim (in which case you can see the two concurrently
running EVM computations each printing out what they're doing, and
you'll see that they interleave).

A question: is it possible to make EVMC work asynchronously? (For
now, this code compiles and "make test" passes even if ENABLE_EVMC
is turned on, but it doesn't actually work asynchronously, it just
falls back on doing the usual synchronous EVMC thing. See
FIXME-asyncAndEvmc.)
Also ditched the plain-data Vm2AsyncOperation type; it wasn't
really serving much purpose. Instead, the pendingAsyncOperation
field directly contains the Future.
It's not the right solution to the "how do we know whether we
still need to fetch the storage value or not?" problem. I
haven't implemented the right solution yet, but at least
we're better off not putting in a wrong one.
(Based on feedback on the PR.)
There was some back-and-forth in the PR regarding whether nested
waitFor calls are acceptable:

#1260 (comment)

The eventual decision was to just change the waitFor to a doAssert
(since we probably won't want this extra functionality when running
synchronously anyway) to make sure that the Future is already
finished.
The basic idea is that just *before* we're about to need some account or slot or block header, we prefetch it asynchronously. (e.g. Accounts and slots are fetched by using eth_getProof, then putting the fetched nodes into the database.)

You can start in this mode by running:
    nimbus --sync-mode=stateless --stateless-data-source-url=https://mainnet.infura.io/whatever
(At which point nimbus-eth1 will wait for an eth2 client to feed it blocks to run via newPayload in the Engine API.)

Alternatively, you can run the block with a particular hash by running:
    nimbus statelesslyRun --stateless-data-source-url=https://mainnet.infura.io/whatever --stateless-block-hash=1234ABCetc.
(This will simply fetch the block header corresponding to that hash from the data source, then run it, without bothering to wait for an eth2 client.)

I say "not working yet" because the state roots are still coming out wrong.

Also, block processing times are still much too slow to be practical: on the order of five minutes per block. We'll need to do precalculated witnesses or something like that in order to make this viable.

One more note: I've removed the concurrentAssemblers test that I implemented when I did the basic async EVM stuff; it was useful as a temporary way to exercise the async code, but the test was based on a very hacky idea and it was more trouble than it was worth.
@AdamSpitz AdamSpitz marked this pull request as draft November 22, 2022 10:54
discard result.beginSavepoint

proc init*(x: typedesc[AccountsCache], db: TrieDatabaseRef, pruneTrie: bool = true): AccountsCache =
init(x, db, emptyRlpHash, pruneTrie)
proc statelessInit*(x: typedesc[AccountsCache], db: TrieDatabaseRef,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between init and statelessInit here?

fork: Fork): Result[GasInt,void]
# wildcard exception, wrapped below
{.gcsafe, raises: [Exception].} =
return waitFor(asyncProcessTransactionImpl(vmState, tx, sender, header, fork))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK as long this is still work-in-progress, but you'll have to restore the proper sync version before merging this PR.

@@ -0,0 +1,30 @@
import
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be called data_sources/json_rpc or data_sources/web3. Fluffy will add another on-demand data source which will probably be called portal.

let h = chainDB.getBlockHash(blockNumber)
doAssert(h == header.blockHash, "stored the block header for block " & $(blockNumber))

template raiseExceptionIfError[E](whatAreWeVerifying: untyped, r: Result[void, E]) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a generic proc to avoid the multiple evaluation of the r parameter.

# from ../../lc_proxy/validate_proof import getAccountFromProof


var durationSpentDoingFetches*: times.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the metrics library for this. This would allow visualizing the data in Grafana once we deploy long-running instances of Nimbus-eth1 on our "fleet" servers. As an example, here are some metrics from our Nimbus eth2 fleet:

https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&refresh=5m

Take a look at the "Block & Attestation Delay" panels at the bottom of the screen which are currently using histograms:

# at the module top-level scope
declareHistogram data_fetching_duration,
  "Data fetch duration", buckets = [0.25, 0.5, 1, 2, 4, 8, Inf]

# inside functions
data_fetching_duration.observe(durationInSecondsAsFloat) # Usually based on `Moment.now()`


proc fetchBlockHeaderWithHash*(rpcClient: RpcClient, h: Hash256): Future[BlockHeader] {.async.} =
let t0 = now()
let r = request("eth_getBlockByHash", %[%h.prefixHex, %false], some(rpcClient))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request uses waitFor internally, which is not quite appropriate. Furthermore, all HTTP requests may hang forever unless they are guarded with a timeout. Study the usages of awaitWithTimeout from the nimbus-eth2 codebase.

let (blockNumber) = k.cpt.stack.popInt(1)
k.cpt.stack.push:
k.cpt.getBlockHash(blockNumber)
let cpt = k.cpt # so it can safely be captured by the asyncChainTo closure below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this quite expensive to copy?
Can you explain the problem being solved in more detail? Perhaps we need to get back to the drawing board.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about what you're asking.

If you're asking why I'm saying let cpt = k.cpt, the problem I'm solving is that if I instead refer directly to k.cpt inside the asyncChainTo closure, I get error messages like this:

Error: 'k' is of type <var Vm2Ctx> which cannot be captured as it would violate memory safety, declared here: /home/adam/Projects/nimbus-eth1/nimbus/vm2/interpreter/op_handlers/oph_blockdata.nim(32, 32)

By doing what I did, I avoid the need for the closure to capture k. I don't actually need k (the Vm2Ctx) inside the asyncChainTo closure; I only need cpt (the Computation). (And this turns out to be true in all of the places where I've used this pattern.)

I expected this to be cheap; the Computation type is defined as a ref object, so I assumed that all that was being copied into the closure was a single reference. (But I didn't actually look at the C code to verify that.) Am I wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants