[Merged by Bors] - Optimise tree hash caching for block production #2106

michaelsproul · 2020-12-19T01:17:15Z

Proposed Changes

@potuz on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it.

I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could remove the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client.

Alternatives

I experimented with 2 alternatives to this approach, before deciding on it:

Alternative 1: ensure the head has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes).
Alternative 2: use Arc<BeaconSnapshot> in the snapshot cache and share snapshots between the cache and the head. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there as well as in block production.

In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements 🎉

Additional Info

This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics.

In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.

michaelsproul · 2020-12-19T02:03:33Z

will fix up the tests on Monday

michaelsproul · 2020-12-20T23:49:24Z

This updated graph from @potuz shows this PR (ac59dfa) clearly outperforming v1.0.4 (1abc70e). The other commits were intermediate versions that I ran while testing.

michaelsproul · 2020-12-21T01:18:41Z

Ready.

paulhauner · 2020-12-21T02:46:21Z

beacon_node/beacon_chain/src/beacon_chain.rs

        validator_graffiti: Option<Graffiti>,
    ) -> Result<BeaconBlockAndState<T::EthSpec>, BlockProductionError> {
-        let state = self
-            .state_at_slot(slot - 1, StateSkipConfig::WithStateRoots)


Are you aware that the slot - 1 has been removed and we will no longer be able to produce blocks from slots earlier than the head block?

Yeah I did that intentionally, I'll message you

Good catch! I've restored the slot - 1 with a warning, as we discussed.

I think this will be particularly relevant on the first slot of an epoch when there might be two seemingly legitimate proposers because of propagation delay of the last block of the previous epoch.

paulhauner

Nice, it's good to be doing less hashing!

I'm happy to merge this into unstable, regardless of the nit.

paulhauner · 2020-12-21T06:24:13Z

beacon_node/beacon_chain/src/beacon_chain.rs

+                    "message" => "this block is more likely to be orphaned",
+                    "slot" => slot,
+                );
+                self.state_at_slot(slot - 1, StateSkipConfig::WithStateRoots)


It does feel like this and L1783-1874 should be de-duped, but I wont block on it.

michaelsproul · 2020-12-21T06:29:22Z

bors r+

## Proposed Changes `@potuz` on the Eth R&D Discord observed that Lighthouse blocks on Pyrmont were always arriving at other nodes after at least 1 second. Part of this could be due to processing and slow propagation, but metrics also revealed that the Lighthouse nodes were usually taking 400-600ms to even just produce a block before broadcasting it. I tracked the slowness down to the lack of a pre-built tree hash cache (THC) on the states being used for block production. This was due to using the head state for block production, which lacks a THC in order to keep fork choice fast (cloning a THC takes at least 30ms for 100k validators). This PR modifies block production to clone a state from the snapshot cache rather than the head, which speeds things up by 200-400ms by avoiding the tree hash cache rebuild. In practice this seems to have cut block production time down to 300ms or less. Ideally we could _remove_ the snapshot from the cache (and save the 30ms), but it is required for when we re-process the block after signing it with the validator client. ## Alternatives I experimented with 2 alternatives to this approach, before deciding on it: * Alternative 1: ensure the `head` has a tree hash cache. This is too slow, as it imposes a +30ms hit on fork choice, which currently takes ~5ms (with occasional spikes). * Alternative 2: use `Arc<BeaconSnapshot>` in the snapshot cache and share snapshots between the cache and the `head`. This made fork choice blazing fast (1ms), and block production the same as in this PR, but had a negative impact on block processing which I don't think is worth it. It ended up being necessary to clone the full state from the snapshot cache during block production, imposing the +30ms penalty there _as well_ as in block production. In contract, the approach in this PR should only impact block production, and it improves it! Yay for pareto improvements 🎉 ## Additional Info This commit (ac59dfa) is currently running on all the Lighthouse Pyrmont nodes, and I've added a dashboard to the Pyrmont grafana instance with the metrics. In future work we should optimise the attestation packing, which consumes around 30-60ms and is now a substantial contributor to the total.

bors · 2020-12-21T07:42:56Z

Pull request successfully merged into unstable.

Build succeeded:

paulhauner · 2021-05-14T04:50:45Z

This branch can potentially be deleted :)

michaelsproul · 2021-05-14T04:59:11Z

done 😇

Block production optimisation + metrics

ac59dfa

michaelsproul added ready-for-review The code is ready for review t Consensus & Verification A0 labels Dec 19, 2020

michaelsproul requested a review from paulhauner December 19, 2020 01:17

Update snapshot cache tests

e0181e1

paulhauner reviewed Dec 21, 2020

View reviewed changes

Re-instate slot - 1 head, with warning

cf78c00

paulhauner approved these changes Dec 21, 2020

View reviewed changes

bors bot changed the title ~~Optimise tree hash caching for block production~~ [Merged by Bors] - Optimise tree hash caching for block production Dec 21, 2020

bors bot closed this Dec 21, 2020

michaelsproul deleted the optimise-block-proposal branch May 14, 2021 04:59

michaelsproul added the consensus An issue/PR that touches consensus code, such as state_processing or block verification. label Nov 9, 2022

[Merged by Bors] - Optimise tree hash caching for block production #2106

[Merged by Bors] - Optimise tree hash caching for block production #2106

Uh oh!

Conversation

michaelsproul commented Dec 19, 2020

Proposed Changes

Alternatives

Additional Info

Uh oh!

michaelsproul commented Dec 19, 2020

Uh oh!

michaelsproul commented Dec 20, 2020

Uh oh!

michaelsproul commented Dec 21, 2020

Uh oh!

paulhauner Dec 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelsproul Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

michaelsproul Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

paulhauner left a comment

Choose a reason for hiding this comment

Uh oh!

paulhauner Dec 21, 2020

Choose a reason for hiding this comment

Uh oh!

michaelsproul commented Dec 21, 2020

Uh oh!

bors bot commented Dec 21, 2020

Uh oh!

paulhauner commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelsproul commented May 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paulhauner Dec 21, 2020 •

edited

Loading

paulhauner commented May 14, 2021 •

edited

Loading