[Merged by Bors] - Optimise snapshot cache for late blocks #2832

macladson · 2021-11-29T06:39:12Z

Proposed Changes

In the event of a late block, keep the block in the snapshot cache by cloning it. This helps us process new blocks quickly in the event the late block was re-org'd.

michaelsproul

Looks great!

I think we could roll this out on Prater pretty soon!

It might be cool to add a metric (and a debug log) for the cache miss? I've added some rough INFO logs on my testing branch 18c29af

beacon_node/beacon_chain/src/snapshot_cache.rs

michaelsproul · 2021-11-30T23:39:59Z

Running this overnight on a mainnet node yielded really promising results!

59 late blocks that cloned from the cache, with low import delays for each (~250-500ms), e.g.

Nov 30 22:09:18.374 INFO Cloned snapshot for late block          block_delay: Some(7.21757108s), parent_root: 0xc108acc15f5ab6b0b74f1899b5b7214533deaa03ed10a209495b65d1137983da, parent_slot: 2623843, slot: 2623844, service: beacon
Nov 30 22:09:18.461 DEBG Delayed head block                      set_as_head_delay: Some(27.887671ms), imported_delay: Some(216.372575ms), observed_delay: Some(7.21757108s), block_delay: 7.461831326s, slot: 2623844, proposer_index: 52841, block_root: 0xdd8c0a8caaf416fdf4917acdc301bf9848bc102741dcd5d82646e4c4e7f8bfca, service: beacon

12 snapshot cache misses

Of the 12 snapshot cache misses, 9 blocks were orphaned and 3 became part of the canonical chain.

2617853: canonical
2618180: orphaned
2619655: orphaned
2619747: orphaned
2620058: canonical
2620396: orphaned
2620737: orphaned
2620781: orphaned
2622490: orphaned
2623547: orphaned
2623659: canonical
2623830: orphaned

We don't need to worry too much about the 9 orphaned blocks because validators don't stand to lose rewards by processing them slowly.

Here are the 3 canonical blocks that missed the cache:

Missed snapshot cache                   block_delay: Some(13.205655529s), parent_root: 0x502ce95fd60f6021f807ea4df31d4f24f002950fabbad68c2774a48782dd75ad, parent_slot: 2617851, slot: 2617853, service: beacon

Missed snapshot cache                   block_delay: Some(14.32521942s), parent_root: 0x164250a1dbffec2e1b8bd4f36f7c2745e5ae0c43bebdbb690b5bc27884fc3132, parent_slot: 2620057, slot: 2620058, service: beacon

Missed snapshot cache                   block_delay: Some(14.378901006s), parent_root: 0xcf95400e04a22ba0dab383b22af1793e51029bede01b4d1b483b6003046fdd66, parent_slot: 2623658, slot: 2623659, service: beacon

In all cases they arrived more than 12 seconds after the start of their assigned slot, and in all cases they re-orged the block at the slot after (an ex-ante re-org). This is essentially an instance of the 1-slot re-org attack described by Schwarz-Schilling et al. (which will be mitigated by proposer boosting).

The reason we miss the cache in this case is that there's a race between the n + 1th and nth blocks to access the cached state. If the n + 1th block arrives first then it will consume the state because it is legitimately on time. This motivates a new heuristic: if the previous slot is a skipped slot, clone rather than move, so that a super late block for the previous slot still has access to its cached state. We could check this by passing the slot of the new block to get_state_for_block_processing and cloning if block_slot > cache_item.beacon_block.slot() + 1

michaelsproul · 2021-12-01T03:03:32Z

I thought import delay was wholly dependent on our node's processing speed but I just discovered an interesting case where we set the observation timestamp for a block while rejecting it:

Dec 01 01:39:49.189 DEBG Rejected gossip block                   slot: 2624897, graffiti: , error: ParentUnknown(parent_root:0x6d33…2cd8), service: beacon
Dec 01 01:39:49.205 DEBG Unknown block received. Starting a parent lookup, block_hash: 0x9c21…9952, block_slot: 2624897, service: sync
Dec 01 01:39:52.006 DEBG Head beacon block                       slot: 2624897, root: 0x9c21…9952, finalized_epoch: 82026, finalized_root: 0xc465…4bd2, justified_epoch: 82027, justified_root: 0xaaa3…a065, service: beacon
Dec 01 01:39:52.007 DEBG Delayed head block                      set_as_head_delay: Some(29.86472ms), imported_delay: Some(2.808787566s), observed_delay: Some(2.169209863s), block_delay: 5.007862149s, slot: 2624897, proposer_index: 48065, block_root: 0x9c21881256414f38c9b4da6613992fe7fdfc29a20c004619143d7b96f6a99952, service: beacon

The timestamp of the first message at 01:39:49.189 approximately matches the observed_delay of 2.17s (slot start is 01:39:47). In this case the import delay includes the time to fetch the block from the network and to process it. This should only happen during rare cases where we aren't receiving blocks via gossip (clock sync or low peer count), so I don't think we should do anything about it right now. In my node's case this occurred immediately after start-up when the node had only found 3 peers, so nothing to worry about.

macladson · 2021-12-01T04:24:59Z

... In this case the import delay includes the time to fetch the block from the network and to process it.

Ah interesting find.

Whenever we set the observed timestamp, we don't set it to the current time, but rather we set it to the time we originally observed it (I think this explains why the time is so short even when it was initially rejected).
(Timestamp Recorded -> Cache Written)

Off the top of my head I can think of two solutions:

Introduce a new timestamp to separate these events from block processing.
Overwrite the observed timestamp when needing to fetch these kinds of blocks. E.g. set the observed timestamp again (this time to the current time) after the parent lookup is complete. This is probably simpler but it makes the definition of observed slightly ambiguous. (Does it refer to the first time the block was seen or when all network stuff is done and the block is ready to be imported)

michaelsproul · 2021-12-01T22:56:39Z

Only 2 cache misses overnight and they were for bad blocks that got orphaned:

Dec 01 12:26:18.490 DEBG Missed snapshot cache                   block_delay: Some(6.915789455s), parent_root: 0x43bc9b73a202d6db26f3decd7242d83d61fa79088a78cec1249c061c30121e34, parent_slot: 2628127, slot: 2628129, service: beacon
Dec 01 16:32:46.795 DEBG Missed snapshot cache                   block_delay: Some(8.95573682s), parent_root: 0xf0212e39770861a913ee82955142322176003b323cc232e67e51b9114a9b7b0f, parent_slot: 2629345, slot: 2629361, service: beacon

Both blocks were based on a stale parent which the snapshot cache had advanced past -- 16 slots past in the case of the second block. I'm ok with processing such blocks slowly for now, until we have a more generic framework for deep forking built atop tree states.

michaelsproul · 2021-12-02T01:33:24Z

Introduce a new timestamp to separate these events from block processing.

I think I prefer this option, as it's the most thorough, but we needn't implement it yet.

Although there were no significant cache misses overnight there were still 5 blocks from the canonical chain with import delays greater than 1 second. I've collated them in this spreadsheet: https://docs.google.com/spreadsheets/d/1mWBX9a8muC2s78Ejg8hqS4JoQ67zXhXk99ExOvPqp38/

One of them is due to a skipped slot on an epoch boundary, which forced 500ms of slot processing, but the others seem to be due to slowness in ForkChoice::on_block. The fact that it's taking several seconds (up to 5?) is concerning and definitely cause for investigation!

beacon_node/beacon_chain/src/block_verification.rs

Co-authored-by: Michael Sproul <micsproul@gmail.com>

michaelsproul · 2021-12-02T05:42:49Z

A PR that I think should fix the fork choice issue is here: #2849

I'm going to cherry-pick it into the mashup of optimisations that I've been running.

michaelsproul

I'm happy to merge this, pending no show-stopping issues on Prater (running now)

I've edited the title slightly to reflect the inclusion of the two heuristics

michaelsproul · 2021-12-06T03:41:14Z

These changes don't seem to have helped much with block imports on Prater because Prater is a lot more chaotic than mainnet and has many more blocks that fork more than 1 slot. However it also doesn't seem to have had any detrimental effect, and seeing as it offers real improvements on mainnet I'm going to merge it.

bors r+

## Proposed Changes In the event of a late block, keep the block in the snapshot cache by cloning it. This helps us process new blocks quickly in the event the late block was re-org'd. Co-authored-by: Michael Sproul <michael@sigmaprime.io>

bors · 2021-12-06T05:34:09Z

Pull request successfully merged into unstable.

Build succeeded:

## Issue Addressed NA ## Proposed Changes In #2832 we made some changes to the `SnapshotCache` to help deal with the one-block reorgs seen on mainnet (and testnets). I believe the change in #2832 is good and we should keep it, but I think that in its present form it is causing the `SnapshotCache` to hold onto states that it doesn't need anymore. For example, a skip slot will result in one more `BeaconSnapshot` being stored in the cache. This PR adds a new type of pruning that happens after a block is inserted to the cache. We will remove any snapshot from the cache that is a *grandparent* of the block being imported. Since we know the grandparent has two valid blocks built atop it, it is not at risk from a one-block re-org. ## Additional Info NA

Keep late blocks in snapshot cache

1b3dd2b

macladson added the work-in-progress PR is a work-in-progress label Nov 29, 2021

michaelsproul mentioned this pull request Nov 29, 2021

Reduce frequency of full state storage #2833

Closed

michaelsproul reviewed Nov 30, 2021

View reviewed changes

beacon_node/beacon_chain/src/snapshot_cache.rs Outdated Show resolved Hide resolved

michaelsproul and others added 2 commits November 30, 2021 11:41

Avoid cloning two states

23ced93

Add metrics and debug logs

e5ce0e4

macladson changed the title ~~[WIP] Keep late blocks in snapshot cache~~ Keep late blocks in snapshot cache Nov 30, 2021

macladson marked this pull request as ready for review November 30, 2021 01:23

macladson added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Nov 30, 2021

michaelsproul reviewed Nov 30, 2021

View reviewed changes

beacon_node/beacon_chain/src/snapshot_cache.rs Outdated Show resolved Hide resolved

Remove derived clone

1c5c7a9

Clone cache when slot is skipped

c26c08a

michaelsproul reviewed Dec 2, 2021

View reviewed changes

beacon_node/beacon_chain/src/block_verification.rs Outdated Show resolved Hide resolved

Update debug log

cd05adb

Co-authored-by: Michael Sproul <micsproul@gmail.com>

michaelsproul approved these changes Dec 2, 2021

View reviewed changes

michaelsproul changed the title ~~Keep late blocks in snapshot cache~~ Optimise snapshot cache for late blocks Dec 2, 2021

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Dec 6, 2021

bors bot changed the title ~~Optimise snapshot cache for late blocks~~ [Merged by Bors] - Optimise snapshot cache for late blocks Dec 6, 2021

bors bot closed this Dec 6, 2021

macladson deleted the late-block-snapshot branch December 6, 2021 07:06

paulhauner mentioned this pull request Jan 14, 2022

[Merged by Bors] - Remove grandparents from snapshot cache #2917

Closed

michaelsproul mentioned this pull request Jan 21, 2022

Snapshots cloned during sync #2944

Closed

michaelsproul mentioned this pull request Feb 7, 2022

Beacon node excessive cpu usage #2904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Optimise snapshot cache for late blocks #2832

[Merged by Bors] - Optimise snapshot cache for late blocks #2832

macladson commented Nov 29, 2021

michaelsproul left a comment

michaelsproul commented Nov 30, 2021

michaelsproul commented Dec 1, 2021

macladson commented Dec 1, 2021

michaelsproul commented Dec 1, 2021

michaelsproul commented Dec 2, 2021

michaelsproul commented Dec 2, 2021 •

edited

michaelsproul left a comment •

edited

michaelsproul commented Dec 6, 2021 •

edited

bors bot commented Dec 6, 2021

[Merged by Bors] - Optimise snapshot cache for late blocks #2832

[Merged by Bors] - Optimise snapshot cache for late blocks #2832

Conversation

macladson commented Nov 29, 2021

Proposed Changes

michaelsproul left a comment

Choose a reason for hiding this comment

michaelsproul commented Nov 30, 2021

michaelsproul commented Dec 1, 2021

macladson commented Dec 1, 2021

michaelsproul commented Dec 1, 2021

michaelsproul commented Dec 2, 2021

michaelsproul commented Dec 2, 2021 • edited

michaelsproul left a comment • edited

Choose a reason for hiding this comment

michaelsproul commented Dec 6, 2021 • edited

bors bot commented Dec 6, 2021

michaelsproul commented Dec 2, 2021 •

edited

michaelsproul left a comment •

edited

michaelsproul commented Dec 6, 2021 •

edited