New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Optimise snapshot cache for late blocks #2832
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I think we could roll this out on Prater pretty soon!
It might be cool to add a metric (and a debug log) for the cache miss? I've added some rough INFO logs on my testing branch 18c29af
Running this overnight on a mainnet node yielded really promising results!
Of the 12 snapshot cache misses, 9 blocks were orphaned and 3 became part of the canonical chain.
We don't need to worry too much about the 9 orphaned blocks because validators don't stand to lose rewards by processing them slowly. Here are the 3 canonical blocks that missed the cache:
In all cases they arrived more than 12 seconds after the start of their assigned slot, and in all cases they re-orged the block at the slot after (an ex-ante re-org). This is essentially an instance of the 1-slot re-org attack described by Schwarz-Schilling et al. (which will be mitigated by proposer boosting). The reason we miss the cache in this case is that there's a race between the |
I thought import delay was wholly dependent on our node's processing speed but I just discovered an interesting case where we set the observation timestamp for a block while rejecting it:
The timestamp of the first message at 01:39:49.189 approximately matches the |
Ah interesting find. Whenever we set the observed timestamp, we don't set it to the current time, but rather we set it to the time we originally observed it (I think this explains why the time is so short even when it was initially rejected). Off the top of my head I can think of two solutions:
|
Only 2 cache misses overnight and they were for bad blocks that got orphaned:
Both blocks were based on a stale parent which the snapshot cache had advanced past -- 16 slots past in the case of the second block. I'm ok with processing such blocks slowly for now, until we have a more generic framework for deep forking built atop tree states. |
I think I prefer this option, as it's the most thorough, but we needn't implement it yet. Although there were no significant cache misses overnight there were still 5 blocks from the canonical chain with import delays greater than 1 second. I've collated them in this spreadsheet: https://docs.google.com/spreadsheets/d/1mWBX9a8muC2s78Ejg8hqS4JoQ67zXhXk99ExOvPqp38/ One of them is due to a skipped slot on an epoch boundary, which forced 500ms of slot processing, but the others seem to be due to slowness in |
Co-authored-by: Michael Sproul <micsproul@gmail.com>
A PR that I think should fix the fork choice issue is here: #2849 I'm going to cherry-pick it into the mashup of optimisations that I've been running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to merge this, pending no show-stopping issues on Prater (running now)
I've edited the title slightly to reflect the inclusion of the two heuristics
These changes don't seem to have helped much with block imports on Prater because Prater is a lot more chaotic than mainnet and has many more blocks that fork more than 1 slot. However it also doesn't seem to have had any detrimental effect, and seeing as it offers real improvements on mainnet I'm going to merge it. bors r+ |
## Proposed Changes In the event of a late block, keep the block in the snapshot cache by cloning it. This helps us process new blocks quickly in the event the late block was re-org'd. Co-authored-by: Michael Sproul <michael@sigmaprime.io>
Pull request successfully merged into unstable. Build succeeded: |
## Issue Addressed NA ## Proposed Changes In #2832 we made some changes to the `SnapshotCache` to help deal with the one-block reorgs seen on mainnet (and testnets). I believe the change in #2832 is good and we should keep it, but I think that in its present form it is causing the `SnapshotCache` to hold onto states that it doesn't need anymore. For example, a skip slot will result in one more `BeaconSnapshot` being stored in the cache. This PR adds a new type of pruning that happens after a block is inserted to the cache. We will remove any snapshot from the cache that is a *grandparent* of the block being imported. Since we know the grandparent has two valid blocks built atop it, it is not at risk from a one-block re-org. ## Additional Info NA
Proposed Changes
In the event of a late block, keep the block in the snapshot cache by cloning it. This helps us process new blocks quickly in the event the late block was re-org'd.