Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Use async code when interacting with EL #3244

Closed
wants to merge 8 commits into from

Conversation

paulhauner
Copy link
Member

@paulhauner paulhauner commented Jun 6, 2022

Overview

This rather extensive PR achieves two primary goals:

  1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
  2. Refactors fork choice, block production and block processing to async functions.

Additionally, it achieves:

  • Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
  • Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
  • Concurrent per-block-processing and execution payload verification during block processing.
  • The Arc-ification of SignedBeaconBlock during block processing (it's never mutated, so why not?):
    • I had to do this to deal with sending blocks into spawned tasks.
    • Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper Arc clones.
    • We were also Box-ing and un-Box-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    • Avoids cloning all the blocks in every chain segment during sync.
    • It also has the potential to clean up our code where we need to pass an owned block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
  • The BeaconChain::HeadSafetyStatus struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see #3244 (comment)

Changes to canonical_head and fork_choice

Previously, the BeaconChain had two separate fields:

canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>

Now, we have grouped these values under a single struct:

canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 

Apart from ergonomics, the only actual change here is wrapping the canonical head snapshot in an Arc. This means that we no longer need to hold the cached_head (canonical_head, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the cached_head and fork_choice locks simultaneously.

Breaking Changes

The state (root) field in the finalized_checkpoint SSE event

Consider the scenario where epoch n is just finalized, but start_slot(n) is skipped. There are two state roots we might in the finalized_checkpoint SSE event:

  1. The state root of the finalized block, which is get_block(finalized_checkpoint.root).state_root.
  2. The state root at slot of start_slot(n), which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when Teku generates that event it uses getStateRootFromBlockRoot which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

Notes for Reviewers

I've renamed BeaconChain::fork_choice to BeaconChain::recompute_head. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the ForkChoice struct.

I've changed the ordering of SSE events when a block is received. It used to be [block, finalized, head] and now it's [block, head, finalized]. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to get_head. Ensuring get_head has been run at least once means that the cached values doesn't need to wrapped in an Option. This was fairly simple, it just involved passing a slot to the constructor so it knows when it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the slot that was saved in the fork_choice_store. That seems like it would be a faithful representation of the slot when we saved it.

I added the genesis_time: u64 to the BeaconChain. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the 0x00..00 roots at genesis. You can see I had to remove a work-around in ef-tests here: b56be3b. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the network expected the 0x00..00 alias and patched it here: 3f26ac3.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:

  • Changing tests to be tokio::async tests.
  • Adding .await to fork choice, block processing and block production functions.
  • Refactor of the canonical_head "API" provided by the BeaconChain. E.g., chain.canonical_head.cached_head() instead of chain.canonical_head.read().
  • Wrapping SignedBeaconBlock in an Arc.
  • In the beacon_chain/tests/block_verification, we can't use the lazy_static CHAIN_SEGMENT variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable rayon concurrent tests in the fork_choice tests. This is because the use of rayon and block_on was causing a panic.

@paulhauner paulhauner added work-in-progress PR is a work-in-progress bellatrix Required to support the Bellatrix Upgrade labels Jun 6, 2022
@paulhauner paulhauner added the backwards-incompat Backwards-incompatible API change label Jun 7, 2022
return Err(format!(
"Database corrupt: fork choice is finalized at {:?} whilst head is finalized at \
let head_finalized = head_snapshot.beacon_state.finalized_checkpoint();
if fc_finalized.epoch < head_finalized.epoch {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers note: the uncoupling of just/fin from the canonical head means that we can't be as strict here.

@winksaville
Copy link
Contributor

Such a large change at this time makes me very nervous. It currently modifies 95 files in 26 commits, in my opinion those numbers are way to high for a single PR. At a minimum there should be two PR's one for each the two "primary goals", but I'd contend there should probably more. I would say it somewhere between 2 and 26 and obviously all tests should succeed on each PR.

Another suggestion, before doing this massive of change I believe it would be prudent to determine what the current code coverage metric is and make it a goal to not decrease the code coverage with each PR.

Finally, have there been discussions on what alternative solutions are available and are both primary goals absolutely necessary now?

@paulhauner
Copy link
Member Author

Such a large change at this time makes me very nervous.

I can see why, I also don't like big changes before the merge. However, the code without this PR also makes me nervous. I won't go into detail here, but this PR addresses the two largest concerns I have with Lighthouse and the merge.

Hopefully I can address your concerns and make you feel less nervous. We're talking about the subjective topic of how best to manage code changes, so even if you don't end up agreeing with me I hope that I can at least show that we chose this path with much deliberation and walk it with caution.

It currently modifies 95 files in 26 commits, in my opinion those numbers are way to high for a single PR.

I generally agree. I don't like the "mega-PRs" and would much prefer smaller, segmented PRs. This is usually how we work, but like any good rule it has exceptions.

There sometimes arises a "choose one" problem between:

  1. Having an unstable branch with all tests passing.
  2. Breaking a PR up into small chunks.

As an example, consider turning the BeaconChain::fork_choice function into an async function by simply adding async in front of fn fork_choice. Firstly, that's going to affect all tests which call fork choice, which is a lot of tests. Just from that we're going to have to touch ~80-90 files to change #[test] to #[tokio::test] and fn my_test to async fn my_test.

Given that those 80-90 changes are trivial to review, I would say it's perfectly fine (and necessary) to create a PR that touches 80-90 files. In fact, if you look through this PR you'll find a significant number of changes are just that, making sync tests async.

Now, even if we do that, the tests are going to start failing because the block processing and block production functions use block_on and will be called from an async context and therefore panic. Now, at this point we must choose between:

  1. Allowing the tests to fail.
  2. Hacking the tests so that all calls to BeaconChain::fork_choice are wrapped in a way that avoids panics.
  3. Modifying block production and processing to be async as well.

(1) means that we can't have tests passing on unstable. This means that all other PRs after that PR aren't getting the full test-suite and might be quietly introducing bugs. (2) is doing "throw away" work, it's work that needs to be reviewed thoroughly to ensure it's not introducing bugs, but then needs to be thrown away once we make block production and processing async (as we intend to do). Something else to note about (2) is that it significantly increases the total amount of lines of code to be reviewed. So, given that (1) and (2) have distinct downsides, I've gone ahead and attempted to do (3) in a minimal way that produces a diff that is as small-as-possible.

I think it's important to understand that the vast majority of lines changes and files touched are just non-functional changes to tests (e.g., adding .await to functions or getting info about the head from a different API). The changes to fork choice are substantial, but the changes to block processing and production should be non-substantial and I've tried to make them in a way that produces the cleanest and easiest-to-reason-about diff.

At a minimum there should be two PR's one for each the two "primary goals"

The problem here is that in order to refactor BeaconChain::fork_choice to make it ideal (in my view) I believe it needs to be async. So, I could refactor it in a less-than-ideal way so that it's not async, then come back for another refactor to make it async (which, as I've discussed earlier, is either going to be very large or break tests in unstable). Once again I've broken the change into smaller segments but I've increased the total amount of code that needs to be reviewed.

I believe it would be prudent to determine what the current code coverage metric is and make it a goal to not decrease the code coverage with each PR.

Good idea, I'll run some coverage metrics. I don't take coverage as gospel though. Let's consider this PR, it consists of three components:

  1. Refactoring fork choice to use the justified/finalized checkpoints of fork choice rather than the head state.
  2. Wrapping block processing and block production in async blocks, without changing their core functionality.
  3. Doing tedious one-line changes to all the tests to make them async and to read the justified/finalized checkpoints from a different source.

The vast majority of these changes per "files touched" and "lines changed" is from (3). So, a huge amount of these changes are trivial changes to tests that have nothing to do with code coverage (tests don't cover tests).

For (2), we're adding non-functional wrapping code. The underlying code (under the "wrapping") is already extensively tested. I can't write tests for the wrapping code because it's supposed to be transparent (i.e., has no observable effect on the output of the function) and this can only be confirmed by manual review and the existing test suite. These non-functional wrappers create lines of code that is not able to be tested, so code coverage must decrease.

Finally, have there been discussions on what alternative solutions are available and are both primary goals absolutely necessary now?

Yep, absolutely. I've had a one-on-one meeting with another senior developer regarding this. I've also raised it the rest of the team several times at different internal meetings and I believe the entire core team is across this and my reasoning for it.

As always, everyone is cautious about large changes but I'm not aware of anyone on the core team that opposes this.

bors bot pushed a commit that referenced this pull request Jun 10, 2022
## Description

Add a new lint to CI that attempts to detect calls to functions like `block_on` from async execution contexts. This lint was written from scratch exactly for this purpose, on my fork of Clippy: https://github.com/michaelsproul/rust-clippy/tree/disallow-from-async

## Additional Info

- I've successfully detected the previous two issues we had with `block_on` by running the linter on the commits prior to each of these PRs: #3165, #3199.
- The lint runs on CI with `continue-on-error: true` so that if it fails spuriously it won't block CI.
- I think it would be good to merge this PR before #3244 so that we can lint the extensive executor-related changes in that PR.
- I aim to upstream the lint to Clippy, at which point building a custom version of Clippy from my fork will no longer be necessary. I imagine this will take several weeks or months though, because the code is currently a bit hacky and will need some renovations to pass review.
@michaelsproul michaelsproul changed the title Use async code when interfacting with EL Use async code when interacting with EL Jun 14, 2022
@paulhauner
Copy link
Member Author

I'm moving some text from the description to this comment. I feel it was crowding the description, but I wanted to keep it for reference. Text is below.


Why have you done this to us?

One might wonder if we need a huge refactor just before the merge. That's for you to decide, but here are my motivations.

Firstly, I think it is well understood that it's necessary to perform the just/fin refactor (1) before the merge. This change required an extensive refactor of the BeaconChain::fork_choice function, both internally and externally. The function had to be structurally changed in order for finalization to advance without the head changing. Furthermore, all downstream functions had to be changed to read the fin/just information from fork choice, rather than from the canonical_head. This posed an immediate problem where the beacon_chain.canonical_head and beacon_chain.fork_choice were behind separate RwLocks, so atomic reads were not possible without interleaving locks and risking deadlocks. Given that the optimistic status of blocks is stored in fork_choice, it became difficult to atomically ask the question "what is the head and is it optimistic?". My solution is to put the "canonical head" and "fork choice" behind the same RwLock, which is implemented here.

So, I have made the point for an extensive refactor of fork choice. Now, if I'm refactoring fork choice and I need to touch everything that calls fork choice, why not also switch fork choice to be async? We've been having difficulties with panics (e.g. #3165) arising from our use of block_on in an async context. The type system does not yet protect us from this (yet) and it clearly prefers that we would make functions that do async things officially async. So, why not make our fork choice function async whilst I'm at it.

Now, I hope I have a case for refactoring fork choice and making it async. This required a horrific refactor of all tests that produce blocks to make them async. Once this was done, I found that any test that processed or produced a block would panic those functions use block_on. So, the solutions are to (a) give up and cede ground, (b) wrap all those functions in spawn_blocking tasks or (c) refactor block production/processing to be async too. I chose (c). It was becoming clear that the use of block_on is an anti-pattern; the compiler does not protect us against it panics arising from it and it is practically impossible to program defensively against. It's also worth noting that #3070 was running into these same issues with mixing of async/sync code (the http_api uses async tests) and this PR will avoid a whole lot of ugly code in that PR.

All that said, I hope the reader can see why I have chosen to undertake such a big refactor at this point in time. The calls to an external EL is what put us in this mess, which is a merge-specific thing. I think we're in the scenario where we can make this change and still see it run on several testnets before it's time to merge. I also think that after this PR we're ultimately safer, since we're back to programming within the protective fortress of the compiler.

Copy link
Member

@realbigsean realbigsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing today, only really got through canonical_head.rs so far. The new structure is making sense to me, but need to think about it a bit more in the context of all the changes. Looking really good so far!

beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/canonical_head.rs Outdated Show resolved Hide resolved
@paulhauner paulhauner marked this pull request as ready for review June 22, 2022 22:59
@paulhauner paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Jun 23, 2022
Copy link
Member

@realbigsean realbigsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made another dent in review here, think I've gotten though the meat of the changes at this point. I'll need another day or so to get through the rest

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Show resolved Hide resolved
beacon_node/beacon_chain/src/beacon_chain.rs Show resolved Hide resolved
@paulhauner
Copy link
Member Author

I've just rebased onto unstable.

@paulhauner
Copy link
Member Author

Latest set of comments addressed, thanks!

Copy link
Member

@realbigsean realbigsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, finished the initial review!

beacon_node/client/src/builder.rs Outdated Show resolved Hide resolved
common/task_executor/src/lib.rs Show resolved Hide resolved
@paulhauner
Copy link
Member Author

All comments addressed! Thanks @michaelsproul and @ethDreamer for this round!

I've had a few tests fail after rebasing onto unstable. I'll let the tests run and address any failures.

I think once those tests pass and I have ✔️ from @realbigsean and @ethDreamer then we're in a good place to merge. I'm currently running this on Prater and Ropsten.

Copy link
Member

@realbigsean realbigsean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@paulhauner paulhauner added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jul 2, 2022
@paulhauner
Copy link
Member Author

Thanks again everyone 🙏 Let's get this done!

bors r+

bors bot pushed a commit that referenced this pull request Jul 2, 2022
## Overview

This rather extensive PR achieves two primary goals:

1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.

Additionally, it achieves:

- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
    - I had to do this to deal with sending blocks into spawned tasks.
    - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
    - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    - Avoids cloning *all the blocks* in *every chain segment* during sync.
    - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see #3244 (comment)

## Changes to `canonical_head` and `fork_choice`

Previously, the `BeaconChain` had two separate fields:

```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```

Now, we have grouped these values under a single struct:

```
canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 
```

Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.

## Breaking Changes

### The `state` (root) field in the `finalized_checkpoint` SSE event

Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:

1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

## Notes for Reviewers

I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.

I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.

I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3b. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.

Co-authored-by: Mac L <mjladson@pm.me>
@bors
Copy link

bors bot commented Jul 2, 2022

Build failed:

@paulhauner
Copy link
Member Author

Hmm, lets try again. It might have been a race condition.

bors r+

bors bot pushed a commit that referenced this pull request Jul 2, 2022
## Overview

This rather extensive PR achieves two primary goals:

1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.

Additionally, it achieves:

- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
    - I had to do this to deal with sending blocks into spawned tasks.
    - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
    - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    - Avoids cloning *all the blocks* in *every chain segment* during sync.
    - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see #3244 (comment)

## Changes to `canonical_head` and `fork_choice`

Previously, the `BeaconChain` had two separate fields:

```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```

Now, we have grouped these values under a single struct:

```
canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 
```

Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.

## Breaking Changes

### The `state` (root) field in the `finalized_checkpoint` SSE event

Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:

1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

## Notes for Reviewers

I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.

I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.

I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3b. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.

Co-authored-by: Mac L <mjladson@pm.me>
@bors
Copy link

bors bot commented Jul 2, 2022

Build failed:

@michaelsproul
Copy link
Member

The integration test breakage is fixed by #3303, which I'll batch with this PR now

bors r+

bors bot pushed a commit that referenced this pull request Jul 3, 2022
## Overview

This rather extensive PR achieves two primary goals:

1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.

Additionally, it achieves:

- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
    - I had to do this to deal with sending blocks into spawned tasks.
    - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
    - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    - Avoids cloning *all the blocks* in *every chain segment* during sync.
    - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see #3244 (comment)

## Changes to `canonical_head` and `fork_choice`

Previously, the `BeaconChain` had two separate fields:

```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```

Now, we have grouped these values under a single struct:

```
canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 
```

Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.

## Breaking Changes

### The `state` (root) field in the `finalized_checkpoint` SSE event

Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:

1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

## Notes for Reviewers

I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.

I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.

I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3b. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.

Co-authored-by: Mac L <mjladson@pm.me>
@bors
Copy link

bors bot commented Jul 3, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@michaelsproul
Copy link
Member

bors r-

(for now)

@bors
Copy link

bors bot commented Jul 3, 2022

Canceled.

@divagant-martian
Copy link
Collaborator

bors r+

bors bot pushed a commit that referenced this pull request Jul 3, 2022
## Overview

This rather extensive PR achieves two primary goals:

1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.

Additionally, it achieves:

- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
    - I had to do this to deal with sending blocks into spawned tasks.
    - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
    - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    - Avoids cloning *all the blocks* in *every chain segment* during sync.
    - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see #3244 (comment)

## Changes to `canonical_head` and `fork_choice`

Previously, the `BeaconChain` had two separate fields:

```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```

Now, we have grouped these values under a single struct:

```
canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 
```

Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.

## Breaking Changes

### The `state` (root) field in the `finalized_checkpoint` SSE event

Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:

1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

## Notes for Reviewers

I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.

I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.

I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3b. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.

Co-authored-by: Mac L <mjladson@pm.me>
@michaelsproul
Copy link
Member

bors r-

@bors
Copy link

bors bot commented Jul 3, 2022

Canceled.

@michaelsproul
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Jul 3, 2022
## Overview

This rather extensive PR achieves two primary goals:

1. Uses the finalized/justified checkpoints of fork choice (FC), rather than that of the head state.
2. Refactors fork choice, block production and block processing to `async` functions.

Additionally, it achieves:

- Concurrent forkchoice updates to the EL and cache pruning after a new head is selected.
- Concurrent "block packing" (attestations, etc) and execution payload retrieval during block production.
- Concurrent per-block-processing and execution payload verification during block processing.
- The `Arc`-ification of `SignedBeaconBlock` during block processing (it's never mutated, so why not?):
    - I had to do this to deal with sending blocks into spawned tasks.
    - Previously we were cloning the beacon block at least 2 times during each block processing, these clones are either removed or turned into cheaper `Arc` clones.
    - We were also `Box`-ing and un-`Box`-ing beacon blocks as they moved throughout the networking crate. This is not a big deal, but it's nice to avoid shifting things between the stack and heap.
    - Avoids cloning *all the blocks* in *every chain segment* during sync.
    - It also has the potential to clean up our code where we need to pass an *owned* block around so we can send it back in the case of an error (I didn't do much of this, my PR is already big enough 😅)
- The `BeaconChain::HeadSafetyStatus` struct was removed. It was an old relic from prior merge specs.

For motivation for this change, see #3244 (comment)

## Changes to `canonical_head` and `fork_choice`

Previously, the `BeaconChain` had two separate fields:

```
canonical_head: RwLock<Snapshot>,
fork_choice: RwLock<BeaconForkChoice>
```

Now, we have grouped these values under a single struct:

```
canonical_head: CanonicalHead {
  cached_head: RwLock<Arc<Snapshot>>,
  fork_choice: RwLock<BeaconForkChoice>
} 
```

Apart from ergonomics, the only *actual* change here is wrapping the canonical head snapshot in an `Arc`. This means that we no longer need to hold the `cached_head` (`canonical_head`, in old terms) lock when we want to pull some values from it. This was done to avoid deadlock risks by preventing functions from acquiring (and holding) the `cached_head` and `fork_choice` locks simultaneously.

## Breaking Changes

### The `state` (root) field in the `finalized_checkpoint` SSE event

Consider the scenario where epoch `n` is just finalized, but `start_slot(n)` is skipped. There are two state roots we might in the `finalized_checkpoint` SSE event:

1. The state root of the finalized block, which is `get_block(finalized_checkpoint.root).state_root`.
4. The state root at slot of `start_slot(n)`, which would be the state from (1), but "skipped forward" through any skip slots.

Previously, Lighthouse would choose (2). However, we can see that when [Teku generates that event](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/beaconrestapi/src/main/java/tech/pegasys/teku/beaconrestapi/handlers/v1/events/EventSubscriptionManager.java#L171-L182) it uses [`getStateRootFromBlockRoot`](https://github.com/ConsenSys/teku/blob/de2b2801c89ef5abf983d6bf37867c37fc47121f/data/provider/src/main/java/tech/pegasys/teku/api/ChainDataProvider.java#L336-L341) which uses (1).

I have switched Lighthouse from (2) to (1). I think it's a somewhat arbitrary choice between the two, where (1) is easier to compute and is consistent with Teku.

## Notes for Reviewers

I've renamed `BeaconChain::fork_choice` to `BeaconChain::recompute_head`. Doing this helped ensure I broke all previous uses of fork choice and I also find it more descriptive. It describes an action and can't be confused with trying to get a reference to the `ForkChoice` struct.

I've changed the ordering of SSE events when a block is received. It used to be `[block, finalized, head]` and now it's `[block, head, finalized]`. It was easier this way and I don't think we were making any promises about SSE event ordering so it's not "breaking".

I've made it so fork choice will run when it's first constructed. I did this because I wanted to have a cached version of the last call to `get_head`. Ensuring `get_head` has been run *at least once* means that the cached values doesn't need to wrapped in an `Option`. This was fairly simple, it just involved passing a `slot` to the constructor so it knows *when* it's being run. When loading a fork choice from the store and a slot clock isn't handy I've just used the `slot` that was saved in the `fork_choice_store`. That seems like it would be a faithful representation of the slot when we saved it.

I added the `genesis_time: u64` to the `BeaconChain`. It's small, constant and nice to have around.

Since we're using FC for the fin/just checkpoints, we no longer get the `0x00..00` roots at genesis. You can see I had to remove a work-around in `ef-tests` here: b56be3b. I can't find any reason why this would be an issue, if anything I think it'll be better since the genesis-alias has caught us out a few times (0x00..00 isn't actually a real root). Edit: I did find a case where the `network` expected the 0x00..00 alias and patched it here: 3f26ac3.

You'll notice a lot of changes in tests. Generally, tests should be functionally equivalent. Here are the things creating the most diff-noise in tests:
- Changing tests to be `tokio::async` tests.
- Adding `.await` to fork choice, block processing and block production functions.
- Refactor of the `canonical_head` "API" provided by the `BeaconChain`. E.g., `chain.canonical_head.cached_head()` instead of `chain.canonical_head.read()`.
- Wrapping `SignedBeaconBlock` in an `Arc`.
- In the `beacon_chain/tests/block_verification`, we can't use the `lazy_static` `CHAIN_SEGMENT` variable anymore since it's generated with an async function. We just generate it in each test, not so efficient but hopefully insignificant.

I had to disable `rayon` concurrent tests in the `fork_choice` tests. This is because the use of `rayon` and `block_on` was causing a panic.

Co-authored-by: Mac L <mjladson@pm.me>
@bors bors bot changed the title Use async code when interacting with EL [Merged by Bors] - Use async code when interacting with EL Jul 3, 2022
@bors bors bot closed this Jul 3, 2022
head_state: &BeaconState<T::EthSpec>,
new_finalized_state_root: Hash256,
) -> Result<(), Error> {
self.fork_choice.write().prune()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

bors bot pushed a commit that referenced this pull request Aug 2, 2022
## Issue Addressed

NA

## Proposed Changes

There was a regression in #3244 (released in v2.4.0) which stopped pruning fork choice (see [here](#3244 (comment))).

This would form a very slow memory leak, using ~100mb per month. The release has been out for ~11 days, so users should not be seeing a dangerous increase in memory, *yet*.

Credits to @michaelsproul for noticing this 🎉 

## Additional Info

NA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompat Backwards-incompatible API change bellatrix Required to support the Bellatrix Upgrade ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants