Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSV use lighthouse v5.0.0, proposal block miss. #5365

Closed
hwhe opened this issue Mar 7, 2024 · 7 comments
Closed

SSV use lighthouse v5.0.0, proposal block miss. #5365

hwhe opened this issue Mar 7, 2024 · 7 comments
Labels
dvt Distributed validator technology e.g. SSV, Obol HTTP-API optimization Something to make Lighthouse run more efficiently.

Comments

@hwhe
Copy link

hwhe commented Mar 7, 2024

Description

My usage scenario as follows:
4 ssv operators use the same beacon node (use lighthouse client). After I upgraded the lighthouse version from v4.4.1 to v5.0.0, the block miss rate increased.

I check the code and find a lock is added in the #4925 issue. The four get validator/blinded_blocks/{slot} requests for creating blocks are serialized. As a result, the time required for the consensus of the four ssv operators increases, and the block generation time becomes slow.

 ```
       );
            (re_org_state.pre_state, re_org_state.state_root)
        }
        // Normal case: proposing a block atop the current head. Use the snapshot cache.
        // Normal case: proposing a block atop the current head using the cache.
        else if let Some((_, cached_state)) = self
            .block_production_state
            .lock()
            .take()
            .filter(|(cached_block_root, _)| *cached_block_root == head_block_root)
        {
            (cached_state.pre_state, cached_state.state_root)
        }
        // Fall back to a direct read of the snapshot cache.
        else if let Some(pre_state) = self
            .snapshot_cache
            .try_read_for(BLOCK_PROCESSING_CACHE_LOCK_TIMEOUT)
            .and_then(|snapshot_cache| {
                snapshot_cache.get_state_for_block_production(head_block_root)
            })
        {
            warn!(
                self.log,
                "Block production cache miss";
                "message" => "falling back to snapshot cache clone",
                "slot" => slot
            );
            (pre_state.pre_state, pre_state.state_root)
        } else {

    ```
/// State with complete tree hash cache, ready for block production.
    ///
    /// NB: We can delete this once we have tree-states.
    #[allow(clippy::type_complexity)]
    pub block_production_state: Arc<Mutex<Option<(Hash256, BlockProductionPreState<T::EthSpec>)>>>,

the V4.4.1 log as shown below, We can check "requesting blinded header from connected builder" and see that the requests of the four ssv operators are basically the same.

27:48.356 INFO Requesting blinded header from connected builder,
27:48.399 INFO Requesting blinded header from connected builder,
27:48.415 WARN Duplicate payload cached, this might indicate red
27:48.440 INFO Requesting blinded header from connected builder,
27:48.456 WARN Duplicate payload cached, this might indicate red
27:48.464 INFO Requesting blinded header from connected builder,
27:48.479 WARN Duplicate payload cached, this might indicate red
27:49.326 INFO Requested blinded execution payload parent_ha
27:49.326 INFO Received local and builder payloads parent_ha
27:49.326 INFO Relay block is more profitable than local block,
27:49.352 INFO Requested blinded execution payload parent_ha
27:49.352 INFO Received local and builder payloads parent_ha
27:49.352 INFO Relay block is more profitable than local block,
27:49.402 INFO Requested blinded execution payload parent_ha
27:49.402 INFO Received local and builder payloads parent_ha
27:49.402 INFO Relay block is more profitable than local block,
27:49.416 INFO Requested blinded execution payload parent_ha
27:49.416 INFO Received local and builder payloads parent_ha
27:49.416 INFO Relay block is more profitable than local block,
27:49.719 ERRO Block broadcast was delayed root: ***
27:49.719 ERRO Block broadcast was delayed root: ***
27:49.720 ERRO Block broadcast was delayed root: ***
27:49.822 ERRO Block broadcast was delayed root: ***
27:50.629 INFO New block received root: ***
27:52.583 INFO Builder successfully revealed payload parent_ha
27:52.583 INFO Successfully published a block to the builder net
27:52.603 WARN Error processing HTTP API request method: POST, path: /eth/v1/beacon/blinded_blocks, status: 202 Accepted, elapsed: 2.894625054s

the V5.0.0 log as shown below,We can see that the time difference between the four requests is large.

45:35.232 INFO Requesting blinded header from connected
45:35.940 WARN Block production cache miss
45:35.960 INFO Requesting blinded header from connected
45:35.964 WARN Duplicate payload cached, this might ind
45:36.191 INFO Requested blinded execution payload
45:36.192 INFO Received local and builder payloads
45:36.193 INFO Relay block is more profitable than loca
45:36.672 WARN Block production cache miss
45:36.692 INFO Requesting blinded header from connected
45:36.696 WARN Duplicate payload cached, this might ind
45:36.912 INFO Requested blinded execution payload
45:36.912 INFO Received local and builder payloads
45:36.915 INFO Relay block is more profitable than loca
45:37.514 WARN Block production cache miss
45:37.534 INFO Requesting blinded header from connected
45:37.538 WARN Duplicate payload cached, this might ind
45:37.663 INFO Requested blinded execution payload
45:37.663 INFO Received local and builder payloads
45:37.670 INFO Relay block is more profitable than loca
45:38.485 INFO Requested blinded execution payload
45:38.485 INFO Received local and builder payloads
45:38.495 INFO Relay block is more profitable than loca
45:39.877 ERRO Block was broadcast too late
45:39.878 ERRO Block was broadcast too late
45:39.878 ERRO Block was broadcast too late
45:40.860 WARN Block production cache miss
45:41.003 INFO Synced
45:41.474 ERRO Block was broadcast too late
45:45.109 WARN Builder failed to reveal payload
45:45.109 WARN Error processing HTTP API request
45:45.131 WARN Builder failed to reveal payload
45:45.131 WARN Error processing HTTP API request
45:45.168 WARN Builder failed to reveal payload
45:45.168 WARN Error processing HTTP API request
45:46.671 WARN Builder failed to reveal payload
45:46.671 WARN Error processing HTTP API request
45:48.388 INFO New block received
45:53.001 INFO Synced

ths ssv proposal block as follows
// executeDuty steps:
// 1) sign a partial randao sig and wait for 2f+1 partial sigs from peers
// 2) reconstruct randao and send GetBeaconBlock to BN
// 3) start consensus on duty + block data
// 4) Once consensus decides, sign partial block and broadcast
// 5) collect 2f+1 partial sigs, reconstruct and broadcast valid block sig to the BN

Because the step 2 is slow, the consensus is slow, and finally the block is missed.

Version

version: Lighthouse/v5.0.0

Present Behaviour

The previous version V4.4.1 was fine.

Expected Behaviour

The performance of concurrent requests for obtaining blocks should not be reduced. Consider the ssv scenario.

Steps to resolve

i don't know, maybe consider read-write locks or other ways

@hwhe hwhe changed the title ssv user lighthouse v5.0.0, proposal block miss. SSV user lighthouse v5.0.0, proposal block miss. Mar 7, 2024
@hwhe hwhe changed the title SSV user lighthouse v5.0.0, proposal block miss. SSV use lighthouse v5.0.0, proposal block miss. Mar 7, 2024
@michaelsproul
Copy link
Member

Huh, I didn't anticipate this case. It seems strange that:

  1. There are multiple SSV VCs connected to a single BN. This kind of defeats the point of SSV doesn't it? i.e. independent operators
  2. I'm surprised that this is how SSV comes to consensus on blocks. All of the blocks produced by the different key shards have the potential to be quite different, especially if using different clients. I suspect it must only actually use 1 of the produced blocks (the "best" by some metric)?
  3. I'm surprised that the .lock() call jams things up for so long. The first request to hit the lock should move the state out of the lock (a few ms max) and then yield it to the next waiting thread. Perhaps we are inadvertantly holding the lock while copying from the snapshot cache here:

let pre_state = self
.block_production_state
.lock()
.take()
.and_then(|(cached_block_root, state)| {
(cached_block_root == re_org_parent_block).then_some(state)
})
.or_else(|| {
warn!(
self.log,
"Block production cache miss";
"message" => "falling back to snapshot cache during re-org",
"slot" => slot,
"block_root" => ?re_org_parent_block
);
self.snapshot_cache
.try_read_for(BLOCK_PROCESSING_CACHE_LOCK_TIMEOUT)
.and_then(|snapshot_cache| {
snapshot_cache.get_state_for_block_production(re_org_parent_block)
})
})

We can try tweaking that locking behaviour to be more explicit about when it drops the lock. Do you have an SSV testnet setup where we could test this change? (I can test locally, but would like to make sure it actually works holistically with SSV)

This optimisation will also be going away soon when we merge tree-states, which keeps more states in memory and can clone them in a matter of milliseconds (see #3206).

@michaelsproul michaelsproul added optimization Something to make Lighthouse run more efficiently. HTTP-API dvt Distributed validator technology e.g. SSV, Obol labels Mar 7, 2024
@michaelsproul
Copy link
Member

@hwhe I think I fixed it. Can you try this PR on your test setup?

@hwhe
Copy link
Author

hwhe commented Mar 7, 2024

hi michaelsproul, ths for your reply.
1、 Actually, my some test business is more about security than reliability. so my 4 operators running independently on 4 ecs. I only use SSV operators for its private key sharding capabilities. In addition, there are some test scenarios in our business that require this.
2、yes. Although every ssv operator will obtains a block, but only one operator will be selected to be leader use random algorithm. They actually use only the block generated by the leader for partials signature. So ssv can achive consensus at least 3 requests finished.
3、yes. ssv support eth holesky network. we can test it. meanwhile, I also hope you can test it on the holeksy network.
tks

@lilkk-jerry
Copy link

lilkk-jerry commented Mar 7, 2024

@michaelsproul Hey i found the same issue (when 2 SSV are temporarily connecting to 1 node). Are you going to release a hotfix version before Dencun?

@michaelsproul
Copy link
Member

@lilkk-jerry Yeah there will be a v5.1.0 next monday with some fixes. We can probably include this.

Can you test it with SSV for me?

@lilkk-jerry
Copy link

@lilkk-jerry Yeah there will be a v5.1.0 next monday with some fixes. We can probably include this.

Can you test it with SSV for me?

I would love to, and i have SSV Operator for Goerli indeed, but it's hard to produce the block proposal task cuz I only run 1 validator on SSV.

Based on the past experience, the SSV Operator can suit well with Lighthouse 5.0.0 for other kind of task(attestation, aggregation and so on), so I think maybe we can simply make a concurrent getBlindedBlock request and see if it's behavior is the same with 4.4.1

@jimmygchen
Copy link
Member

Fixed in #5368 and will be in v5.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dvt Distributed validator technology e.g. SSV, Obol HTTP-API optimization Something to make Lighthouse run more efficiently.
Projects
None yet
Development

No branches or pull requests

4 participants