SSV use lighthouse v5.0.0， proposal block miss. #5365

hwhe · 2024-03-07T01:48:46Z

Description

My usage scenario as follows:
4 ssv operators use the same beacon node (use lighthouse client). After I upgraded the lighthouse version from v4.4.1 to v5.0.0, the block miss rate increased.

I check the code and find a lock is added in the #4925 issue. The four get validator/blinded_blocks/{slot} requests for creating blocks are serialized. As a result, the time required for the consensus of the four ssv operators increases, and the block generation time becomes slow.

 ```
       );
            (re_org_state.pre_state, re_org_state.state_root)
        }
        // Normal case: proposing a block atop the current head. Use the snapshot cache.
        // Normal case: proposing a block atop the current head using the cache.
        else if let Some((_, cached_state)) = self
            .block_production_state
            .lock()
            .take()
            .filter(|(cached_block_root, _)| *cached_block_root == head_block_root)
        {
            (cached_state.pre_state, cached_state.state_root)
        }
        // Fall back to a direct read of the snapshot cache.
        else if let Some(pre_state) = self
            .snapshot_cache
            .try_read_for(BLOCK_PROCESSING_CACHE_LOCK_TIMEOUT)
            .and_then(|snapshot_cache| {
                snapshot_cache.get_state_for_block_production(head_block_root)
            })
        {
            warn!(
                self.log,
                "Block production cache miss";
                "message" => "falling back to snapshot cache clone",
                "slot" => slot
            );
            (pre_state.pre_state, pre_state.state_root)
        } else {


    ```
/// State with complete tree hash cache, ready for block production.
    ///
    /// NB: We can delete this once we have tree-states.
    #[allow(clippy::type_complexity)]
    pub block_production_state: Arc<Mutex<Option<(Hash256, BlockProductionPreState<T::EthSpec>)>>>,

the V4.4.1 log as shown below， We can check "requesting blinded header from connected builder" and see that the requests of the four ssv operators are basically the same.

27:48.356 INFO Requesting blinded header from connected builder,
27:48.399 INFO Requesting blinded header from connected builder,
27:48.415 WARN Duplicate payload cached, this might indicate red
27:48.440 INFO Requesting blinded header from connected builder,
27:48.456 WARN Duplicate payload cached, this might indicate red
27:48.464 INFO Requesting blinded header from connected builder,
27:48.479 WARN Duplicate payload cached, this might indicate red
27:49.326 INFO Requested blinded execution payload parent_ha
27:49.326 INFO Received local and builder payloads parent_ha
27:49.326 INFO Relay block is more profitable than local block,
27:49.352 INFO Requested blinded execution payload parent_ha
27:49.352 INFO Received local and builder payloads parent_ha
27:49.352 INFO Relay block is more profitable than local block,
27:49.402 INFO Requested blinded execution payload parent_ha
27:49.402 INFO Received local and builder payloads parent_ha
27:49.402 INFO Relay block is more profitable than local block,
27:49.416 INFO Requested blinded execution payload parent_ha
27:49.416 INFO Received local and builder payloads parent_ha
27:49.416 INFO Relay block is more profitable than local block,
27:49.719 ERRO Block broadcast was delayed root: ***
27:49.719 ERRO Block broadcast was delayed root: ***
27:49.720 ERRO Block broadcast was delayed root: ***
27:49.822 ERRO Block broadcast was delayed root: ***
27:50.629 INFO New block received root: ***
27:52.583 INFO Builder successfully revealed payload parent_ha
27:52.583 INFO Successfully published a block to the builder net
27:52.603 WARN Error processing HTTP API request method: POST, path: /eth/v1/beacon/blinded_blocks, status: 202 Accepted, elapsed: 2.894625054s

the V5.0.0 log as shown below，We can see that the time difference between the four requests is large.

45:35.232 INFO Requesting blinded header from connected
45:35.940 WARN Block production cache miss
45:35.960 INFO Requesting blinded header from connected
45:35.964 WARN Duplicate payload cached, this might ind
45:36.191 INFO Requested blinded execution payload
45:36.192 INFO Received local and builder payloads
45:36.193 INFO Relay block is more profitable than loca
45:36.672 WARN Block production cache miss
45:36.692 INFO Requesting blinded header from connected
45:36.696 WARN Duplicate payload cached, this might ind
45:36.912 INFO Requested blinded execution payload
45:36.912 INFO Received local and builder payloads
45:36.915 INFO Relay block is more profitable than loca
45:37.514 WARN Block production cache miss
45:37.534 INFO Requesting blinded header from connected
45:37.538 WARN Duplicate payload cached, this might ind
45:37.663 INFO Requested blinded execution payload
45:37.663 INFO Received local and builder payloads
45:37.670 INFO Relay block is more profitable than loca
45:38.485 INFO Requested blinded execution payload
45:38.485 INFO Received local and builder payloads
45:38.495 INFO Relay block is more profitable than loca
45:39.877 ERRO Block was broadcast too late
45:39.878 ERRO Block was broadcast too late
45:39.878 ERRO Block was broadcast too late
45:40.860 WARN Block production cache miss
45:41.003 INFO Synced
45:41.474 ERRO Block was broadcast too late
45:45.109 WARN Builder failed to reveal payload
45:45.109 WARN Error processing HTTP API request
45:45.131 WARN Builder failed to reveal payload
45:45.131 WARN Error processing HTTP API request
45:45.168 WARN Builder failed to reveal payload
45:45.168 WARN Error processing HTTP API request
45:46.671 WARN Builder failed to reveal payload
45:46.671 WARN Error processing HTTP API request
45:48.388 INFO New block received
45:53.001 INFO Synced

ths ssv proposal block as follows
// executeDuty steps:
// 1) sign a partial randao sig and wait for 2f+1 partial sigs from peers
// 2) reconstruct randao and send GetBeaconBlock to BN
// 3) start consensus on duty + block data
// 4) Once consensus decides, sign partial block and broadcast
// 5) collect 2f+1 partial sigs, reconstruct and broadcast valid block sig to the BN

Because the step 2 is slow, the consensus is slow, and finally the block is missed.

Version

version: Lighthouse/v5.0.0

Present Behaviour

The previous version V4.4.1 was fine.

Expected Behaviour

The performance of concurrent requests for obtaining blocks should not be reduced. Consider the ssv scenario.

Steps to resolve

i don't know, maybe consider read-write locks or other ways

The text was updated successfully, but these errors were encountered:

michaelsproul · 2024-03-07T02:41:36Z

Huh, I didn't anticipate this case. It seems strange that:

There are multiple SSV VCs connected to a single BN. This kind of defeats the point of SSV doesn't it? i.e. independent operators
I'm surprised that this is how SSV comes to consensus on blocks. All of the blocks produced by the different key shards have the potential to be quite different, especially if using different clients. I suspect it must only actually use 1 of the produced blocks (the "best" by some metric)?
I'm surprised that the .lock() call jams things up for so long. The first request to hit the lock should move the state out of the lock (a few ms max) and then yield it to the next waiting thread. Perhaps we are inadvertantly holding the lock while copying from the snapshot cache here:

lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs

Lines 4315 to 4335 in b5bae6e

    
           let pre_state = self 
        
               .block_production_state 
        
               .lock() 
        
               .take() 
        
               .and_then(|(cached_block_root, state)| { 
        
                   (cached_block_root == re_org_parent_block).then_some(state) 
        
               }) 
        
               .or_else(|| { 
        
                   warn!( 
        
                       self.log, 
        
                       "Block production cache miss"; 
        
                       "message" => "falling back to snapshot cache during re-org", 
        
                       "slot" => slot, 
        
                       "block_root" => ?re_org_parent_block 
        
                   ); 
        
                   self.snapshot_cache 
        
                       .try_read_for(BLOCK_PROCESSING_CACHE_LOCK_TIMEOUT) 
        
                       .and_then(|snapshot_cache| { 
        
                           snapshot_cache.get_state_for_block_production(re_org_parent_block) 
        
                       }) 
        
               })

We can try tweaking that locking behaviour to be more explicit about when it drops the lock. Do you have an SSV testnet setup where we could test this change? (I can test locally, but would like to make sure it actually works holistically with SSV)

This optimisation will also be going away soon when we merge tree-states, which keeps more states in memory and can clone them in a matter of milliseconds (see #3206).

michaelsproul · 2024-03-07T06:32:18Z

@hwhe I think I fixed it. Can you try this PR on your test setup?

Optimise concurrent block production #5368

hwhe · 2024-03-07T06:39:04Z

hi michaelsproul, ths for your reply.
1、 Actually, my some test business is more about security than reliability. so my 4 operators running independently on 4 ecs. I only use SSV operators for its private key sharding capabilities. In addition, there are some test scenarios in our business that require this.
2、yes. Although every ssv operator will obtains a block, but only one operator will be selected to be leader use random algorithm. They actually use only the block generated by the leader for partials signature. So ssv can achive consensus at least 3 requests finished.
3、yes. ssv support eth holesky network. we can test it. meanwhile, I also hope you can test it on the holeksy network.
tks

lilkk-jerry · 2024-03-07T08:11:16Z

@michaelsproul Hey i found the same issue (when 2 SSV are temporarily connecting to 1 node). Are you going to release a hotfix version before Dencun?

michaelsproul · 2024-03-07T08:30:42Z

@lilkk-jerry Yeah there will be a v5.1.0 next monday with some fixes. We can probably include this.

Can you test it with SSV for me?

lilkk-jerry · 2024-03-07T09:24:22Z

@lilkk-jerry Yeah there will be a v5.1.0 next monday with some fixes. We can probably include this.

Can you test it with SSV for me?

I would love to, and i have SSV Operator for Goerli indeed, but it's hard to produce the block proposal task cuz I only run 1 validator on SSV.

Based on the past experience, the SSV Operator can suit well with Lighthouse 5.0.0 for other kind of task(attestation, aggregation and so on), so I think maybe we can simply make a concurrent getBlindedBlock request and see if it's behavior is the same with 4.4.1

jimmygchen · 2024-03-08T05:51:23Z

Fixed in #5368 and will be in v5.1.0

hwhe changed the title ~~ssv user lighthouse v5.0.0， proposal block miss.~~ SSV user lighthouse v5.0.0， proposal block miss. Mar 7, 2024

hwhe changed the title ~~SSV user lighthouse v5.0.0， proposal block miss.~~ SSV use lighthouse v5.0.0， proposal block miss. Mar 7, 2024

michaelsproul added optimization Something to make Lighthouse run more efficiently. HTTP-API dvt Distributed validator technology e.g. SSV, Obol labels Mar 7, 2024

michaelsproul mentioned this issue Mar 7, 2024

Optimise concurrent block production #5368

Merged

jimmygchen closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSV use lighthouse v5.0.0， proposal block miss. #5365

SSV use lighthouse v5.0.0， proposal block miss. #5365

hwhe commented Mar 7, 2024 •

edited

michaelsproul commented Mar 7, 2024

michaelsproul commented Mar 7, 2024

hwhe commented Mar 7, 2024

lilkk-jerry commented Mar 7, 2024 •

edited

michaelsproul commented Mar 7, 2024

lilkk-jerry commented Mar 7, 2024

jimmygchen commented Mar 8, 2024

SSV use lighthouse v5.0.0， proposal block miss. #5365

SSV use lighthouse v5.0.0， proposal block miss. #5365

Comments

hwhe commented Mar 7, 2024 • edited

Description

Version

Present Behaviour

Expected Behaviour

Steps to resolve

michaelsproul commented Mar 7, 2024

michaelsproul commented Mar 7, 2024

hwhe commented Mar 7, 2024

lilkk-jerry commented Mar 7, 2024 • edited

michaelsproul commented Mar 7, 2024

lilkk-jerry commented Mar 7, 2024

jimmygchen commented Mar 8, 2024

hwhe commented Mar 7, 2024 •

edited

lilkk-jerry commented Mar 7, 2024 •

edited