fix obtaining deposits after connection loss #3943

etan-status · 2022-08-08T16:39:23Z

When an error occurs during Eth1 deposits import, the already imported
blocks are kept while the connection to the EL is re-established.
However, the corresponding merkleizer is not persisted, leading to any
future deposits no longer being properly imported. This is quite common
when syncing a fresh Nimbus instance against an already-synced Geth EL.
Fixed by persisting the head merkleizer together with the blocks.

When an error occurs during Eth1 deposits import, the already imported blocks are kept while the connection to the EL is re-established. However, the corresponding merkleizer is not persisted, leading to any future deposits no longer being properly imported. This is quite common when syncing a fresh Nimbus instance against an already-synced Geth EL. Fixed by persisting the head merkleizer together with the blocks.

zah · 2022-08-08T17:01:28Z

The head merkleizer is not kept around on purpose, because the web3 providers are known to miss certain deposit events from time to time which causes the merkleizer to get infected with invalid values. These are eventually discovered, the monitor is restarted and we continue from the known safe finalized state.

etan-status · 2022-08-08T17:04:21Z

Yes, that case still works. But in the case where there are no invalid values, there are still regular disconnects of the websocket connection every couple minutes, and because there is nothing invalid in there, the sync starts from the previous head (instead of from finalized head). In this case, the old head merkleizer still needs to be available.

Memory consumption should be similar, whether it is living in a newClone of a persistent background task, or whether it is retained across followup invocations of the same task as part of Eth1Chain.

etan-status · 2022-08-08T17:22:04Z

The scenario you are describing goes through this flow:

  if m.latestEth1Block.isSome and m.depositsChain.blocks.len > 0:
    let needsReset = m.depositsChain.hasConsensusViolation or (block:
      let
        lastKnownBlock = m.depositsChain.blocks.peekLast
        matchingBlockAtNewProvider = awaitWithRetries(
          m.dataProvider.getBlockByNumber lastKnownBlock.number)

      lastKnownBlock.voteData.block_hash.asBlockHash != matchingBlockAtNewProvider.hash)

    if needsReset:
      m.depositsChain.clear()
      m.latestEth1Block = none(FullBlockId)

and then, because m.depositsChain.blocks.len == 0 due to the reset:

  if shouldProcessDeposits and m.depositsChain.blocks.len == 0:
    let startBlock = awaitWithRetries(
      m.dataProvider.getBlockByHash(
        m.depositsChain.finalizedBlockHash.asBlockHash))

    m.depositsChain.addBlock Eth1Block(
      number: Eth1BlockNumber startBlock.number,
      timestamp: Eth1BlockTimestamp startBlock.timestamp,
      voteData: eth1DataFromMerkleizer(
        m.depositsChain.finalizedBlockHash,
        m.depositsChain.finalizedDepositsMerkleizer))

    eth1SyncedTo = Eth1BlockNumber startBlock.number

    eth1_synced_head.set eth1SyncedTo.toGaugeValue
    eth1_finalized_head.set eth1SyncedTo.toGaugeValue
    eth1_finalized_deposits.set(
      m.depositsChain.finalizedDepositsMerkleizer.getChunkCount.toGaugeValue)

    m.depositsChain.headMerkleizer = copy m.finalizedDepositsMerkleizer

    debug "Starting Eth1 syncing", `from` = shortLog(m.depositsChain.blocks[0])

The fix is for the case where needsReset == false, where blocks.len > 0. The merkleizer was lost and sync got stuck.

github-actions · 2022-08-08T20:37:30Z

Unit Test Results

      12 files ±0     860 suites ±0 1h 2m 48s ⏱️ + 1m 23s
  1 911 tests ±0   1 764 ✔️ ±0 147 💤 ±0 0 ❌ ±0
10 341 runs ±0 10 151 ✔️ ±0 190 💤 ±0 0 ❌ ±0

Results for commit 5ee18eb. ± Comparison against base commit 250f7b4.

♻️ This comment has been updated with latest results.

zah · 2022-08-09T21:20:47Z

Alright, thank you for the explanation.

It's also curious why your WebSocket connection was dying so frequently. Have you tried to look into the root cause behind this?

#3944 The use of nested `awaitWithRetries` calls would have resulted in an unexpected number of retries (3x3). We now use regular `await` in outer layer to avoid the problem. #3943 The new code has an invariant that the `headMerkleizer` field in the `Eth1Chain` is always kept in sync with the blocks stored in the chain. This invariant is now enforced better by doing the necessary merkleizer updates in the `Eth1Chain.addBlock` function.

#3944 The use of nested `awaitWithRetries` calls would have resulted in an unexpected number of retries (3x3). We now use regular `await` in outer layer to avoid the problem. #3943 The new code has an invariant that the `headMerkleizer` field in the `Eth1Chain` is always kept in sync with the blocks stored in the chain. This invariant is now enforced better by doing the necessary merkleizer updates in the `Eth1Chain.addBlock` function, in the `Eth1Chain.init` function and in the `Eth1Chain.reset` function.

etan-status added 3 commits August 8, 2022 17:21

rm unnecessary import

61ce5eb

fix compile of default-disabled test code

d97ec4b

keep merkleizer intact if block without deposits fails

5ee18eb

zah merged commit 4ef621f into unstable Aug 9, 2022

zah deleted the dev/etan/el-reconnect branch August 9, 2022 21:32

zah mentioned this pull request Aug 9, 2022

Minor post-merge cleanups #3945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix obtaining deposits after connection loss #3943

fix obtaining deposits after connection loss #3943

etan-status commented Aug 8, 2022

zah commented Aug 8, 2022

etan-status commented Aug 8, 2022

etan-status commented Aug 8, 2022 •

edited

Loading

github-actions bot commented Aug 8, 2022 •

edited

Loading

zah commented Aug 9, 2022

fix obtaining deposits after connection loss #3943

fix obtaining deposits after connection loss #3943

Conversation

etan-status commented Aug 8, 2022

zah commented Aug 8, 2022

etan-status commented Aug 8, 2022

etan-status commented Aug 8, 2022 • edited Loading

github-actions bot commented Aug 8, 2022 • edited Loading

Unit Test Results

zah commented Aug 9, 2022

etan-status commented Aug 8, 2022 •

edited

Loading

github-actions bot commented Aug 8, 2022 •

edited

Loading