Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix obtaining deposits after connection loss #3943

Merged
merged 4 commits into from
Aug 9, 2022
Merged

Conversation

etan-status
Copy link
Contributor

When an error occurs during Eth1 deposits import, the already imported
blocks are kept while the connection to the EL is re-established.
However, the corresponding merkleizer is not persisted, leading to any
future deposits no longer being properly imported. This is quite common
when syncing a fresh Nimbus instance against an already-synced Geth EL.
Fixed by persisting the head merkleizer together with the blocks.

When an error occurs during Eth1 deposits import, the already imported
blocks are kept while the connection to the EL is re-established.
However, the corresponding merkleizer is not persisted, leading to any
future deposits no longer being properly imported. This is quite common
when syncing a fresh Nimbus instance against an already-synced Geth EL.
Fixed by persisting the head merkleizer together with the blocks.
@zah
Copy link
Member

zah commented Aug 8, 2022

The head merkleizer is not kept around on purpose, because the web3 providers are known to miss certain deposit events from time to time which causes the merkleizer to get infected with invalid values. These are eventually discovered, the monitor is restarted and we continue from the known safe finalized state.

@etan-status
Copy link
Contributor Author

Yes, that case still works. But in the case where there are no invalid values, there are still regular disconnects of the websocket connection every couple minutes, and because there is nothing invalid in there, the sync starts from the previous head (instead of from finalized head). In this case, the old head merkleizer still needs to be available.

Memory consumption should be similar, whether it is living in a newClone of a persistent background task, or whether it is retained across followup invocations of the same task as part of Eth1Chain.

@etan-status
Copy link
Contributor Author

etan-status commented Aug 8, 2022

The scenario you are describing goes through this flow:

  if m.latestEth1Block.isSome and m.depositsChain.blocks.len > 0:
    let needsReset = m.depositsChain.hasConsensusViolation or (block:
      let
        lastKnownBlock = m.depositsChain.blocks.peekLast
        matchingBlockAtNewProvider = awaitWithRetries(
          m.dataProvider.getBlockByNumber lastKnownBlock.number)

      lastKnownBlock.voteData.block_hash.asBlockHash != matchingBlockAtNewProvider.hash)

    if needsReset:
      m.depositsChain.clear()
      m.latestEth1Block = none(FullBlockId)

and then, because m.depositsChain.blocks.len == 0 due to the reset:

  if shouldProcessDeposits and m.depositsChain.blocks.len == 0:
    let startBlock = awaitWithRetries(
      m.dataProvider.getBlockByHash(
        m.depositsChain.finalizedBlockHash.asBlockHash))

    m.depositsChain.addBlock Eth1Block(
      number: Eth1BlockNumber startBlock.number,
      timestamp: Eth1BlockTimestamp startBlock.timestamp,
      voteData: eth1DataFromMerkleizer(
        m.depositsChain.finalizedBlockHash,
        m.depositsChain.finalizedDepositsMerkleizer))

    eth1SyncedTo = Eth1BlockNumber startBlock.number

    eth1_synced_head.set eth1SyncedTo.toGaugeValue
    eth1_finalized_head.set eth1SyncedTo.toGaugeValue
    eth1_finalized_deposits.set(
      m.depositsChain.finalizedDepositsMerkleizer.getChunkCount.toGaugeValue)

    m.depositsChain.headMerkleizer = copy m.finalizedDepositsMerkleizer

    debug "Starting Eth1 syncing", `from` = shortLog(m.depositsChain.blocks[0])

The fix is for the case where needsReset == false, where blocks.len > 0. The merkleizer was lost and sync got stuck.

@github-actions
Copy link

github-actions bot commented Aug 8, 2022

Unit Test Results

       12 files  ±0       860 suites  ±0   1h 2m 48s ⏱️ + 1m 23s
  1 911 tests ±0    1 764 ✔️ ±0  147 💤 ±0  0 ±0 
10 341 runs  ±0  10 151 ✔️ ±0  190 💤 ±0  0 ±0 

Results for commit 5ee18eb. ± Comparison against base commit 250f7b4.

♻️ This comment has been updated with latest results.

@zah
Copy link
Member

zah commented Aug 9, 2022

Alright, thank you for the explanation.

It's also curious why your WebSocket connection was dying so frequently. Have you tried to look into the root cause behind this?

@zah zah merged commit 4ef621f into unstable Aug 9, 2022
@zah zah deleted the dev/etan/el-reconnect branch August 9, 2022 21:32
zah added a commit that referenced this pull request Aug 9, 2022
#3944

The use of nested `awaitWithRetries` calls would have
resulted in an unexpected number of retries (3x3).
We now use regular `await` in outer layer to avoid the problem.

#3943

The new code has an invariant that the `headMerkleizer` field in
the `Eth1Chain` is always kept in sync with the blocks stored in
the chain.

This invariant is now enforced better by doing the necessary merkleizer
updates in the `Eth1Chain.addBlock` function.
@zah zah mentioned this pull request Aug 9, 2022
zah added a commit that referenced this pull request Aug 10, 2022
#3944

The use of nested `awaitWithRetries` calls would have
resulted in an unexpected number of retries (3x3).
We now use regular `await` in outer layer to avoid the problem.

#3943

The new code has an invariant that the `headMerkleizer` field in
the `Eth1Chain` is always kept in sync with the blocks stored in
the chain.

This invariant is now enforced better by doing the necessary merkleizer updates
in the `Eth1Chain.addBlock` function, in the `Eth1Chain.init` function and in the
`Eth1Chain.reset` function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants