Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backfill stalling #5192

Merged
merged 2 commits into from
Feb 8, 2024
Merged

Fix backfill stalling #5192

merged 2 commits into from
Feb 8, 2024

Conversation

pawanjay176
Copy link
Member

Issue Addressed

N/A

Proposed Changes

Hopefully fixes all backfill stalled issues we have been seeing.

The root cause of the issue is that the peer_disconnect function returns early if the call to retry_batch_download within it returns an error. This can happen when we have no synced peers to retry from.
This early return fails to set the BatchState for multiple batches from BatchState::Downloading to BatchState::AwaitingDownload
If the processing_target batch was also one of the batches that failed to change states, then backfill effectively stalls.
When we resume backfill after getting new peers, we do the following which could potentially re-request the target:

  1. Call to resume_batches
    fn resume_batches(&mut self, network: &mut SyncNetworkContext<T>) -> Result<(), BackFillError> {
    let batch_ids_to_retry = self
    .batches
    .iter()
    .filter_map(|(batch_id, batch)| {
    // In principle there should only ever be on of these, and we could terminate the
    // loop early, however the processing is negligible and we continue the search
    // for robustness to handle potential future modification
    if matches!(batch.state(), BatchState::AwaitingDownload) {
    Some(*batch_id)
    } else {
    None
    }
    })
    .collect::<Vec<_>>();

    Since the processing_target is stuck in BatchState::Downloading, it is not requested in this call
  2. Call to request_batches which does nothing because of this condition:
    if self
    .batches
    .iter()
    .filter(|&(_epoch, batch)| in_buffer(batch))
    .count()
    > BACKFILL_BATCH_BUFFER_SIZE as usize
    {
    return None;
    }

    In most of the logs that I have seen of this issue, batches.len() is usually > BACKFILL_BATCH_BUFFER_SIZE

Hence, the target can never be requested again and the target state can never change until a restart.
This PR basically handles the error in peer_disconnect instead of short circuiting.

Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah nice find.

I also notice that in the case that we short-circuit we also don't remove the peer from the participating_peers() mapping, which also looks like it could have dormant issues.

@AgeManning
Copy link
Member

@Mergifyio queue

Copy link

mergify bot commented Feb 8, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 0b59d10

@AgeManning AgeManning added ready-for-merge This PR is ready to merge. v5.0.0 Q1 2024 and removed ready-for-review The code is ready for review labels Feb 8, 2024
mergify bot added a commit that referenced this pull request Feb 8, 2024
@mergify mergify bot merged commit 0b59d10 into sigp:unstable Feb 8, 2024
29 checks passed
danielramirezch pushed a commit to danielramirezch/lighthouse that referenced this pull request Feb 14, 2024
* Prevent early short circuit in `peer_disconnected`

* lint
@chong-he chong-he mentioned this pull request Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge. v5.0.0 Q1 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants