-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blocksync: retry requests after timeout #9518
Conversation
blocksync/pool.go
Outdated
select { | ||
case <-bpr.pool.Quit(): | ||
if err := bpr.Stop(); err != nil { | ||
bpr.Logger.Error("Error stopped requester", "err", err) | ||
} | ||
return | ||
case <-bpr.Quit(): | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factoring this out is not really necessary for this change. However, the intermingling here seemed strange: once the block was received, the for loop would be restarted so that the requester could wait to be shutdown by receiving the quit signal. However, this also allowed the successful requester to receive a 'redo' signal, prompting it to run the block request again even though it would have already received the block. This change moves the shutdown logic into this separate select clause that is run after the block is received to prevent this odd case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that redo can be triggered if validation fails on a received block. The intent there is to redo the request, by removing the peer which you requested the block from and doing the continue OUTER_LOOP
thing within this case statement. But I think your logic is still correct in that we should make sure redo is not ran for a block that is validated and applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @jmalicevic's comment, I think the changes here could be minimized by not touching other conditions, and the newly added condition (timeout) finishing with continue OUTER_LOOP
... effectively turning the timeout case, into a block validation failure.
HOWEVER, as I am taking over from William, I will merge this as is (@jmalicevic says the logic is still correct); will backport to v0.37.x
; and run the rotating nodes test with this change in.
Then, when William's back, we can rediscuss if we want to simplify it in main
.
EDIT: I will discuss this with Jasmina and Callum tomorrow morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
blocksync/pool.go
Outdated
case <-to.C: | ||
bpr.pool.Logger.Debug("Retrying block pool request after timeout", "height", bpr.height, "peer", bpr.peerID) | ||
to.Reset(requestRetrySeconds * time.Second) | ||
continue WAIT_LOOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be better if we were to retry with a different peer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that matches my comment on "turning this case into a block validation failure".
I'll leave it as is for the moment, tho
blocksync/pool.go
Outdated
// Continue the for-loop and wait til Quit. | ||
continue WAIT_LOOP | ||
// Exit the for loop and wait until we receive a quit signal. | ||
break WAIT_LOOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause the node to try getting a new peer for the same height in OUTER_LOOP?
I've run quite some tests this afternoon (on several variants of this patch on top of
|
@cmwaters, @jmalicevic, please take a look at the latest version, which seems to work very well on the rotating testnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me.
What would happen hypothetically if a peer did respond with a block after the 30 second timeout? i.e. do we safely handle that case?
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
This line here should protect us from accepting the same block from a different peer. However, this will cause an error to be propagated and might lead to a disconnect from this particular peer. Which might be acceptable, but we should just be aware. |
Ok so perhaps a bit confusing but not detrimental. |
Also, shouldn't we be backporting this? |
Yes... I've already tested the backport |
* blocksync: retry requests after timeout * Minimize changes to re-send block request after timeout * TO REVERT: reduce queue capacity * Add reset * Revert "TO REVERT: reduce queue capacity" This reverts commit dd0fee5. * 30 seconds * don't reset the timer * Update blocksync/pool.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Callum Waters <cmwaters19@gmail.com> (cherry picked from commit a371b1e)
* blocksync: retry requests after timeout * Minimize changes to re-send block request after timeout * TO REVERT: reduce queue capacity * Add reset * Revert "TO REVERT: reduce queue capacity" This reverts commit dd0fee5. * 30 seconds * don't reset the timer * Update blocksync/pool.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Callum Waters <cmwaters19@gmail.com> (cherry picked from commit a371b1e)
* blocksync: retry requests after timeout (#9518) * blocksync: retry requests after timeout * Minimize changes to re-send block request after timeout * TO REVERT: reduce queue capacity * Add reset * Revert "TO REVERT: reduce queue capacity" This reverts commit dd0fee5. * 30 seconds * don't reset the timer * Update blocksync/pool.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Callum Waters <cmwaters19@gmail.com> (cherry picked from commit a371b1e) * Add changelog entry Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
* blocksync: retry requests after timeout (#9518) * blocksync: retry requests after timeout * Minimize changes to re-send block request after timeout * TO REVERT: reduce queue capacity * Add reset * Revert "TO REVERT: reduce queue capacity" This reverts commit dd0fee5. * 30 seconds * don't reset the timer * Update blocksync/pool.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Callum Waters <cmwaters19@gmail.com> (cherry picked from commit a371b1e) * Add changelog entry Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
…x-rc1 * release/v0.37.0-rc1: QA Process report for v0.37.x (and baseline for v0.34.x) (tendermint#9499) (tendermint#9577) Fix TX payload for DO testnets (tendermint#9540) (tendermint#9542) blocksync: retry requests after timeout (backport tendermint#9518) (tendermint#9533) Extend the load report tool to include transactions' hashes (tendermint#9509) (tendermint#9513) build(deps): Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1 (tendermint#9502) build(deps): Bump actions/stale from 5 to 6 (tendermint#9494) loadtime: add block time to the data point (tendermint#9484) (tendermint#9489) config: Add missing storage section when generating config (tendermint#9483) (tendermint#9487) Sync Vote.Verify() in spec with implementation (tendermint#9466) (tendermint#9476) fix spec (tendermint#9467) (tendermint#9469) metrics: fix panic because of absent prometheus label (tendermint#9455) (tendermint#9474) Ensure Dockerfile stages use consistent Go version (backport tendermint#9462) (tendermint#9472) build(deps): Bump slackapi/slack-github-action from 1.21.0 to 1.22.0 (tendermint#9432) build(deps): Bump bufbuild/buf-setup-action from 1.7.0 to 1.8.0 (tendermint#9453) state: restore previous error message (tendermint#9435) (tendermint#9440) build(deps): Bump gonum.org/v1/gonum from 0.11.0 to 0.12.0 (tendermint#9411) docs: Update ADRs for v0.37 (tendermint#9399) (tendermint#9418) build(deps): Bump github.com/spf13/viper from 1.12.0 to 1.13.0 (tendermint#9410) build(deps): Bump github.com/lib/pq from 1.10.6 to 1.10.7 (tendermint#9409) feat: support HTTPS inside websocket (tendermint#9416) (tendermint#9422) Removed unused param (tendermint#9394) test: generate uuid on startup for load tool (tendermint#9383) (tendermint#9392) add redirect links (tendermint#9385) (tendermint#9389) refactor: migrate to cosmos/gogoproto (backport tendermint#9356) (tendermint#9381) cmd: print all versions of tendermint and its sub protocols (tendermint#9329) (tendermint#9386) Add missing changes changelog files (backport tendermint#9376) (tendermint#9382) add separated runs by UUID (backport tendermint#9367) (tendermint#9379) spec: abci++ cleanup for v0.37 (backport tendermint#9288) (tendermint#9374) ci: Remove "(WARNING: BETA SOFTWARE)" tagline from all upcoming releases (tendermint#9371) (tendermint#9372) Update rpc client header (tendermint#9276) (tendermint#9349) ci: Pre-release workflows (backport tendermint#9366) (tendermint#9368) test: add the loadtime report tool (tendermint#9351) (tendermint#9364) Update Tendermint version to v0.37.0 (tendermint#9354) test: add the loadtime tool (tendermint#9342) (tendermint#9357) # Conflicts: # version/version.go
What does this pull request do?
This pull request updates the blocksync logic to include a request retry timeout. The timeout chosen was 15 seconds, this seems long enough to clearly be more than just a slow network sending a large block, although perhaps other values should be considered.
The changed logic starts a timer before sending the blocksync request to a peer, then, if the timeout fires, resets the timer and resends the request to the peer.
Result from testnet
This change was run in a small 6 node testnet to experiment and ensure the behavior was correct. A stall was first observed whereby the blocksyncing node was unable to make progress for a long period and then eventually the node recovered and completed syncing.
Below is a snippet of logs from the blocksyncing node that show the dropped request followed by the request being retried.
PR checklist
CHANGELOG_PENDING.md
updated, or no changelog entry neededdocs/
) and code comments, or nodocumentation updates needed