Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blocksync: retry requests after timeout #9518

Merged
merged 9 commits into from
Oct 10, 2022
Merged

Conversation

williambanfield
Copy link
Contributor

What does this pull request do?

This pull request updates the blocksync logic to include a request retry timeout. The timeout chosen was 15 seconds, this seems long enough to clearly be more than just a slow network sending a large block, although perhaps other values should be considered.

The changed logic starts a timer before sending the blocksync request to a peer, then, if the timeout fires, resets the timer and resends the request to the peer.

Result from testnet

This change was run in a small 6 node testnet to experiment and ensure the behavior was correct. A stall was first observed whereby the blocksyncing node was unable to make progress for a long period and then eventually the node recovered and completed syncing.

Below is a snippet of logs from the blocksyncing node that show the dropped request followed by the request being retried.

Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=110
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=111
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=11
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=14
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=15
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=78
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=82
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=86
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=116
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=121
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288 height=315
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288 height=316
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=326
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=327
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=328
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=329
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=330
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=331
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=332
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=333
....
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.270] Retrying block pool request after timeout    module=blockchain height=333 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=315 peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=86 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=11 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=15 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=116 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=328 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=111 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=327 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=326 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=78 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=82 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=121 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=331 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=14 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.270] Retrying block pool request after timeout    module=blockchain height=110 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=330 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=332 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=316 peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=329 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555

PR checklist

  • Tests written/updated, or no tests needed
  • CHANGELOG_PENDING.md updated, or no changelog entry needed
  • Updated relevant documentation (docs/) and code comments, or no
    documentation updates needed

@williambanfield williambanfield marked this pull request as ready for review October 6, 2022 22:12
@williambanfield williambanfield requested a review from a team October 6, 2022 22:12
Comment on lines 647 to 654
select {
case <-bpr.pool.Quit():
if err := bpr.Stop(); err != nil {
bpr.Logger.Error("Error stopped requester", "err", err)
}
return
case <-bpr.Quit():
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factoring this out is not really necessary for this change. However, the intermingling here seemed strange: once the block was received, the for loop would be restarted so that the requester could wait to be shutdown by receiving the quit signal. However, this also allowed the successful requester to receive a 'redo' signal, prompting it to run the block request again even though it would have already received the block. This change moves the shutdown logic into this separate select clause that is run after the block is received to prevent this odd case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that redo can be triggered if validation fails on a received block. The intent there is to redo the request, by removing the peer which you requested the block from and doing the continue OUTER_LOOP thing within this case statement. But I think your logic is still correct in that we should make sure redo is not ran for a block that is validated and applied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @jmalicevic's comment, I think the changes here could be minimized by not touching other conditions, and the newly added condition (timeout) finishing with continue OUTER_LOOP... effectively turning the timeout case, into a block validation failure.

HOWEVER, as I am taking over from William, I will merge this as is (@jmalicevic says the logic is still correct); will backport to v0.37.x; and run the rotating nodes test with this change in.

Then, when William's back, we can rediscuss if we want to simplify it in main.

EDIT: I will discuss this with Jasmina and Callum tomorrow morning

Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 622 to 625
case <-to.C:
bpr.pool.Logger.Debug("Retrying block pool request after timeout", "height", bpr.height, "peer", bpr.peerID)
to.Reset(requestRetrySeconds * time.Second)
continue WAIT_LOOP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be better if we were to retry with a different peer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that matches my comment on "turning this case into a block validation failure".

I'll leave it as is for the moment, tho

// Continue the for-loop and wait til Quit.
continue WAIT_LOOP
// Exit the for loop and wait until we receive a quit signal.
break WAIT_LOOP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause the node to try getting a new peer for the same height in OUTER_LOOP?

@sergio-mena
Copy link
Contributor

sergio-mena commented Oct 9, 2022

I've run quite some tests this afternoon (on several variants of this patch on top of v0.37.x).

  • the fix as of 7cf4d7f still gets stuck when the peer that was queried initially is lost
  • I have tried several ways of addressing @cmwaters' and @jmalicevic's comments, and I am pushing now a version of the fix that didn't get stuck after many iterations of rotate.sh

@sergio-mena
Copy link
Contributor

@cmwaters, @jmalicevic, please take a look at the latest version, which seems to work very well on the rotating testnet

Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to me.

What would happen hypothetically if a peer did respond with a block after the 30 second timeout? i.e. do we safely handle that case?

blocksync/pool.go Outdated Show resolved Hide resolved
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
@jmalicevic
Copy link
Contributor

What would happen hypothetically if a peer did respond with a block after the 30 second timeout? i.e. do we safely handle that case?

This line here should protect us from accepting the same block from a different peer. However, this will cause an error to be propagated and might lead to a disconnect from this particular peer. Which might be acceptable, but we should just be aware.

@cmwaters
Copy link
Contributor

This line here should protect us from accepting the same block from a different peer. However, this will cause an error to be propagated and might lead to a disconnect from this particular peer. Which might be acceptable, but we should just be aware.

Ok so perhaps a bit confusing but not detrimental.

@cmwaters
Copy link
Contributor

Also, shouldn't we be backporting this?

@sergio-mena
Copy link
Contributor

Also, shouldn't we be backporting this?

Yes... I've already tested the backport

@sergio-mena sergio-mena added S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x labels Oct 10, 2022
@sergio-mena sergio-mena merged commit a371b1e into main Oct 10, 2022
@sergio-mena sergio-mena deleted the wb/retry-blocksync branch October 10, 2022 12:58
mergify bot pushed a commit that referenced this pull request Oct 10, 2022
* blocksync: retry requests after timeout

* Minimize changes to re-send block request after timeout

* TO REVERT: reduce queue capacity

* Add reset

* Revert "TO REVERT: reduce queue capacity"

This reverts commit dd0fee5.

* 30 seconds

* don't reset the timer

* Update blocksync/pool.go

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit a371b1e)
mergify bot pushed a commit that referenced this pull request Oct 10, 2022
* blocksync: retry requests after timeout

* Minimize changes to re-send block request after timeout

* TO REVERT: reduce queue capacity

* Add reset

* Revert "TO REVERT: reduce queue capacity"

This reverts commit dd0fee5.

* 30 seconds

* don't reset the timer

* Update blocksync/pool.go

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit a371b1e)
@sergio-mena sergio-mena mentioned this pull request Oct 10, 2022
3 tasks
sergio-mena added a commit that referenced this pull request Oct 10, 2022
* blocksync: retry requests after timeout (#9518)

* blocksync: retry requests after timeout

* Minimize changes to re-send block request after timeout

* TO REVERT: reduce queue capacity

* Add reset

* Revert "TO REVERT: reduce queue capacity"

This reverts commit dd0fee5.

* 30 seconds

* don't reset the timer

* Update blocksync/pool.go

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit a371b1e)

* Add changelog entry

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
sergio-mena added a commit that referenced this pull request Oct 10, 2022
* blocksync: retry requests after timeout (#9518)

* blocksync: retry requests after timeout

* Minimize changes to re-send block request after timeout

* TO REVERT: reduce queue capacity

* Add reset

* Revert "TO REVERT: reduce queue capacity"

This reverts commit dd0fee5.

* 30 seconds

* don't reset the timer

* Update blocksync/pool.go

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Sergio Mena <sergio@informal.systems>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
(cherry picked from commit a371b1e)

* Add changelog entry

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
@thanethomson thanethomson mentioned this pull request Oct 18, 2022
3 tasks
james-chf added a commit to heliaxdev/tendermint that referenced this pull request Nov 25, 2022
…x-rc1

* release/v0.37.0-rc1:
  QA Process report for v0.37.x (and baseline for v0.34.x) (tendermint#9499) (tendermint#9577)
  Fix TX payload for DO testnets (tendermint#9540) (tendermint#9542)
  blocksync: retry requests after timeout (backport tendermint#9518) (tendermint#9533)
  Extend the load report tool to include transactions' hashes (tendermint#9509) (tendermint#9513)
  build(deps): Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1 (tendermint#9502)
  build(deps): Bump actions/stale from 5 to 6 (tendermint#9494)
  loadtime: add block time to the data point (tendermint#9484) (tendermint#9489)
  config: Add missing storage section when generating config (tendermint#9483) (tendermint#9487)
  Sync Vote.Verify() in spec with implementation (tendermint#9466) (tendermint#9476)
  fix spec (tendermint#9467) (tendermint#9469)
  metrics: fix panic because of absent prometheus label (tendermint#9455) (tendermint#9474)
  Ensure Dockerfile stages use consistent Go version (backport tendermint#9462) (tendermint#9472)
  build(deps): Bump slackapi/slack-github-action from 1.21.0 to 1.22.0 (tendermint#9432)
  build(deps): Bump bufbuild/buf-setup-action from 1.7.0 to 1.8.0 (tendermint#9453)
  state: restore previous error message (tendermint#9435) (tendermint#9440)
  build(deps): Bump gonum.org/v1/gonum from 0.11.0 to 0.12.0 (tendermint#9411)
  docs: Update ADRs for v0.37 (tendermint#9399) (tendermint#9418)
  build(deps): Bump github.com/spf13/viper from 1.12.0 to 1.13.0 (tendermint#9410)
  build(deps): Bump github.com/lib/pq from 1.10.6 to 1.10.7 (tendermint#9409)
  feat: support HTTPS inside websocket (tendermint#9416) (tendermint#9422)
  Removed unused param (tendermint#9394)
  test: generate uuid on startup for load tool (tendermint#9383) (tendermint#9392)
  add redirect links (tendermint#9385) (tendermint#9389)
  refactor: migrate to cosmos/gogoproto (backport tendermint#9356) (tendermint#9381)
  cmd: print all versions of tendermint and its sub protocols  (tendermint#9329) (tendermint#9386)
  Add missing changes changelog files (backport tendermint#9376) (tendermint#9382)
  add separated runs by UUID (backport tendermint#9367) (tendermint#9379)
  spec: abci++ cleanup for v0.37 (backport tendermint#9288) (tendermint#9374)
  ci: Remove "(WARNING: BETA SOFTWARE)" tagline from all upcoming releases (tendermint#9371) (tendermint#9372)
  Update rpc client header (tendermint#9276) (tendermint#9349)
  ci: Pre-release workflows (backport tendermint#9366) (tendermint#9368)
  test: add the loadtime report tool (tendermint#9351) (tendermint#9364)
  Update Tendermint version to v0.37.0 (tendermint#9354)
  test: add the loadtime tool (tendermint#9342) (tendermint#9357)

# Conflicts:
#	version/version.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x
Projects
Status: Done/Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants