blocksync: retry requests after timeout #9518

williambanfield · 2022-10-06T22:12:05Z

What does this pull request do?

This pull request updates the blocksync logic to include a request retry timeout. The timeout chosen was 15 seconds, this seems long enough to clearly be more than just a slow network sending a large block, although perhaps other values should be considered.

The changed logic starts a timer before sending the blocksync request to a peer, then, if the timeout fires, resets the timer and resends the request to the peer.

Result from testnet

This change was run in a small 6 node testnet to experiment and ensure the behavior was correct. A stall was first observed whereby the blocksyncing node was unable to make progress for a long period and then eventually the node recovered and completed syncing.

Below is a snippet of logs from the blocksyncing node that show the dropped request followed by the request being retried.

Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=110
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=111
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=11
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=14
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=15
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=78
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=82
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=86
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=116
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.264] Send queue is full, drop block request       module=blockchain peer=2551a13ed720101b271a5df4816d1e4b3d3bd133 height=121
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288 height=315
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288 height=316
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=326
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=327
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=328
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=329
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=330
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=331
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=332
Oct 06 21:54:53 ephemeral001 node[33453]: D[2022-10-06|21:54:53.268] Send queue is full, drop block request       module=blockchain peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555 height=333
....
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.270] Retrying block pool request after timeout    module=blockchain height=333 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=315 peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=86 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=11 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=15 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=116 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=328 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=111 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=327 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=326 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=78 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=82 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=121 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=331 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=14 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.270] Retrying block pool request after timeout    module=blockchain height=110 peer=2551a13ed720101b271a5df4816d1e4b3d3bd133
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=330 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=332 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=316 peer=8773a83e2f9fa4bc7a3e94303eb4df33af294288
Oct 06 21:55:08 ephemeral001 node[33453]: D[2022-10-06|21:55:08.271] Retrying block pool request after timeout    module=blockchain height=329 peer=0a9c99096b50a6d72a24d5aa5286d3f7022b3555

PR checklist

Tests written/updated, or no tests needed
CHANGELOG_PENDING.md updated, or no changelog entry needed
Updated relevant documentation (docs/) and code comments, or no
documentation updates needed

williambanfield · 2022-10-06T22:14:35Z

blocksync/pool.go

+		select {
+		case <-bpr.pool.Quit():
+			if err := bpr.Stop(); err != nil {
+				bpr.Logger.Error("Error stopped requester", "err", err)
 			}
+			return
+		case <-bpr.Quit():
+			return


Factoring this out is not really necessary for this change. However, the intermingling here seemed strange: once the block was received, the for loop would be restarted so that the requester could wait to be shutdown by receiving the quit signal. However, this also allowed the successful requester to receive a 'redo' signal, prompting it to run the block request again even though it would have already received the block. This change moves the shutdown logic into this separate select clause that is run after the block is received to prevent this odd case.

Note that redo can be triggered if validation fails on a received block. The intent there is to redo the request, by removing the peer which you requested the block from and doing the continue OUTER_LOOP thing within this case statement. But I think your logic is still correct in that we should make sure redo is not ran for a block that is validated and applied.

From @jmalicevic's comment, I think the changes here could be minimized by not touching other conditions, and the newly added condition (timeout) finishing with continue OUTER_LOOP... effectively turning the timeout case, into a block validation failure.

HOWEVER, as I am taking over from William, I will merge this as is (@jmalicevic says the logic is still correct); will backport to v0.37.x; and run the rotating nodes test with this change in.

Then, when William's back, we can rediscuss if we want to simplify it in main.

EDIT: I will discuss this with Jasmina and Callum tomorrow morning

cmwaters

LGTM

cmwaters · 2022-10-07T13:54:17Z

blocksync/pool.go

+			case <-to.C:
+				bpr.pool.Logger.Debug("Retrying block pool request after timeout", "height", bpr.height, "peer", bpr.peerID)
+				to.Reset(requestRetrySeconds * time.Second)
+				continue WAIT_LOOP


Would it not be better if we were to retry with a different peer?

Yeah, that matches my comment on "turning this case into a block validation failure".

I'll leave it as is for the moment, tho

jmalicevic · 2022-10-07T14:03:06Z

blocksync/pool.go

-				// Continue the for-loop and wait til Quit.
-				continue WAIT_LOOP
+				// Exit the for loop and wait until we receive a quit signal.
+				break WAIT_LOOP


Will this cause the node to try getting a new peer for the same height in OUTER_LOOP?

sergio-mena · 2022-10-09T18:50:50Z

I've run quite some tests this afternoon (on several variants of this patch on top of v0.37.x).

the fix as of 7cf4d7f still gets stuck when the peer that was queried initially is lost
I have tried several ways of addressing @cmwaters' and @jmalicevic's comments, and I am pushing now a version of the fix that didn't get stuck after many iterations of rotate.sh

This reverts commit dd0fee5.

sergio-mena · 2022-10-09T19:45:54Z

@cmwaters, @jmalicevic, please take a look at the latest version, which seems to work very well on the rotating testnet

cmwaters

This seems good to me.

What would happen hypothetically if a peer did respond with a block after the 30 second timeout? i.e. do we safely handle that case?

blocksync/pool.go

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

jmalicevic · 2022-10-10T08:38:15Z

What would happen hypothetically if a peer did respond with a block after the 30 second timeout? i.e. do we safely handle that case?

This line here should protect us from accepting the same block from a different peer. However, this will cause an error to be propagated and might lead to a disconnect from this particular peer. Which might be acceptable, but we should just be aware.

cmwaters · 2022-10-10T08:46:48Z

This line here should protect us from accepting the same block from a different peer. However, this will cause an error to be propagated and might lead to a disconnect from this particular peer. Which might be acceptable, but we should just be aware.

Ok so perhaps a bit confusing but not detrimental.

cmwaters · 2022-10-10T08:47:15Z

Also, shouldn't we be backporting this?

sergio-mena · 2022-10-10T12:41:14Z

Also, shouldn't we be backporting this?

Yes... I've already tested the backport

* blocksync: retry requests after timeout * Minimize changes to re-send block request after timeout * TO REVERT: reduce queue capacity * Add reset * Revert "TO REVERT: reduce queue capacity" This reverts commit dd0fee5. * 30 seconds * don't reset the timer * Update blocksync/pool.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Callum Waters <cmwaters19@gmail.com> (cherry picked from commit a371b1e)

* blocksync: retry requests after timeout (#9518) * blocksync: retry requests after timeout * Minimize changes to re-send block request after timeout * TO REVERT: reduce queue capacity * Add reset * Revert "TO REVERT: reduce queue capacity" This reverts commit dd0fee5. * 30 seconds * don't reset the timer * Update blocksync/pool.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Callum Waters <cmwaters19@gmail.com> (cherry picked from commit a371b1e) * Add changelog entry Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>

…x-rc1 * release/v0.37.0-rc1: QA Process report for v0.37.x (and baseline for v0.34.x) (tendermint#9499) (tendermint#9577) Fix TX payload for DO testnets (tendermint#9540) (tendermint#9542) blocksync: retry requests after timeout (backport tendermint#9518) (tendermint#9533) Extend the load report tool to include transactions' hashes (tendermint#9509) (tendermint#9513) build(deps): Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1 (tendermint#9502) build(deps): Bump actions/stale from 5 to 6 (tendermint#9494) loadtime: add block time to the data point (tendermint#9484) (tendermint#9489) config: Add missing storage section when generating config (tendermint#9483) (tendermint#9487) Sync Vote.Verify() in spec with implementation (tendermint#9466) (tendermint#9476) fix spec (tendermint#9467) (tendermint#9469) metrics: fix panic because of absent prometheus label (tendermint#9455) (tendermint#9474) Ensure Dockerfile stages use consistent Go version (backport tendermint#9462) (tendermint#9472) build(deps): Bump slackapi/slack-github-action from 1.21.0 to 1.22.0 (tendermint#9432) build(deps): Bump bufbuild/buf-setup-action from 1.7.0 to 1.8.0 (tendermint#9453) state: restore previous error message (tendermint#9435) (tendermint#9440) build(deps): Bump gonum.org/v1/gonum from 0.11.0 to 0.12.0 (tendermint#9411) docs: Update ADRs for v0.37 (tendermint#9399) (tendermint#9418) build(deps): Bump github.com/spf13/viper from 1.12.0 to 1.13.0 (tendermint#9410) build(deps): Bump github.com/lib/pq from 1.10.6 to 1.10.7 (tendermint#9409) feat: support HTTPS inside websocket (tendermint#9416) (tendermint#9422) Removed unused param (tendermint#9394) test: generate uuid on startup for load tool (tendermint#9383) (tendermint#9392) add redirect links (tendermint#9385) (tendermint#9389) refactor: migrate to cosmos/gogoproto (backport tendermint#9356) (tendermint#9381) cmd: print all versions of tendermint and its sub protocols (tendermint#9329) (tendermint#9386) Add missing changes changelog files (backport tendermint#9376) (tendermint#9382) add separated runs by UUID (backport tendermint#9367) (tendermint#9379) spec: abci++ cleanup for v0.37 (backport tendermint#9288) (tendermint#9374) ci: Remove "(WARNING: BETA SOFTWARE)" tagline from all upcoming releases (tendermint#9371) (tendermint#9372) Update rpc client header (tendermint#9276) (tendermint#9349) ci: Pre-release workflows (backport tendermint#9366) (tendermint#9368) test: add the loadtime report tool (tendermint#9351) (tendermint#9364) Update Tendermint version to v0.37.0 (tendermint#9354) test: add the loadtime tool (tendermint#9342) (tendermint#9357) # Conflicts: # version/version.go

blocksync: retry requests after timeout

7cf4d7f

williambanfield marked this pull request as ready for review October 6, 2022 22:12

williambanfield requested a review from a team October 6, 2022 22:12

williambanfield commented Oct 6, 2022

View reviewed changes

cmwaters approved these changes Oct 7, 2022

View reviewed changes

jmalicevic reviewed Oct 7, 2022

View reviewed changes

sergio-mena assigned williambanfield and sergio-mena Oct 9, 2022

sergio-mena added 6 commits October 9, 2022 20:53

Minimize changes to re-send block request after timeout

ccc1433

TO REVERT: reduce queue capacity

541ee31

Add reset

fda1bc7

Revert "TO REVERT: reduce queue capacity"

6d19a2e

This reverts commit dd0fee5.

30 seconds

efbdd43

don't reset the timer

bb85140

cmwaters approved these changes Oct 10, 2022

View reviewed changes

blocksync/pool.go Outdated Show resolved Hide resolved

Update blocksync/pool.go

514634f

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

jmalicevic approved these changes Oct 10, 2022

View reviewed changes

sergio-mena added S:backport-to-v0.34.x Tell mergify to backport the PR to v0.34.x S:backport-to-v0.37.x Tell mergify to backport the PR to v0.37.x labels Oct 10, 2022

Merge remote-tracking branch 'origin/main' into wb/retry-blocksync

845df18

sergio-mena merged commit a371b1e into main Oct 10, 2022

sergio-mena deleted the wb/retry-blocksync branch October 10, 2022 12:58

mergify bot mentioned this pull request Oct 10, 2022

blocksync: retry requests after timeout (backport #9518) #9533

Merged

mergify bot mentioned this pull request Oct 10, 2022

blocksync: retry requests after timeout (backport #9518) #9534

Merged

sergio-mena mentioned this pull request Oct 10, 2022

Add changelog entry to #9518 #9535

Merged

3 tasks

thanethomson mentioned this pull request Oct 18, 2022

Release v0.34.22 #9583

Merged

3 tasks

jingchen2222 mentioned this pull request Oct 23, 2022

Initialize a development environment based on the kv storage example in tendermint-rs dbpunk-labs/db3#103

Closed

cason mentioned this pull request Sep 4, 2023

Fast sync gets stuck if there's not many peers because requestRoutine in pool requests blocks out of order cometbft/cometbft#1283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blocksync: retry requests after timeout #9518

blocksync: retry requests after timeout #9518

williambanfield commented Oct 6, 2022

williambanfield Oct 6, 2022

jmalicevic Oct 7, 2022

sergio-mena Oct 9, 2022 •

edited

cmwaters left a comment

cmwaters Oct 7, 2022

sergio-mena Oct 9, 2022

jmalicevic Oct 7, 2022

sergio-mena commented Oct 9, 2022 •

edited

sergio-mena commented Oct 9, 2022

cmwaters left a comment

jmalicevic commented Oct 10, 2022

cmwaters commented Oct 10, 2022

cmwaters commented Oct 10, 2022

sergio-mena commented Oct 10, 2022

blocksync: retry requests after timeout #9518

blocksync: retry requests after timeout #9518

Conversation

williambanfield commented Oct 6, 2022

What does this pull request do?

Result from testnet

PR checklist

williambanfield Oct 6, 2022

Choose a reason for hiding this comment

jmalicevic Oct 7, 2022

Choose a reason for hiding this comment

sergio-mena Oct 9, 2022 • edited

Choose a reason for hiding this comment

cmwaters left a comment

Choose a reason for hiding this comment

cmwaters Oct 7, 2022

Choose a reason for hiding this comment

sergio-mena Oct 9, 2022

Choose a reason for hiding this comment

jmalicevic Oct 7, 2022

Choose a reason for hiding this comment

sergio-mena commented Oct 9, 2022 • edited

sergio-mena commented Oct 9, 2022

cmwaters left a comment

Choose a reason for hiding this comment

jmalicevic commented Oct 10, 2022

cmwaters commented Oct 10, 2022

cmwaters commented Oct 10, 2022

sergio-mena commented Oct 10, 2022

sergio-mena Oct 9, 2022 •

edited

sergio-mena commented Oct 9, 2022 •

edited