feat(networking): add backoff period after failed dial #1462

alrevuelta · 2022-12-13T12:29:06Z

Summary:
After failing to dial a peer, it adds a backoff to it so that we don't attempt to dial the same peer again during some time. The time to wait depends on the amount of consecutive failures, calculated with the following formula:

initialBackoffInSec*(backoffFactor^(failedAttempts-1))

Note that initialBackoffInSec and backoffFactor are configurable values that allow to configure how aggressive the backoffs are. Using initialBackoffInSec=120 and backoffFactor=4 the times to wait would be:

120s, 480s, 1920, 7680s

This PR helps nodes to rapidly increase their number of connections, since less time is wasted in trying to connect to nodes that fail. This improvement is even more noticeable in networks whith a high ratio of non reachable peers. Note that this backoff only applies to relay peers.

Changes:

Add exponential backoff after a failed dial.
Some minor test refactoring.
Some minor logging improvements.
Directly use switch.dial in ping (keep alive)

status-im-auto · 2022-12-13T23:00:26Z

Jenkins Builds

Click to see older builds (7)

❔	Commit	#️⃣	Finished (UTC)	Duration	Platform	Result
✔️	`41f72ac`	#1	2022-12-13 23:00:26	~13 min	`macos`	📦`bin`

✔️	`c2058e0`	#2	2022-12-14 23:00:49	~13 min	`macos`	📦`bin`

✔️	`8af035c`	#3	2022-12-15 23:17:09	~29 min	`macos`	📦`bin`

❌	`7c5aa5a`	#4	2023-01-04 23:03:53	~16 min	`macos`	📄`log`

❌	`b28405a`	#5	2023-01-13 22:49:32	~2 min	`macos`	📄`log`

❌	`99b3327`	#6	2023-01-16 22:49:28	~2 min	`macos`	📄`log`

✔️	`faac57b`	#7	2023-01-18 23:02:16	~14 min	`macos`	📦`bin`

❔	Commit	#️⃣	Finished (UTC)	Duration	Platform	Result
✔️	`7cf6b15`	#8	2023-01-19 23:01:45	~14 min	`macos`	📦`bin`

✔️	`75c67bf`	#9	2023-01-20 23:01:21	~13 min	`macos`	📦`bin`

jm-clius

Although I agree with the idea of exponential backoffs, I'm not sure I agree that this should be part of (every) dialPeer. In my mind, if an "application" (or just a protocol) calls dialPeer, it should attempt to connect to that peer immediately without considering delays/backoffs (except maybe a very recent failure). The protocol/application may "know better" than the peer manager about when a peer has become available again - especially when we're in the 14h backoff period. If a protocol attempts to continuously dial an unreachable peer, there is a problem with that protocol that should be addressed. The only protocol that currently requires continuous connection attempts, however, is Relay. In my mind then there should be a connectivity loop that continuously attempts to connect to some peers from the peer store and respects the backoff period for these peers within this connectivity loop. In this approach, dialPeer will be available to protocols that want to make an ad-hoc connection to a specific peer and should remain a simple way to attempt to dial the peer (there may be scope for some sanity checks here, but I don't think a protocol should attempt to dial a peer and then wait ~14 hours for the dial to be attempted).

jm-clius · 2022-12-15T13:21:13Z

Of course, similar connectivity guarantees could be useful for service protocols in future too. I just think this connectivity attempts and backoffs should be done in parallel and not serially/ad-hoc whenever attempting to dial.

alrevuelta · 2022-12-15T14:15:00Z

Thanks @jm-clius, fair comments. How about only respecting the backoff for the relay protocol, and having a direct path for the service protocols? Also related to "service protocol slots" #1461.

Some comments:

The protocol/application may "know better" than the peer manager about when a peer has become available again

mm not sure I follow. Can you provide an example? Not sure how the application layer can have knowledge of this.

The only protocol that currently requires continuous connection attempts, however, is Relay. In my mind then there should be a connectivity loop that continuously attempts to connect to some peers from the peer store and respects the backoff period for these peers within this connectivity loop

Nice, as stated I will implement that, and have a direct path for service protocols.

it should attempt to connect to that peer immediately without considering delays/backoffs (except maybe a very recent failure)

Sure. Linking this to the "service protocol slots" #1461. I can implement n retries for "slotted" peers. If its fine will leave that for the PR fixing that issue.

alrevuelta · 2022-12-16T08:41:31Z

Will leave this PR on hold since with the existing code, I can't differenciate between service (store, lp,...) and relay peers. Meaning that a store peer configured with setStorePeer can be overriden by any peer supporting also store.

Once I implement this feature #1461 I will be able to know which peers are "slotted" (aka set as preferred service/lp peer) and don't apply the backoff to them.

Edit: I actually can withproto eg proto=StoreCodec but that might be a different one than the one provided as store-peer

alrevuelta · 2023-01-04T15:11:25Z

@jm-clius Fixed your comment by adding a flag to make respecting the backoff optional. Let me know what you think :)

This feature is currently unused and will be used in #1477 in the connectivity loop that you also refer to, and only for relay peers.

jm-clius

Thanks! I will review in more detail after the weekend, in case I've missed some intricacies. :)

Adding the respectBackoff flag is indeed better, but I'm still not convinced that this logic should be part of the dialer.

As I see it, we have two distinct needs here:

A dialer interface that must allow any protocol/application to attempt to dial peers, maintain connectivity related peer books, etc.
At least one "application" of the peer manager/store (the "connect loop") that wants to use this dialer to continuously attempt to connect to all relevant peers in the peer store. This application should only attempt connection to peers that are not being backed off from, remove peers that it has the authority to do (i.e. not static peers), etc.

Mixing logic from (2) into (1) seems to me to create some confusion. What if an application wants to respectBackoff? Currently it will simply receive a none(Connection) in return if the peer is being backed off from. In my mind it has no reason not to continue attempting to connect to this peer and doesn't gain any information to help it make better decisions in future. Furthermore, the dialer will now make decisions such as removing peers from the peer store after max failed attempts. It seems to me to be doing that because it "knows" that this is what the connect loop would expect of it, since the connect loop is selecting peers from the peer store for attempted connection. This has no bearing, however, on other applications/protocols that may be managing their own peers.

waku/v2/node/peer_manager/peer_manager.nim

jm-clius · 2023-01-06T12:19:12Z

waku/v2/node/peer_manager/peer_manager.nim

+  var deadline = sleepAsync(dialTimeout)
  let dialFut = pm.switch.dial(peerId, addrs, proto)

+  var reasonFailed = ""
  try:
-    # Attempt to dial remote peer
-    if (await dialFut.withTimeout(DefaultDialTimeout)):
+    await dialFut or deadline
+    if dialFut.finished():
+      if not deadline.finished():
+        deadline.cancel()


Any reason not to use dialFut.withTimeout()?

just reverted my changes, using again withTimeout.

i thought there was a possible race condition and wanted also to cancel the dial if the timer timed out, but noticed it didnt made much sense.

jm-clius · 2023-01-06T12:29:00Z

waku/v2/node/peer_manager/peer_manager.nim

+  waku_peers_dials.inc(labelValues = [reasonFailed])
+
+  # If failed too many times, remove peer from peer store
+  if respectBackoff and pm.peerStore[NumberFailedConnBook][peerId] >= pm.maxFailedAttempts:


This seems a bit weird to me - the fact that me dialing some peer could result in it being removed from the peer store (and doing it again would result in it being added again, presumably?). Of course, we know that for the application of "continuously attempt to connect to all available relay peers" this would make sense, i.e. to eventually stop attempting to dial peers that continue to fail. But this highlights my concern that this logic should not be part of the dialer, even if behind a respectBackoff flag.

alrevuelta · 2023-01-09T08:34:35Z

@jm-clius I agree with your 1. and 2. needs. Main reason I added the respectBackoff to dialPeer and mixed the logic is so that its easier to unit test. If I respect the backoff directly in the loop, that makes it more difficult to unit test.

Plan b is to have a separate function, but that involves duplicating lots of logic: update metrics, dial + timeouts, etc.

Will convert the PR to draft and implement a vanila connectivity loop in a separate PR. Then I will add the backoff to that connectivity loop, which should comply with your "not mixing logic" requirement.

rymnc

LGTM, just a couple questions :)

rymnc · 2023-01-20T09:01:56Z

waku/v2/node/waku_node.nim


-      trace "Discovered peers", count=discoveredPeers.get().len()
+    if discoveredPeersRes.isOk:


Suggested change

if discoveredPeersRes.isOk:

if discoveredPeersRes.isOk():

nit, but i believe this is the style guide we're adopting

maybe we should handle an error if findRandomPeers fails. An error log atleast

sure thanks!

rymnc · 2023-01-20T09:05:15Z

waku/v2/node/peer_manager/waku_peer_store.nim

+  # If it errored we wait an exponential backoff from last connection
+  # the more failed attemps, the greater the backoff since last attempt
+  let now = Moment.init(getTime().toUnix, Second)
+  let lastFailed = peerStore[LastFailedConnBook][peerId]


Suggested change

let lastFailed = peerStore[LastFailedConnBook][peerId]

let lastFailed = peerStore.getLastFailedPeer(peerId)

or similar, which would allow us to change the underlying datastructure if required in the futute

I agree with this, but I'm just trying to follow nimlibp2p peerstore https://github.com/status-im/nim-libp2p/blob/unstable/libp2p/peerstore.nim#L148 pattern, of using custom books.

Doing this will require to have 6-7 new getter functions, with just one line of code, and not sure I see the benefit right now.

But I'm totally open for suggestions.

Sounds good! just something to keep in mind if we decide to have more complex functionality later.

rymnc · 2023-01-20T09:06:47Z

waku/v2/node/peer_manager/peer_manager.nim

+  InitialBackoffInSec = 120
+  BackoffFactor = 4


Just wondering what people's views are on making this configurable by the operator. To me, it allows for more aggressive dialing behaviour. wdyt?

mm these are some default safe values that i've tested. what do you mean by configurable? new cli flags? not sure if that would be to low level for an operator? note though that they can be changed when creating the peermanager.

yeah, I mean cli flags :)

In general I'm in favour of using (hard-coded) defaults until it becomes clear that making them configurable is useful to an operator. These default values could be part of a BCP RFC, for example, so that other client implementations can follow suit and agreement can be reached on what the most reasonable default is.

jm-clius

Thanks! Makes sense to me now that the dialPeer mechanism does not make any decisions on whether to dial a peer or not. Some minor comments below. My biggest concern is prioritising some mechanism to manage the number of peers kept in the store to avoid leaks. Would be good to monitor peer management behaviour closely once this is merged (and auto-deployed to wakuv2.test)

jm-clius · 2023-01-23T11:04:40Z

waku/v2/node/peer_manager/peer_manager.nim

+  InitialBackoffInSec = 120
+  BackoffFactor = 4


In general I'm in favour of using (hard-coded) defaults until it becomes clear that making them configurable is useful to an operator. These default values could be part of a BCP RFC, for example, so that other client implementations can follow suit and agreement can be reached on what the most reasonable default is.

jm-clius · 2023-01-23T11:51:12Z

waku/v2/node/peer_manager/peer_manager.nim

-
-    let numPeersToConnect = min(min(maxConnections - numConPeers, disconnectedPeers.len), MaxParalelDials)
+    var notConnectedPeers = pm.peerStore.getNotConnectedPeers().mapIt(RemotePeerInfo.init(it.peerId, it.addrs))
+    var withinBackoffPeers = notConnectedPeers.filterIt(pm.peerStore.canBeConnected(it.peerId,


Any reason this is a var?

ouch, leftover. will fix.

jm-clius · 2023-01-23T11:52:51Z

waku/v2/node/peer_manager/peer_manager.nim

-
-    let numPeersToConnect = min(min(maxConnections - numConPeers, disconnectedPeers.len), MaxParalelDials)
+    var notConnectedPeers = pm.peerStore.getNotConnectedPeers().mapIt(RemotePeerInfo.init(it.peerId, it.addrs))
+    var withinBackoffPeers = notConnectedPeers.filterIt(pm.peerStore.canBeConnected(it.peerId,


Extremely nitpicky: shouldn't these be something like outsideBackoffPeers? :D To me this sounds like these peers are still within the period of backing off.

ah right! will fix.

jm-clius · 2023-01-23T14:51:05Z

waku/v2/node/waku_node.nim

+      try:
+        let conn = await node.switch.dial(peer.peerId, peer.addrs, PingCodec)
+        let pingDelay = await node.libp2pPing.ping(conn)
+      except CatchableError as exc:


Not sure I understand the move away from Result (or Option) based error handling here - is it because there are CatchableError possible here that we can't enumerate and deal with explicitly? Note that we prefer explicit error handling, see e.g. https://status-im.github.io/nim-style-guide/errors.exceptions.html

The main change here is to use switch.dial instead of peerManager.dialPeer for the ping. Main reason is that dialPeer updates metrics on ok/nok connections, failed attempts, last time failed, Can/Cannot be connected etc.

And since here we are just pinging this peer, I don't think using that function makes sense. For example, we will be updating the metrics with ok connections, every time we ping a peer.

And actually this should be more like getConnection because we are not dialing/connecting to any peer, but getting an already existing connection and sending a ping.

I agree with explicir error handling, but here I had to add the try except because switch.dial doesn't return Result[xx].

jm-clius · 2023-01-23T14:57:16Z

waku/v2/node/peer_manager/waku_peer_store.nim

+
+  # If it errored we wait an exponential backoff from last connection
+  # the more failed attemps, the greater the backoff since last attempt
+  let now = Moment.init(getTime().toUnix, Second)


May be worth extracting this in future as another argument for canBeConnected, so that canBeConnected becomes an isolated utility-type function with predictable unit testing outputs and so that you only have to read the current system time once when checking the canBeConnected() status for multiple peers (as is the most common use case, I think)

jm-clius · 2023-01-23T15:18:10Z

waku/v2/node/peer_manager/waku_peer_store.nim

+    return true
+  return false
+
+proc delete*(peerStore: PeerStore,


Afaict, this is not yet used anywhere? Given the fact that in existing deployments peer IDs are cycled very often (I think), we should add a mechanism to manage the size of the peer store fairly urgently - unsure of the implication if this memory essentially leaks in the meantime.

Yep, not used in this PR but tracked here in "Prune peers from peerstore".
shouldn't leak since its limited by .withPeerStore(capacity=xxx) but yes, we have to handle it more gracefully.

LNSD

LGTM ✅

alrevuelta changed the title ~~feat(p2p): add backoff period after failed dial~~ feat(networking): add backoff period after failed dial Dec 13, 2022

alrevuelta force-pushed the add-exponential-backoff branch from 23382b6 to c2058e0 Compare December 14, 2022 19:16

alrevuelta requested review from jm-clius and LNSD December 14, 2022 19:23

alrevuelta force-pushed the add-exponential-backoff branch from c2058e0 to c37f3f7 Compare December 15, 2022 09:05

alrevuelta marked this pull request as ready for review December 15, 2022 09:36

jm-clius reviewed Dec 15, 2022

View reviewed changes

alrevuelta force-pushed the add-exponential-backoff branch from 8af035c to 7c5aa5a Compare January 4, 2023 14:31

alrevuelta mentioned this pull request Jan 4, 2023

feat(networking): add a conectionLoop loop that ensures we keep x relay connections #1477

Closed

2 tasks

alrevuelta requested a review from jm-clius January 4, 2023 15:11

jm-clius reviewed Jan 6, 2023

View reviewed changes

This was referenced Jan 12, 2023

bug: undialable & inconsistent peer ids between response from peer exchange & fleets.status.im #1484

Closed

feat(networking): add relay connectivity loop #1482

Merged

alrevuelta force-pushed the add-exponential-backoff branch 2 times, most recently from 83c78b5 to fc238bf Compare January 13, 2023 08:42

alrevuelta marked this pull request as draft January 13, 2023 09:11

alrevuelta changed the base branch from master to add-connectivity-loop January 13, 2023 09:15

alrevuelta force-pushed the add-connectivity-loop branch from ba89315 to 6542289 Compare January 13, 2023 09:17

alrevuelta force-pushed the add-exponential-backoff branch 4 times, most recently from 8bfedcb to 99b3327 Compare January 16, 2023 10:54

alrevuelta force-pushed the add-connectivity-loop branch from 6542289 to 9837b6c Compare January 18, 2023 07:43

Base automatically changed from add-connectivity-loop to master January 18, 2023 14:17

feat(networking): add exponential backoff when dialing relay peers

faac57b

alrevuelta force-pushed the add-exponential-backoff branch from 99b3327 to faac57b Compare January 18, 2023 14:54

alrevuelta added 5 commits January 19, 2023 08:49

feat(networking): fix tests

d367557

revert withTimeout

1f3b506

feat(networking): refactor tests

f123464

feat(networking): improve logs + ping using switch

7cf6b15

feat(networking): fix backoff bug + fix tests

75c67bf

alrevuelta marked this pull request as ready for review January 20, 2023 08:14

alrevuelta requested a review from jm-clius January 20, 2023 08:36

rymnc reviewed Jan 20, 2023

View reviewed changes

jm-clius approved these changes Jan 23, 2023

View reviewed changes

feat(networking): fix comments

8f24b5b

LNSD approved these changes Jan 23, 2023

View reviewed changes

alrevuelta merged commit 028efc8 into master Jan 23, 2023

alrevuelta deleted the add-exponential-backoff branch January 23, 2023 20:24

alrevuelta mentioned this pull request Jan 26, 2023

Networking MVP: Refactor + extend functionality #1353

Closed

17 tasks


		trace "Discovered peers", count=discoveredPeers.get().len()
		if discoveredPeersRes.isOk:

	let lastFailed = peerStore[LastFailedConnBook][peerId]
	let lastFailed = peerStore.getLastFailedPeer(peerId)

feat(networking): add backoff period after failed dial #1462

feat(networking): add backoff period after failed dial #1462

Conversation

alrevuelta commented Dec 13, 2022 • edited Loading

status-im-auto commented Dec 13, 2022 • edited Loading

Jenkins Builds

jm-clius left a comment

Choose a reason for hiding this comment

jm-clius commented Dec 15, 2022

alrevuelta commented Dec 15, 2022

alrevuelta commented Dec 16, 2022 • edited Loading

alrevuelta commented Jan 4, 2023

jm-clius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrevuelta commented Jan 9, 2023

rymnc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jm-clius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LNSD left a comment

Choose a reason for hiding this comment

alrevuelta commented Dec 13, 2022 •

edited

Loading

status-im-auto commented Dec 13, 2022 •

edited

Loading

alrevuelta commented Dec 16, 2022 •

edited

Loading