-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(networking): add backoff period after failed dial #1462
Conversation
Jenkins BuildsClick to see older builds (7)
|
23382b6
to
c2058e0
Compare
c2058e0
to
c37f3f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I agree with the idea of exponential backoffs, I'm not sure I agree that this should be part of (every) dialPeer
. In my mind, if an "application" (or just a protocol) calls dialPeer
, it should attempt to connect to that peer immediately without considering delays/backoffs (except maybe a very recent failure). The protocol/application may "know better" than the peer manager about when a peer has become available again - especially when we're in the 14h backoff period. If a protocol attempts to continuously dial an unreachable peer, there is a problem with that protocol that should be addressed. The only protocol that currently requires continuous connection attempts, however, is Relay. In my mind then there should be a connectivity loop that continuously attempts to connect to some peers from the peer store and respects the backoff period for these peers within this connectivity loop. In this approach, dialPeer
will be available to protocols that want to make an ad-hoc connection to a specific peer and should remain a simple way to attempt to dial the peer (there may be scope for some sanity checks here, but I don't think a protocol should attempt to dial a peer and then wait ~14 hours for the dial to be attempted).
Of course, similar connectivity guarantees could be useful for service protocols in future too. I just think this connectivity attempts and backoffs should be done in parallel and not serially/ad-hoc whenever attempting to dial. |
Thanks @jm-clius, fair comments. How about only respecting the backoff for the relay protocol, and having a direct path for the service protocols? Also related to "service protocol slots" #1461. Some comments:
mm not sure I follow. Can you provide an example? Not sure how the application layer can have knowledge of this.
Nice, as stated I will implement that, and have a direct path for service protocols.
Sure. Linking this to the "service protocol slots" #1461. I can implement n retries for "slotted" peers. If its fine will leave that for the PR fixing that issue. |
Will leave this PR on hold since with the existing code, I can't differenciate between service (store, lp,...) and relay peers. Meaning that a store peer configured with Once I implement this feature #1461 I will be able to know which peers are "slotted" (aka set as preferred service/lp peer) and don't apply the backoff to them. Edit: I actually can with |
8af035c
to
7c5aa5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will review in more detail after the weekend, in case I've missed some intricacies. :)
Adding the respectBackoff
flag is indeed better, but I'm still not convinced that this logic should be part of the dialer.
As I see it, we have two distinct needs here:
- A dialer interface that must allow any protocol/application to attempt to dial peers, maintain connectivity related peer books, etc.
- At least one "application" of the peer manager/store (the "connect loop") that wants to use this dialer to continuously attempt to connect to all relevant peers in the peer store. This application should only attempt connection to peers that are not being backed off from, remove peers that it has the authority to do (i.e. not static peers), etc.
Mixing logic from (2) into (1) seems to me to create some confusion. What if an application wants to respectBackoff
? Currently it will simply receive a none(Connection)
in return if the peer is being backed off from. In my mind it has no reason not to continue attempting to connect to this peer and doesn't gain any information to help it make better decisions in future. Furthermore, the dialer will now make decisions such as removing peers from the peer store after max failed attempts. It seems to me to be doing that because it "knows" that this is what the connect loop would expect of it, since the connect loop is selecting peers from the peer store for attempted connection. This has no bearing, however, on other applications/protocols that may be managing their own peers.
var deadline = sleepAsync(dialTimeout) | ||
let dialFut = pm.switch.dial(peerId, addrs, proto) | ||
|
||
var reasonFailed = "" | ||
try: | ||
# Attempt to dial remote peer | ||
if (await dialFut.withTimeout(DefaultDialTimeout)): | ||
await dialFut or deadline | ||
if dialFut.finished(): | ||
if not deadline.finished(): | ||
deadline.cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use dialFut.withTimeout()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just reverted my changes, using again withTimeout.
i thought there was a possible race condition and wanted also to cancel the dial if the timer timed out, but noticed it didnt made much sense.
waku_peers_dials.inc(labelValues = [reasonFailed]) | ||
|
||
# If failed too many times, remove peer from peer store | ||
if respectBackoff and pm.peerStore[NumberFailedConnBook][peerId] >= pm.maxFailedAttempts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit weird to me - the fact that me dialing some peer could result in it being removed from the peer store (and doing it again would result in it being added again, presumably?). Of course, we know that for the application of "continuously attempt to connect to all available relay
peers" this would make sense, i.e. to eventually stop attempting to dial peers that continue to fail. But this highlights my concern that this logic should not be part of the dialer, even if behind a respectBackoff
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@jm-clius I agree with your 1. and 2. needs. Main reason I added the Plan b is to have a separate function, but that involves duplicating lots of logic: update metrics, dial + timeouts, etc. Will convert the PR to draft and implement a vanila connectivity loop in a separate PR. Then I will add the backoff to that connectivity loop, which should comply with your "not mixing logic" requirement. |
83c78b5
to
fc238bf
Compare
ba89315
to
6542289
Compare
8bfedcb
to
99b3327
Compare
6542289
to
9837b6c
Compare
99b3327
to
faac57b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple questions :)
|
||
trace "Discovered peers", count=discoveredPeers.get().len() | ||
if discoveredPeersRes.isOk: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if discoveredPeersRes.isOk: | |
if discoveredPeersRes.isOk(): |
nit, but i believe this is the style guide we're adopting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should handle an error if findRandomPeers
fails. An error log atleast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thanks!
# If it errored we wait an exponential backoff from last connection | ||
# the more failed attemps, the greater the backoff since last attempt | ||
let now = Moment.init(getTime().toUnix, Second) | ||
let lastFailed = peerStore[LastFailedConnBook][peerId] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let lastFailed = peerStore[LastFailedConnBook][peerId] | |
let lastFailed = peerStore.getLastFailedPeer(peerId) |
or similar, which would allow us to change the underlying datastructure if required in the futute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, but I'm just trying to follow nimlibp2p peerstore https://github.com/status-im/nim-libp2p/blob/unstable/libp2p/peerstore.nim#L148 pattern, of using custom books.
Doing this will require to have 6-7 new getter functions, with just one line of code, and not sure I see the benefit right now.
But I'm totally open for suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! just something to keep in mind if we decide to have more complex functionality later.
InitialBackoffInSec = 120 | ||
BackoffFactor = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering what people's views are on making this configurable by the operator. To me, it allows for more aggressive dialing behaviour. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm these are some default safe values that i've tested. what do you mean by configurable? new cli flags? not sure if that would be to low level for an operator? note though that they can be changed when creating the peermanager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I mean cli flags :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm in favour of using (hard-coded) defaults until it becomes clear that making them configurable is useful to an operator. These default values could be part of a BCP RFC, for example, so that other client implementations can follow suit and agreement can be reached on what the most reasonable default is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Makes sense to me now that the dialPeer
mechanism does not make any decisions on whether to dial a peer or not. Some minor comments below. My biggest concern is prioritising some mechanism to manage the number of peers kept in the store to avoid leaks. Would be good to monitor peer management behaviour closely once this is merged (and auto-deployed to wakuv2.test
)
InitialBackoffInSec = 120 | ||
BackoffFactor = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm in favour of using (hard-coded) defaults until it becomes clear that making them configurable is useful to an operator. These default values could be part of a BCP RFC, for example, so that other client implementations can follow suit and agreement can be reached on what the most reasonable default is.
|
||
let numPeersToConnect = min(min(maxConnections - numConPeers, disconnectedPeers.len), MaxParalelDials) | ||
var notConnectedPeers = pm.peerStore.getNotConnectedPeers().mapIt(RemotePeerInfo.init(it.peerId, it.addrs)) | ||
var withinBackoffPeers = notConnectedPeers.filterIt(pm.peerStore.canBeConnected(it.peerId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is a var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, leftover. will fix.
|
||
let numPeersToConnect = min(min(maxConnections - numConPeers, disconnectedPeers.len), MaxParalelDials) | ||
var notConnectedPeers = pm.peerStore.getNotConnectedPeers().mapIt(RemotePeerInfo.init(it.peerId, it.addrs)) | ||
var withinBackoffPeers = notConnectedPeers.filterIt(pm.peerStore.canBeConnected(it.peerId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extremely nitpicky: shouldn't these be something like outsideBackoffPeers
? :D To me this sounds like these peers are still within the period of backing off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right! will fix.
try: | ||
let conn = await node.switch.dial(peer.peerId, peer.addrs, PingCodec) | ||
let pingDelay = await node.libp2pPing.ping(conn) | ||
except CatchableError as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the move away from Result
(or Option
) based error handling here - is it because there are CatchableError
possible here that we can't enumerate and deal with explicitly? Note that we prefer explicit error handling, see e.g. https://status-im.github.io/nim-style-guide/errors.exceptions.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change here is to use switch.dial
instead of peerManager.dialPeer
for the ping. Main reason is that dialPeer
updates metrics on ok/nok connections, failed attempts, last time failed, Can/Cannot be connected etc.
And since here we are just pinging this peer, I don't think using that function makes sense. For example, we will be updating the metrics with ok connections, every time we ping a peer.
And actually this should be more like getConnection
because we are not dialing/connecting to any peer, but getting an already existing connection and sending a ping.
I agree with explicir error handling, but here I had to add the try
except
because switch.dial
doesn't return Result[xx]
.
|
||
# If it errored we wait an exponential backoff from last connection | ||
# the more failed attemps, the greater the backoff since last attempt | ||
let now = Moment.init(getTime().toUnix, Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth extracting this in future as another argument for canBeConnected
, so that canBeConnected
becomes an isolated utility-type function with predictable unit testing outputs and so that you only have to read the current system time once when checking the canBeConnected()
status for multiple peers (as is the most common use case, I think)
return true | ||
return false | ||
|
||
proc delete*(peerStore: PeerStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict, this is not yet used anywhere? Given the fact that in existing deployments peer IDs are cycled very often (I think), we should add a mechanism to manage the size of the peer store fairly urgently - unsure of the implication if this memory essentially leaks in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, not used in this PR but tracked here in "Prune peers from peerstore".
shouldn't leak since its limited by .withPeerStore(capacity=xxx)
but yes, we have to handle it more gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
Closes #1414
Summary:
After failing to dial a peer, it adds a backoff to it so that we don't attempt to dial the same peer again during some time. The time to wait depends on the amount of consecutive failures, calculated with the following formula:
Note that
initialBackoffInSec
andbackoffFactor
are configurable values that allow to configure how aggressive the backoffs are. UsinginitialBackoffInSec=120
andbackoffFactor=4
the times to wait would be:This PR helps nodes to rapidly increase their number of connections, since less time is wasted in trying to connect to nodes that fail. This improvement is even more noticeable in networks whith a high ratio of non reachable peers. Note that this backoff only applies to
relay
peers.Changes:
switch.dial
in ping (keep alive)