Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: peer_manager - extend the number of connection requests to known peers #2534

Merged
merged 3 commits into from
Mar 19, 2024

Conversation

Ivansete-status
Copy link
Collaborator

Description

@danisharora099 informed about a very low number of in connections on his node:

image

We've also noticed this behavior in other nodes.


This is how peer_manager establishes new out connections in a nwaku node:
Every node tries to establish connections to its known peers, every minute.
On every iteration, await pm.connectToNodes(outsideBackoffPeers[0..<numPeersToConnect]) is run but the number of parallel connections is bound to just 10 (MaxParallelDials.) And that means that any other node will try to establish a connection with 10 peers, every minute.
Then, given the number of known peers that each other node has (754 in the Danish node) that makes it more difficult, or slower, to try to connect to other nodes.


With this change, we make any node to establish as many connections as possible to other nodes, on each relayConnectivityLoop iteration. And with that, we enlarge the chances of other nodes to receive in connections.

Changes

  • Try to connect to all known peers on each relayConnectivityLoop iteration.

Copy link

github-actions bot commented Mar 15, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2534-rln-v2-false

Built from f595bf6

@Ivansete-status Ivansete-status marked this pull request as ready for review March 15, 2024 11:55
@Ivansete-status Ivansete-status self-assigned this Mar 15, 2024
Copy link
Contributor

@SionoiS SionoiS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

N.B.
The experimental peer manager try to connect to all possible nodes here -> https://github.com/waku-org/nwaku/blob/0894f0cfea9fac17348607cb50768e17bd788660/waku/node/peer_manager/peer_manager.nim#L750C1-L753C51

We could try to enable it in waku.test?

@Ivansete-status
Copy link
Collaborator Author

LGTM!

N.B. The experimental peer manager try to connect to all possible nodes here -> https://github.com/waku-org/nwaku/blob/0894f0cfea9fac17348607cb50768e17bd788660/waku/node/peer_manager/peer_manager.nim#L750C1-L753C51

We could try to enable it waku.test?

Good point, thanks!
I think that we should properly separate both networks so that we see the impact of enabling such an experimental feature.

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm. I think we'd need a more comprehensive solution, unless I'm missing something. This ignores the fact that there is a maximum number of connections in general (we can't just connect to everyone). In L670 this proc is exited if we have already reached our max out conn target. Previously we could exceed this, but by no more than 9 (10 minus 1) as we wouldn't attempt more than 10 connections if under this threshold. Now we will continue attempting connections whether we've reached the underlying libp2p conn limit or not. Furthermore, it's almost more important to ask why the peer store seemingly has so many peers stored that are almost certainly unreachable (given that > 700 peers in the store with less than 50 good connections). We should also consider having a larger inRelayPeers threshold and maintaining that, something that has been disabled (probably with good reason) higher up in the proc.

@Ivansete-status
Copy link
Collaborator Author

Thanks for the comments @jm-clius !

Mmm. I think we'd need a more comprehensive solution, unless I'm missing something. This ignores the fact that there is a maximum number of connections in general (we can't just connect to everyone). In L670 this proc is exited if we have already reached our max out conn target. Previously we could exceed this, but by no more than 9 (10 minus 1) as we wouldn't attempt more than 10 connections if under this threshold. Now we will continue attempting connections whether we've reached the underlying libp2p conn limit or not.

I completely agree with that, I've replicated the L670 condition within the connection loop in order to keep the same "out" limit.


Furthermore, it's almost more important to ask why the peer store seemingly has so many peers stored that are almost certainly unreachable (given that > 700 peers in the store with less than 50 good connections).

That high number of peers within PeerStore, all come from Discv5. I don't know whether all of them are reachable or not.


We should also consider having a larger inRelayPeers threshold and maintaining that, something that has been disabled (probably with good reason) higher up in the proc.

Excuse me, but I can't quite get this comment :)

@jm-clius
Copy link
Contributor

I don't know whether all of them are reachable or not.

Indeed. My point is that we may need a separate investigation into why the discv5 dht would have so many stale peers. Perhaps it should be tuned to ping peers more regularly, but it could also be an indication that many peers may somehow be reachable for discv5 exchanges but not be useful on the libp2p layer. In either case, having a very large proportion of "known" peers be essentially useless would cause fragile connectivity, loads of bandwidth/resource overhead, etc. in the network. My point is that this will also affect the number of successful "in" connections.

Excuse me, but I can't quite get this comment :)

Similar to how we have a target number for out connections, we also have target number of in connections - however, the logic that would keep us to this target seems to be disabled higher up in the proc.

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the out target now included, I can approve. However, I'm not convinced that this will necessary improve the situation (since we'll be dialing in batches of 10 in any case, always waiting for the last connection/dial timeout before attempting to dial a new batch). We may need to monitor impact on resource usage here as well.

@Ivansete-status Ivansete-status merged commit 2173fe2 into master Mar 19, 2024
14 of 15 checks passed
@Ivansete-status Ivansete-status deleted the dial-issue branch March 19, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants