Peer dialing/kicking system overhaul #3346

Menduist · 2022-01-31T16:35:03Z

This PR is an ongoing effort to improve peering

The most complex part of this PR is the new force dial mechanism.
Previously, we tried to dial peer, and if we didn't have room to dial peer, we would kick some to make room for the new one. The major downside of this system is that it kick peers before actually dialing them, so it make kick peers and fail to dial the new ones afterward.

Instead, this PR relies on a small modification in libp2p that allows to go over the maxPeers when dialing. A regular trimmer will then kick the excess connections.
The tricky part is that we need to balance between health and incoming starvation. Indeed, by being most of the time above maxPeers, incoming peers will never be allowed, so over time, we'll end up with outgoing only, which is not good for the network as a whole.

Instead, this PR will let a few incoming peers in when we are force dialing.

A few metrics of it running in a few scenarios, with all subnets:

Mainnet, max-peers=300

It's really hard to get a healthy set of peers for every subnet on mainnet, even with 300 peers. This node is never able to stabilize, instead hovering around 310 peers and spend all his time kicking and dialing new peers.

Prater, max-peers=60

On prater, getting healthy peers is easier (bigger nodes on average), so we're stable, hover around 60, and have around half incoming, half outgoing which is good.

Mainnet, max-peers=60

If getting good health with 300 peers is hard, 60 is impossible. But still, a good incoming/outgoing balance

Astute readers will have notice the weird spikes on the last two screenshots. The hard limit of connections is now max-conn * 2, and sometime we will find a lot of peers in a discovery round, and dial them all at once, until we hit the limit. Then the kicker will come in and trim the excess fairly quickly.
EDIT: the spikes were actually caused by the trimmer getting stuck

github-actions · 2022-01-31T17:58:58Z

Unit Test Results

    12 files   829 suites 30m 6s ⏱️
1 674 tests 1 626 ✔️   48 💤 0 ❌
9 775 runs 9 663 ✔️ 112 💤 0 ❌

Results for commit 97a5977.

♻️ This comment has been updated with latest results.

arnetheduck · 2022-01-31T20:07:45Z

Then the kicker will come in and trim the excess fairly quickly.

I don't see the corresponding spike in the kicked peers metric under discovery and dialling - is that a display issue in grafana or something else?

arnetheduck · 2022-01-31T20:10:38Z

beacon_chain/networking/eth2_network.nim

+      # between kicks, so we must not kick too many peers at
+      # once
+      for _ in 0..<excessPeers div 2:
+        await node.trimConnections(2)


intuitively, this feels like it's on the aggressive end: consider startup - we just connected to a bunch of peers and they have not necessarily had time to even establish a score / connection / handshake to meaningfully differentiate them, and here we kick them out very aggressively -

The metadata pinger is used as a grace period, ie we won't kick a peer if we don't have his metadata.
This may seem a bit weird, but the grace period is a tricky subject: if it's too low, we'll kick perfectly good peers that don't have time to be established correctly, if it's too long, we may kick a useful peer because a useless peer is still in his grace period.

The idea is that if we had time to get his metadata, we should be nicely setup, and if he doesn't respond to metadata requests, he'll eventually get kicked anyway

I did try to do a smarter trimmer, that will only kick peers if we have to, in this spirit:
ed3b2fd#diff-84fc3d73d852761d07624e61621b0784b6f144dbeeb57f001990bd8f7bc5d7a7R1650-R1659

But we ended up getting stable at 75% of the hard-capacity (so stabilized at max-peers * 1.5)
I tried later to do a more soft limit, eg max-score = lerp(minScore=300, maxScore=10000, a=0, b=maxPeers, value=excessPeers), but then we just stabilized at a random point between max-peers and max-peers * 2, which I find more confusing than anything. + it relies on magic values (300 & 10000)

So anyway, the smart approach proved difficult, I switched back to a dumb one, which is pretty effective, imo

I'm thinking this loop could be: while len(node.peerPool) > node.wantedPeers: kick(worst peer); sleep(1) or something like this - basically, we fetch new peers from discovery at a certain rate - we should not kick peers at a higher rate than that (but perhaps there's logic somewhere that I'm missing that keeps the two in balance?)

Actually we don't know the discovery rate, for instance on mainnet we discover a lot more peers than in prater (because the DHT is shared and there is more nodes in mainnet)
But the result is the same: since only dialing can go over max-peers, if we kick everything over max-peers, we'll kick as much as we discover (and successfully dial)

I've thought about moving this to the connect worker, eg:

if canConnect: let success = await connect if success and len(node.peerPool) > node.wantedPeers: kick(worst peer)

But I think the current approach is more "bullet-proof"

agree, kicking should stay in this separate task - what would the downsides of "rate-limiting" kicking be?

rate-limiting being your snippet?

Since we don't know how much peers we discover on each round, it may be too slow. To be safe, it would have to sleep very little, which would be equivalent to the current code

Some numbers on mainnet:

2022-02-01 11:32:45.011+00:00 discovered_nodes:56 2022-02-01 11:33:06.336+00:00 discovered_nodes:8 2022-02-01 11:33:13.220+00:00 discovered_nodes:30 2022-02-01 11:33:22.452+00:00 discovered_nodes:24

As you can see, both the number of discovered nodes and the duration of the discovery loop varies wildly

To be safe, it would have to sleep very little, which would be equivalent to the current code

except that it allows the chronos loop to run between each kick as well, which is a good thing in general - how about my loop with some cap on the number of peers kicked in each round - maybe something like max(1, sqrt(peers_in_excess)) which will kick more aggressively when there's lots of overshoot and slow down later..

Last commit switches to

while true: let excessPeers = connectedPeers - node.wantedPeers if excessPeers > 0: kick(2 worst peer) sleep(1.seconds div max(1, excessPeers))

Which works quite well

kdeme · 2022-02-01T09:39:31Z

beacon_chain/networking/eth2_network.nim

+    except CancelledError as exc:
+      raise exc
+    except CatchableError as exc:
+      debug "failed to kick peer", peerId, err=exc.msg


Is this something that can really occur? Is the dec toKick then ok on the line below?

Not sure if it can actually happen, though worst case scenario, the next iteration (5 seconds later) will pick up the slack

And I would rather stay safe and over-estimate the number of kicked peers, than under-estimate

arnetheduck · 2022-02-01T11:29:02Z

beacon_chain/networking/eth2_network.nim

-
-      scores[peer.peerId] = scores[peer.peerId] + thisPeersScore
+    for peer in node.pubsub.gossipsub.getOrDefault(topic):
+      if peer.peerId notin scores: continue


why do we skip non-stability peers here?

ok, I read the code wrong - we're looking at peers that nimbus knows about, and filtering out peers that are in gossipsub and not in nimbus - if this happens, it smells of bug, but moving on:

we don't give score to syncnet peers in the initial loop - this needs addressing as well - also, we sort of prefer "large" peers with lots of subnets over smaller peer with fewer subnets - I suspect this ends up being wrong over time because it gives an unfair advantage to large peers which in now way are guaranteed to be "better" - it would be good to strive for a random mix of both kinds - perhaps "has_some_interesting_attnet" and "has_some_interesting_syncnet" would be better conditions?

Another idea I had, instead of scoring on SUM(topicsPoints), we could score on MAX(topicPoints), wdyt?

The first loop is only here to score peers during sync at this point, ideally I would like to get rid of it, not sure how atm

and filtering out peers that are in gossipsub and not in nimbus - if this happens, it smells of bug, but moving on

It can happen if we don't have a metadata for them yet (ie, they are in their grace period)

Last commit switches to AVG(nonGlobalTopicsPoints)
Global topics are not counted because that would be unfair towards small nodes. They always have low value, and would bring the average down

Menduist · 2022-02-01T12:50:59Z

I don't see the corresponding spike in the kicked peers metric under discovery and dialling - is that a display issue in grafana or something else?

Indeed, I was wrong, the issue was that the peer trimmer got stuck while disconnecting a peer, switched to an asyncSpawn to avoid it happening again, seems good so far

arnetheduck · 2022-02-04T11:05:25Z

Indeed, I was wrong, the issue was that the peer trimmer got stuck while disconnecting a peer,

when does it get .. unstuck?

Menduist · 2022-02-04T11:18:51Z

It goes back to the discussion we had a while ago: should await switch.disconnect(peer) wait until the peer is actually disconnected or not

Currently, it does wait, so if you are disconnecting someone and don't want to wait, you have to asyncSpawn. And I fell into the trap of not doing it, which caused the trimmer to be stuck for a few minutes when a slow peer is being kicked

And I can confirm that it still runs without spikes for 3 days, so that's fixed

beacon_chain/networking/eth2_network.nim

Co-authored-by: Jacek Sieka <jacek@status.im>

…unepeering

arnetheduck reviewed Jan 31, 2022

View reviewed changes

kdeme reviewed Feb 1, 2022

View reviewed changes

arnetheduck reviewed Feb 1, 2022

View reviewed changes

Extract easy stuff from #3346

2b5354a

Menduist mentioned this pull request Feb 1, 2022

Tune peering part 1 #3348

Merged

Menduist changed the base branch from unstable to tunepeer1 February 1, 2022 12:28

Menduist changed the base branch from tunepeer1 to unstable February 1, 2022 12:30

Force dial + excess peer trimmer

9504e4b

Menduist force-pushed the tunepeering branch from 0bbdab2 to 9504e4b Compare February 1, 2022 12:48

Menduist changed the base branch from unstable to tunepeer1 February 1, 2022 12:49

Base automatically changed from tunepeer1 to unstable February 1, 2022 17:20

Merge remote-tracking branch 'origin/unstable' into tunepeering

31239b5

Menduist added 4 commits February 10, 2022 11:29

Merge remote-tracking branch 'origin/unstable' into tunepeering

94f0cff

Update gossip scoring

2e7ed23

Ensure we always have outgoing peers

1174ed0

Fix test

32fdd8c

Base automatically changed from unstable to stable February 15, 2022 20:57

zah changed the base branch from stable to unstable February 15, 2022 22:03

Menduist added 2 commits February 21, 2022 16:14

Merge remote-tracking branch 'origin/unstable' into tunepeering

5c7f394

New trim loop

a036932

arnetheduck reviewed Feb 22, 2022

View reviewed changes

beacon_chain/networking/eth2_network.nim Outdated Show resolved Hide resolved

Menduist mentioned this pull request Feb 22, 2022

Force dial vacp2p/nim-libp2p#696

Merged

Menduist added 3 commits February 22, 2022 12:53

Bump libp2p & kick one peer at a time

170fd53

Bump libp2p

4f23049

Merge branch 'bumplibp2p' into tunepeering

ae57dbc

Menduist mentioned this pull request Feb 25, 2022

Bump libp2p #3438

Merged

Menduist marked this pull request as ready for review February 25, 2022 12:54

Menduist changed the title ~~Tune peering~~ Peer dialing/kicking system overhaul Feb 25, 2022

Menduist added 3 commits March 3, 2022 10:50

Add configurable hard-max-peers

058d08f

Peer trimmer grace period safeguard

2d3ebcb

Re-introduce kicking in discovery

a6b4877

arnetheduck reviewed Mar 3, 2022

View reviewed changes

beacon_chain/networking/eth2_network.nim Outdated Show resolved Hide resolved

Update beacon_chain/networking/eth2_network.nim

cc4c49e

Co-authored-by: Jacek Sieka <jacek@status.im>

arnetheduck approved these changes Mar 11, 2022

View reviewed changes

Menduist added 2 commits March 11, 2022 09:57

Merge remote-tracking branch 'origin/unstable' into tunepeering

8341516

Merge branch 'tunepeering' of github.com:status-im/nimbus-eth2 into t…

97a5977

…unepeering

Menduist enabled auto-merge (squash) March 11, 2022 09:06

Menduist merged commit f589bf2 into unstable Mar 11, 2022

Menduist deleted the tunepeering branch March 11, 2022 10:51

Menduist mentioned this pull request Sep 21, 2022

Allow forced outgoing connections while soft-limiting total number of connections #3031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer dialing/kicking system overhaul #3346

Peer dialing/kicking system overhaul #3346

Menduist commented Jan 31, 2022 •

edited

github-actions bot commented Jan 31, 2022 •

edited

arnetheduck commented Jan 31, 2022

arnetheduck Jan 31, 2022

Menduist Feb 1, 2022

arnetheduck Feb 1, 2022

Menduist Feb 1, 2022

arnetheduck Feb 1, 2022

Menduist Feb 1, 2022

arnetheduck Feb 1, 2022

Menduist Feb 21, 2022

kdeme Feb 1, 2022

Menduist Feb 1, 2022 •

edited

Menduist Feb 1, 2022

arnetheduck Feb 1, 2022

arnetheduck Feb 1, 2022

Menduist Feb 1, 2022

Menduist Feb 4, 2022

Menduist Feb 11, 2022

Menduist commented Feb 1, 2022

arnetheduck commented Feb 4, 2022

Menduist commented Feb 4, 2022

Peer dialing/kicking system overhaul #3346

Peer dialing/kicking system overhaul #3346

Conversation

Menduist commented Jan 31, 2022 • edited

Mainnet, max-peers=300

Prater, max-peers=60

Mainnet, max-peers=60

github-actions bot commented Jan 31, 2022 • edited

Unit Test Results

arnetheduck commented Jan 31, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Menduist Feb 1, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Menduist commented Feb 1, 2022

arnetheduck commented Feb 4, 2022

Menduist commented Feb 4, 2022

Menduist commented Jan 31, 2022 •

edited

github-actions bot commented Jan 31, 2022 •

edited

Menduist Feb 1, 2022 •

edited