Add routing table metrics + tweaks + fixes #261

kdeme · 2020-06-29T22:08:05Z

routing table metrics + option in dcli
only forward "seen" nodes on a findNode request
setJustSeen & replace on ping AND findnode
self lookup only at start
revalidate 10x more
use bitsPerHop (b) of 5
small fix in resolve
small fix in bucket split

- routing table metrics + option in dcli - only forward "seen" nodes on a findNode request - setJustSeen & replace on ping AND findnode - self lookup only at start - revalidate 10x more - use bitsPerHop (b) of 5 - small fix in resolve - small fix in bucket split

kdeme · 2020-06-30T08:49:05Z

This PR basically tackles #183 and #206.

Regarding liveness, only peers that are not "seen" yet are not forwarded in the FindNode requests, as it should be.
They are still passed with randomNodes call though, as else providing peers at start-up will be much slower. (a custom randomNodes call could still filter these).

Regarding staleness, I've done some quick testing with allowing nodes with a failed request/response to stay in the routing table until after n failures. In the somewhat naive implementation of a simple counter, and the same revalidation code, this has sort of a negative effect.
Because of the extra pings required, the actual nodes that are "alive" take much longer to be "seen". A lot of time is wasted on stale nodes. And this is worse because the network seems to have still a lot of stale nodes in the DHT (This is probably due to a combination of some clients not cleaning up stale nodes fast and typically having a lot of stale nodes due to the nature of a testnet, this was tested on witti).
So an alternative solution (e.g. separate exp. back-off ping not obstructing current revalidation) would be good here, I'll create a ticket for that. As a result, nodes are currently still replaced (or removed) immediately on failure.

Here is the state on witti with the current settings, nodes in routing table versus "seen" nodes in routing table.

The current setting of revalidation frequency is required to keep routing table not to have too many unseen nodes.

The setting of bitsPerHop (b) is chosen mostly based on how many nodes in the DHT, see following screenshot:

This is with settings of b=2, b=4, b=3. b=4 would probably still be fine. Either way, these can be lowered, it will just give less nodes in the routing table. But it should be noted that we currently use randomNodes in nbc to get our nodes, so the newly discovered nodes through lookups, that are not added because the table is e.g. full, will not be used.

The selflookup is not only done at startup for an initial list of "close" nodes. Running that continuously has not proven to increase the routing table seen nodes.

kdeme · 2020-06-30T08:54:04Z

And here when running on altona. (These are tests where discovery is launched with only 1 bootstrap node)

You can see some continuously happening small drops in the seen nodes. Some (most?) of these can possibly be avoided by better stale re-pinging, see comment above. But as we continuously do lookups, it doesn't matter much currently.

zah · 2020-06-30T11:31:22Z

This looks like quite a complicated way to draw a dinosaur :P

LTGM

arnetheduck · 2020-06-30T15:01:40Z

What's the cost of a) keeping a node in the table and b) pinging it?

arnetheduck · 2020-06-30T15:03:10Z

also, an option might be to score nodes in the table - ie those that have "recently" been validated get a better score, so stuff like randomnodes returns high-score peers first and low-score peers only if there aren't that many

kdeme · 2020-06-30T15:40:37Z

What's the cost of a) keeping a node in the table and b) pinging it?

Direct cost of a) is just memory, ENR size (which is max 300 bytes) + (duplicated from enr) data stored in Node object. Of course there is a hidden cost in the sense that it possibly takes up a spot for another more useful node, in the case that this node is offline. As mentioned before, this could be changed by not using randomNodes but instead lookupRandom directly in the nbc networking code. I eventually want to test that better.

b) is just a request response. And if it is the first ping, a handshake + request response (If we move to an LRU cache for the secrets, possibly not just the first ping that does a handshake). Can't give numbers in time as I haven't benchmarked this, but should be small. I think go implementation benchmarked the handshake part.

kdeme · 2020-06-30T15:47:22Z

also, an option might be to score nodes in the table - ie those that have "recently" been validated get a better score, so stuff like randomnodes returns high-score peers first and low-score peers only if there aren't that many

Nodes are ordered on least recently seen in the bucket. So you could easily get the least recently validated nodes first if you want. And it would change all the time as long as the revalidation runs with a low interval, so perhaps enough randomisation. Not sure though. I guess it depends also on the application. If you want to waste more trial&error connections in favour of having maximum of nodes, or if you want to connect to the least possible but be more sure that it would work (e.g. for mobile).

kdeme marked this pull request as ready for review June 30, 2020 08:55

This was referenced Jun 30, 2020

Liveness verified functionality #183

Closed

Better handling of stale nodes #262

Open

Improve (re)validation of nodes #206

Closed

kdeme merged commit 9a46722 into master Jun 30, 2020

kdeme deleted the discv5-metrics branch July 9, 2020 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add routing table metrics + tweaks + fixes #261

Add routing table metrics + tweaks + fixes #261

kdeme commented Jun 29, 2020

kdeme commented Jun 30, 2020 •

edited

kdeme commented Jun 30, 2020 •

edited

zah commented Jun 30, 2020

arnetheduck commented Jun 30, 2020

arnetheduck commented Jun 30, 2020

kdeme commented Jun 30, 2020

kdeme commented Jun 30, 2020

Add routing table metrics + tweaks + fixes #261

Add routing table metrics + tweaks + fixes #261

Conversation

kdeme commented Jun 29, 2020

kdeme commented Jun 30, 2020 • edited

kdeme commented Jun 30, 2020 • edited

zah commented Jun 30, 2020

arnetheduck commented Jun 30, 2020

arnetheduck commented Jun 30, 2020

kdeme commented Jun 30, 2020

kdeme commented Jun 30, 2020

kdeme commented Jun 30, 2020 •

edited

kdeme commented Jun 30, 2020 •

edited