Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add routing table metrics + tweaks + fixes #261

Merged
merged 1 commit into from Jun 30, 2020
Merged

Add routing table metrics + tweaks + fixes #261

merged 1 commit into from Jun 30, 2020

Conversation

kdeme
Copy link
Contributor

@kdeme kdeme commented Jun 29, 2020

  • routing table metrics + option in dcli
  • only forward "seen" nodes on a findNode request
  • setJustSeen & replace on ping AND findnode
  • self lookup only at start
  • revalidate 10x more
  • use bitsPerHop (b) of 5
  • small fix in resolve
  • small fix in bucket split

- routing table metrics + option in dcli
- only forward "seen" nodes on a findNode request
- setJustSeen & replace on ping AND findnode
- self lookup only at start
- revalidate 10x more
- use bitsPerHop (b) of 5
- small fix in resolve
- small fix in bucket split
@kdeme
Copy link
Contributor Author

kdeme commented Jun 30, 2020

This PR basically tackles #183 and #206.

Regarding liveness, only peers that are not "seen" yet are not forwarded in the FindNode requests, as it should be.
They are still passed with randomNodes call though, as else providing peers at start-up will be much slower. (a custom randomNodes call could still filter these).

Regarding staleness, I've done some quick testing with allowing nodes with a failed request/response to stay in the routing table until after n failures. In the somewhat naive implementation of a simple counter, and the same revalidation code, this has sort of a negative effect.
Because of the extra pings required, the actual nodes that are "alive" take much longer to be "seen". A lot of time is wasted on stale nodes. And this is worse because the network seems to have still a lot of stale nodes in the DHT (This is probably due to a combination of some clients not cleaning up stale nodes fast and typically having a lot of stale nodes due to the nature of a testnet, this was tested on witti).
So an alternative solution (e.g. separate exp. back-off ping not obstructing current revalidation) would be good here, I'll create a ticket for that. As a result, nodes are currently still replaced (or removed) immediately on failure.

Here is the state on witti with the current settings, nodes in routing table versus "seen" nodes in routing table.
Screenshot_2020-06-29-current_state
The current setting of revalidation frequency is required to keep routing table not to have too many unseen nodes.

The setting of bitsPerHop (b) is chosen mostly based on how many nodes in the DHT, see following screenshot:
Screenshot_2020-06-29_withnoselflookup-max1000ms-60s-nostale-2-fixed-2-4-3-b
This is with settings of b=2, b=4, b=3. b=4 would probably still be fine. Either way, these can be lowered, it will just give less nodes in the routing table. But it should be noted that we currently use randomNodes in nbc to get our nodes, so the newly discovered nodes through lookups, that are not added because the table is e.g. full, will not be used.

The selflookup is not only done at startup for an initial list of "close" nodes. Running that continuously has not proven to increase the routing table seen nodes.

@kdeme
Copy link
Contributor Author

kdeme commented Jun 30, 2020

And here when running on altona. (These are tests where discovery is launched with only 1 bootstrap node)

Screenshot_2020-06-30 New dashboard - Grafana
You can see some continuously happening small drops in the seen nodes. Some (most?) of these can possibly be avoided by better stale re-pinging, see comment above. But as we continuously do lookups, it doesn't matter much currently.

@kdeme kdeme marked this pull request as ready for review June 30, 2020 08:55
@zah
Copy link
Member

zah commented Jun 30, 2020

This looks like quite a complicated way to draw a dinosaur :P

LTGM

@kdeme kdeme merged commit 9a46722 into master Jun 30, 2020
@arnetheduck
Copy link
Member

What's the cost of a) keeping a node in the table and b) pinging it?

@arnetheduck
Copy link
Member

also, an option might be to score nodes in the table - ie those that have "recently" been validated get a better score, so stuff like randomnodes returns high-score peers first and low-score peers only if there aren't that many

@kdeme
Copy link
Contributor Author

kdeme commented Jun 30, 2020

What's the cost of a) keeping a node in the table and b) pinging it?

Direct cost of a) is just memory, ENR size (which is max 300 bytes) + (duplicated from enr) data stored in Node object. Of course there is a hidden cost in the sense that it possibly takes up a spot for another more useful node, in the case that this node is offline. As mentioned before, this could be changed by not using randomNodes but instead lookupRandom directly in the nbc networking code. I eventually want to test that better.

b) is just a request response. And if it is the first ping, a handshake + request response (If we move to an LRU cache for the secrets, possibly not just the first ping that does a handshake). Can't give numbers in time as I haven't benchmarked this, but should be small. I think go implementation benchmarked the handshake part.

@kdeme
Copy link
Contributor Author

kdeme commented Jun 30, 2020

also, an option might be to score nodes in the table - ie those that have "recently" been validated get a better score, so stuff like randomnodes returns high-score peers first and low-score peers only if there aren't that many

Nodes are ordered on least recently seen in the bucket. So you could easily get the least recently validated nodes first if you want. And it would change all the time as long as the revalidation runs with a low interval, so perhaps enough randomisation. Not sure though. I guess it depends also on the application. If you want to waste more trial&error connections in favour of having maximum of nodes, or if you want to connect to the least possible but be more sure that it would work (e.g. for mobile).

@kdeme kdeme deleted the discv5-metrics branch July 9, 2020 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants