Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase ZT_MAX_PEER_NETWORK_PATHS to 128 #1891

Closed
wants to merge 1 commit into from
Closed

Conversation

joseph-henry
Copy link
Contributor

@joseph-henry joseph-henry commented Mar 7, 2023

Too many paths

Historically if ZeroTier received a packet on a path that had the same IP but a different port it would simply overwrite that entry and learn the new path (to a max of 16). This worked well enough for simple cases. In fact, the vast majority of cases. However if we look at the logs of those early versions you could see ZeroTier constantly learning, forgetting, and re-learning the same paths. Even in this state it wasn't too big of a problem since it didn't prevent communication (the path would be re-learnt for each incoming packet just in time) and didn't eat up much CPU, but it was still an undesirable state.

In response to this I changed how we learn paths and bumped the max allowed paths to 64 and now ZeroTier will accumulate more paths (this isn't the dupe path issue), and it will only overwrite an entry if there are no slots left, even when it overwrites a path it will prioritize based on similarity to the new path. This seems to work a lot better but the side effect is that we end up with huge numbers of paths when people use multiple interfaces with multiple addresses. In most cases this is fine. However we're seeing people reporting high CPU usage when instances of ZeroTier are running into the 64 max cap since people are using multiple interfaces with multiple addresses on both ends. (See: https://discuss.zerotier.com/t/constant-high-cpu-usage-on-raspberry-pi-4/11533/32)

Here are the options as I see it:

  1. Ignore reports and let the high CPU bugs persist.

  2. Once again bump the max path count to 128 or whatever. This doesn't solve the issue but does kick the can down the road. Even though a linear search happens for each incoming/outgoing packet I don't think it's a significant performance bottleneck and even if it is, it would only affect a small number of people filling up the array. One additional side effect would be an increased memory footprint that wouldn't be severe but possibly an issue on memory constrained environments like mobile where memory per process is absurdly low.

  3. introduce a learning backoff timer but if we aren't learning all the paths possible then there might be missed traffic.

  4. Forget paths more quickly

My inclination would be to bump it to 128 and confirm that it doesn't cause too much additional memory usage and possibly tune ZeroTier to forget paths more quickly.

Thoughts?

@laduke
Copy link
Contributor

laduke commented Mar 7, 2023

No to idea 1 please :)


If you're on wifi and ethernet, and have ipv6, you have:

2 ipv4 addresses
4 ipv6 addresses (at least?) ULA and Public

  • 3 ports
    6 * 3 = 18 listening sockets
    18 * 18 = 324 paths. 😬

is 128 enough?

@joseph-henry
Copy link
Contributor Author

is 128 enough?

I wanted to be conservative in the increase and let people re-complain if needed. I'm not opposed to a higher number if we can prove it doesn't degrade performance or consume too much memory.

@Crest
Copy link

Crest commented Mar 7, 2023

If doubling the number of tracked paths increases the (almost) idle network CPU base load you're closing in on getting CPU bound on an Apple M1 Max running macOS Ventura. Please spend the resources to understand the problem. The 1.10.3 release is unusable in production networks for me because of the drastic performance regressions going as far as effectively cutting the connection between certain nodes on my network all while burning through the battery runtime faster than Chrome with dozens of tabs of social media sides (auto-)playing noise.

I have serval questions and don't know the best way to interact with a source available project as both an open source developer as well as a paying customer. These are the questions that I asked myself looking at this problem:

  • Have you understood why ZeroTier nodes learns more than 64 (incoming) paths from a node?
  • What are the chances a tracked path is useful for reaching a node?
  • How much state/history associated with a path is worth tracking to rank them?
  • Have you tried different eviction strategies rather than increasing the upper limit?
  • What do you expect to improve by tracking more than 64 (incoming) networks paths per node?
  • What are the additional up to 64 learned network paths useful for and how expensive would it be to try all up to 128 of paths?
  • How many paths can reasonably be expected to exist over what time frame between pairs of nodes? What fraction of them needs to be tracked and how costly would it be in different categories (CPU time, system calls to classify them, memory consumption, ...)?
  • How much and why did the system call and CPU time per received (and transmitted) packet change between recent releases (on all supported platforms)?

Please don't just bump the limit hoping the problem will go away and not come back.

@laduke
Copy link
Contributor

laduke commented Mar 7, 2023

Would it be hard to make it runtime configurable in local.conf? So embedded people can set it to (low number).
I'm biased towards it just working for everyday users and letting engineers fiddle with fiddly bits.

@erikh
Copy link
Contributor

erikh commented Mar 7, 2023 via email

@joseph-henry
Copy link
Contributor Author

If doubling the number of tracked paths increases the (almost) idle network CPU base load you're closing in on getting CPU bound on an Apple M1 Max running macOS Ventura. Please spend the resources to understand the problem.

Of course we are. This is why I've asked you for more details in the linked discussion. Thanks again btw.

Have you understood why ZeroTier nodes learns more than 64 (incoming) paths from a node?

Yes. This was explained in the linked discussion.

What are the chances a tracked path is useful for reaching a node?

Depends entirely on your unique network conditions. Impossible for us to know.

How much state/history associated with a path is worth tracking to rank them?

I'm not sure what you mean. Can you rephrase this if it wasn't answered elsewhere?

Have you tried different eviction strategies rather than increasing the upper limit?

Probably a more restrictive aliveness timer based on the last packet received. Mentioned in the PR but not yet included.

What do you expect to improve by tracking more than 64 (incoming) networks paths per node?

This is described in the PR

What are the additional up to 64 learned network paths useful for?

That is unique to each case. We cannot know.

How expensive would it be to try all up to 128 of paths?

Unknown right now which is why this is a PR on the dev branch to receive testing

How many paths can reasonably be expected to exist over what time frame between pairs of nodes? What fraction of them needs to be tracked and how costly would it be in different categories (CPU time, system calls to classify them, memory consumption, ...)?

The overhead should be insignificant and nearly undetectable. The spamming of gitifaddrs is a separate problem that will require a separate solution but will be introduced in tandem with this in the next release ideally.

How much and why did the system call and CPU time per received (and transmitted) packet change between recent releases (on all supported platforms)?

This performance degradation is unrelated to path learning, it is likely related to the fixing of another bug.

Please don't just bump the limit hoping the problem will go away and not come back.

This isn't what is suggested. This PR is a work in progress and more changes will eventually be included before it is merged.

@joseph-henry
Copy link
Contributor Author

Would it be hard to make it runtime configurable in local.conf? So embedded people can set it to (low number).
I'm biased towards it just working for everyday users and letting engineers fiddle with fiddly bits.

Very possible. But the advantage of a statically allocated array would be less performance overhead per packet. However this could be minimal if we find the right tricks. Worth testing.

@Boilerplate4u
Copy link

Just pure speculative but it seem the same problem is hitting android as well. A root cause analysis with a potential fix other than max paths would be welcome.

https://discuss.zerotier.com/t/android-zerotier-keep-stop-working-after-1-10-3-update

@joseph-henry
Copy link
Contributor Author

Closing this and continuing in a different PR: #1914

@joseph-henry joseph-henry deleted the dev-maxpaths-fix branch March 29, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants