Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory/CPU utilization for moderately sized cluster #108

Closed
jacksontj opened this issue Aug 21, 2019 · 5 comments
Closed

High memory/CPU utilization for moderately sized cluster #108

jacksontj opened this issue Aug 21, 2019 · 5 comments

Comments

@jacksontj
Copy link

I have a cluster of ~300 nodes and mesh is consuming ~1-2 GB of RAM. I dug into it and the memory is all being consumed by the topology gossip messages. Upon further inspection I found that the gossip messages are including all peers in the message -- which means the message sizes (and therefore the memory and CPU to generate them) scale with the cluster size.

Are there any plans to implement a more scalable topology gossip?

@jacksontj jacksontj changed the title High memory utilization for moderately sized cluster High memory/CPU utilization for moderately sized cluster Aug 21, 2019
@bboreham
Copy link
Contributor

The idea is that gossip messages are not sent very often - Weave Mesh selects log(number of connections) peers to send to - so the scalability should be good.

However various other issues in the code mean that messages are sent way more often than this ideal. #101, #106 and #107 are attempts to improve matters, though work is ongoing to understand the full set of causes.

@jacksontj
Copy link
Author

From my use-case of protokube (kubernetes/kops#7427) I'm seeing ~2 cores of CPU usage and ~3G of RAM usage with a fully connected mesh of ~300 nodes. This seems to highlight some serious scale limitations of weaveworks/mesh -- as that isn't even a very large cluster. More importantly the utilization ramp-up was more-or-less exponential as more nodes were added.

Even after I made a custom build with #107 fixed the CPU usage dropped to 1.6 cores -- which is still way too many (all the CPU time was being spent marshaling/unmarshaling the peer list being gossiped around).

There seem to be quite a few issues, a couple: (1) no concept of "suspect" state (2) peer messages include the list of all peers it has connected-- which scaled with cluster size. There are likely more but TBH I have decided to instead spend my time swapping protokube to a more robust/reliable gossip library.

@bboreham
Copy link
Contributor

We're seeing quite positive results from deferring gossip updates - #117 and #118.
Would you like to try those in your build?

@bboreham
Copy link
Contributor

bboreham commented Nov 4, 2019

peer messages include the list of all peers it has connected

It's worse than that - the topology message lists all the connections of all peers. In other words in a fully-connected cluster it's O(N^2).

However, for 300 nodes that might be 8MB per message, which needs something else to get to 1-2GB.

We found that the connections would each read a message then block on the Peers lock to apply the update. So with 200 connections that's 1.6GB.

Changing the "everyone sends everything" behaviour is quite a big change, so ahead of that I felt that just slowing down the initial connections would help - #124 . After initial connection the updates only go to logN peers so we shouldn't get the massive spikes.

@bboreham
Copy link
Contributor

bboreham commented Nov 5, 2019

I'm going to close this issue now 0.4 is released - if you want to come back to the discussion please do.

@bboreham bboreham closed this as completed Nov 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants