New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add additional ZMQ tuning parameters necessary for 1k+ minions per master [WIP] #27606

Merged
merged 1 commit into from Oct 16, 2015

Conversation

Projects
None yet
7 participants
@plastikos
Contributor

plastikos commented Oct 1, 2015

Do not merge

Start collecting tuning parameters together in the master config file.

Thayne Harbaugh
Add additional ZMQ tuning parameters necessary for 1,000+ minions per…
… server.

Start collecting tuning parameters together in the master config file.
@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 1, 2015

Contributor

Something similar should be added to the conf/minion since these parameters are also configurable in the minion.

Contributor

plastikos commented on conf/master in 2fe799d Oct 1, 2015

Something similar should be added to the conf/minion since these parameters are also configurable in the minion.

@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 1, 2015

Contributor

Likely need this same change in AsyncEventPublisher - line 755.

Contributor

plastikos commented on salt/utils/event.py in 2fe799d Oct 1, 2015

Likely need this same change in AsyncEventPublisher - line 755.

@plastikos plastikos changed the title from RFC: Add additional ZMQ tuning parameters necessary for 1,000+ minions per server to RFC: Add additional ZMQ tuning parameters necessary for 1k+ minions per master Oct 1, 2015

@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 2, 2015

Contributor

These specific tuning parameters were used with the following configuration:

  • 32 GiB RAM
  • 24 Cores
  • 2.2 GHz
  • 8,400 Minions

It seems that the most significant thing for servicing large numbers of minions is CPU cores and GHz. If the ZMQ queues get backed-up into memory too far then the machine will get swamped, fall behind and the OOM killer will start wacking processes. Additionally this can all be compromised by a slow job cache.

Contributor

plastikos commented Oct 2, 2015

These specific tuning parameters were used with the following configuration:

  • 32 GiB RAM
  • 24 Cores
  • 2.2 GHz
  • 8,400 Minions

It seems that the most significant thing for servicing large numbers of minions is CPU cores and GHz. If the ZMQ queues get backed-up into memory too far then the machine will get swamped, fall behind and the OOM killer will start wacking processes. Additionally this can all be compromised by a slow job cache.

@basepi basepi changed the title from RFC: Add additional ZMQ tuning parameters necessary for 1k+ minions per master to RFC: Add additional ZMQ tuning parameters necessary for 1k+ minions per master [WIP] Oct 5, 2015

@basepi

This comment has been minimized.

Show comment
Hide comment
@basepi

basepi Oct 5, 2015

Collaborator

Updated the title in order to make it more obvious this should not be merged without more review.

Collaborator

basepi commented Oct 5, 2015

Updated the title in order to make it more obvious this should not be merged without more review.

@jacksontj

This comment has been minimized.

Show comment
Hide comment
@jacksontj

jacksontj Oct 6, 2015

Contributor

We should include HWM configuration for the REQ socket as well-- since (at least in my experience) thats where most of the HWM issues are (since its the busiest socket).

Contributor

jacksontj commented Oct 6, 2015

We should include HWM configuration for the REQ socket as well-- since (at least in my experience) thats where most of the HWM issues are (since its the busiest socket).

@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 6, 2015

Contributor

@jacksontj, my instrumented ZMQ has never triggered the HWM for the REQ socket - but maybe I don't have the correct use pattern to cause this. Can you describe a use pattern where the REQ socket will trigger the HWM?

Contributor

plastikos commented Oct 6, 2015

@jacksontj, my instrumented ZMQ has never triggered the HWM for the REQ socket - but maybe I don't have the correct use pattern to cause this. Can you describe a use pattern where the REQ socket will trigger the HWM?

@jacksontj

This comment has been minimized.

Show comment
Hide comment
@jacksontj

jacksontj Oct 9, 2015

Contributor

The vast majority of cases end up being that REQ socket (at least in my experience). When you do a publish-- that initially goes through the REQ socket (locally). In addition, if a lot of minions return data at close to the same time-- that all goes through the REQ socket.

What might be a bit confusing-- is I don't mean the zmq.REQ socket on the minion-- rather I mean the router/dealer/req sockets on the master.

Contributor

jacksontj commented Oct 9, 2015

The vast majority of cases end up being that REQ socket (at least in my experience). When you do a publish-- that initially goes through the REQ socket (locally). In addition, if a lot of minions return data at close to the same time-- that all goes through the REQ socket.

What might be a bit confusing-- is I don't mean the zmq.REQ socket on the minion-- rather I mean the router/dealer/req sockets on the master.

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Oct 12, 2015

Contributor

@plastikos Are we ready to go on this one?

Contributor

cachedout commented Oct 12, 2015

@plastikos Are we ready to go on this one?

@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 16, 2015

Contributor

Sure, merge it if you want. This is useful, but not complete. I haven't pushed this because I'm not convinced the problem has been completely characterized for a complete solution.

Contributor

plastikos commented Oct 16, 2015

Sure, merge it if you want. This is useful, but not complete. I haven't pushed this because I'm not convinced the problem has been completely characterized for a complete solution.

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Oct 16, 2015

Contributor

It's been in the queue for a while so I'm just going to merge it as-is and of course follow-ups are more than welcome. :]

Thanks for your good work here, @plastikos

Contributor

cachedout commented Oct 16, 2015

It's been in the queue for a while so I'm just going to merge it as-is and of course follow-ups are more than welcome. :]

Thanks for your good work here, @plastikos

cachedout added a commit that referenced this pull request Oct 16, 2015

Merge pull request #27606 from plastikos/bug-scaling_zmq_hwm
RFC: Add additional ZMQ tuning parameters necessary for 1k+ minions per master [WIP]

@cachedout cachedout merged commit 1f8d65c into saltstack:develop Oct 16, 2015

2 of 5 checks passed

jenkins/salt-pr-linode-ubuntu14.04-n Salt PR - Linode Ubuntu 14.04 #1017 — ABORTED
Details
jenkins/salt-pr-rs-cent7-n Salt PR - RS CentOS 7 #8491 — ABORTED
Details
default Merged build finished.
Details
jenkins/salt-pr-clone Salt PR - Clone Repository #9944 — SUCCESS
Details
jenkins/salt-pr-lint-n Salt PR - Code Lint #9662 — SUCCESS
Details
@justinta

This comment has been minimized.

Show comment
Hide comment
@justinta

justinta Oct 16, 2015

@cachedout @plastikos I have a feeling that this may have broke develop. Most platforms are stacktracing in our tests until I stop them.

Stacktrace:

20:47:15 Process EventPublisher-18:
20:47:15 Traceback (most recent call last):
20:47:15   File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
20:47:15     self.run()
20:47:15   File "/testing/salt/utils/event.py", line 882, in run
20:47:15     self.epub_sock.setsockopt(zmq.SNDHWM, self.opts.get('event_publisher_pub_hwm'))
20:47:15   File "zmq/backend/cython/socket.pyx", line 390, in zmq.backend.cython.socket.Socket.set (zmq/backend/cython/socket.c:4138)
20:47:15 TypeError: expected int, got: None

justinta commented Oct 16, 2015

@cachedout @plastikos I have a feeling that this may have broke develop. Most platforms are stacktracing in our tests until I stop them.

Stacktrace:

20:47:15 Process EventPublisher-18:
20:47:15 Traceback (most recent call last):
20:47:15   File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
20:47:15     self.run()
20:47:15   File "/testing/salt/utils/event.py", line 882, in run
20:47:15     self.epub_sock.setsockopt(zmq.SNDHWM, self.opts.get('event_publisher_pub_hwm'))
20:47:15   File "zmq/backend/cython/socket.pyx", line 390, in zmq.backend.cython.socket.Socket.set (zmq/backend/cython/socket.c:4138)
20:47:15 TypeError: expected int, got: None

@plastikos plastikos deleted the plastikos:bug-scaling_zmq_hwm branch Oct 20, 2015

@plastikos plastikos restored the plastikos:bug-scaling_zmq_hwm branch Oct 20, 2015

@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 20, 2015

Contributor

@jtand, your stack trace is due to some objects incorrectly passing opts. I thought all of those locations had been fixed. I'll look into it.

What has happened in the past is that default option values were put into config.py as well as the location where the option was used because opts wasn't correctly passed in to object initialization. The object initialization should be fixed rather than scattering default values into multiple files.

Contributor

plastikos commented Oct 20, 2015

@jtand, your stack trace is due to some objects incorrectly passing opts. I thought all of those locations had been fixed. I'll look into it.

What has happened in the past is that default option values were put into config.py as well as the location where the option was used because opts wasn't correctly passed in to object initialization. The object initialization should be fixed rather than scattering default values into multiple files.

@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 20, 2015

Contributor

@jtand - some of the opts parameter passing was even transposed with other variables. In the past some of it was so bad I'm not sure how anything worked. I'm digging around on this right now.

Contributor

plastikos commented Oct 20, 2015

@jtand - some of the opts parameter passing was even transposed with other variables. In the past some of it was so bad I'm not sure how anything worked. I'm digging around on this right now.

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Oct 20, 2015

Contributor

I already fixed this issue earlier today.

On Tue, Oct 20, 2015 at 4:02 PM, plastikos notifications@github.com wrote:

@jtand https://github.com/jtand - some of the opts parameter passing
was even transposed with other variables. In the past some of it was so bad
I'm not sure how anything worked. I'm digging around on this right now.


Reply to this email directly or view it on GitHub
#27606 (comment).

Contributor

cachedout commented Oct 20, 2015

I already fixed this issue earlier today.

On Tue, Oct 20, 2015 at 4:02 PM, plastikos notifications@github.com wrote:

@jtand https://github.com/jtand - some of the opts parameter passing
was even transposed with other variables. In the past some of it was so bad
I'm not sure how anything worked. I'm digging around on this right now.


Reply to this email directly or view it on GitHub
#27606 (comment).

plastikos referenced this pull request Oct 20, 2015

Set a fallback HWM
This mostly happens in the test suite when opts are setup by hand, but a safe default is a good idea regardless.
@plastikos

This comment has been minimized.

Show comment
Hide comment
@plastikos

plastikos Oct 21, 2015

Contributor

See #28189

Contributor

plastikos commented Oct 21, 2015

See #28189

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Nov 20, 2015

Contributor

@plastikos Would you feel comfortable if we backported this to the 2015.8 branch?

Contributor

cachedout commented Nov 20, 2015

@plastikos Would you feel comfortable if we backported this to the 2015.8 branch?

@cachedout

This comment has been minimized.

Show comment
Hide comment
@cachedout

cachedout Jan 6, 2016

Contributor

@rallytime Can you backport this please?

Contributor

cachedout commented Jan 6, 2016

@rallytime Can you backport this please?

@rallytime

This comment has been minimized.

Show comment
Hide comment
@rallytime

rallytime Jan 6, 2016

Contributor

Certainly!

Contributor

rallytime commented Jan 6, 2016

Certainly!

cachedout added a commit that referenced this pull request Jan 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment