-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561
Comments
Here's the results from salt version 2015.8.10: Salt Version: Dependency Versions: System Versions: time salt-call state.highstate Summary for localSucceeded: 212 Failed: 0Total states run: 212 real 0m7.354s time salt '10.73.0.13' state.highstate Summary for 10.73.0.13Succeeded: 212 Failed: 0Total states run: 212 real 0m7.454s I'll see about getting some further debug logs from the master and submit them here. |
I also tried the "Live Salt-Master Profiling" from https://docs.saltstack.com/en/latest/topics/troubleshooting/master.html, however no logs are being generated. Possibly a separate issue which I can open if necessary. BTW, killall won't work on salt-master as the parent process is python, so I had to improvise and get all the PIDs for the salt-master and issue a |
#33575 appears to be a duplicate of this. |
@seanjnkns i'm not seeing too much of a performance difference between note: i tested this with the file.managed state I used in issue 33575 |
Primarily file.managed, service.running, and pkg.installed checks. I'm also confused that you would mark this as "Info Needed" when you tested issue 33575 and marked it differently, when it's technically a duplicate. In my case, I have 212 states being performed, in your test on that, you had 20 and noticed a 1s timing difference. I'm pretty sure if you increased that to a range(1,200+) and did the same test, you'd notice a much larger disparity. Additionally, and I apologize if I don't know how to analyze the callgrind file I attached, but did you analyze that or did it prove useful if you did? btw, I'm using the same configuration files for both salt-master and salt-minion for 2016.3.0 that I was using for 2015.8.10. If you'd like, I'd be happy to attach those two configurations. |
I'll do a git bisect on this then and see what it turns up. It's obviously a regression that was introduced. |
@seanjnkns A git bisect would be perfect! Thank you for taking the time. |
Ok, here's the results of the bisect. However, I'm going to re-run it as when it got down to the last bisect, there was still a minor disparity between salt-call vs. salt 'minion_id' state.highstate, so I'm going to re-bisect and get that one too just in case. However, this one here seems to be having the biggest impact thus far: df97ef8 is the first bad commit
:040000 040000 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 4628850e1245cfd87529b0f699fc434ea5e627c3 M salt |
Most interesting. cc: @jacksontj
|
The other one causing in my case ~2s disparity is similarly related: 6838a95 is the first bad commit
:040000 040000 fec886b46db5460b55eab95d411322ae7f5efc65 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 M salt |
What I'm confused about, is these commits seem to point to the Tornado transport, but I'm using the default transport of zeromq. Is there some type of bleeding over of transports? |
Although in no way convinced it's a kernel issue, I tested this on the latest 3.10 CentOS 7 kernel and on the 4.4.12 kernel, both with the same results. |
Until this regression is resolved, I've reverted all our servers back to 2015.8.10 |
@seanjnkns thanks for all of the additional investigation work towards finding where the issue exists. To answer your question my understanding is that tornado has been added into parts of salt, but zermoq is still the default transport. For example on the 2015.8.0 release notes here tornado became the default backend for
|
FYI, I took a test server and updated it to 2016.3.1 and then applied #33896, and retested. No change, salt-call vs. salt 'minion_id' state.highstate showed significant differences in the time to process. |
Retested this on 2016.3.2 and although there's roughly a 1-1.5s overhead increase compared to 2015.8.11, the time difference between salt * state.highstate vs. salt-call state.highstate is the same. Posting my results of just a single salt master/minion combination with the 2 versions: 2015.8.11:
2016.3.2:
Given this, I believe we can mark this resolved. Not exactly sure what commit(s) fixed it, but glad it is. |
Description of Issue/Question
Time testing a salt-call state.highstate vs. a salt '<minion_id>' state.highstate, the timing can be 2+ x longer. Same results can be seen even running the salt-call vs. salt 'minion_id' state.highstate on a salt-master.
Setup
(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
Steps to Reproduce Issue
(Include debug logs if possible and relevant.)
Setup a salt master with a series of rules to apply to itself and make the salt-master a minion of itself (easiest way, but just as easily have any # of minions attached to the salt-master). Then just run
time salt-call state.highstate
vs.time salt '<salt-master minion_id>' state.highstate
. You'll notice a significant time difference.Running in debug mode, it would appear in my case, there's a long delay in the following:
2016-05-26 15:31:53,200 [salt.utils.event ][DEBUG ][26599] Sending event - data = {'fun_args': ['20160526153147994499'], 'jid': '20160526153153163941', 'return': {'tgt_type': 'glob', 'jid': '20160526153147994499', 'tgt': '10.73.0.13', 'pid': 735, 'ret': '', 'user': 'root', 'arg': [], 'fun': 'state.highstate'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2016-05-26T21:31:53.200121', 'fun': 'saltutil.find_job', 'id': '10.73.0.13'}
2016-05-26 15:31:53,201 [salt.utils.reactor][DEBUG ][26386] Gathering reactors for tag salt/job/20160526153153163941/ret/10.73.0.13
2016-05-26 15:32:03,308 [salt.client ][DEBUG ][718] Checking whether jid 20160526153147994499 is still running
2016-05-26 15:32:03,308 [salt.transport.zeromq][DEBUG ][718] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
2016-05-26 15:32:03,363 [salt.utils.lazy ][DEBUG ][26600] LazyLoaded local_cache.prep_jid
Note there's 10 seconds between the Gathering reactors for tag... and Checking whether jid...
and...
in performing additional tests, it "appears" to be isolated to this section as shown from the CLI when running the later mentioned commands:
[DEBUG ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Checking whether jid 20160526162529376753 is still running
[DEBUG ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] jid 20160526162529376753 return from 10.73.0.13
My time test results:
time salt '10.73.0.13' state.highstate
Summary for 10.73.0.13
Succeeded: 212
Failed: 0
Total states run: 212
real 0m20.154s
user 0m0.551s
sys 0m0.074s
time salt-call state.highstate
local:
Summary for local
Succeeded: 212
Failed: 0
Total states run: 212
real 0m9.179s
user 0m6.443s
sys 0m0.996s
Versions Report
(Provided by running
salt --versions-report
. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 2016.3.0
Dependency Versions:
cffi: 0.8.6
cherrypy: 3.2.2
dateutil: 1.5
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.3
libgit2: Not Installed
libnacl: 1.4.3
M2Crypto: 0.21.1
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 4.4.4.bs.ufd
system: Linux
version: CentOS Linux 7.2.1511 Core
The text was updated successfully, but these errors were encountered: