Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

Closed
seanjnkns opened this issue May 26, 2016 · 17 comments
Closed

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

seanjnkns opened this issue May 26, 2016 · 17 comments
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P1 Priority 1 severity-high 2nd top severity, seen by most users, causes major problems
Milestone

Comments

@seanjnkns
Copy link
Contributor

Description of Issue/Question

Time testing a salt-call state.highstate vs. a salt '<minion_id>' state.highstate, the timing can be 2+ x longer. Same results can be seen even running the salt-call vs. salt 'minion_id' state.highstate on a salt-master.

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)
Setup a salt master with a series of rules to apply to itself and make the salt-master a minion of itself (easiest way, but just as easily have any # of minions attached to the salt-master). Then just run time salt-call state.highstate vs. time salt '<salt-master minion_id>' state.highstate. You'll notice a significant time difference.

Running in debug mode, it would appear in my case, there's a long delay in the following:

2016-05-26 15:31:53,200 [salt.utils.event ][DEBUG ][26599] Sending event - data = {'fun_args': ['20160526153147994499'], 'jid': '20160526153153163941', 'return': {'tgt_type': 'glob', 'jid': '20160526153147994499', 'tgt': '10.73.0.13', 'pid': 735, 'ret': '', 'user': 'root', 'arg': [], 'fun': 'state.highstate'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2016-05-26T21:31:53.200121', 'fun': 'saltutil.find_job', 'id': '10.73.0.13'}
2016-05-26 15:31:53,201 [salt.utils.reactor][DEBUG ][26386] Gathering reactors for tag salt/job/20160526153153163941/ret/10.73.0.13
2016-05-26 15:32:03,308 [salt.client ][DEBUG ][718] Checking whether jid 20160526153147994499 is still running
2016-05-26 15:32:03,308 [salt.transport.zeromq][DEBUG ][718] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
2016-05-26 15:32:03,363 [salt.utils.lazy ][DEBUG ][26600] LazyLoaded local_cache.prep_jid

Note there's 10 seconds between the Gathering reactors for tag... and Checking whether jid...

and...

in performing additional tests, it "appears" to be isolated to this section as shown from the CLI when running the later mentioned commands:

[DEBUG ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Checking whether jid 20160526162529376753 is still running
[DEBUG ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] jid 20160526162529376753 return from 10.73.0.13

My time test results:
time salt '10.73.0.13' state.highstate

Summary for 10.73.0.13

Succeeded: 212

Failed: 0

Total states run: 212

real 0m20.154s
user 0m0.551s
sys 0m0.074s

time salt-call state.highstate
local:

Summary for local

Succeeded: 212

Failed: 0

Total states run: 212

real 0m9.179s
user 0m6.443s
sys 0m0.996s

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 2016.3.0

Dependency Versions:
cffi: 0.8.6
cherrypy: 3.2.2
dateutil: 1.5
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.3
libgit2: Not Installed
libnacl: 1.4.3
M2Crypto: 0.21.1
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 4.4.4.bs.ufd
system: Linux
version: CentOS Linux 7.2.1511 Core

@seanjnkns
Copy link
Contributor Author

seanjnkns commented May 27, 2016

Here's the results from salt version 2015.8.10:

Salt Version:
Salt: 2015.8.10

Dependency Versions:
Jinja2: 2.7.3
M2Crypto: 0.21.1
Mako: 0.8.1
PyYAML: 3.11
PyZMQ: 14.7.0
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
cffi: 0.8.6
cherrypy: 3.2.2
dateutil: 1.5
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
libgit2: Not Installed
libnacl: 1.4.3
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: Not Installed
python-gnupg: Not Installed
smmap: Not Installed
timelib: Not Installed

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 4.4.4.bs.ufd
system: CentOS Linux 7.2.1511 Core

time salt-call state.highstate
local:

Summary for local

Succeeded: 212

Failed: 0

Total states run: 212

real 0m7.354s
user 0m5.430s
sys 0m0.876s

time salt '10.73.0.13' state.highstate
10.73.0.13:

Summary for 10.73.0.13

Succeeded: 212

Failed: 0

Total states run: 212

real 0m7.454s
user 0m5.430s
sys 0m0.876s

I'll see about getting some further debug logs from the master and submit them here.

@seanjnkns
Copy link
Contributor Author

I also tried the "Live Salt-Master Profiling" from https://docs.saltstack.com/en/latest/topics/troubleshooting/master.html, however no logs are being generated. Possibly a separate issue which I can open if necessary. BTW, killall won't work on salt-master as the parent process is python, so I had to improvise and get all the PIDs for the salt-master and issue a kill -SIGUSR2 <long list of pids> while running the salt-master in debug mode. Let me know if there's any further useful information I can provide to determine the cause.

@seanjnkns
Copy link
Contributor Author

@seanjnkns
Copy link
Contributor Author

#33575 appears to be a duplicate of this.

@Ch3LL
Copy link
Contributor

Ch3LL commented May 27, 2016

@seanjnkns i'm not seeing too much of a performance difference between salt-call state.highstate and salt 'minion' state.highstate. Nothing different from 2015.8 and 2015.5. What kind of states are you calling in your highstate?

note: i tested this with the file.managed state I used in issue 33575

@Ch3LL Ch3LL added the info-needed waiting for more info label May 27, 2016
@Ch3LL Ch3LL added this to the Blocked milestone May 27, 2016
@seanjnkns
Copy link
Contributor Author

Primarily file.managed, service.running, and pkg.installed checks. I'm also confused that you would mark this as "Info Needed" when you tested issue 33575 and marked it differently, when it's technically a duplicate. In my case, I have 212 states being performed, in your test on that, you had 20 and noticed a 1s timing difference. I'm pretty sure if you increased that to a range(1,200+) and did the same test, you'd notice a much larger disparity.

Additionally, and I apologize if I don't know how to analyze the callgrind file I attached, but did you analyze that or did it prove useful if you did?

btw, I'm using the same configuration files for both salt-master and salt-minion for 2016.3.0 that I was using for 2015.8.10. If you'd like, I'd be happy to attach those two configurations.

@seanjnkns
Copy link
Contributor Author

I'll do a git bisect on this then and see what it turns up. It's obviously a regression that was introduced.

@cachedout
Copy link
Contributor

@seanjnkns A git bisect would be perfect! Thank you for taking the time.

@seanjnkns
Copy link
Contributor Author

Ok, here's the results of the bisect. However, I'm going to re-run it as when it got down to the last bisect, there was still a minor disparity between salt-call vs. salt 'minion_id' state.highstate, so I'm going to re-bisect and get that one too just in case. However, this one here seems to be having the biggest impact thus far:

df97ef8 is the first bad commit
commit df97ef8
Author: Thomas Jackson jacksontj.89@gmail.com
Date: Sat Oct 17 10:57:35 2015 -0700

Clone all contexts before executing the minion methods

:040000 040000 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 4628850e1245cfd87529b0f699fc434ea5e627c3 M salt

@cachedout
Copy link
Contributor

Most interesting.

cc: @jacksontj

On May 28, 2016, at 7:39 AM, Sean Jenkins notifications@github.com wrote:

Ok, here's the results of the bisect. However, I'm going to re-run it as when it got down to the last bisect, there was still a minor disparity between salt-call vs. salt 'minion_id' state.highstate, so I'm going to re-bisect and get that one too just in case. However, this one here seems to be having the biggest impact thus far:

df97ef8 is the first bad commit
commit df97ef8
Author: Thomas Jackson jacksontj.89@gmail.com
Date: Sat Oct 17 10:57:35 2015 -0700

Clone all contexts before executing the minion methods
:040000 040000 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 4628850e1245cfd87529b0f699fc434ea5e627c3 M salt


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@seanjnkns
Copy link
Contributor Author

The other one causing in my case ~2s disparity is similarly related:

6838a95 is the first bad commit
commit 6838a95
Author: Thomas Jackson jacksontj.89@gmail.com
Date: Fri Oct 9 16:11:39 2015 -0700

Initial addition of ContextDict

Inspired by #23373

The basic issue we ran into is that the loader is injecting globals directly into the global namespace. This means that these injected values are not thread or coroutine safe-- meaning we can never do more than one thing at a time. Instead of multiprocessing everything to death-- we can simply use a stack_context to nicely handle this down in the core. As far as the module authors/users are concerned nothing has changed-- but the storage behind the scenes is now per-JID. This same set of classes can easily be used to store additional data (next candidates are reactors, master MWorker tasks, etc.).

:040000 040000 fec886b46db5460b55eab95d411322ae7f5efc65 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 M salt
:040000 040000 7bf3dfa8fd44f50ea79b85d6dbb5068e40464edb 362bd6835694e0e08813032223810e1cbcd84160 M test

@seanjnkns
Copy link
Contributor Author

What I'm confused about, is these commits seem to point to the Tornado transport, but I'm using the default transport of zeromq. Is there some type of bleeding over of transports?

@seanjnkns
Copy link
Contributor Author

Although in no way convinced it's a kernel issue, I tested this on the latest 3.10 CentOS 7 kernel and on the 4.4.12 kernel, both with the same results.

@seanjnkns
Copy link
Contributor Author

Until this regression is resolved, I've reverted all our servers back to 2015.8.10

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior severity-high 2nd top severity, seen by most users, causes major problems P1 Priority 1 Core relates to code central or existential to Salt and removed info-needed waiting for more info labels Jun 14, 2016
@Ch3LL Ch3LL modified the milestones: Approved, Blocked Jun 14, 2016
@Ch3LL
Copy link
Contributor

Ch3LL commented Jun 14, 2016

@seanjnkns thanks for all of the additional investigation work towards finding where the issue exists. To answer your question my understanding is that tornado has been added into parts of salt, but zermoq is still the default transport. For example on the 2015.8.0 release notes here tornado became the default backend for http.request:

"Tornado is now the default backend for http.request"

@seanjnkns
Copy link
Contributor Author

FYI, I took a test server and updated it to 2016.3.1 and then applied #33896, and retested. No change, salt-call vs. salt 'minion_id' state.highstate showed significant differences in the time to process.

@seanjnkns
Copy link
Contributor Author

Retested this on 2016.3.2 and although there's roughly a 1-1.5s overhead increase compared to 2015.8.11, the time difference between salt * state.highstate vs. salt-call state.highstate is the same. Posting my results of just a single salt master/minion combination with the 2 versions:

2015.8.11:
echo 3 > /proc/sys/vm/drop_caches ; time salt-call state.highstate:
real 0m9.643s
user 0m5.477s
sys 0m1.028s

echo 3 > /proc/sys/vm/drop_caches ; time salt \* state.highstate:
real    0m9.117s
user    0m0.457s
sys 0m0.101s

2016.3.2:
echo 3 > /proc/sys/vm/drop_caches ; time salt-call state.highstate:
real 0m10.832s
user 0m5.784s
sys 0m1.312s

echo 3 > /proc/sys/vm/drop_caches ; time salt \* state.highstate:
real    0m10.580s
user    0m0.519s
sys 0m0.102s

Given this, I believe we can mark this resolved. Not exactly sure what commit(s) fixed it, but glad it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P1 Priority 1 severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

No branches or pull requests

3 participants