[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

seanjnkns · 2016-05-26T22:51:28Z

Description of Issue/Question

Time testing a salt-call state.highstate vs. a salt '<minion_id>' state.highstate, the timing can be 2+ x longer. Same results can be seen even running the salt-call vs. salt 'minion_id' state.highstate on a salt-master.

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)
Setup a salt master with a series of rules to apply to itself and make the salt-master a minion of itself (easiest way, but just as easily have any # of minions attached to the salt-master). Then just run time salt-call state.highstate vs. time salt '<salt-master minion_id>' state.highstate. You'll notice a significant time difference.

Running in debug mode, it would appear in my case, there's a long delay in the following:

2016-05-26 15:31:53,200 [salt.utils.event ][DEBUG ][26599] Sending event - data = {'fun_args': ['20160526153147994499'], 'jid': '20160526153153163941', 'return': {'tgt_type': 'glob', 'jid': '20160526153147994499', 'tgt': '10.73.0.13', 'pid': 735, 'ret': '', 'user': 'root', 'arg': [], 'fun': 'state.highstate'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2016-05-26T21:31:53.200121', 'fun': 'saltutil.find_job', 'id': '10.73.0.13'}
2016-05-26 15:31:53,201 [salt.utils.reactor][DEBUG ][26386] Gathering reactors for tag salt/job/20160526153153163941/ret/10.73.0.13
2016-05-26 15:32:03,308 [salt.client ][DEBUG ][718] Checking whether jid 20160526153147994499 is still running
2016-05-26 15:32:03,308 [salt.transport.zeromq][DEBUG ][718] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
2016-05-26 15:32:03,363 [salt.utils.lazy ][DEBUG ][26600] LazyLoaded local_cache.prep_jid

Note there's 10 seconds between the Gathering reactors for tag... and Checking whether jid...

and...

in performing additional tests, it "appears" to be isolated to this section as shown from the CLI when running the later mentioned commands:

[DEBUG ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] Checking whether jid 20160526162529376753 is still running
[DEBUG ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/master', '10.73.0.13_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG ] jid 20160526162529376753 return from 10.73.0.13

My time test results:
time salt '10.73.0.13' state.highstate

Summary for 10.73.0.13

Succeeded: 212

Failed: 0

Total states run: 212

real 0m20.154s
user 0m0.551s
sys 0m0.074s

time salt-call state.highstate
local:

Summary for local

Succeeded: 212

Failed: 0

Total states run: 212

real 0m9.179s
user 0m6.443s
sys 0m0.996s

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 2016.3.0

Dependency Versions:
cffi: 0.8.6
cherrypy: 3.2.2
dateutil: 1.5
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.3
libgit2: Not Installed
libnacl: 1.4.3
M2Crypto: 0.21.1
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 4.4.4.bs.ufd
system: Linux
version: CentOS Linux 7.2.1511 Core

The text was updated successfully, but these errors were encountered:

seanjnkns · 2016-05-27T14:14:25Z

Here's the results from salt version 2015.8.10:

Salt Version:
Salt: 2015.8.10

Dependency Versions:
Jinja2: 2.7.3
M2Crypto: 0.21.1
Mako: 0.8.1
PyYAML: 3.11
PyZMQ: 14.7.0
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
RAET: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5
cffi: 0.8.6
cherrypy: 3.2.2
dateutil: 1.5
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
libgit2: Not Installed
libnacl: 1.4.3
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: 1.2.3
pycparser: 2.14
pycrypto: 2.6.1
pygit2: Not Installed
python-gnupg: Not Installed
smmap: Not Installed
timelib: Not Installed

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 4.4.4.bs.ufd
system: CentOS Linux 7.2.1511 Core

time salt-call state.highstate
local:

Summary for local

Succeeded: 212

Failed: 0

Total states run: 212

real 0m7.354s
user 0m5.430s
sys 0m0.876s

time salt '10.73.0.13' state.highstate
10.73.0.13:

Summary for 10.73.0.13

Succeeded: 212

Failed: 0

Total states run: 212

real 0m7.454s
user 0m5.430s
sys 0m0.876s

I'll see about getting some further debug logs from the master and submit them here.

seanjnkns · 2016-05-27T14:30:41Z

I also tried the "Live Salt-Master Profiling" from https://docs.saltstack.com/en/latest/topics/troubleshooting/master.html, however no logs are being generated. Possibly a separate issue which I can open if necessary. BTW, killall won't work on salt-master as the parent process is python, so I had to improvise and get all the PIDs for the salt-master and issue a kill -SIGUSR2 <long list of pids> while running the salt-master in debug mode. Let me know if there's any further useful information I can provide to determine the cause.

seanjnkns · 2016-05-27T14:43:17Z

callgrind.salt-1464360059-19260.zip

seanjnkns · 2016-05-27T17:27:32Z

#33575 appears to be a duplicate of this.

Ch3LL · 2016-05-27T18:31:21Z

@seanjnkns i'm not seeing too much of a performance difference between salt-call state.highstate and salt 'minion' state.highstate. Nothing different from 2015.8 and 2015.5. What kind of states are you calling in your highstate?

note: i tested this with the file.managed state I used in issue 33575

seanjnkns · 2016-05-27T19:47:53Z

Primarily file.managed, service.running, and pkg.installed checks. I'm also confused that you would mark this as "Info Needed" when you tested issue 33575 and marked it differently, when it's technically a duplicate. In my case, I have 212 states being performed, in your test on that, you had 20 and noticed a 1s timing difference. I'm pretty sure if you increased that to a range(1,200+) and did the same test, you'd notice a much larger disparity.

Additionally, and I apologize if I don't know how to analyze the callgrind file I attached, but did you analyze that or did it prove useful if you did?

btw, I'm using the same configuration files for both salt-master and salt-minion for 2016.3.0 that I was using for 2015.8.10. If you'd like, I'd be happy to attach those two configurations.

seanjnkns · 2016-05-27T21:57:31Z

I'll do a git bisect on this then and see what it turns up. It's obviously a regression that was introduced.

cachedout · 2016-05-27T22:02:14Z

@seanjnkns A git bisect would be perfect! Thank you for taking the time.

seanjnkns · 2016-05-28T13:39:46Z

Ok, here's the results of the bisect. However, I'm going to re-run it as when it got down to the last bisect, there was still a minor disparity between salt-call vs. salt 'minion_id' state.highstate, so I'm going to re-bisect and get that one too just in case. However, this one here seems to be having the biggest impact thus far:

df97ef8 is the first bad commit
commit df97ef8
Author: Thomas Jackson jacksontj.89@gmail.com
Date: Sat Oct 17 10:57:35 2015 -0700

Clone all contexts before executing the minion methods

:040000 040000 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 4628850e1245cfd87529b0f699fc434ea5e627c3 M salt

cachedout · 2016-05-28T14:38:16Z

Most interesting.

cc: @jacksontj

On May 28, 2016, at 7:39 AM, Sean Jenkins notifications@github.com wrote:

Ok, here's the results of the bisect. However, I'm going to re-run it as when it got down to the last bisect, there was still a minor disparity between salt-call vs. salt 'minion_id' state.highstate, so I'm going to re-bisect and get that one too just in case. However, this one here seems to be having the biggest impact thus far:

df97ef8 is the first bad commit
commit df97ef8
Author: Thomas Jackson jacksontj.89@gmail.com
Date: Sat Oct 17 10:57:35 2015 -0700

Clone all contexts before executing the minion methods
:040000 040000 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 4628850e1245cfd87529b0f699fc434ea5e627c3 M salt

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

seanjnkns · 2016-05-28T14:40:22Z

The other one causing in my case ~2s disparity is similarly related:

6838a95 is the first bad commit
commit 6838a95
Author: Thomas Jackson jacksontj.89@gmail.com
Date: Fri Oct 9 16:11:39 2015 -0700

Initial addition of ContextDict

Inspired by #23373

The basic issue we ran into is that the loader is injecting globals directly into the global namespace. This means that these injected values are not thread or coroutine safe-- meaning we can never do more than one thing at a time. Instead of multiprocessing everything to death-- we can simply use a stack_context to nicely handle this down in the core. As far as the module authors/users are concerned nothing has changed-- but the storage behind the scenes is now per-JID. This same set of classes can easily be used to store additional data (next candidates are reactors, master MWorker tasks, etc.).

:040000 040000 fec886b46db5460b55eab95d411322ae7f5efc65 db1c3c5f9083859d3826a545c2c6ff5a8f2d3597 M salt
:040000 040000 7bf3dfa8fd44f50ea79b85d6dbb5068e40464edb 362bd6835694e0e08813032223810e1cbcd84160 M test

seanjnkns · 2016-05-31T15:08:28Z

What I'm confused about, is these commits seem to point to the Tornado transport, but I'm using the default transport of zeromq. Is there some type of bleeding over of transports?

seanjnkns · 2016-06-02T19:58:14Z

Although in no way convinced it's a kernel issue, I tested this on the latest 3.10 CentOS 7 kernel and on the 4.4.12 kernel, both with the same results.

seanjnkns · 2016-06-13T22:06:27Z

Until this regression is resolved, I've reverted all our servers back to 2015.8.10

Ch3LL · 2016-06-14T17:36:13Z

@seanjnkns thanks for all of the additional investigation work towards finding where the issue exists. To answer your question my understanding is that tornado has been added into parts of salt, but zermoq is still the default transport. For example on the 2015.8.0 release notes here tornado became the default backend for http.request:

"Tornado is now the default backend for http.request"

seanjnkns · 2016-07-01T19:10:06Z

FYI, I took a test server and updated it to 2016.3.1 and then applied #33896, and retested. No change, salt-call vs. salt 'minion_id' state.highstate showed significant differences in the time to process.

seanjnkns · 2016-08-03T22:47:39Z

Retested this on 2016.3.2 and although there's roughly a 1-1.5s overhead increase compared to 2015.8.11, the time difference between salt * state.highstate vs. salt-call state.highstate is the same. Posting my results of just a single salt master/minion combination with the 2 versions:

2015.8.11:
echo 3 > /proc/sys/vm/drop_caches ; time salt-call state.highstate:
real 0m9.643s
user 0m5.477s
sys 0m1.028s

echo 3 > /proc/sys/vm/drop_caches ; time salt \* state.highstate:
real    0m9.117s
user    0m0.457s
sys 0m0.101s

2016.3.2:
echo 3 > /proc/sys/vm/drop_caches ; time salt-call state.highstate:
real 0m10.832s
user 0m5.784s
sys 0m1.312s

echo 3 > /proc/sys/vm/drop_caches ; time salt \* state.highstate:
real    0m10.580s
user    0m0.519s
sys 0m0.102s

Given this, I believe we can mark this resolved. Not exactly sure what commit(s) fixed it, but glad it is.

Ch3LL added the info-needed waiting for more info label May 27, 2016

Ch3LL added this to the Blocked milestone May 27, 2016

tbaker57 mentioned this issue Jun 14, 2016

2016.3 Degraded Highstate Performance #33988

Closed

Ch3LL added Bug broken, incorrect, or confusing behavior severity-high 2nd top severity, seen by most users, causes major problems P1 Priority 1 Core relates to code central or existential to Salt and removed info-needed waiting for more info labels Jun 14, 2016

Ch3LL modified the milestones: Approved, Blocked Jun 14, 2016

seanjnkns closed this as completed Aug 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

seanjnkns commented May 26, 2016

seanjnkns commented May 27, 2016 •

edited

Loading

seanjnkns commented May 27, 2016

seanjnkns commented May 27, 2016

seanjnkns commented May 27, 2016

Ch3LL commented May 27, 2016

seanjnkns commented May 27, 2016

seanjnkns commented May 27, 2016

cachedout commented May 27, 2016

seanjnkns commented May 28, 2016

cachedout commented May 28, 2016

seanjnkns commented May 28, 2016

seanjnkns commented May 31, 2016

seanjnkns commented Jun 2, 2016

seanjnkns commented Jun 13, 2016

Ch3LL commented Jun 14, 2016

seanjnkns commented Jul 1, 2016

seanjnkns commented Aug 3, 2016

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

Comments

seanjnkns commented May 26, 2016

Description of Issue/Question

Setup

Steps to Reproduce Issue

Summary for 10.73.0.13

Failed: 0

Summary for local

Failed: 0

Versions Report

seanjnkns commented May 27, 2016 • edited Loading

Summary for local

Failed: 0

Summary for 10.73.0.13

Failed: 0

seanjnkns commented May 27, 2016

seanjnkns commented May 27, 2016

seanjnkns commented May 27, 2016

Ch3LL commented May 27, 2016

seanjnkns commented May 27, 2016

seanjnkns commented May 27, 2016

cachedout commented May 27, 2016

seanjnkns commented May 28, 2016

cachedout commented May 28, 2016

seanjnkns commented May 28, 2016

seanjnkns commented May 31, 2016

seanjnkns commented Jun 2, 2016

seanjnkns commented Jun 13, 2016

Ch3LL commented Jun 14, 2016

seanjnkns commented Jul 1, 2016

seanjnkns commented Aug 3, 2016

seanjnkns commented May 27, 2016 •

edited

Loading