File states seem slower in 2016.3, especially on first cache retrieval #33575

anlutro · 2016-05-27T11:08:06Z

I upgraded to 2016.3 in my VMs and file states seem significantly slower on the first pass (10-20 times slower). However, on subsequent state.apply calls, things seem fine again.

For every file that the minion has to fetch from the master, I see about 30-40 occurances of these log lines in the minion log:

2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.transport.zeromq] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.crypt] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-05-27 12:52:34,319 [   DEBUG] [17955] [salt.crypt] Re-using SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')

Followed by:

2016-05-27 12:52:35,930 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', looking at rel_path '|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** attempting ** 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] No dest file found

Then some more of the zeromq-related log lines, then this:

2016-05-27 12:52:37,632 [    INFO] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** done ** '|salt/minion/configs/master.conf'
2016-05-27 12:52:37,633 [   DEBUG] [17955] [salt.utils.jinja] Jinja search path: '['/var/cache/salt/minion/files/base']'

I had a look at logs from before the upgrade and there were nowhere near the same amounts of log lines regarding re-auth.

vagrant@orch:~$ salt-call --versions
Salt Version:
           Salt: 2016.3.0

Dependency Versions:
           cffi: 0.8.6
       cherrypy: 3.2.3
       dateutil: 2.2
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.4 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.4

The text was updated successfully, but these errors were encountered:

genx7up · 2016-05-27T16:12:29Z

Similar experience after upgrade to 2016.3.0. First run of highstate on clean install is significantly slower. Subsequent runs are also slower vs 2015.8.10

Salt Version:
Salt: 2016.3.0

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.13.1.el7.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core

Ch3LL · 2016-05-27T18:11:43Z

@anlutro I am able to replicate this and am seeing a performance difference with a small test case. Here's the test case:

[root@ch3ll-master salt]# cat issues/33575.sls 
{% for file in range(1,20) %}
add-file-{{ file }}:
  file.managed:
    - name: /tmp/test
    - source: salt://test{{ file }}
{% endfor %}

[root@ch3ll-master salt]# cat /srv/salt/top.sls 
base:
  '*':
    - issues.33575

When running the following: time sudo salt 'ch3ll-mas*' state.highstate -ldebug

on 2015.8.10:

real    0m2.728s
user    0m0.888s
sys     0m0.157s

on 2016.3.0

real    0m3.795s
user    0m0.931s
sys     0m0.151s

thatch45 · 2016-05-27T21:42:36Z

@cachedout do you think that this might be part of the multimaster cacheing file client fix?

cachedout · 2016-05-27T21:47:45Z

Hmmm, that's a good thought. Could be. Trying to find the culprit with a git bisect could be a good approach here.

hgfischer · 2016-06-09T07:17:37Z

Looks like I have the same problem. salt.fileclient is super slow. Some examples for the log:

2016-06-09 09:14:00,996 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/taskhandle.py'
2016-06-09 09:14:04,187 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/utils.py'

It took ~4 seconds to copy a file with 2891 bytes!!

inty · 2016-06-09T11:53:11Z

Yeah we're experiencing this issue too... a job that used to take like 10 minutes now takes 3 hours. What takes time is copying about 8000 images from the master.

danielmotaleite · 2016-06-16T23:02:18Z

we also have this problem, a highstate takes now about 15m on each server, so for now i reverted to back to 2015.8.
I would say that this is a good reason for a bugfix release

thatch45 · 2016-06-17T15:22:48Z

Yes, we are still hunting this one down.

cachedout · 2016-06-17T15:27:38Z

This should be resolved here and available in 2016.3.1:

#33896

thatch45 · 2016-06-17T15:28:48Z

Phew! Thanks @cachedout ! I thought this had been missed for a minute!

thatch45 · 2016-06-17T15:29:20Z

so can this be closed?

thatch45 · 2016-06-17T15:31:35Z

Fixed by #33896

danielmotaleite · 2016-06-17T18:42:28Z

Just upgraded again to 2016.3.1 and still takes a long time to process...

the salt-minion keeps using 100% of the CPU until it finishes the highstate.

salt-minion -l garbage i get this log showing always a AsyncZeroMQReqChannel and each 1~2 seconds i finally get a normal state being applied:

[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156465323904 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 12483
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183639104 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185678768 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] In saltenv 'base', looking at rel_path 'templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[INFO    ] Fetching file from saltenv 'base', ** skipped ** latest already in cache 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[INFO    ] File /etc/nsswitch.conf is in the correct state
[INFO    ] Completed state [/etc/nsswitch.conf] at time 20:25:29.541349 duration_in_ms=5715.604
[INFO    ] Running state [/etc/rsyslog.conf] at time 20:25:29.541611
[INFO    ] Executing state file.managed for /etc/rsyslog.conf
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182034624 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[TRACE   ] ret_val = None
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182063512 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185516080 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183189480 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156157189872 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156181971744 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156184754224 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183288000 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')

Doing a strace -enetwork -f shows in the loop this:

[pid  7894] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 48
[pid  7894] connect(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  7894] getsockopt(48, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_KEEPIDLE, [300], 4) = 0
[pid  7894] getpeername(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, [16]) = 0
[pid  7894] recvfrom(48, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
[pid  7894] recvfrom(48, 0x7f5340000ef2, 2, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\377\0\0\0\0\0\0\0\1\177\3", 11, 0, NULL, 0) = 11
[pid  7894] recvfrom(48, "\3\0", 2, 0, NULL, NULL) = 2
[pid  7894] recvfrom(48, "NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 52, 0, NULL, NULL) = 52
[pid  7894] recvfrom(48, 0x7f5340003200, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\0NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 53, 0, NULL, 0) = 53
[pid  7894] sendto(48, "\4&\5READY\vSocket-Type\0\0\0\3REQ\10Identity\0\0\0\0", 40, 0, NULL, 0) = 40
[pid  7894] recvfrom(48, "\4)\5READY\vSocket-Type\0\0\0\6ROUTER\10Identity\0\0\0\0", 8192, 0, NULL, NULL) = 43
[pid  7893] +++ exited with 0 +++
[pid  7894] +++ exited with 0 +++

So this query is being done several times per second and with the encryption, it eats all the CPU of the machine

danielmotaleite · 2016-06-17T18:49:04Z

reverting to the 2015.8, the Initializing new AsyncZeroMQReqChannel is ONLY show up 5 times in the "garbage" output during the machine highstate vs the thousands of times from the 2016.03.1

anlutro · 2016-06-20T07:43:23Z

I can't reproduce this problem in 2016.3.1, I'll leave another comment if it should pop up again.

ahammond · 2016-06-21T20:13:29Z

@markahopper this (specifically what @danielmotaleite pointed out) is huge for us.

DmitryKuzmenko · 2016-06-22T07:07:26Z

@danielmotaleite, @anlutro, @ahammond
As @thatch45 pointed above this bug is fixed by #33896. Currently the only 2016.3 branch and develop branch contain the fix.
You should either wait for 2016.3.2 release or try to apply #33896 by yourself.

anlutro · 2016-06-22T11:02:48Z

Oh, odd. I was sure it was merged before 2016.3.1. I guess it's just a coincidence that I can't reproduce it.

danielmotaleite · 2016-06-22T20:51:54Z

oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.

I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.

Thanks, i will wait for the 2016.3.2 release

seanjnkns mentioned this issue May 27, 2016

[2016.3.0] salt-call vs salt '<minion_id>' state.highstate regression #33561

Closed

Ch3LL added this to the Approved milestone May 27, 2016

Ch3LL added the Bug broken, incorrect, or confusing behavior label May 27, 2016

meggiebot modified the milestones: C 9, Approved May 28, 2016

meggiebot added the TEAM Core label May 28, 2016

meggiebot assigned DmitryKuzmenko May 31, 2016

meggiebot modified the milestones: C 8, C 9 Jun 6, 2016

meggiebot removed severity-critical top severity, seen by most users, serious issues ZRELEASED - Boron labels Jun 6, 2016

Ch3LL mentioned this issue Jun 8, 2016

file.managed state should not download a file if the checksum did not change #33831

Closed

This was referenced Jun 9, 2016

[Don't merge] Ignore this: wrong target branch. #33895

Closed

Don't deep copy context dict values. #33896

Merged

purgatorio mentioned this issue Jun 14, 2016

2016.3 Degraded Highstate Performance #33988

Closed

meggiebot added the fixed-pls-verify fix is linked, bug author to confirm fix label Jun 17, 2016

thatch45 closed this as completed Jun 17, 2016

meggiebot added the ZRELEASED - 2016.3.2 label Jun 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File states seem slower in 2016.3, especially on first cache retrieval #33575

File states seem slower in 2016.3, especially on first cache retrieval #33575

anlutro commented May 27, 2016 •

edited

genx7up commented May 27, 2016

Ch3LL commented May 27, 2016

thatch45 commented May 27, 2016

cachedout commented May 27, 2016

hgfischer commented Jun 9, 2016

inty commented Jun 9, 2016

danielmotaleite commented Jun 16, 2016

thatch45 commented Jun 17, 2016

cachedout commented Jun 17, 2016 •

edited

thatch45 commented Jun 17, 2016

thatch45 commented Jun 17, 2016

thatch45 commented Jun 17, 2016

danielmotaleite commented Jun 17, 2016

danielmotaleite commented Jun 17, 2016 •

edited

anlutro commented Jun 20, 2016

ahammond commented Jun 21, 2016

DmitryKuzmenko commented Jun 22, 2016

anlutro commented Jun 22, 2016

danielmotaleite commented Jun 22, 2016

File states seem slower in 2016.3, especially on first cache retrieval #33575

File states seem slower in 2016.3, especially on first cache retrieval #33575

Comments

anlutro commented May 27, 2016 • edited

genx7up commented May 27, 2016

Ch3LL commented May 27, 2016

thatch45 commented May 27, 2016

cachedout commented May 27, 2016

hgfischer commented Jun 9, 2016

inty commented Jun 9, 2016

danielmotaleite commented Jun 16, 2016

thatch45 commented Jun 17, 2016

cachedout commented Jun 17, 2016 • edited

thatch45 commented Jun 17, 2016

thatch45 commented Jun 17, 2016

thatch45 commented Jun 17, 2016

danielmotaleite commented Jun 17, 2016

danielmotaleite commented Jun 17, 2016 • edited

anlutro commented Jun 20, 2016

ahammond commented Jun 21, 2016

DmitryKuzmenko commented Jun 22, 2016

anlutro commented Jun 22, 2016

danielmotaleite commented Jun 22, 2016

anlutro commented May 27, 2016 •

edited

cachedout commented Jun 17, 2016 •

edited

danielmotaleite commented Jun 17, 2016 •

edited