Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File states seem slower in 2016.3, especially on first cache retrieval #33575

Closed
anlutro opened this issue May 27, 2016 · 19 comments
Closed

File states seem slower in 2016.3, especially on first cache retrieval #33575

anlutro opened this issue May 27, 2016 · 19 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P1 Priority 1 severity-high 2nd top severity, seen by most users, causes major problems ZRELEASED - 2016.3.2
Milestone

Comments

@anlutro
Copy link
Contributor

anlutro commented May 27, 2016

I upgraded to 2016.3 in my VMs and file states seem significantly slower on the first pass (10-20 times slower). However, on subsequent state.apply calls, things seem fine again.

For every file that the minion has to fetch from the master, I see about 30-40 occurances of these log lines in the minion log:

2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.transport.zeromq] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-05-27 12:52:34,313 [   DEBUG] [17955] [salt.crypt] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-05-27 12:52:34,319 [   DEBUG] [17955] [salt.crypt] Re-using SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')

Followed by:

2016-05-27 12:52:35,930 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', looking at rel_path '|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/|salt/minion/configs/master.conf' to resolve 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** attempting ** 'salt://|salt/minion/configs/master.conf'
2016-05-27 12:52:35,933 [   DEBUG] [17955] [salt.fileclient] No dest file found 

Then some more of the zeromq-related log lines, then this:

2016-05-27 12:52:37,632 [    INFO] [17955] [salt.fileclient] Fetching file from saltenv 'base', ** done ** '|salt/minion/configs/master.conf'
2016-05-27 12:52:37,633 [   DEBUG] [17955] [salt.utils.jinja] Jinja search path: '['/var/cache/salt/minion/files/base']'

I had a look at logs from before the upgrade and there were nowhere near the same amounts of log lines regarding re-auth.

vagrant@orch:~$ salt-call --versions
Salt Version:
           Salt: 2016.3.0

Dependency Versions:
           cffi: 0.8.6
       cherrypy: 3.2.3
       dateutil: 2.2
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.4 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.4
@genx7up
Copy link

genx7up commented May 27, 2016

Similar experience after upgrade to 2016.3.0. First run of highstate on clean install is significantly slower. Subsequent runs are also slower vs 2015.8.10

Salt Version:
Salt: 2016.3.0

Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
Jinja2: 2.7.2
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.6
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 14.7.0
RAET: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.2.1
ZMQ: 4.0.5

System Versions:
dist: centos 7.2.1511 Core
machine: x86_64
release: 3.10.0-327.13.1.el7.x86_64
system: Linux
version: CentOS Linux 7.2.1511 Core

@Ch3LL
Copy link
Contributor

Ch3LL commented May 27, 2016

@anlutro I am able to replicate this and am seeing a performance difference with a small test case. Here's the test case:

[root@ch3ll-master salt]# cat issues/33575.sls 
{% for file in range(1,20) %}
add-file-{{ file }}:
  file.managed:
    - name: /tmp/test
    - source: salt://test{{ file }}
{% endfor %}
[root@ch3ll-master salt]# cat /srv/salt/top.sls 
base:
  '*':
    - issues.33575

When running the following: time sudo salt 'ch3ll-mas*' state.highstate -ldebug

on 2015.8.10:

real    0m2.728s
user    0m0.888s
sys     0m0.157s

on 2016.3.0

real    0m3.795s
user    0m0.931s
sys     0m0.151s

@Ch3LL Ch3LL added P1 Priority 1 severity-high 2nd top severity, seen by most users, causes major problems ZRELEASED - Boron severity-critical top severity, seen by most users, serious issues Core relates to code central or existential to Salt Confirmed Salt engineer has confirmed bug/feature - often including a MCVE labels May 27, 2016
@Ch3LL Ch3LL added this to the Approved milestone May 27, 2016
@Ch3LL Ch3LL added the Bug broken, incorrect, or confusing behavior label May 27, 2016
@thatch45
Copy link
Member

@cachedout do you think that this might be part of the multimaster cacheing file client fix?

@cachedout
Copy link
Contributor

Hmmm, that's a good thought. Could be. Trying to find the culprit with a git bisect could be a good approach here.

@hgfischer
Copy link
Contributor

Looks like I have the same problem. salt.fileclient is super slow. Some examples for the log:

2016-06-09 09:14:00,996 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/taskhandle.py'
2016-06-09 09:14:04,187 [salt.fileclient  ][INFO    ][17677] Fetching file from saltenv 'base', ** done ** '|vim/bundles/python-mode/pymode/libs2/rope/base/utils.py'

It took ~4 seconds to copy a file with 2891 bytes!!

@inty
Copy link

inty commented Jun 9, 2016

Yeah we're experiencing this issue too... a job that used to take like 10 minutes now takes 3 hours. What takes time is copying about 8000 images from the master.

@danielmotaleite
Copy link
Contributor

we also have this problem, a highstate takes now about 15m on each server, so for now i reverted to back to 2015.8.
I would say that this is a good reason for a bugfix release

@thatch45
Copy link
Member

Yes, we are still hunting this one down.

@cachedout
Copy link
Contributor

cachedout commented Jun 17, 2016

This should be resolved here and available in 2016.3.1:

#33896

@thatch45
Copy link
Member

Phew! Thanks @cachedout ! I thought this had been missed for a minute!

@thatch45
Copy link
Member

so can this be closed?

@meggiebot meggiebot added the fixed-pls-verify fix is linked, bug author to confirm fix label Jun 17, 2016
@thatch45
Copy link
Member

Fixed by #33896

@danielmotaleite
Copy link
Contributor

Just upgraded again to 2016.3.1 and still takes a long time to process...

the salt-minion keeps using 100% of the CPU until it finishes the highstate.

salt-minion -l garbage i get this log showing always a AsyncZeroMQReqChannel and each 1~2 seconds i finally get a normal state being applied:

[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156465323904 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 12483
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183639104 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185678768 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] In saltenv 'base', looking at rel_path 'templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/templates/CONFIGURATION/nsswitch.conf' to resolve 'salt://templates/CONFIGURATION/nsswitch.conf'
[INFO    ] Fetching file from saltenv 'base', ** skipped ** latest already in cache 'salt://templates/CONFIGURATION/nsswitch.conf'
[DEBUG   ] Jinja search path: ['/var/cache/salt/minion/files/base']
[INFO    ] File /etc/nsswitch.conf is in the correct state
[INFO    ] Completed state [/etc/nsswitch.conf] at time 20:25:29.541349 duration_in_ms=5715.604
[INFO    ] Running state [/etc/rsyslog.conf] at time 20:25:29.541611
[INFO    ] Executing state file.managed for /etc/rsyslog.conf
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182034624 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[TRACE   ] ret_val = None
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156182063512 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156185516080 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183189480 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156157189872 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156181971744 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156184754224 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')
[TRACE   ] Inserted key into loop_instance_map id 140156183288000 for key ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes') and process 10959
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'jumiaA13', 'tcp://192.168.106.1:4506')

Doing a strace -enetwork -f shows in the loop this:

[pid  7894] socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 48
[pid  7894] connect(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
[pid  7894] getsockopt(48, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid  7894] setsockopt(48, SOL_TCP, TCP_KEEPIDLE, [300], 4) = 0
[pid  7894] getpeername(48, {sa_family=AF_INET, sin_port=htons(4506), sin_addr=inet_addr("192.168.106.1")}, [16]) = 0
[pid  7894] recvfrom(48, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
[pid  7894] recvfrom(48, 0x7f5340000ef2, 2, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\377\0\0\0\0\0\0\0\1\177\3", 11, 0, NULL, 0) = 11
[pid  7894] recvfrom(48, "\3\0", 2, 0, NULL, NULL) = 2
[pid  7894] recvfrom(48, "NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 52, 0, NULL, NULL) = 52
[pid  7894] recvfrom(48, 0x7f5340003200, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
[pid  7894] sendto(48, "\0NULL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 53, 0, NULL, 0) = 53
[pid  7894] sendto(48, "\4&\5READY\vSocket-Type\0\0\0\3REQ\10Identity\0\0\0\0", 40, 0, NULL, 0) = 40
[pid  7894] recvfrom(48, "\4)\5READY\vSocket-Type\0\0\0\6ROUTER\10Identity\0\0\0\0", 8192, 0, NULL, NULL) = 43
[pid  7893] +++ exited with 0 +++
[pid  7894] +++ exited with 0 +++

So this query is being done several times per second and with the encryption, it eats all the CPU of the machine

@danielmotaleite
Copy link
Contributor

danielmotaleite commented Jun 17, 2016

reverting to the 2015.8, the Initializing new AsyncZeroMQReqChannel is ONLY show up 5 times in the "garbage" output during the machine highstate vs the thousands of times from the 2016.03.1

@anlutro
Copy link
Contributor Author

anlutro commented Jun 20, 2016

I can't reproduce this problem in 2016.3.1, I'll leave another comment if it should pop up again.

@ahammond
Copy link
Contributor

@markahopper this (specifically what @danielmotaleite pointed out) is huge for us.

@DmitryKuzmenko
Copy link
Contributor

@danielmotaleite, @anlutro, @ahammond
As @thatch45 pointed above this bug is fixed by #33896. Currently the only 2016.3 branch and develop branch contain the fix.
You should either wait for 2016.3.2 release or try to apply #33896 by yourself.

@anlutro
Copy link
Contributor Author

anlutro commented Jun 22, 2016

Oh, odd. I was sure it was merged before 2016.3.1. I guess it's just a coincidence that I can't reproduce it.

@danielmotaleite
Copy link
Contributor

oohh, i see! as this ticket was marked as fixed, closed and then the 2016.3.1 was released, we all assume this version had the fix.

I tested with 2016.3.1 and applied the fix patch...and it worked!
Still seems slower than 2015.8, but now is a lot faster than it was without the fix.

Thanks, i will wait for the 2016.3.2 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P1 Priority 1 severity-high 2nd top severity, seen by most users, causes major problems ZRELEASED - 2016.3.2
Projects
None yet
Development

No branches or pull requests