Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt master cache handling drains cpu resources and fails communication with minion #17193

Closed
span opened this issue Nov 5, 2014 · 2 comments

Comments

@span
Copy link

span commented Nov 5, 2014

I have an issue where the master fails with a timeout when trying to apply a state to a minion using 2014.1.13 on both master and minion. Both are running Ubuntu 14.04.

root@SALT_MASTER:~# salt --versions
           Salt: 2014.1.13
         Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
            ZMQ: 4.0.4
root@MINION_IDr:~# salt-minion --versions
           Salt: 2014.1.13
         Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
            ZMQ: 4.0.4

The master fails with the message: "Failed to authenticate, is this user permitted to execute commands?"

This is a new problem that I did not have before upgrading to 2014.1.13. I ran 2014.1.10 (or 11) before upgrading. I have had problems earlier with salt not printing any result data although exit code is 0. However, trying again and restarting services has usually solved the problem. This time, I just can't get the state to apply. As you can see in the logs, the ping works some times.

Checking "top" after salt prints the error message shows salt-master at 100% cpu. The salt-master is running as root.

Here are two links to master and minion debugging where the issue should become more clear:
https://www.refheap.com/92763
https://www.refheap.com/92765

This issue seems related: #15719

@span
Copy link
Author

span commented Nov 6, 2014

So, seeing the salt-master process using 100% in "top" we ran strace to see what was going on with the process. We noticed that the master seemed to be stuck in a loop tryng to create a directory in the salt master cache. We received about 15MB of strace logs for this behavious in a couple of seconds.

top - 07:26:56 up 17:25,  4 users,  load average: 1.03, 1.01, 1.05
Tasks: 141 total,   2 running, 139 sleeping,   0 stopped,   0 zombie
%Cpu(s): 15.7 us,  9.5 sy,  0.0 ni, 74.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   6112392 total,  3533748 used,  2578644 free,   231336 buffers
KiB Swap:   524284 total,        0 used,   524284 free.  2768180 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
31141 root      20   0  597780  41396   5188 R 100.1  0.7 207:56.64 salt-master
root@SALT_MASTER:~# strace -p 31141
Process 31141 attached
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0
mkdir("/var/cache/salt/master/file_lists/roots/.base.w", 0777) = -1 EEXIST (File exists)
stat("/var/cache/salt/master/file_lists/roots/base.p", {st_mode=S_IFREG|0644, st_size=5154122, ...}) = 0

We solved it by stopping the service, moving the /var/cache/salt directory and then starting the salt-master again. The directory was recreated and we did no longer have the issues described.

@span span changed the title Salt master fails with with message: "Failed to authenticate, is this user permitted to execute commands?" Salt master cache handling drains cpu resources and fails communication with minion Nov 6, 2014
@span
Copy link
Author

span commented Nov 6, 2014

Duplicate of #15719

@span span closed this as completed Nov 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant