Syndics fail after period of time and logs "Authentication failure of type "user" occurred." #4319

Closed
dgarstang opened this Issue Mar 29, 2013 · 11 comments

Projects

None yet

2 participants

@dgarstang

I have a master master with 4 syndics. Every was running fine. I could run a test.ping on the master master and get results for all minions behind all syndics. Left it over night, came back the next day, and three of the syndics have disappeared.

Now, when running a test.ping on the master master no results are displayed for minions behind three of the syndics, and the failing syndics log:

"Authentication failure of type "user" occurred."

This error is displayed eight times in the source. The actual one occurring is in master.py at line 1661 of salt 0.14. I poked around and it appears that this error occurs when the root key received in the payload from the master does not match the local root key. This WAS working. Nothing changed in the configuration, and then over night it stopped working.

I believe this is reproducable. I had this situation earlier last week and somehow, after removing and reinstalling everything, got it to work. This is when I discovered that the syndic also needs to sign it's own key, which makes no sense to me.

I restarted the syndics.... didn't help. I restarted the master master. Didn't help. I restarted the syndics again. Didn't help.

From master master, @Ubuntu 12.04.2

Linux apex 3.2.0-36-virtual #57-Ubuntu SMP Tue Jan 8 22:04:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
           Salt: 0.14.0
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: not installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 13.0.0

From failing syndic, @Ubuntu 11.04.

           Salt: 0.14.0
         Python: 2.7.1+ (r271:86832, Sep 27 2012, 21:12:17)
         Jinja2: 2.5.5
       M2Crypto: 0.20.1
 msgpack-python: 0.1.9.final
   msgpack-pure: not installed
       pycrypto: 2.1.0
         PyYAML: 3.09
          PyZMQ: 13.0.0

From another failing syndic, @Ubuntu 11.04

           Salt: 0.14.0
         Python: 2.7.1+ (r271:86832, Sep 27 2012, 21:12:17)
         Jinja2: 2.5.5
       M2Crypto: 0.20.1
 msgpack-python: 0.1.10
   msgpack-pure: not installed
       pycrypto: 2.1.0
         PyYAML: 3.09
          PyZMQ: 13.0.0

From another failing syndic, @Ubuntu 11.04

           Salt: 0.13.3-21-g178f50f
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.20.1
 msgpack-python: 0.1.10
   msgpack-pure: not installed
       pycrypto: 2.1.0
         PyYAML: 3.09
          PyZMQ: 13.0.0

From a working syndic, @Ubuntu 12.04.2

           Salt: 0.13.3-21-g178f50f
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: not installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 13.0.0
@thatch45
Member
thatch45 commented Apr 1, 2013

Sorry I am late on the uptake here, I have been spending a lot of time recently updating and fixing the syndic.
I think this has to do with client_acl on the master master, are you using client_acl on the master master?

@dgarstang

The master master config has:

root@apex:~# cat /etc/salt/master | grep -v "^#" | grep -ve "^$"
worker_threads: 10
order_masters: True
log_level: all
@thatch45
Member
thatch45 commented Apr 1, 2013

Hmm, thanks for the info, this helps (in that I know my first assumption is wrong)
I will get this figured out!

@dgarstang

Oh, you know what... I should have added this sooner. On Friday as I was pasting the output of those various library version numbers above, I noticed that on the working syndics that the versions of pycrypto and M2Crypto were newer. These also have older versions of Ubuntu, but I'd upgraded ZMQ on those earlier. So, I went and upgraded the python crypto libraries on one. It seems to have helped. I just ran a test.ping on the master master, and that one is still working today. That might be the answer.

I went to upgrade on a second one to reinforce that this was the issue, and unfortunately seem to have broken some libraries. :(

ImportError: /usr/local/lib/python2.7/dist-packages/M2Crypto-0.21.1-py2.7-linux-x86_64.egg/M2Crypto/__m2crypto.so: undefined symbol: SSLv2_method

So, not 100% conclusive, but seems likely that pycrypto 2.1.0 and M2Crypto 0.20.1 have issues. We have a third syndic that I can try and upgrade those libraries on... I think...

@thatch45
Member
thatch45 commented Apr 1, 2013

The m2crypto needs to be patched to no have ssl2 support because a bunch of distros have stripped out support for ssl2 in openssl.
But this is good info, we have never seen an issue with the crypto though, the pycrypto baseline is rather old, and M2 has been unchanged for a while.

@dgarstang

This problem has mysteriously returned. I was running a test.ping all day, regularly getting results for all 184 minions across 5 syndics. Number suddenly dropped to 17 and I lost 3 of the syndics. It's VERY interesting to note that this happened a few minutes past 00:00 local time on all three syndics. The syndic masters are now logging the "Authentication failure of type "user" occurred." messages again.

@dgarstang

Yep... I realised I'd left this running continuously since Tuesday afternoon (36 hours ago)...

while true; do salt '*' test.ping --out text  | wc -l; sleep 60; done

17 minions each time. Failed every single time. Was working fine until the very next poll after 00:00 (about 5pm Tuesday) local time. Wondering if there's a date/time bug in the syndics. Most likely related to older versions of something though, because I've upgraded the m2crypto and pycrypto packages on those syndics, but maybe something still isnt' as up to date as it needs to be.

@thatch45
Member
thatch45 commented Apr 4, 2013

A valid condition is certainly happening, I am planning on looking into this one once I finish getting the event stuff working on the syndic

@dgarstang

Did a dist-upgrade on one of the master/syndics. It's now running Ubuntu 12.04 (like the working ones). Still doing it. :(

@thatch45
Member

Right, this is still an issue and I will get to it as soon as I can

@dgarstang

I deleted the syndic key on the master master and resigned, and the error has gone away, and now I'm getting results to a test.ping on the master master for minions behind this syndic. We'll see if it lasts past 00:00, or fails again there like before...

@thatch45 thatch45 added a commit that closed this issue Apr 12, 2013
@thatch45 thatch45 Fix #4319
The master server's key is reset when the master is restarted, the
syndic needs to detect this authentecation change and refresh the
local auth key
edcc3db
@thatch45 thatch45 closed this in edcc3db Apr 12, 2013
@thatch45 thatch45 added a commit that referenced this issue Apr 12, 2013
@thatch45 thatch45 Fix #4319
The master server's key is reset when the master is restarted, the
syndic needs to detect this authentecation change and refresh the
local auth key
d17cf33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment