Minions stop communicating with masters until restart is done #20874

minddrive · 2015-02-20T20:28:05Z

We currently have been attempting to replace our MCollective infrastructure with Salt, but have hit one major hurdle: many of the minions after running fine for a while (usually 45-60 minutes) suddenly stop communicating with the master (have removed domain from hosts for all output/logs):

$ salt -E 'dpuppetklareau*' test.ping -v
Executing job with jid 20150220111427692727
-------------------------------------------

dpuppetklareau01:
    Minion did not return. [Not connected]

Restarting the minion immediately corrects the problem, but on many of the minions we have the problem recurs much of the time. Have set the logging levels on the minions and masters (we have two in each of our environments) to debug but no useful information has been found. It seems that authentication continues even when the minion is no longer responding... here's part of the logs from the master side:

2015-02-20 10:38:56,099 [salt.master                                ][INFO    ] Authentication request from dpuppetklareau01
2015-02-20 10:38:56,117 [salt.master                                ][INFO    ] Authentication accepted from dpuppetklareau01
2015-02-20 10:38:56,122 [salt.master                                ][INFO    ] Authentication request from dpuppetklareau01
2015-02-20 10:38:56,142 [salt.master                                ][INFO    ] Authentication accepted from dpuppetklareau01
2015-02-20 10:38:56,246 [salt.utils.event                           ][DEBUG   ] Sending event - data = {'id': 'dpuppetklareau01', '_stamp': '2015-02-20T10:38:56.245430', 'result': True, 'pub': '<public key>', 'act': 'accept'}
2015-02-20 10:38:56,258 [salt.utils.event                           ][DEBUG   ] Sending event - data = {'id': 'dpuppetklareau01', '_stamp': '2015-02-20T10:38:56.258226', 'result': True, 'pub': '<public key>', 'act': 'accept'}
2015-02-20 11:14:27,693 [salt.utils.event                           ][DEBUG   ]

and the logs from the minion side:

[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem

Another thing to note is that even after the minion no longer responds to the master, one internal job still seems to run:

[INFO    ] Running scheduled job: __mine_interval
[DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: This job was scheduled with a max number of 2
[INFO    ] Running scheduled job: __mine_interval
[DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: This job was scheduled with a max number of 2
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20150220083855680400', 'pid': 2032, 'id': 'dpuppetklareau01', 'schedule': '__mine_interval'}
[DEBUG   ] schedule.handle_func: Checking job against fun mine.update: {'fun': 'mine.update', 'jid': '20150220083855680400', 'pid': 2032, 'id': 'dpuppetklareau01', 'schedule': '__mine_interval'}
[DEBUG   ] schedule.handle_func: Incrementing jobcount, now 1, maxrunning is 2
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20150220083855686924', 'pid': 2037, 'id': 'dpuppetklareau01', 'schedule': '__mine_interval'}

We've noted that usually shortly after one of the '__mine_interval' jobs has run (usually one of the first three or four after restart) is when the minion stops responding to the master. A lot of searching on the web for a similar issue hasn't found anything that seems to match the problem we're seeing (and some of the suggested solutions have not helped, though there still may be some strange configuration issue).

The version output from one of the masters and one of the affected minions is below. The masters are currently running 2014.7.1 as per Tom Hatch who mentioned a bug in 2014.7.0 that caused the master to prematurely give a minion's status before the minion had had a chance to respond:

dsalt01 (master):
           Salt: 2014.7.1
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.7.2
       M2Crypto: 0.20.2
 msgpack-python: 0.4.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 2.2.0.1
           RAET: Not Installed
            ZMQ: 3.2.4
           Mako: Not Installed

dpuppetklareau01 (affected minion):
               Salt: 2014.7.0
             Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
             Jinja2: 2.7.3
           M2Crypto: 0.20.2
     msgpack-python: 0.4.0
       msgpack-pure: Not Installed
           pycrypto: 2.6.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 2.2.0.1
               RAET: Not Installed
                ZMQ: 3.2.4
               Mako: Not Installed

If some of the configuration information is needed, I can gladly give that, though I'm not sure which information would be most relevant in this case; please let me know.

Ken Lareau

The text was updated successfully, but these errors were encountered:

minddrive · 2015-02-20T20:43:28Z

Some clarifications:

Each of our environments has two masters, and all the minions in each environment points to both (we have 300-700 minions in each environment).
When the minion stops responding, it stops responding to both masters, not just one.

rallytime · 2015-02-20T21:10:38Z

@minddrive Thanks very much for this excellent report. I'm not sure if this will fix your problem exactly, but here are two things that I would try first:

First, I would try upgrading your affected minions to 2014.7.1
If you're still seeing problems there, you might also want to upgrade your version of ZMQ to 4.x.

That is interesting about the mine interval jobs. I am not super familiar with that function or code, but @basepi or @cachedout might have some more suggestions on how you can trouble shoot these connectivity issues.

tbaker57 · 2015-02-20T22:02:11Z

Is this the same issue as #20735 relating to auth failures during mine.update ?

cheers
Tim

minddrive · 2015-02-20T22:40:13Z

@tbaker57 In my case, the minion process continues to run and seemingly continues to communicate at some level, just not to the point where it can actually do anything.

minddrive · 2015-02-20T22:41:52Z

@rallytime Thanks for the information. I'm upgrading all the minions to 2014.7.1 as we speak in our development environment, and we'll see how that pans out over the weekend. The next step will be to try upgrading ZMQ and PyZMQ to more recent versions to see if that helps.

rallytime · 2015-02-20T22:43:36Z

@minddrive Excellent! Let us know how that works out and we can go from there.

minddrive · 2015-02-28T00:16:57Z

So after a bit of work, I now have:

dsalt01 (master):
           Salt: 2014.7.1
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.7.2
       M2Crypto: 0.20.2
 msgpack-python: 0.4.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.3.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: Not Installed

dpuppetklareau01 (affected minion):
               Salt: 2014.7.1
             Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
             Jinja2: 2.7.3
           M2Crypto: 0.20.2
     msgpack-python: 0.4.0
       msgpack-pure: Not Installed
           pycrypto: 2.6.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 14.3.1
               RAET: Not Installed
                ZMQ: 4.0.4
               Mako: Not Installed

This has led to some more stability - less systems are randomly failing, but usually an initial 'test.ping' run needs to be done to 'prime the pump' to see the truly non-responsive systems - though particularly with a few of the development VMs we have there's still some issues. My next step is to update the master and minion configuration files and tweak a few settings to see if I can take care of the stragglers, but the upgrades have helped some! Will follow-up once the new configurations are in place.

rallytime · 2015-02-28T00:25:44Z

@minddrive Awesome! Glad those updates helped. Let us know after you tweak some of your settings if this is still an issue, or what else we can help troubleshoot. I am also wondering if something like setting the ping_on_rotate option to True on your master config might help here. (I'm not sure if that is the exact issue you're seeing, but it very well might be.) You can learn a little more about that here: http://docs.saltstack.com/en/latest/ref/configuration/examples.html

minddrive · 2015-03-25T20:27:44Z

Looks like things are pretty stable now. Thanks for the help, folks, and in particular @rallytime . :)

rallytime · 2015-03-25T20:31:58Z

Glad you're back on your feet! I'm going to go ahead and close this issue. If you run into this problem again, leave a comment and we will be glad to open this issue again a readdress the problem. Thanks!

rallytime added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists labels Feb 20, 2015

rallytime added this to the Approved milestone Feb 20, 2015

rallytime added the fixed-pls-verify fix is linked, bug author to confirm fix label Mar 12, 2015

rallytime closed this as completed Mar 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minions stop communicating with masters until restart is done #20874

Minions stop communicating with masters until restart is done #20874

minddrive commented Feb 20, 2015

minddrive commented Feb 20, 2015

rallytime commented Feb 20, 2015

tbaker57 commented Feb 20, 2015

minddrive commented Feb 20, 2015

minddrive commented Feb 20, 2015

rallytime commented Feb 20, 2015

minddrive commented Feb 28, 2015

rallytime commented Feb 28, 2015

minddrive commented Mar 25, 2015

rallytime commented Mar 25, 2015

Minions stop communicating with masters until restart is done #20874

Minions stop communicating with masters until restart is done #20874

Comments

minddrive commented Feb 20, 2015

minddrive commented Feb 20, 2015

rallytime commented Feb 20, 2015

tbaker57 commented Feb 20, 2015

minddrive commented Feb 20, 2015

minddrive commented Feb 20, 2015

rallytime commented Feb 20, 2015

minddrive commented Feb 28, 2015

rallytime commented Feb 28, 2015

minddrive commented Mar 25, 2015

rallytime commented Mar 25, 2015