Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minions stop communicating with masters until restart is done #20874

Closed
minddrive opened this issue Feb 20, 2015 · 10 comments
Closed

Minions stop communicating with masters until restart is done #20874

minddrive opened this issue Feb 20, 2015 · 10 comments
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix severity-low 4th level, cosemtic problems, work around exists
Milestone

Comments

@minddrive
Copy link

We currently have been attempting to replace our MCollective infrastructure with Salt, but have hit one major hurdle: many of the minions after running fine for a while (usually 45-60 minutes) suddenly stop communicating with the master (have removed domain from hosts for all output/logs):

$ salt -E 'dpuppetklareau*' test.ping -v
Executing job with jid 20150220111427692727
-------------------------------------------

dpuppetklareau01:
    Minion did not return. [Not connected]

Restarting the minion immediately corrects the problem, but on many of the minions we have the problem recurs much of the time. Have set the logging levels on the minions and masters (we have two in each of our environments) to debug but no useful information has been found. It seems that authentication continues even when the minion is no longer responding... here's part of the logs from the master side:

2015-02-20 10:38:56,099 [salt.master                                ][INFO    ] Authentication request from dpuppetklareau01
2015-02-20 10:38:56,117 [salt.master                                ][INFO    ] Authentication accepted from dpuppetklareau01
2015-02-20 10:38:56,122 [salt.master                                ][INFO    ] Authentication request from dpuppetklareau01
2015-02-20 10:38:56,142 [salt.master                                ][INFO    ] Authentication accepted from dpuppetklareau01
2015-02-20 10:38:56,246 [salt.utils.event                           ][DEBUG   ] Sending event - data = {'id': 'dpuppetklareau01', '_stamp': '2015-02-20T10:38:56.245430', 'result': True, 'pub': '<public key>', 'act': 'accept'}
2015-02-20 10:38:56,258 [salt.utils.event                           ][DEBUG   ] Sending event - data = {'id': 'dpuppetklareau01', '_stamp': '2015-02-20T10:38:56.258226', 'result': True, 'pub': '<public key>', 'act': 'accept'}
2015-02-20 11:14:27,693 [salt.utils.event                           ][DEBUG   ]

and the logs from the minion side:

[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem

Another thing to note is that even after the minion no longer responds to the master, one internal job still seems to run:

[INFO    ] Running scheduled job: __mine_interval
[DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: This job was scheduled with a max number of 2
[INFO    ] Running scheduled job: __mine_interval
[DEBUG   ] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG   ] schedule: This job was scheduled with a max number of 2
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20150220083855680400', 'pid': 2032, 'id': 'dpuppetklareau01', 'schedule': '__mine_interval'}
[DEBUG   ] schedule.handle_func: Checking job against fun mine.update: {'fun': 'mine.update', 'jid': '20150220083855680400', 'pid': 2032, 'id': 'dpuppetklareau01', 'schedule': '__mine_interval'}
[DEBUG   ] schedule.handle_func: Incrementing jobcount, now 1, maxrunning is 2
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20150220083855686924', 'pid': 2037, 'id': 'dpuppetklareau01', 'schedule': '__mine_interval'}

We've noted that usually shortly after one of the '__mine_interval' jobs has run (usually one of the first three or four after restart) is when the minion stops responding to the master. A lot of searching on the web for a similar issue hasn't found anything that seems to match the problem we're seeing (and some of the suggested solutions have not helped, though there still may be some strange configuration issue).

The version output from one of the masters and one of the affected minions is below. The masters are currently running 2014.7.1 as per Tom Hatch who mentioned a bug in 2014.7.0 that caused the master to prematurely give a minion's status before the minion had had a chance to respond:

dsalt01 (master):
           Salt: 2014.7.1
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.7.2
       M2Crypto: 0.20.2
 msgpack-python: 0.4.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 2.2.0.1
           RAET: Not Installed
            ZMQ: 3.2.4
           Mako: Not Installed
dpuppetklareau01 (affected minion):
               Salt: 2014.7.0
             Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
             Jinja2: 2.7.3
           M2Crypto: 0.20.2
     msgpack-python: 0.4.0
       msgpack-pure: Not Installed
           pycrypto: 2.6.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 2.2.0.1
               RAET: Not Installed
                ZMQ: 3.2.4
               Mako: Not Installed

If some of the configuration information is needed, I can gladly give that, though I'm not sure which information would be most relevant in this case; please let me know.

  • Ken Lareau
@minddrive
Copy link
Author

Some clarifications:

  • Each of our environments has two masters, and all the minions in each environment points to both (we have 300-700 minions in each environment).
  • When the minion stops responding, it stops responding to both masters, not just one.

@rallytime
Copy link
Contributor

@minddrive Thanks very much for this excellent report. I'm not sure if this will fix your problem exactly, but here are two things that I would try first:

  • First, I would try upgrading your affected minions to 2014.7.1
  • If you're still seeing problems there, you might also want to upgrade your version of ZMQ to 4.x.

That is interesting about the mine interval jobs. I am not super familiar with that function or code, but @basepi or @cachedout might have some more suggestions on how you can trouble shoot these connectivity issues.

@rallytime rallytime added Bug broken, incorrect, or confusing behavior severity-low 4th level, cosemtic problems, work around exists labels Feb 20, 2015
@rallytime rallytime added this to the Approved milestone Feb 20, 2015
@tbaker57
Copy link
Contributor

Is this the same issue as #20735 relating to auth failures during mine.update ?

cheers
Tim

@minddrive
Copy link
Author

@tbaker57 In my case, the minion process continues to run and seemingly continues to communicate at some level, just not to the point where it can actually do anything.

@minddrive
Copy link
Author

@rallytime Thanks for the information. I'm upgrading all the minions to 2014.7.1 as we speak in our development environment, and we'll see how that pans out over the weekend. The next step will be to try upgrading ZMQ and PyZMQ to more recent versions to see if that helps.

@rallytime
Copy link
Contributor

@minddrive Excellent! Let us know how that works out and we can go from there.

@minddrive
Copy link
Author

So after a bit of work, I now have:

dsalt01 (master):
           Salt: 2014.7.1
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.7.2
       M2Crypto: 0.20.2
 msgpack-python: 0.4.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.3.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: Not Installed
dpuppetklareau01 (affected minion):
               Salt: 2014.7.1
             Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
             Jinja2: 2.7.3
           M2Crypto: 0.20.2
     msgpack-python: 0.4.0
       msgpack-pure: Not Installed
           pycrypto: 2.6.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 14.3.1
               RAET: Not Installed
                ZMQ: 4.0.4
               Mako: Not Installed

This has led to some more stability - less systems are randomly failing, but usually an initial 'test.ping' run needs to be done to 'prime the pump' to see the truly non-responsive systems - though particularly with a few of the development VMs we have there's still some issues. My next step is to update the master and minion configuration files and tweak a few settings to see if I can take care of the stragglers, but the upgrades have helped some! Will follow-up once the new configurations are in place.

@rallytime
Copy link
Contributor

@minddrive Awesome! Glad those updates helped. Let us know after you tweak some of your settings if this is still an issue, or what else we can help troubleshoot. I am also wondering if something like setting the ping_on_rotate option to True on your master config might help here. (I'm not sure if that is the exact issue you're seeing, but it very well might be.) You can learn a little more about that here: http://docs.saltstack.com/en/latest/ref/configuration/examples.html

@rallytime rallytime added the fixed-pls-verify fix is linked, bug author to confirm fix label Mar 12, 2015
@minddrive
Copy link
Author

Looks like things are pretty stable now. Thanks for the help, folks, and in particular @rallytime . :)

@rallytime
Copy link
Contributor

Glad you're back on your feet! I'm going to go ahead and close this issue. If you run into this problem again, leave a comment and we will be glad to open this issue again a readdress the problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix severity-low 4th level, cosemtic problems, work around exists
Projects
None yet
Development

No branches or pull requests

3 participants