-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minions stop communicating with masters until restart is done #20874
Comments
Some clarifications:
|
@minddrive Thanks very much for this excellent report. I'm not sure if this will fix your problem exactly, but here are two things that I would try first:
That is interesting about the mine interval jobs. I am not super familiar with that function or code, but @basepi or @cachedout might have some more suggestions on how you can trouble shoot these connectivity issues. |
Is this the same issue as #20735 relating to auth failures during mine.update ? cheers |
@tbaker57 In my case, the minion process continues to run and seemingly continues to communicate at some level, just not to the point where it can actually do anything. |
@rallytime Thanks for the information. I'm upgrading all the minions to 2014.7.1 as we speak in our development environment, and we'll see how that pans out over the weekend. The next step will be to try upgrading ZMQ and PyZMQ to more recent versions to see if that helps. |
@minddrive Excellent! Let us know how that works out and we can go from there. |
So after a bit of work, I now have:
This has led to some more stability - less systems are randomly failing, but usually an initial 'test.ping' run needs to be done to 'prime the pump' to see the truly non-responsive systems - though particularly with a few of the development VMs we have there's still some issues. My next step is to update the master and minion configuration files and tweak a few settings to see if I can take care of the stragglers, but the upgrades have helped some! Will follow-up once the new configurations are in place. |
@minddrive Awesome! Glad those updates helped. Let us know after you tweak some of your settings if this is still an issue, or what else we can help troubleshoot. I am also wondering if something like setting the |
Looks like things are pretty stable now. Thanks for the help, folks, and in particular @rallytime . :) |
Glad you're back on your feet! I'm going to go ahead and close this issue. If you run into this problem again, leave a comment and we will be glad to open this issue again a readdress the problem. Thanks! |
We currently have been attempting to replace our MCollective infrastructure with Salt, but have hit one major hurdle: many of the minions after running fine for a while (usually 45-60 minutes) suddenly stop communicating with the master (have removed domain from hosts for all output/logs):
Restarting the minion immediately corrects the problem, but on many of the minions we have the problem recurs much of the time. Have set the logging levels on the minions and masters (we have two in each of our environments) to debug but no useful information has been found. It seems that authentication continues even when the minion is no longer responding... here's part of the logs from the master side:
and the logs from the minion side:
Another thing to note is that even after the minion no longer responds to the master, one internal job still seems to run:
We've noted that usually shortly after one of the '__mine_interval' jobs has run (usually one of the first three or four after restart) is when the minion stops responding to the master. A lot of searching on the web for a similar issue hasn't found anything that seems to match the problem we're seeing (and some of the suggested solutions have not helped, though there still may be some strange configuration issue).
The version output from one of the masters and one of the affected minions is below. The masters are currently running 2014.7.1 as per Tom Hatch who mentioned a bug in 2014.7.0 that caused the master to prematurely give a minion's status before the minion had had a chance to respond:
If some of the configuration information is needed, I can gladly give that, though I'm not sure which information would be most relevant in this case; please let me know.
The text was updated successfully, but these errors were encountered: