-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Minion did not return" executing state with long running command, 2016.3 regression #34872
Comments
I think I'm running into the same issue, when running tasks that may not come back for 1-3 minutes:
Results in
But the task does complete. I'm using CentOS 7 with salt 2016.3. |
@cbuechler i'm not seeing this on 2016.3.1 using this test example:
Are your minions also 2016.3.1? |
It only happens when the minions are 2016.3.1, those versions shown are the minion version. The 2015.8 minion shown on the same 2016.3.1 master works fine. That 'sleep 300' cmd.run exhibits the same behavior. Sleep of 15 seconds is the first that fails in the manner described, 14 seconds or less succeeds. Logs just like what I included in the original report. This is on a freshly-deployed FreeBSD instance on EC2. It seems to be specific to FreeBSD, doing same on Linux doesn't exhibit that behavior. From a Ubuntu minion that doesn't have the issue:
|
It's not just a long-running cmd.run either, applying state.highstate times out in the same fashion on FreeBSD minions even though it completes successfully. Logs along the lines of what I included in the original post. |
This is also a problem on Cent 7.2. I have only observed this using
In my environment it appears to time out at about 18 seconds.
The same problem does not occur when running |
hmmm i'm still not able to replicate this behavior on centos7, even when adding a separate environment to the test case. Is there something unique about your environment? Anything in your master or minion configs? |
Thanks for checking back in on this. I am now not able to reproduce this problem either. My minion/master configs are nearly default from the 2015.8 release except for the pillar and file root paths. @cbuechler are you still observing this problem? |
@eradman what changed in your environment to make this problem go away? |
I was not able to identify anything in my environment changed--I left my test configuration in a tmux session and we didn't run update any RPMs. While trying to root cause this I did restart the master and minion daemons, but at the time that didn't solve the problem. |
Ok, please let us know if it resurfaces. Have the minion systems been under extra load when this happens? |
@eradman, as @cbuechler mentioned, for us it only happens when minion is 2016.3.1. With 2015.8 minion it works as expect |
thats very interesting, when the minion is 2016.3.1 does the job show up in the job cache? can you look it up with |
@thatch45 here are relevant data:
Then on minion side:
And while it was being executed, I ran on master:
|
After the work is done and you run |
Interesting @thatch45, here are results after job is done:
|
Excellent, then the issue can be isolated to the client, in that it is returning to the cli before the job is finished, something difficult in Salt because of the disconnected nature of things. When this happens this is the first thing to check. |
Thanks @thatch45, When we have both master and minion on 2015.8.1 it was working. After master was upgraded to 2016.3.1 it kept working. The problem started when we upgraded minion to 2016.3.1. All these examples I sent you were done running both on 2016.3.1 |
Very odd indeed, but this gives us a better place to hunt, I will ask more questions as we find them |
Greetings--this does not appear to be a problem in 2016.3.3, but I also can't reproduce it with master at 2016.3.3 and minion at 2016.3.1. Master is Ubuntu 14.04, minion is FreeBSD 10.3. Versions-report and states I used are below. I'm happy to investigate further, I am curious about the network setup. Here are a few questions re: the setup:
top.sls:
sleeptest.sls
state.highstate working:
Versions:
|
OK, I dug a little deeper and discovered that a similar FreeBSD 10.3 machine created in Amazon does show the issue. However, the FreeBSD machine I created in Vultr does not. So it seems to have something to do with Amazon (configuration of that AMI, or Amazon's network) |
Happens on our vmware environment, so its not specific to Amazon. |
@cypher386 are your minions FreeBSD also? |
I'm facing the same issue on a FreeBSD 10.1-RELEASE where I'm working on adding some FreeBSD modules, using
but the minion's logs show that the command is still executing and successfully returns when it's done, without the master being notified about it. What I see if I run
I think that the master loses the minion X seconds after this message appears, where X=10 in the above case, but I'm not exactly sure about the exact timing. The only thing I'm sure about is that the master loses the minion only after this message has appeared (I've used different delay times in I don't know if that helps, but I'd be glad to test your ideas on my machine in order to resolve this issue. |
@mamalos Is this a VM? If so, what hypervisor or cloud provider? |
Also, that's excellent that you were able to reproduce it w/master and minion on the same box. That rules out an issue with the network. |
@cro The FreeBSD 10.1-RELEASE amd64 is running on a Virtualbox VM, but I'm experiencing the exact same issue, with the exact same configuration (master/minion on the same host) on a FreeBSD 10.2-RELEASE-p14 amd64 which is installed on an IBM workstation. So running On FreeBSD 10.1-RELEASE:
and on my FreeBSD 10.2-RELEASE-p14:
It is very easy to reproduce it -as others have already mentioned in this thread- by using the To be honest, I've only experienced this issue when using the |
Greetings. I think I fixed this. It turned out to be far simpler than we thought. Incidentally it was also broken on macOS. Let me know if I nailed it or not. |
@cro, bravo that you found the source of the problem! I can confirm that the message does not appear any more, but |
I just confirmed the root cause mounting /proc before start the command the takes a long time to return and it worked. FreeBSD supports /proc but it's not mounted by default, and not recommended. I just did it to validate the issue. |
PR was merged, closing |
Sorry to comment on a closed topic, but this is happening on CentOS 7 + salt (minion & master) 2016.11.4-1. |
Hi @amraam83, could you please provide some code to reproduce? I've setup a CentOS 7 on a VM with salt 2016.11.4 (not 2016.11.4-1, but I'm not using RH, so I followed salt's guide of installing the latest version of salt using bootstrap, and that's the version it installed. If you need me to install a different one, please give some guidance.) and ran |
Hi @mamalos , it turned out it was a very nasty bug on our states (a dependency that restarted the salt-minion daemon during a highstate). Sorry for bothering :( |
Cool! Because it would be reaaally weird if this issue was affecting Linux systems as well! :) |
Hi,
Thanks for your help |
@ntvietvn, as stated before, this issue is hard to be related to a Linux machine, since Linux has a proc filesystem mounted that contains the information that salt is asking about. As @amraam83 commented a few months ago, his analogous issue (on a Centos7 machine) was eventaully related to a bug in their states. Are you sure this is not the case for you? Have you ran the above commands on a clean centos6 installation using the salt versions you've metnioned? |
Happens to me also. Only when setting timeout to 200sec or even more, we get output. It doesn't matter which state it is, as long as it is a long state, it happens. |
Hi @gaizeror, if you see the code this issue is referring to, you'll see that you shouldn't have a problem on a Windows Server (at least wrt this issues' cause of the problem). In order to see it more deeply, it would be good if you could submit some more details about your configuration, but based on your minion, I don't think that your problem's cause is the same that this issue is referring to. |
This is a setup I've been using for about a year, on a variety of 2015 versions, with no issue. Upgrading the minion to 2016.3.1 makes it stop working. I believe it's long-running commands, but maybe something else specific to what I'm doing.
Take the following state:
Where image-latest.img.gz is a gzipped disk image file, which in normal operation is sent to dd with output to a hard drive, but for sake of example, /dev/null behaves in the same way. Any sufficiently large gz file (100 MB would suffice) will show the end result problem.
When executing that state from the master via
salt my-deploy state.sls write-disk
, the master claims the minion didn't return.But if you watch the minion's log, it's running the command, and it finishes successfully. So the minion works, it just apparently doesn't reply back to the master, leaving it thinking it disconnected.
Minion log when master reports "Minion did not return".
Master log from 2015.8.1 which works correctly.
Versions Report
Not working:
Working:
The OS used doesn't matter either way, FreeBSD 10.2 with any 2015 salt version works fine.
The text was updated successfully, but these errors were encountered: