New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
salt '*' highstate returns 'minion did not return', salt [minion] highstate works #8647
Comments
That's very strange. Is it just the one minion that is acting this way? And have you modified the default timeout at all? What OS is the minion? |
Ubuntu 12.04, all the way. No, it happens across all minions. Haven't edited the timeout afaik. |
Stranger still! The only difference between |
What timeouts are there to tune by the way? A 'ls' works for all minions, for example. The weird part is that (I always run with -v, aliased 'salt' as 'salt -v') the periodic logging of 'Execution is still running on XXX' batches of servers get progressively smaller, if you get my drift. |
With |
What kind of networking setup do you have? Could it be an issue of salt not gracefully failing to communicate with minions? |
Nothing changed (AFAIK) to the network setup recently. However: The second location is quite small, I would recognize the set of minions over there. Everything else is either on the same switch or even on the same physical machine. I don't think the network is an issue, there's no high amounts of traffic on the times I see this (backups are overnight). |
I am not claiming something has changed with your network that now prevents salt to work. I'm claiming that salt might have had a regression regarding how it communicates with other minions. What happens when you disconnect the VPN? Can you show us the logs of a hanging minion? |
I also experienced this bug. Unfortunatelly it's not directly replicable. In the next run the minion returns. I'm targeting all (~10) minions. In the next run the minion return without problems. I try to increase the timeout to 120s and see if it appears again. But it looks that it "does not respond" much later than current timeout of 5s. |
This bug affects me as well :( |
@zamotivator Can you elaborate? Does it happen every time? Same minions? Reproducible? |
@zamotivator Ah, thanks. |
I think this bug is the "salt randomly does nothing" bug that I've complained about.
I've had trouble reporting it in the past because it only happens one in hundreds of times and I've not been able to reproduce it before. Finally, it was happening repeatedly so I ran with
|
In this case, it turns out the minion had stopped on |
Well, keep us posted if it happens again when the minion is running. Hopefully we'll be able to track down what's going wrong there. |
I have a feeling that it must be the minion crashing or being killed by the OOM killer or something of this form. My complaint is that the salt master doesn't give any output to tell you what might have gone wrong. And then there is nothing in the logs to indicate something went wrong. Since it's pretty rare, we should prepare now rather than wait until it next happens. What should I do to prepare the system for understanding what's wrong? The minion is very quiet by default, which I find surprising. |
So there's nothing in the minion config, or nothing in the master config? This is relevant for reporting back to master: #11631 As far as the minion goes, if there was an uncaught exception or something, I would definitely expect it to be in the log. Maybe change the log to debug mode on your minions in the hopes that there's information there? I'm not sure I've ever heard of a minion just stopping without putting anything in the logs, unless it was stopped externally, maybe. |
I keep seeing this fairly regularly with windows minions in 2014.1.4 :( |
I observe this on 2014.1.10: Highstate is kind of a promise all hosts are OK, so I find this critical for the confidense in Saltstacks ability to keep the systems in a controlled state. |
@oyvjel Adding a |
Not sure if it is related, but I have to repeat "salt-run manage.status" 4-5 times before it returns correct status. |
It's possible this is related to the resetting AES session that @cachedout refers to here. |
I'm using salt 2014.1.10 and I also got this problem, highstate sometimes only work if I put -t ABIGAMOUNTOFTIME
|
Unfortunately |
@frumpel That's another bug. Timeouts should be the maximum number of seconds to wait for responses. |
Are people still running into this? Reading back through, it sounds to me like the |
Yes, I am facing similar issue
Version info
|
In an older 2014.1.5 environment we have adding "gather_job_timeout: 15" fixed this for us. |
@Mrten - can we close this issue or are you still seeing a problem? |
I doubt any code is the same since then :) |
@kracekumar Can you try increasing the |
Thanks @basepi and @jordanrinke increasing time out works. |
I'm thinking we should probably increase the gather_job_timeout by default. It will increase the total time it takes the CLI to return for fast-running jobs when minions are offline, but may be worth avoiding issues such as this one. |
As of now, once minion doesn't report, master stops trying. How about min retry of n times in x interval ? |
That's not correct, actually. Or it shouldn't be. What happens is the master sends a job to the minions. It then waits If a minion doesn't reply to the Are you seeing different behavior? |
This same behavior occurs while running test.ping. We're seeing it while using 2015.5.3 when running 'salt -t 10 host test.ping' on a largely idle master and minion (~0.5 load average). We have a single master configuration and there is low latency between the master and minion. Here is the debug output when the test.ping request fails:
and on the minion:
And here's the debug output when the test.ping request succeeds:
and on the minion:
The zmq connection is up throughout. tcpdump shows that the response is making it back to the master server (or at least some data is making it back). It looks like the master is either ignoring the result or dropping it on the floor.
Unfortunately I don't master server debug logs to show. I'll add those when I have 'em. (Restarting the master with debug logging enabled "fixed" the problem, naturally!) |
This post includes debug logs from the master.
Logs on the master for this request:
I may not have captured every line of this response -- the master is busy with other things. However, this is at least every line with either jid *3788 or jid *6344. salt-run jobs.lookup_id for either of these jids returns the expected True result. A subsequent test found that jobs.lookup_id returns a result even while the original salt command line shows that minion did not return error. I've verified that none of the salt processes are out of fds or other resource limits (e.g. ulimit limits). strace shows that the connection is active, that the client is continuously polling ( All of this points to the bug being in the server, IMO. I suspect that the server not sending the result down the zmq socket. |
@dpkirchner Interesting, thanks for the detective work. The issue you're seeing looks a bit different from this original issue. Would you mind opening a new one with the same findings? |
There's an apparent difference in targeting '*' and a host directly. (0.17.2 all the way now)
I have this state that updates salt to the latest version:
It should be a no-op, since I'm on the latest version (I checked this time :)
It returns with a no-op when targeted directly, but it doesn't return at all when targeted with '*'. It does run, I can see that because I run the minion on the console with
salt-minion -d debug
.So, to re-iterate:
gives me
but
gives me
Complete output from salt-minion -d debug is on http://pastebin.com/10h2YETr
The text was updated successfully, but these errors were encountered: