-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeouts (Minion did not return) since 2014.1.7 #14343
Comments
Very strange! Thanks for the report, we'll investigate this. |
When you run the commands, does the job show up in the minion log? Wondering if it's a loss of connection, or just a problem with the return data, or what... |
Experiencing the same issue. CentOS 7.
No indications in minion logs apart from:
|
@JoeHazzers Did you also just start seeing it on 2014.1.7, or have you seen it previously? |
It is not easy to reproduce the problem. Currently it is gone: After posting this issue I enabled debug mode for master and 2 minions. Also, we rebooted every server in our network over the weekend (kernel upgrade). Since we rebooted every system we haven't seen this problem anymore. So it looks like a missing reboot or something like that. Or maybe running in debug mode doesn't trigger the bug? |
I doubt debug log level affects this bug, but who knows? Keep us posted if you find a way to reproduce this issue, or if you get those logs that I mentioned. |
Very strange. I'll keep this open for now, keep us posted! |
Now it happened again. No software was updated so I am 100% sure that there was no pending reboot. But as said, debug log level was turned off. So nothing I can show you. I'll turn on logging again, maybe this will only happen when the master was running for some hours... |
Curses. =) It happens. If you don't mind keeping debug logging on for a time, that would be helpful. |
OK, here we go.I was able to reproduce the issue in a reduced test network: Seems to be an authentication problem (
master debug log:
srv3 minion debug log:
srv8 minion debug log:
Because the re-authentication time is different per client I think that's why it feels like it takes 1-2 minutes until all minions are back like stated in #14343 (comment) Now the question is why authentication will fail sometimes. I am wondering that the master doesn't log any details regarding the failed authentication. |
Well, interestingly enough, in both cases the failed authentication was before the actual command was received by master, so it doesn't seem like that's actually the issue. The command is received and returned quickly. |
Here's another minion log:
|
Weird! For now I would recommend setting up a cron job with just a basic |
FYI the current develop version has a keepalive patch that might help keep the minion online. the minion will ping the master every 10min... if the master don't respond the minion will restart. The above might get you by for now but the root issue should still be investigated. |
I tend to wonder if this might be related to the ZeroMQ connection caching that was added in that time frame. (cc: @jacksontj for his thoughts.) |
+1 I'm also getting this since |
Actually looking at the logs it doesn't seem to be related to my zmq caching since it happens so close to when the event happens (and before the command is run)-- and the channel re-use stuff is on rep/recv (the return side of things). The failure that you are seeing looks like the master AES key has rotated in that time. If you look at srv8 minion's log you can see the authentication failure message in the log at the exact time that the job was published. The only 2 places that print that exact message are in crypt.py (https://github.com/saltstack/salt/blob/develop/salt/crypt.py#L671) which both mean that either the AES key was bad or the pub job was malformed-- i'd bet on the AES key rotating. With that you are kicking in the acceptance_wait_time back-off (https://github.com/saltstack/salt/blob/develop/salt/minion.py#L874) which makes the re-auth much less aggressive. This was changed back to higher numbers (50s+) in this release, although I know for a fact that it has been set significantly lower in previous releases. This would cause your returns to take significantly longer to come back because the minion is taking significantly longer to get them (time of pub + time to re-auth + job time). Now, the reason that this effects some minions and not all is that the re-auth delay is randomized, so it makes sense that "some" minions "don't" return, becuase they got a longer re-auth timer. If your environment is relatively small you can tune down the acceptance_wait_time to have the minions re-auth faster, an maybe disable AES key rotation if you have it enabled. The ping every N seconds will help because it will force the minion to re-auth sooner (instead of waiting for a job) but it doesn't solve the problem. Basically its a tradeoff of minion availability and master load (for the re-auth). If you want to play around with it more you can probably re-produce this by doing a salt-key -d on some specific minion key as this forces an AES key rotation, or otherwise forcing an AES key rotation |
@jacksontj, is what you just suggested consistent with running |
please ckeck your system logs on the minon to be sure nothing strange correlates with the minion down time.
this might sound a little silly, but could you log pings to the master...
im not sure if the above command will work... if not dont worry about it :) |
@pwaller Yes, the re-auth is a blocking call within the minion's main process. So, until it is able to re-auth no new jobs will be pulled in. Depending on the HWMs it may or may not process those jobs after the auth is finished. |
Just adding a +1 for me. Version 2014.1.7. Seems to be "several minutes" (2-10) before the minions are responsive again. |
+1 for me as well, we've added a cronjob to test.ping every 5 minutes, but a fix in salt for this would be awesome. :) |
I've just submitted a pull req to log the aes key rotation on the master Key rotation in salt happens for a few different reasons:
Key rotation doesn't really "kill" the connection, it just means that the On Mon, Sep 8, 2014 at 12:29 PM, Thomas D. notifications@github.com wrote:
|
Thanks for the info @jacksontj. I've not heard of publish_session before this ticket. Just had a look and by default it's set to |
Yea, that is in seconds-- so that would mean every 24h. You will also We also use the reauth_delay stuff in our infrastructure, but the more On Tue, Sep 9, 2014 at 12:05 AM, superted666 notifications@github.com
|
OK, I applied #15603 to salt-2014.1.10 and deactivated the 5min For a short moment I thought the problem is gone because everything was working for ~20 hours. No timeout. But then it happens again. And thanks to the new log output I saw the reason:
The first salt command after this was logged run into a timeout. All other following requests succeeded until
To be clear:
From my understanding (please correct me if I am wrong) this is happening because the master doesn't notify the minions "Hey, I have rotated my keys... check your authentication...". This could be improved if
|
The notification of minions should be relatively easy (i'd think the
As for the minion cannot decrypt thing, it actually does exactly what you
I've looked into solving this issue a bit, but zeromq makes solving a few On Wed, Sep 10, 2014 at 6:07 PM, Thomas D. notifications@github.com wrote:
|
@jacksontj Regarding AES key rotation, there's some work over in #15722. |
Are you sure that this is happening? I mean, if this is actual happening, shouldn't I see the following work:
The command will timeout - no minion will return (because that's the first command after the key was rotated). But if you are right and the minion will detect a key change, re-authenticate and resume work, than /i-was-here should have a last-modified time after 01:15+backoff_timer_value... I am not sure about the current backoff_timer value, but if I will check each minion at 02:15 (60 minutes after I run the touch command) I should see recently touched file /i-was-here, not? If this is not working (my feeling) /i-was-here should not exist/shouldn't be modified on any minion... because the job (touching /i-was-here) was lost. I'll test that in 3 hours... :) |
Guess I should be clear there too, for some reason there are different And yes, for your test from what I expect to happen is something like
If the back-off time is relatively long its very likely that return will be On Fri, Sep 12, 2014 at 1:14 PM, Thomas D. notifications@github.com wrote:
|
@jacksontj OK, what you said is correct:
I don't understand the
From my understanding minion srv2, srv3, srv4, srv5 and srv7 are saying
but srv6 and srv8 are running this job. So I don't understand why these two minions are also listed with Minion did not return status. Using the reactor system sounds nice. But we need a way to tell the CLI that the job isn't lost. Well, like shown, the job isn't lost, but when you are using the remote execution stuff and get such a response you will probably re-send the command. But this could be a problem if the command shouldn't/cannot run twice for example. |
👍 I am also affected by this problem. I have about a 100 minions distributed across many DCs. The problem happens quite often, as our servers scale up and scale down un-attended. The salt master loses connectivity to many minions and ends up disrupting other orchestrations when the AES key is rotated due to a minion being deleted. Are there any stop gap measures that I can use, while this issue is being resolved? |
On 2014.7.0, if you set This does allow minions with those keys to continue to eavesdrop on publishes, but they can't interact with the master anymore. |
Same issues here... running Debian 7. I did setup a basic cron job to run every 5 minutes but it still doesn't help. Makes it quite unusable. |
I've been dealing with this issue for the past few weeks without much success in finding a solution. Running a test.ping cron every few minutes seemed to help, but really didn't fix the issue at all. Similarly, I saw no improvement after configuring Last night out of desperation I tried completely disabling the salt reactor configuration I had in place on my master. I've had zero issues with minions dropping or not returning since making this change. |
@kelchm Just to verify, are you on 2014.7.0? Actually, the output of |
@kelchm We would also be interested to see your reactor config and sls files. |
@kelchm AFAIK there should be nothing inherent in the reactor that would cause AES key rotation (and hence your problem). My best guess would be that one of your reactor events causes a key rotation (deleting a salt-minion key maybe?) But the reactor config and sls would help us nail that down. |
@Whissi - Is this still a problem for you? |
Not sure, we are still pinging all minions every 5 minutes to prevent this problem. I'll stop the cronjob next week to see if this is still happening with salt-2015.8.3 and update the issue report. |
@Whissi it seems that 2015.5.3 and upper has resolved this problem. |
@Whissi - Did you have time to update? Did it fix it? |
After running salt-2015.8.3 over one week I haven't seen this issue again. But this can be a false positive because we currently have at least one salt event per 30min. But let consider this as fixed for now. Thanks! |
Thanks for the update @Whissi |
Hi,
when there was no Salt activity for ~30-60 minutes, the next Salt command (for example
salt '*' --show-timeout cmd.run 'uname -a'
will timeout for most minions.I need to run this command 2-3x times... after 1-2 minutes all minions are back and now I can run my desired command without any timeouts.
This didn't happen with 2014.1.5.
The text was updated successfully, but these errors were encountered: