-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
salt-master crashes every few days #34215
Comments
@rvora would you mind sharing your master/syndic configs to help me try to replicate this issue? |
Here's the /etc/salt/master for master-of-master (some of the sensitive information has been redacted or removed). Note that this is from a non-production setup (open_mode: True) where I am seeing this behavior. I have also removed external_auth, rest_cherrypi clauses below.
And /etc/salt/master for the sub-level masters running salt-syndic:
|
We are having the same problem with 2016.3.1. We typically have to restart our salt-master and salt-api each morning. We do not have syndic enabled. These errors seem to start the problem. It starts with dozens of pillar render errors
We have
|
I doubled the memory on the salt-master, this has not helped. |
@markahopper this is our other flavor of the salt master needing a restart several times a day. |
@rvora thanks for the additional information. I'm setting up a test case right now and will report back the results today/tomorrow. |
Also @DrMerlin @deuscapturus or @rvora are all of you using salt-api when this occurs? I'm trying to narrow this down because I have not seen this in any of my environments. Also seems to maybe be correlated to pillars @deuscapturus is there anything unique about your pillars? maybe git pillars, external pillar of some sorts? Anything that will help me to replicate this issue. |
I wonder if @skizunov has any insight here. This might be pretty tricky to track down. Does memory usage increase at all prior to this happening or do these errors just seemingly come out of nowhere? |
I'm using salt-api. Rajul Rajul Vora - typos by Android
|
Our master configuration
|
@cachedout We updated to version 2016.3.0 on June 8. Our memory utilization became erratic at that point in time. Our logs don't show 100% memory utilization. In general memory utilization is around 50%, but It could be peaking to 100% for short periods of time. |
@deuscapturus Good info, thanks. I'm mostly curious if this is a memory pressure problem or something else entirely. |
We also see this error
|
@deuscapturus At or around the same time you see the other or do they appear at largely different points in the log? |
@cachedout the memory errors were higher in the log. We actually don't see the memory errors each time we get into this bad state. The error |
@deuscapturus K. That's helpful. Thanks. |
@cachedout : I haven't seen such an issue before myself. The first error seems to happen in It may be an infinite (or very deep) recursion problem. Unfortunately, I am not familiar with the pillar logic, as I don't use that feature myself. |
UPDATE: I have not been able to replicate this even having my master run through the night. Here is my master config:
Are you seeing this issue when running a certain command? Or if anyone is willing to do a git bisect that would be useful as well. Also is there any other information that might be able to help me replicate this issue so we can track down how to fix it? |
Can anyone help me out to track this down to a simple test case? We will need to be able to reproduce this error to fix it? Any ideas anyone? |
This is very mysterious. Right now, I am wondering if what's happening here is that we're switching into an IOLoop that's currently waiting on a read callback to complete. (Though, what's even stranger is that I haven't yet found a place where that callback would even be set.) It looks like we're in the context manager cc: @skizunov (Any insight on the above would be greatly appreciated.) |
OK, so I think we have a better idea of what's going on here. What this looks like is that we're pushing data onto the Tornado write queue in the EventPublisher and that memory is never being reclaimed. There are a couple of possible factors in play here:
I think that what we likely want to do here is to set a max_write_buffer_size dynamically, perhaps as some percentage of memory on the machine. We should also allow it to be set statically. Upon filling the buffer, we'll need to drop messages. We should log these, hopefully batching the warnings together. We may also wish to consider an option to allow users to select the use of ZMQ for the event bus. (Though we totally removed this a while ago.) Some users may wish to use the buffering algorithms already implemented there. |
hello, just to add that we have the same problem here with a fresh 2016.3.1:
this is with a simple salt "somehost" test.ping |
All right. I have a first-pass at a fix that does seem to resolve the leak. Please see #34683. We do need extensive real-world testing of this. However be advised that messages will be dropped when this buffer is reached. Using this in production, especially with a high volume of events may be risky. We need further tuning and logging before this is truly ready to go. That said, any testing that people can do here would be incredibly helpful. :] |
This fix has not solved my problem after upgrading salt-master to 2016.3.2 Here's some new information / correlation I have found. I have a cronjob that runs every night at 2:00 UTC:
I also see the following error in the /var/log/salt-run-manage-down file though there are no timestamps in that file so it is hard to tell exactly what time this error happened. Monitoring system also shows memory climbing steadily (over a 24 hour period) to 100%.
|
Did you set the |
@cachedout Should I try ipc_write_buffer: 'dynamic'? If not, how should I go about estimating a value in number of bytes? |
Following up. Our salt master has not had any issues in the past few days after upgrading to 2016.3.2 |
Description of Issue/Question
Every few days (4 or 5 days) salt-master stops responding and we need to restart salt-master.
Setup
We have a number of salt-syndics that connect to this salt-master. Syndics are very old version (2014.7.5). We get a fair number of "Salt minion claiming to be has attempted tocommunicate with the master and could not be verified" messages in the logs but functionally everything works until salt-master crashes.
Steps to Reproduce Issue
and the StreamClosedError exception keeps repeating about 15 times (within 2 or 3 seconds) and then the log stops completely.
Versions Report
(Provided by running
salt --versions-report
. Please also mention any differences in master/minion versions.)Salt-Master (master of masters that crashes every few days):
On the Salt-syndics with older versions:
The text was updated successfully, but these errors were encountered: