-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] intermittent connection between master and minion #65265
Comments
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. |
HI - I am seeing this same issue. |
It seems worse with 3006.5 with Linux as the master when managing Windows minions. |
This is affecting many of our endpoints. I can get them to re-establish communication by restarting the minion or the master, but they lose communication again. |
Restarting the saltmaster seems to fix the issue for all minions, for a while, but the issue will return after about 12 hours on a different seemingly random selection of minions. |
I seem to have a very similar issue with 3006.x but in my case restarting the master does not have any effect and only a minion restart resolves the issue. Another oddity is that I can see in the minion logs that the minion is still receiving commands from the master and is able to execute them just fine but the master seemingly never receives the response data. If I issue a I don't recall having this issue on 3005.x but I have not downgraded that far yet.. so far both 3006.5 and 3006.4 minions have the problem for me. I'll try to run a tcpdump if I have time |
I am encountering similar issues. Everything is 3006.5. I've spent two day's thinking I broke something in some recent changes I made, but I've found that the minions jobs are succeeding, but they timeout trying to communicate back to the master. I'm thinking this may be related to concurrency + load. I use this for testing environment automation, and during tests I have concurrent jobs fired off by the scheduler for test data collections. And that is where the issues start to show up in the logs. When this happens, the minions seem to try to re-send the data which just compounds the problem. The logs on the master show that it is getting the messages because it is flagging duplicate messages, but something seems to be getting lost processing the return data. The traces all look the same and seem to indicate something is getting dropped in concurrency related code: 2024-01-29 15:22:57,215 [salt.master :1924][ERROR ][115353] Error in function minion_pub:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
ret = yield self._uncrypted_transfer(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
ret = yield self.transport.send(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 909, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 589, in send
recv = yield future
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
pub_data = self.pub(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1918, in run_func
ret = getattr(self, func)(load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1839, in minion_pub
return self.masterapi.minion_pub(clear_load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/daemons/masterapi.py", line 952, in minion_pub
ret["jid"] = self.local.cmd_async(**pub_load)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 494, in cmd_async
pub_data = self.run_job(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 409, in run_job
raise SaltClientError(general_exception) |
I just discovered something. At any random time I might have 25-50 minions that don't appear to respond to jobs. They may or may not respond to ...buuuut they ARE actually are listening to the master. So my workflow is stupidly:
|
@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience |
This seems to still be an issue on 3006.7 when both minion and master are the same version |
3007.0 is...worse? Woke up to all ~600 minions in an environment being offline.
The log showed returns from every minion, but the master spit out Restarted the salt-master service, got distracted for ~15 minutes, ran another Used Cluster SSH to connect in to every machine I can reach across the internet and restarted the salt-minion service and I'm down to a mix of ~60 (Windows, Linux, and BSD) that don't respond and I can't reach. Maybe 10 of them are 3006.7. I'd love to test/switch to a different transport like websockets that would probably be more stable, but it appears to be "all or nothing". If I switch to websockets on the master, it looks like every minion will disconnect unless I also update them to use websockets...and if I update them to use websockets and something breaks, I'm going to have to spend the next month trying to get access to hosts to fix salt-minion. |
It just happened on my master which is 3007.0... I was going highstate on a minion that involves certificate signing and it refused to generate the certificate with no error messages in the salt master log. I tried restarting the salt master, no dice. About 10 minutes later I decided to restart the salt master's minion...and suddenly certificate signing worked. The minion on the master wasn't communicating with the master...locally...on the same box... |
Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :
|
Where do i need to add this @gregorg in salt master and how we need to add this |
Add this in |
we upgrade salt to 3006.4 on master and 20 minions out of which 10 minions are not upgraded |
This is not a support ticket, look at salt master logs. |
I tried those settings @gregorg. It's been intermittent for the last three days....and this morning 100% of my minions are offline (even the local one on the salt master) If I connect to a box with a minion I show the service is running. I can totally state.highstate and everything works properly. Restarting the master brings everything online. There's nothing that appears unusual in the master log. I can even see minions reporting their results if I do something like I'd love to switch to a potentially more reliable transport, but it looks like Salt can only have one transport active at a time...so if I enable something like websockets it looks like all my minions will be knocked offline until I reconfigure them. |
I just noticed an interesting log entry on the master. A bunch of my minions weren't talking again, even though the log had a ton of lines of "Got return from..." So I restarted salt-master and noticed this in the log:
Specifically this:
Maybe something's hanging the MWorkerQueue? |
Any improvement with these settings ? |
I didn't use those exact settings because my master is smaller and has fewer minions. It's no longer dropping all the minions every few hours...it's more like once or twice a week. |
Also, I'm not sure if this is related or not, but it seems to be in the same vein--communication between the minions and master is pretty unreliable.
|
Checking back in here, I think this is actually resolved for me once I got all my minions to 3007.0. I've removed all restart cron jobs and the minions appear to have been stable for days now. Is anyone else still having issues with 3007.0 minions?
Yeah.. this may still be an issue for me as well. I'm not sure yet. I noticed some odd things last night in testing but it could be unrelated. I definitely don't have the |
Hi all, I have the same problem on 3007.0. |
We currently encountering same issue. (salt-master and minions are bot 3306.7) |
3007.1 is completely dead for me. Now under 3007.1 (skipped 3007.0 because it was pretty well b0rked for Windows minions), minions disconnect after a few minutes. If I restart the salt master and issue a command, I'm good. If I restart the salt master and wait ~5 minutes, all the minions are offline and won't come back with a The salt master logs show a non-stop stream of "Authentication requested from" and "Authentication accepted from" messages. Typically I would get those messages right after restarting the 3006.7 master or after issuing a command like Now I'm getting 10-15 per second non-stop. Using the minion on the master, I can view the logs and verify the minion doesn't receive the Even stranger, I can connect out to a minion and manually run |
Hmm...I noticed something interesting and potentially significant. After restarting the master, I get lots of entries like this in the log:
It sits there and counts up (if I'm idle or issuing a command like test.ping) until it hits about ~300-400 entries in the cache....then with no warnings or errors in the log it resets and starts counting up again in the middle of the flood of "Authentication requested from" and "Authentication accepted from" messages. |
I downgraded the master and the minion running on the master to 3006.8 and semi-reliable connectivity appears to have been restored. All the minions are still running 3007.1 and appear to be working fine.
|
I can confirm! |
Same issue upgraded master to 3007.1 and minions dropping like flyes. The crazy part, when i run tcpdump on minion and run salt-call from minion works ok, but it bypass the connection between master and minion. Since it create his own, for that call. With latest releases salt becoming more and more unusable in prod enviroment. So is very unconfortable upgrading/downgrading, when you need specific feature to work |
Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while.
salt '*' test.ping
failed with the following error message:here are a few observations:
salt-call test.ping
works fine on minion side. other commands likesalt-call state.apply
also works fine. this indicates minion to master communication is fine but master to minion communication is notSetup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3
. no custom config on minionsaltstack/salt:3006.3
. master configs:state file:
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: