Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] intermittent connection between master and minion #65265

Open
3 of 9 tasks
qianguih opened this issue Sep 21, 2023 · 31 comments
Open
3 of 9 tasks

[BUG] intermittent connection between master and minion #65265

qianguih opened this issue Sep 21, 2023 · 31 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior needs-triage Transport

Comments

@qianguih
Copy link

Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while. salt '*' test.ping failed with the following error message:

    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
    
    salt-run jobs.lookup_jid 20230920213139507242

here are a few observations:

  • restarting the salt-minion service helped but the same minion lost connection again after a while.
  • salt-call test.ping works fine on minion side. other commands like salt-call state.apply also works fine. this indicates minion to master communication is fine but master to minion communication is not
  • below is the error message i found from minion log. i tried to bump up the timeout param like salt '*' -t 600 test.ping . but it doesn't help
2023-09-20 13:00:08,121 [salt.minion      :2733][ERROR   ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?

Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

  • minion was installed by sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3. no custom config on minion
  • master runs inside a container using image saltstack/salt:3006.3 . master configs:
nodegroups:
  prod-early-adopter: L@minion-hostname-1
  prod-general-population: L@minion-hostname-2
  release: L@minion-hostname-3
  custom: L@minion-hostname-4

file_roots:
  base:
    - <path/to/custom/state/file>

state file:

pull_state_job:
  schedule.present:
    - function: state.apply
    - maxrunning: 1
    - when: 8:00pm

deploy:
  cmd.run:
    - name: '<custom-command-here>'
    - runas: ubuntu

Please be as specific as possible and give set-up details.

  • on-prem machine
  • VM (Virtualbox, KVM, etc. please specify)
  • VM running on a cloud service, please be explicit and add details
  • container (Kubernetes, Docker, containerd, etc. please specify)
  • or a combination, please be explicit
  • jails if it is FreeBSD
  • classic packaging
  • onedir packaging
  • used bootstrap to install

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.3
 
Python Version:
        Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: Not Installed
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: alpine 3.14.6 
        locale: utf-8
       machine: x86_64
       release: 5.11.0-1022-aws
        system: Linux
       version: Alpine Linux 3.14.6 

Additional context
Add any other context about the problem here.

@qianguih qianguih added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 21, 2023
@welcome
Copy link

welcome bot commented Sep 21, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@brettgilmer
Copy link

HI - I am seeing this same issue.
I am also seeing on the periodic pings configured by the "ping_interval" minion configuration parameter.
Running salt 3006.5 on both minon and master.

@darkpixel
Copy link
Contributor

It seems worse with 3006.5 with Linux as the master when managing Windows minions.

@brettgilmer
Copy link

This is affecting many of our endpoints. I can get them to re-establish communication by restarting the minion or the master, but they lose communication again.

@brettgilmer
Copy link

Restarting the saltmaster seems to fix the issue for all minions, for a while, but the issue will return after about 12 hours on a different seemingly random selection of minions.

@raddessi
Copy link
Contributor

I seem to have a very similar issue with 3006.x but in my case restarting the master does not have any effect and only a minion restart resolves the issue.

Another oddity is that I can see in the minion logs that the minion is still receiving commands from the master and is able to execute them just fine but the master seemingly never receives the response data. If I issue a salt '*' service.restart salt-minion from the master all of the minions receive the command and restart and pop back up just fine and then communication will work for probably another 12 hours or so.

I don't recall having this issue on 3005.x but I have not downgraded that far yet.. so far both 3006.5 and 3006.4 minions have the problem for me. I'll try to run a tcpdump if I have time

@ReubenM
Copy link

ReubenM commented Jan 29, 2024

I am encountering similar issues. Everything is 3006.5.

I've spent two day's thinking I broke something in some recent changes I made, but I've found that the minions jobs are succeeding, but they timeout trying to communicate back to the master. I'm thinking this may be related to concurrency + load. I use this for testing environment automation, and during tests I have concurrent jobs fired off by the scheduler for test data collections. And that is where the issues start to show up in the logs. When this happens, the minions seem to try to re-send the data which just compounds the problem. The logs on the master show that it is getting the messages because it is flagging duplicate messages, but something seems to be getting lost processing the return data.

The traces all look the same and seem to indicate something is getting dropped in concurrency related code:

2024-01-29 15:22:57,215 [salt.master      :1924][ERROR   ][115353] Error in function minion_pub:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
    ret = yield self._uncrypted_transfer(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
    ret = yield self.transport.send(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 909, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 589, in send
    recv = yield future
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
    pub_data = self.pub(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
    raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1918, in run_func
    ret = getattr(self, func)(load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1839, in minion_pub
    return self.masterapi.minion_pub(clear_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/daemons/masterapi.py", line 952, in minion_pub
    ret["jid"] = self.local.cmd_async(**pub_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 494, in cmd_async
    pub_data = self.run_job(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 409, in run_job
    raise SaltClientError(general_exception)

@darkpixel
Copy link
Contributor

I just discovered something.

At any random time I might have 25-50 minions that don't appear to respond to jobs. They may or may not respond to test.ping, but definitely won't respond to something like state.sls somestate.

...buuuut they ARE actually are listening to the master.

So my workflow is stupidly:

salt '*' minion.restart
# Wait for every minion to return that it failed to respond
salt '*' state.sls somestate

@raddessi
Copy link
Contributor

raddessi commented Feb 13, 2024

@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience

@raddessi
Copy link
Contributor

This seems to still be an issue on 3006.7 when both minion and master are the same version

@darkpixel
Copy link
Contributor

3007.0 is...worse?

Woke up to all ~600 minions in an environment being offline.

salt '*' test.ping failed for every minion.

The log showed returns from every minion, but the master spit out Minion did not return: [Not connected] for every single one.

Restarted the salt-master service, got distracted for ~15 minutes, ran another test.ping and several hundred failed to respond.

Used Cluster SSH to connect in to every machine I can reach across the internet and restarted the salt-minion service and I'm down to a mix of ~60 (Windows, Linux, and BSD) that don't respond and I can't reach. Maybe 10 of them are 3006.7.

I'd love to test/switch to a different transport like websockets that would probably be more stable, but it appears to be "all or nothing". If I switch to websockets on the master, it looks like every minion will disconnect unless I also update them to use websockets...and if I update them to use websockets and something breaks, I'm going to have to spend the next month trying to get access to hosts to fix salt-minion.

@darkpixel
Copy link
Contributor

It just happened on my master which is 3007.0...

I was going highstate on a minion that involves certificate signing and it refused to generate the certificate with no error messages in the salt master log.

I tried restarting the salt master, no dice.

About 10 minutes later I decided to restart the salt master's minion...and suddenly certificate signing worked.

The minion on the master wasn't communicating with the master...locally...on the same box...

@gregorg
Copy link

gregorg commented Mar 25, 2024

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

@sasidharjetb
Copy link

Where do i need to add this @gregorg in salt master and how we need to add this

@gregorg
Copy link

gregorg commented Mar 28, 2024

Where do i need to add this @gregorg in salt master and how we need to add this

Add this in /etc/salt/master and restart salt-master.

@sasidharjetb
Copy link

image

@sasidharjetb
Copy link

we upgrade salt to 3006.4 on master and 20 minions out of which 10 minions are not upgraded
Will this solve my issue ?

@gregorg
Copy link

gregorg commented Mar 28, 2024

This is not a support ticket, look at salt master logs.

@darkpixel
Copy link
Contributor

I tried those settings @gregorg. It's been intermittent for the last three days....and this morning 100% of my minions are offline (even the local one on the salt master)

If I connect to a box with a minion I show the service is running. I can totally state.highstate and everything works properly.

Restarting the master brings everything online.

There's nothing that appears unusual in the master log. I can even see minions reporting their results if I do something like salt '*' test.ping, but all I get back is Minion did not return. [Not connected].

I'd love to switch to a potentially more reliable transport, but it looks like Salt can only have one transport active at a time...so if I enable something like websockets it looks like all my minions will be knocked offline until I reconfigure them.

@darkpixel
Copy link
Contributor

I just noticed an interesting log entry on the master. A bunch of my minions weren't talking again, even though the log had a ton of lines of "Got return from..."

So I restarted salt-master and noticed this in the log:

2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140336] MWorkerQueue under PID 2140336 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140337] MWorkerQueue under PID 2140337 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140318] MWorkerQueue under PID 2140318 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140341] MWorkerQueue under PID 2140341 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140339] MWorkerQueue under PID 2140339 is closing
2024-04-07 14:40:30,310 [salt.transport.zeromq:477 ][INFO    ][2140335] MWorkerQueue under PID 2140335 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140319] MWorkerQueue under PID 2140319 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140338] MWorkerQueue under PID 2140338 is closing
2024-04-07 14:40:30,316 [salt.transport.zeromq:477 ][INFO    ][2140320] MWorkerQueue under PID 2140320 is closing
2024-04-07 14:40:30,335 [salt.transport.zeromq:477 ][INFO    ][2140343] MWorkerQueue under PID 2140343 is closing
2024-04-07 14:40:30,360 [salt.transport.zeromq:477 ][INFO    ][2140333] MWorkerQueue under PID 2140333 is closing
2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)
2024-04-07 14:40:31,308 [salt.utils.process:752 ][INFO    ][2140315] kill_children retries left: 3
2024-04-07 14:40:31,334 [salt.utils.parsers:1061][WARNING ][2140139] Master received a SIGTERM. Exiting.
2024-04-07 14:40:31,334 [salt.cli.daemons :99  ][INFO    ][2140139] The Salt Master is shut down
2024-04-07 14:40:32,241 [salt.cli.daemons :83  ][INFO    ][2407186] Setting up the Salt Master

Specifically this:

2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)

Maybe something's hanging the MWorkerQueue?

@amalaguti
Copy link

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

Any improvement with these settings ?

@darkpixel
Copy link
Contributor

I didn't use those exact settings because my master is smaller and has fewer minions.
8 cores * 1.5 threads per core = 12 threads = 2,400 minions (I only have ~700 on this test box).

It's no longer dropping all the minions every few hours...it's more like once or twice a week.

@darkpixel
Copy link
Contributor

Also, I'm not sure if this is related or not, but it seems to be in the same vein--communication between the minions and master is pretty unreliable.

root@salt:~# salt 'US*' state.sls win.apps
'str' object has no attribute 'pop'

root@salt:~# salt 'US*' state.sls win.apps
US-REDACTED-54:
    Data failed to compile:
----------
    The function "state.sls" is running as PID 7340 and was started at 2024, May 02 03:22:05.301817 with jid 20240502032205301817
'str' object has no attribute 'pop'

root@salt:~# service salt-master restart
root@salt:~# service salt-minion restart
# wait 30 seconds or so for minions to reconnect
root@salt:~# salt 'US*' state.sls win.apps
<snip lots of output>
-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 428
# of minions returned: 405
# of minions that did not return: 23
# of minions with errors: 6
-------------------------------------------
ERROR: Minions returned with non-zero exit code
root@salt:~# 

@raddessi
Copy link
Contributor

raddessi commented May 2, 2024

Checking back in here, I think this is actually resolved for me once I got all my minions to 3007.0. I've removed all restart cron jobs and the minions appear to have been stable for days now. Is anyone else still having issues with 3007.0 minions?

communication between the minions and master is pretty unreliable

Yeah.. this may still be an issue for me as well. I'm not sure yet. I noticed some odd things last night in testing but it could be unrelated. I definitely don't have the 'str' object has no attribute 'pop' or any other error but sometimes minions do not return in time.

@Rosamaha1
Copy link

Hi all,

I have the same problem on 3007.0.
Master to Minion connectivity always fails.
The other way is working fine..
I did some tuning of the master conf but it didn't help!

@tomm144
Copy link

tomm144 commented May 21, 2024

We currently encountering same issue. (salt-master and minions are bot 3306.7)
Ping on one minion causes high load on master and minion becomes "unavailable", resp. master cannot receive answer.
On minion log is this message (multiple times)
May 21 09:56:28 salt-minion[276145]: [ERROR ] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20240521075610312606', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2024-05-21T07:56:10.447647', 'nonce': ''} request

@darkpixel
Copy link
Contributor

3007.1 is completely dead for me.
Under 3006.7 minions would slowly become unavailable over a day or two until I ran a minion.restart. Even though all commands to the minions returned "Minion did not return. [Not connected]``` they actually were connected and restarted themselves and started communicating properly.

Now under 3007.1 (skipped 3007.0 because it was pretty well b0rked for Windows minions), minions disconnect after a few minutes.

If I restart the salt master and issue a command, I'm good. If I restart the salt master and wait ~5 minutes, all the minions are offline and won't come back with a minion.restart, only restarting the salt master.

The salt master logs show a non-stop stream of "Authentication requested from" and "Authentication accepted from" messages. Typically I would get those messages right after restarting the 3006.7 master or after issuing a command like salt '*' test.ping, but they'd settle down when nothing was going on.

Now I'm getting 10-15 per second non-stop.

Using the minion on the master, I can view the logs and verify the minion doesn't receive the minion.restart command--but there are also no errors about communication issues with the master.

Even stranger, I can connect out to a minion and manually run state.highstate and it works perfectly fine. No issues communicating with the master there....just receiving commands I guess.

@darkpixel
Copy link
Contributor

Hmm...I noticed something interesting and potentially significant.
I saw the con_cache setting in the config file that defaults to false. I figured I would turn it on and see what happened.

After restarting the master, I get lots of entries like this in the log:

2024-05-28 02:39:17,506 [salt.utils.master:780 ][INFO    ][806466] ConCache 299 entries in cache

It sits there and counts up (if I'm idle or issuing a command like test.ping) until it hits about ~300-400 entries in the cache....then with no warnings or errors in the log it resets and starts counting up again in the middle of the flood of "Authentication requested from" and "Authentication accepted from" messages.

@darkpixel
Copy link
Contributor

I downgraded the master and the minion running on the master to 3006.8 and semi-reliable connectivity appears to have been restored. All the minions are still running 3007.1 and appear to be working fine.

con_cache was probably a red herring as it keeps dropping back to 0 and counting back up constantly.

@Rosamaha1
Copy link

I downgraded the master and the minion running on the master to 3006.8 and semi-reliable connectivity appears to have been restored. All the minions are still running 3007.1 and appear to be working fine.

con_cache was probably a red herring as it keeps dropping back to 0 and counting back up constantly.

I can confirm!
3007.0 for salt master was for me aswell a total mess! On minion side the version is working fine!
I downgraded my salt master also to 3006.8 and everything is working fine now..
Hopefully a stable release for 3007 will soon arrive!

@frenkye
Copy link

frenkye commented Jun 7, 2024

Same issue upgraded master to 3007.1 and minions dropping like flyes. The crazy part, when i run tcpdump on minion and run salt 'minion' test.ping it doesn't even show single packet on minion, the connection is totally fucked. In netstat on master there can be seen established and close_wait connection to 4505 from minion.

salt-call from minion works ok, but it bypass the connection between master and minion. Since it create his own, for that call.

With latest releases salt becoming more and more unusable in prod enviroment.
3007.1 - minions doesn't work
3007.0 - windows fucked
3006.6 - grains don't work
3005 - salt-ssh was semi-working

So is very unconfortable upgrading/downgrading, when you need specific feature to work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage Transport
Projects
None yet
Development

No branches or pull requests