[BUG] 3007.0 - Vulnerability scanning causes Salt Master denial of service #66519

clayoster · 2024-05-14T17:09:32Z

Description
I have found that my Salt Master servers running 3007.0 become unresponsive on a weekly basis after our internal vulnerability scans run (Tenable Vulnerability Management). This is very similar to the issue described in #64061 that was fixed in versions 3005.2/3006.2 (CVE-2023-20897).

I took a packet capture while running a scan against the server and noticed attempts to start TLS sessions with port 4506 is what triggers the errors below in /var/log/salt/master. The number of errors seems to be equal to the amount of worker processes configured on the master. This only seems to occur when the scan investigates TCP port 4506.

2024-05-13 16:47:48,935 [salt.transport.zeromq:572 ][ERROR   ][189978] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:49,963 [salt.transport.zeromq:572 ][ERROR   ][189979] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:49,963 [salt.transport.zeromq:572 ][ERROR   ][189988] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:50,474 [salt.transport.zeromq:572 ][ERROR   ][189989] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-05-13 16:47:50,730 [salt.transport.zeromq:572 ][ERROR   ][189981] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable

Once these errors occur, the master service becomes completely unresponsive to minion requests. Attempting to issue commands from the affected Salt Master result in an error that the Master is not responding.

user@salt1:~$ sudo salt '*' test.ping
[ERROR   ] Request client send timedout
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

Restarting the salt-master service resolves the issue.

Setup
I am running a Master Cluster with 4 servers built on Oracle Linux 8.9 and Salt 3007.0.

Please be as specific as possible and give set-up details.

on-prem machine
VM (VMware)
VM running on a cloud service, please be explicit and add details
container (Kubernetes, Docker, containerd, etc. please specify)
or a combination, please be explicit
jails if it is FreeBSD
classic packaging
onedir packaging
used bootstrap to install

Steps to Reproduce the behavior
Initiating a scan with Tenable against one of the master servers triggers this issue. Based on the similarity to #64061, I imagine a scan from Rapid7 InsightVM / Nexpose would also trigger the issue.

An easier way to reproduce the issue is to use openssl to attempt opening TLS connections to port 4506 in quick succession:
for i in {1..30}; do openssl s_client -connect salt.example.com:4506 -tls1_2 </dev/null; sleep .2; done

A restart of the salt-master service brings it back to life.

Expected behavior
The Salt Master service should not become unresponsive when port 4506 is investigated by vulnerability scanners or receives other invalid requests.

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Salt Version:
          Salt: 3007.0

Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: unknown
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: 1.7.2
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.14.1
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: oracle 8.9
        locale: utf-8
       machine: x86_64
       release: 5.15.0-205.149.5.1.el8uek.x86_64
        system: Linux
       version: Oracle Linux Server 8.9

Additional context
This only seems to affect version 3007.0. I tested with versions 3005.5 and 3006.8 and they log the following messages when attempting to reproduce the issue, but do not become unresponsive.

2024-05-13 19:02:56,152 [salt.payload     :111 ][CRITICAL][1358] Could not deserialize msgpack message. This often happens when trying to read a file not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.

The text was updated successfully, but these errors were encountered:

tjyang · 2024-05-15T12:40:07Z

@clayoster, thanks for the one-liner script to bring down 3007 salt-master. I am able to confirm this on my test salt-master.

tjyang · 2024-05-23T18:45:16Z

I use same oneliner script to bring down 3007.1 salt-master, 3006.8 withstand this test. A salt-master restart will restore the service.

for i in {1..30}; do openssl s_client -connect salt.example.com:4506 -tls1_2 </dev/null; sleep .2; done

tjyang · 2024-06-02T13:03:25Z

I use same oneliner script to bring down 3007.1 salt-master, 3006.8 withstand this test. A salt-master restart will restore the service.
for i in {1..30}; do openssl s_client -connect salt.example.com:4506 -tls1_2 </dev/null; sleep .2; done

one-liner test still can bring down 4506 after OS upgraded from RockyLinux 8.9 to 8.10(released a few days ago).

Here is the log with trace enabled for log file. following logs showing zmq backend resource unavailable when one-liner script hit.

2024-06-02 09:33:05,632 [salt.transport.zeromq:572 ][ERROR   ][42819] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
2024-06-02 09:33:11,074 [salt.utils.process:32  ][TRACE   ][42741] Process manager iteration

dynek · 2024-06-21T12:27:52Z

Is there a light at the end of the tunnel? I have the same issue, and it doesn’t take much effort to reproduce it.

tjyang · 2024-06-21T12:44:31Z

Is there a light at the end of the tunnel? I have the same issue, and it doesn’t take much effort to reproduce it.
Before this issue is fixed by core dev team which are short of manpower, I would suggest downgrade to LTS 3006.8 version.

clayoster · 2024-06-28T16:47:37Z

A quick band-aid fix is to configure the local firewall on the master server(s) to drop incoming connections to TCP 4506 from the IP addresses of the vulnerability scanning systems which trigger the issue. If you are running a master cluster behind a load balancer, the same configuration needs to be added to the to the load balancer server as well. This has kept my master cluster stable for over a month now.

@dwoz - Is there any chance a fix for this issue will be included in the next minor release? It allows a pretty simple denial of service attack, whether intentional or not.

dynek · 2024-07-01T10:12:44Z

@clayoster thing is on my side, just a few machines connecting to their master reproduce this bug. There's not much effort required to reproduce this. Maybe I'm hitting a different bug 🤷
Rollbacked to 3006.6-r0 (version available on Alpine v3.19) for the moment.

clayoster · 2024-07-02T23:24:11Z

@dynek That sounds like a different issue to me. In my environment, normal minion/master communication does not cause any instability or error logs. All minions and masters are on 3007.1.

clayoster · 2024-08-27T13:46:57Z

I have noticed that the number of errors that are logged before the master becomes unresponsive seems to match the number of worker threads defined. This makes me think that each time one of these errors occurs, a worker thread becomes unresponsive. Once errors == worker threads, the Salt Master becomes unresponsive.

I noticed there is a break after the logging code that generates the message I have been seeing. If I switch this to continue and try to reproduce the issue, I still see errors logged but the master process does not break and minion communication seems to continue successfully.

salt/salt/transport/zeromq.py

Lines 571 to 573 in 246d066

    
           except Exception as exc:  # pylint: disable=broad-except 
        
               log.error("Exception in request handler", exc_info=True) 
        
               break

Perhaps the larger issue is whatever is causing "Resource temporarily unavailable" and the except statement to be reached though.

jsansone-pw · 2024-08-30T12:16:25Z

Believe we may be (were) experiencing the same issue with 3007 - we also have Tenable Vulnerability Scanning software. Unfortunately I cannot test the fix now as we rolled back to 3006.8 since 3007 was essentially broken for us. Also we are getting rid of Tenable in favor of Rapid7, so it would be interesting to determine is this would happen with any vuln scanning or specific to Tenable.

henri9813 · 2024-08-31T10:36:00Z

In my case, my instance is public because i also manage public server so it's possible it has been scanned by random bot

clayoster · 2024-09-01T15:28:35Z

@jsansone-pw - I think there is a high likelihood that you would see the same issue. In a previous environment I managed, Rapid7 was in use and triggered the issue #64061 / CVE-2023-20897 which was present in 3005.1/3006.1 and had the same cause and effect as this issue.

clayoster added Bug broken, incorrect, or confusing behavior needs-triage labels May 14, 2024

twangboy assigned dwoz May 14, 2024

tjyang mentioned this issue May 23, 2024

[BUG] 3006.8 master take extra 4 seconds to finish test.ping #66530

Closed

5 tasks

clayoster changed the title ~~[BUG] 3007.0 - Vulnerability scanning causes Salt Master to become unresponsive~~ [BUG] 3007.0 - Vulnerability scanning causes Salt Master denial of service Jun 28, 2024

clayoster mentioned this issue Aug 30, 2024

Intermittent ZMQ issues #66288

Open

9 tasks

whytewolf mentioned this issue Aug 31, 2024

Salt 3007.1 gets stuck #66868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 3007.0 - Vulnerability scanning causes Salt Master denial of service #66519

[BUG] 3007.0 - Vulnerability scanning causes Salt Master denial of service #66519

clayoster commented May 14, 2024 •

edited

Loading

tjyang commented May 15, 2024

tjyang commented May 23, 2024 •

edited

Loading

tjyang commented Jun 2, 2024 •

edited

Loading

dynek commented Jun 21, 2024

tjyang commented Jun 21, 2024

clayoster commented Jun 28, 2024

dynek commented Jul 1, 2024

clayoster commented Jul 2, 2024

clayoster commented Aug 27, 2024

jsansone-pw commented Aug 30, 2024

henri9813 commented Aug 31, 2024

clayoster commented Sep 1, 2024

[BUG] 3007.0 - Vulnerability scanning causes Salt Master denial of service #66519

[BUG] 3007.0 - Vulnerability scanning causes Salt Master denial of service #66519

Comments

clayoster commented May 14, 2024 • edited Loading

tjyang commented May 15, 2024

tjyang commented May 23, 2024 • edited Loading

tjyang commented Jun 2, 2024 • edited Loading

dynek commented Jun 21, 2024

tjyang commented Jun 21, 2024

clayoster commented Jun 28, 2024

dynek commented Jul 1, 2024

clayoster commented Jul 2, 2024

clayoster commented Aug 27, 2024

jsansone-pw commented Aug 30, 2024

henri9813 commented Aug 31, 2024

clayoster commented Sep 1, 2024

clayoster commented May 14, 2024 •

edited

Loading

tjyang commented May 23, 2024 •

edited

Loading

tjyang commented Jun 2, 2024 •

edited

Loading