You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Perhaps as some sort of regression with onedir salt, high load which traditionally would just result in timeouts and master is unresponsive messages, now additionally causes the salt-master process to fail and exit
Setup
Install salt-master and salt-minion on a machine. Note that in my test case I used a machine with a single CPU (number of CPUs may have a big effect on the results shown below).
We are trying to produce a situation where the master is under heavy load, so the following bash commands can be used to launch 10 orchestrations concurrently
Notice the following error messages in the master logs
2024-04-19 19:57:24,903 [salt.loaded.int.states.saltmod:384 ][WARNING ][11443] Output from salt state not highstate
2024-04-19 19:57:25,973 [salt.state :323 ][ERROR ][11443] {'out': 'highstate', 'ret': {'salt-3006': False}}
2024-04-19 19:57:50,409 [salt.client :1912][ERROR ][11888] Message timed out
2024-04-19 19:57:54,875 [salt.client :1912][ERROR ][11496] Message timed out
2024-04-19 19:58:07,149 [salt.client :1912][ERROR ][11593] Message timed out
2024-04-19 19:58:07,149 [salt.state :323 ][ERROR ][11888] An exception occurred in this state: Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
ret = yield self._uncrypted_transfer(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
ret = yield self.transport.send(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 915, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 594, in send
recv = yield future
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
pub_data = self.pub(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--asy
nc`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache
later.
and now notice that the salt-master process has failed and exited
# systemctl status salt-master
× salt-master.service - The Salt Master Server
Loaded: loaded (/lib/systemd/system/salt-master.service; enabled; vendor preset: enabled)
Active: failed (Result: oom-kill) since Fri 2024-04-19 19:58:08 UTC; 4min 59s ago
Docs: man:salt-master(1)
file:///usr/share/doc/salt/html/contents.html
https://docs.saltproject.io/en/latest/contents.html
Process: 9044 ExecStart=/usr/bin/salt-master (code=exited, status=0/SUCCESS)
Main PID: 9044 (code=exited, status=0/SUCCESS)
CPU: 54.873s
Expected behavior
Master should keep running, ideally allowing all minions to respond without timeouts, when the cpu capacity frees up. The master failing and exiting under these circumstances was not observed on classic installations of salt.
Versions Report
Salt Version:
Salt: 3006.7
Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.15.1
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: ubuntu 22.04.2 jammy
locale: utf-8
machine: x86_64
release: 5.15.0-67-generic
system: Linux
version: Ubuntu 22.04.2 jammy
The text was updated successfully, but these errors were encountered:
Description
Perhaps as some sort of regression with onedir salt, high load which traditionally would just result in timeouts and master is unresponsive messages, now additionally causes the salt-master process to fail and exit
Setup
Install salt-master and salt-minion on a machine. Note that in my test case I used a machine with a single CPU (number of CPUs may have a big effect on the results shown below).
The following configuration was used.
The following states were used to demonstrate this behavior.
Steps to Reproduce the behavior
We are trying to produce a situation where the master is under heavy load, so the following bash commands can be used to launch 10 orchestrations concurrently
Notice the following error messages in the master logs
and now notice that the salt-master process has failed and exited
Expected behavior
Master should keep running, ideally allowing all minions to respond without timeouts, when the cpu capacity frees up. The master failing and exiting under these circumstances was not observed on classic installations of salt.
Versions Report
The text was updated successfully, but these errors were encountered: