You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I decided to open a new ticket given related #32465,#49105,#33750, #46868. This is to record the process of fixing this issue and hope others can help to resolve this issue.
Description of Issue
This happened to my 9.2.2 salt-master. I restored the VM back as old as 9.2.0 and issue still exist.
or older minions have issue to interact with master 9.2.2 ?
Setup
This is a multi-master setup, each minion allow management from salt01 and salt02
Steps to Reproduce Issue
Following exact command sequence able to interact with most minions but still timeout in the end
me@salt02> sudo systemctl stop salt-master && sleep 5 && sudo systemctl start salt-master \
&& sleep 5 && sudo salt \* test.version --output=txt
<snipped>
a31: 2019.2.0
b114: 2019.2.2
c01: 2019.2.0
dms01: 2019.2.0
eslave01: 2019.2.0
Salt request timed out. The master is not responding. You may need to run your command with
`--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job
id (jid) and exit immediately without listening for responses. You can then use
`salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[me@salt01 ~]$
a few minutes later, not even one version number returned from minions.
[me@salt01 ~]$ sudo salt \* test.version --output=txt
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[me@salt01 ~]$
Versions Report
salt-minions versions ranged from 2017 to 2019.2.2
salt master has following HW resource.
8 core CPUs
8G RAM and around 4G used
following setting was all commented out/not-exist, the master setting was one used since version 2017
#max_open_files
#worker_threads
#pub_hwm:
#salt_event_pub_hwm
#event_publisher_pub_hwm
setting example in /etc/salt/master
# These two ZMQ HWM settings, salt_event_pub_hwm and event_publisher_pub_hwm
# are significant for masters with thousands of minions. When these are
# insufficiently high it will manifest in random responses missing in the CLI
# and even missing from the job cache. Masters that have fast CPUs and many
# cores with appropriate worker_threads will not need these set as high.
# On deployment with 8,000 minions, 2.4GHz CPUs, 24 cores, 32GiB memory has
# these settings:
#
# salt_event_pub_hwm: 128000
# event_publisher_pub_hwm: 64000
version report.
[me@salt01 ~]$ sudo salt --versions-report ;date
Salt Version:
Salt: 2019.2.2
Dependency Versions:
cffi: 1.6.0
cherrypy: unknown
dateutil: 1.5
docker-py: Not Installed
gitdb: 0.6.4
gitpython: 1.0.1
ioflo: 1.3.8
Jinja2: 2.7.2
libgit2: 0.26.3
libnacl: 1.6.1
M2Crypto: 0.31.0
Mako: 0.8.1
msgpack-pure: Not Installed
msgpack-python: 0.5.6
mysql-python: 1.2.5
pycparser: 2.14
pycrypto: 2.6.1
pycryptodome: 3.7.3
pygit2: 0.26.4
Python: 2.7.5 (default, Aug 7 2019, 00:51:29)
python-gnupg: Not Installed
PyYAML: 3.11
PyZMQ: 15.3.0
RAET: Not Installed
smmap: 0.9.0
timelib: 0.2.4
Tornado: 4.2.1
ZMQ: 4.1.4
System Versions:
dist: centos 7.7.1908 Core
locale: UTF-8
machine: x86_64
release: 3.10.0-1062.9.1.el7.x86_64
system: Linux
version: CentOS Linux 7.7.1908 Core
Thu Dec 12 16:37:33 CST 2019
[me@salt01 ~]$
[me@salt01 ~]$ grep ^log_fmt_console /etc/salt/minion
log_fmt_console: '%(asctime)s,%(msecs)03d [%(name)-17s:%(lineno)-4d][%(levelname)-8s][%(process)d] %(message)s'
[me@salt01 ~]$
and run command like following to see minion/master interaction details.
sudo time salt-call pkg.version bash -l debug
With log file level set to debug , the message in when failing to get a connection to saltmaster
sudo salt minion01 test.version --log-file-level=debug
<snipped>
2019-12-13 17:50:01,285 [salt.config :2185][DEBUG ][27609] Reading configuration from /etc/salt/master
2019-12-13 17:50:01,290 [salt.config :2329][DEBUG ][27609] Including configuration from '/etc/salt/master.d/reactor.conf'
2019-12-13 17:50:01,291 [salt.config :2185][DEBUG ][27609] Reading configuration from /etc/salt/master.d/reactor.conf
2019-12-13 17:50:01,291 [salt.config :2329][DEBUG ][27609] Including configuration from '/etc/salt/master.d/salt-api.conf'
2019-12-13 17:50:01,291 [salt.config :2185][DEBUG ][27609] Reading configuration from /etc/salt/master.d/salt-api.conf
2019-12-13 17:50:01,293 [salt.config :3666][DEBUG ][27609] Using cached minion ID from /etc/salt/minion_id: salt01
2019-12-13 17:50:01,295 [salt.config :2289][DEBUG ][27609] Missing configuration file: /root/.saltrc
2019-12-13 17:50:01,297 [salt.utils.event :322 ][DEBUG ][27609] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2019-12-13 17:50:01,297 [salt.utils.event :323 ][DEBUG ][27609] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2019-12-13 17:50:01,439 [salt.transport.zeromq:138 ][DEBUG ][27609] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'salt01_master', u'tcp://127.0.0.1:4506', u'clear')
2019-12-13 17:50:01,440 [salt.transport.zeromq:209 ][DEBUG ][27609] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
2019-12-13 17:50:01,440 [salt.transport.zeromq:1189][DEBUG ][27609] Trying to connect to: tcp://127.0.0.1:4506
2019-12-13 17:50:06,448 [salt.transport.zeromq:1244][DEBUG ][27609] SaltReqTimeoutError, retrying. (1/3)
2019-12-13 17:50:11,452 [salt.transport.zeromq:1244][DEBUG ][27609] SaltReqTimeoutError, retrying. (2/3)
2019-12-13 17:50:16,457 [salt.transport.zeromq:1244][DEBUG ][27609] SaltReqTimeoutError, retrying. (3/3)
2019-12-13 17:50:21,464 [salt.transport.zeromq:1214][DEBUG ][27609] Re-init ZMQ socket: Message timed out
2019-12-13 17:50:21,464 [salt.transport.zeromq:1189][DEBUG ][27609] Trying to connect to: tcp://127.0.0.1:4506
2019-12-13 17:50:21,465 [salt.transport.ipc:364 ][DEBUG ][27609] Closing IPCMessageSubscriber instance
2019-12-13 17:50:21,469 [salt.utils.lazy :104 ][DEBUG ][27609] LazyLoaded nested.output
2019-12-13 17:50:21,489 [salt.transport.zeromq:233 ][DEBUG ][27609] Closing AsyncZeroMQReqChannel instance
From older minion ?
2019-12-13 15:03:16,712 [salt.loaded.int.returner.local_cache:156 ][ERROR ][1937] An extra return was detected from minion xxcims01, please verify the minion, this could be a replay attack
<snipped>
was detected from minion xxre01, please verify the minion, this could be a replay attack
2019-12-13 15:03:45,523 [salt.loaded.int.returner.local_cache:156 ][ERROR ][1935] An extra return was detected from minion pcims01, please verify the minion, this could be a replay attack
from /var/log/salt/master, this is from minion with 1st line as "-----BEGIN RSA PUBLIC KEY-----" public key issue.
2019-12-13 12:57:05,172 [tornado.application:123 ][ERROR ][503] Future exception was never retrieved: Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 232, in wrapper
yielded = next(result)
File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 754, in handle_message
stream.send(self.serial.dumps(self._auth(payload['load'])))
File "/usr/lib/python2.7/site-packages/salt/transport/mixins/auth.py", line 436, in _auth
pub = salt.crypt.get_rsa_pub_key(pubfn)
File "/usr/lib/python2.7/site-packages/salt/crypt.py", line 213, in get_rsa_pub_key
key = RSA.load_pub_key_bio(bio)
File "/usr/lib64/python2.7/site-packages/M2Crypto/RSA.py", line 441, in load_pub_key_bio
rsa_error()
File "/usr/lib64/python2.7/site-packages/M2Crypto/RSA.py", line 330, in rsa_error
raise RSAError(Err.get_error_message())
RSAError: no start line
Possible Resolutions
performance setting for master with 16G RAM/16 core CPUs and 1,000 connected minions.
following setting is able pass "for i in 1 2 3 4 5; do sudo salt * test.version --out=txt; sleep 60; done" stress test.
I tried to raise up the threshold of worker_thread, zmq stuff as show above. but it only can last for a while before I need to restart saltmaster to be able to send test.version to one single minion. I am thinking there is bug somewhere.
This is definitely strange. Have you tried running your command with --async, and then using salt-run jobs.find_job <JID> with whatever the JID was? That would show you that at least the results are returning 🤔
You can use the state event runner to watch your event bus: salt-run state.event pretty=True
It sounds like your master is getting swamped with events - that should give you an idea bout what events they actually are. Right now it's not clear what's causing your issue.
I am closing this issue since the root cause was my salt-mater was under hardware specification needed to support 1K minion connection. Once CPU/RAM were doubled,zmq hwm settings were increased. The master now can handle "salt * test.ping" work without timeout.
Hi
I decided to open a new ticket given related #32465,#49105,#33750, #46868. This is to record the process of fixing this issue and hope others can help to resolve this issue.
Description of Issue
This happened to my 9.2.2 salt-master. I restored the VM back as old as 9.2.0 and issue still exist.
Setup
This is a multi-master setup, each minion allow management from salt01 and salt02
Steps to Reproduce Issue
Versions Report
Debugging
I also removed all existing minions (sudo salt-key -D -y) and only keep a few minions for testing version command, still same problem.
" sudo salt-run state.event pretty=True" was used in another vt100 terminal to display event bus traffic, but not thing related to salt-master.
Looking at this ggroup post to raise up threshold if event congestion issue.
Some errors in /var/log/salt/master.
Possible Resolutions
following setting is able pass "for i in 1 2 3 4 5; do sudo salt * test.version --out=txt; sleep 60; done" stress test.
Test case
following pass everytime on a cenots 7.7/salt-master 9.2.2 with only 3 minions ;) but not on another one with 1,000 minions corrected.
The text was updated successfully, but these errors were encountered: