master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with `--async` #55629

tjyang · 2019-12-13T01:46:16Z

Hi

I decided to open a new ticket given related #32465,#49105,#33750, #46868. This is to record the process of fixing this issue and hope others can help to resolve this issue.

Description of Issue

This happened to my 9.2.2 salt-master. I restored the VM back as old as 9.2.0 and issue still exist.

This happened in 2017.7.4 to 2018.3.0 upgrade issue: Salt request timed out. The master is not responding #46868 also, not sure when this issue come back again. This CentOS did got upgraded to latest 7.7. IMHO, looks like OS rpm upgrade change/break the salt-master.
or older minions have issue to interact with master 9.2.2 ?

Setup

This is a multi-master setup, each minion allow management from salt01 and salt02

Steps to Reproduce Issue

Following exact command sequence able to interact with most minions but still timeout in the end

me@salt02> sudo systemctl stop salt-master && sleep 5 && sudo systemctl start salt-master \
&& sleep 5 &&  sudo salt \* test.version --output=txt

<snipped>
a31: 2019.2.0
b114: 2019.2.2
c01: 2019.2.0
dms01: 2019.2.0
eslave01: 2019.2.0
Salt request timed out. The master is not responding. You may need to run your command with
 `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job 
id (jid) and exit immediately without listening for responses. You can then use
 `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[me@salt01 ~]$

a few minutes later, not even one version number returned from minions.

[me@salt01 ~]$ sudo salt \* test.version --output=txt
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[me@salt01 ~]$

Versions Report

salt-minions versions ranged from 2017 to 2019.2.2
salt master has following HW resource.

8 core CPUs
8G RAM  and around 4G used
following setting was all commented out/not-exist, the master setting was one used since version 2017
#max_open_files
#worker_threads
#pub_hwm:
#salt_event_pub_hwm
#event_publisher_pub_hwm

setting example in /etc/salt/master

# These two ZMQ HWM settings, salt_event_pub_hwm and event_publisher_pub_hwm
# are significant for masters with thousands of minions.  When these are
# insufficiently high it will manifest in random responses missing in the CLI
# and even missing from the job cache.  Masters that have fast CPUs and many
# cores with appropriate worker_threads will not need these set as high.

# On deployment with 8,000 minions, 2.4GHz CPUs, 24 cores, 32GiB memory has
# these settings:
#
#   salt_event_pub_hwm: 128000
#   event_publisher_pub_hwm: 64000

version report.

[me@salt01 ~]$ sudo salt  --versions-report ;date
Salt Version:
           Salt: 2019.2.2

Dependency Versions:
           cffi: 1.6.0
       cherrypy: unknown
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: 1.3.8
         Jinja2: 2.7.2
        libgit2: 0.26.3
        libnacl: 1.6.1
       M2Crypto: 0.31.0
           Mako: 0.8.1
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: 1.2.5
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.3
         pygit2: 0.26.4
         Python: 2.7.5 (default, Aug  7 2019, 00:51:29)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.7.1908 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1062.9.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.7.1908 Core

Thu Dec 12 16:37:33 CST 2019
[me@salt01 ~]$

Debugging

per hint from master auth delay #49586

[me@salt01 ~]$ grep ^log_fmt_console /etc/salt/minion
log_fmt_console: '%(asctime)s,%(msecs)03d [%(name)-17s:%(lineno)-4d][%(levelname)-8s][%(process)d] %(message)s'
[me@salt01 ~]$
and run command like following to see minion/master interaction details.
sudo time salt-call pkg.version bash -l debug

look at minion's public key 1st line

sudo salt-run survey.hash \*    cmd.run 'head -1 /etc/salt/pki/minion/minion.pub'

I also removed all existing minions (sudo salt-key -D -y) and only keep a few minions for testing version command, still same problem.
" sudo salt-run state.event pretty=True" was used in another vt100 terminal to display event bus traffic, but not thing related to salt-master.
Looking at this ggroup post to raise up threshold if event congestion issue.

Some errors in /var/log/salt/master.

With log file level set to debug , the message in when failing to get a connection to saltmaster

sudo salt minion01  test.version  --log-file-level=debug
<snipped>
2019-12-13 17:50:01,285 [salt.config      :2185][DEBUG   ][27609] Reading configuration from /etc/salt/master
2019-12-13 17:50:01,290 [salt.config      :2329][DEBUG   ][27609] Including configuration from '/etc/salt/master.d/reactor.conf'
2019-12-13 17:50:01,291 [salt.config      :2185][DEBUG   ][27609] Reading configuration from /etc/salt/master.d/reactor.conf
2019-12-13 17:50:01,291 [salt.config      :2329][DEBUG   ][27609] Including configuration from '/etc/salt/master.d/salt-api.conf'
2019-12-13 17:50:01,291 [salt.config      :2185][DEBUG   ][27609] Reading configuration from /etc/salt/master.d/salt-api.conf
2019-12-13 17:50:01,293 [salt.config      :3666][DEBUG   ][27609] Using cached minion ID from /etc/salt/minion_id: salt01
2019-12-13 17:50:01,295 [salt.config      :2289][DEBUG   ][27609] Missing configuration file: /root/.saltrc
2019-12-13 17:50:01,297 [salt.utils.event :322 ][DEBUG   ][27609] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2019-12-13 17:50:01,297 [salt.utils.event :323 ][DEBUG   ][27609] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2019-12-13 17:50:01,439 [salt.transport.zeromq:138 ][DEBUG   ][27609] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'salt01_master', u'tcp://127.0.0.1:4506', u'clear')
2019-12-13 17:50:01,440 [salt.transport.zeromq:209 ][DEBUG   ][27609] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
2019-12-13 17:50:01,440 [salt.transport.zeromq:1189][DEBUG   ][27609] Trying to connect to: tcp://127.0.0.1:4506
2019-12-13 17:50:06,448 [salt.transport.zeromq:1244][DEBUG   ][27609] SaltReqTimeoutError, retrying. (1/3)
2019-12-13 17:50:11,452 [salt.transport.zeromq:1244][DEBUG   ][27609] SaltReqTimeoutError, retrying. (2/3)
2019-12-13 17:50:16,457 [salt.transport.zeromq:1244][DEBUG   ][27609] SaltReqTimeoutError, retrying. (3/3)
2019-12-13 17:50:21,464 [salt.transport.zeromq:1214][DEBUG   ][27609] Re-init ZMQ socket: Message timed out
2019-12-13 17:50:21,464 [salt.transport.zeromq:1189][DEBUG   ][27609] Trying to connect to: tcp://127.0.0.1:4506
2019-12-13 17:50:21,465 [salt.transport.ipc:364 ][DEBUG   ][27609] Closing IPCMessageSubscriber instance
2019-12-13 17:50:21,469 [salt.utils.lazy  :104 ][DEBUG   ][27609] LazyLoaded nested.output
2019-12-13 17:50:21,489 [salt.transport.zeromq:233 ][DEBUG   ][27609] Closing AsyncZeroMQReqChannel instance

From older minion ?

2019-12-13 15:03:16,712 [salt.loaded.int.returner.local_cache:156 ][ERROR   ][1937] An extra return was detected from minion xxcims01, please verify the minion, this could be a replay attack
<snipped>
was detected from minion xxre01, please verify the minion, this could be a replay attack
2019-12-13 15:03:45,523 [salt.loaded.int.returner.local_cache:156 ][ERROR   ][1935] An extra return was detected from minion pcims01, please verify the minion, this could be a replay attack

from /var/log/salt/master, this is from minion with 1st line as "-----BEGIN RSA PUBLIC KEY-----" public key issue.

2019-12-13 12:57:05,172 [tornado.application:123 ][ERROR   ][503] Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 232, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 754, in handle_message
    stream.send(self.serial.dumps(self._auth(payload['load'])))
  File "/usr/lib/python2.7/site-packages/salt/transport/mixins/auth.py", line 436, in _auth
    pub = salt.crypt.get_rsa_pub_key(pubfn)
  File "/usr/lib/python2.7/site-packages/salt/crypt.py", line 213, in get_rsa_pub_key
    key = RSA.load_pub_key_bio(bio)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/RSA.py", line 441, in load_pub_key_bio
    rsa_error()
  File "/usr/lib64/python2.7/site-packages/M2Crypto/RSA.py", line 330, in rsa_error
    raise RSAError(Err.get_error_message())
RSAError: no start line

Possible Resolutions

performance setting for master with 16G RAM/16 core CPUs and 1,000 connected minions.
following setting is able pass "for i in 1 2 3 4 5; do sudo salt * test.version --out=txt; sleep 60; done" stress test.

[me@salt01 ~]$ egrep '^pub_hwm|^salt_event|^event|^worker|^max_open' /etc/salt/master
max_open_files: 65535
worker_threads: 80
pub_hwm: 10000
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000
[me@salt01 ~]$

Test case

Stress test 1
following pass everytime on a cenots 7.7/salt-master 9.2.2 with only 3 minions ;) but not on another one with 1,000 minions corrected.

[me@salt01 ~]$ for i in 1 2 3 4 5; do sudo salt -t 5 \* test.version --out=txt; sleep 10; done

The text was updated successfully, but these errors were encountered:

tjyang · 2019-12-13T18:52:42Z

I tried to raise up the threshold of worker_thread, zmq stuff as show above. but it only can last for a while before I need to restart saltmaster to be able to send test.version to one single minion. I am thinking there is bug somewhere.

Look like a bug in version 9.2.2 to me.

waynew · 2019-12-13T20:18:38Z

This is definitely strange. Have you tried running your command with --async, and then using salt-run jobs.find_job <JID> with whatever the JID was? That would show you that at least the results are returning 🤔

You can use the state event runner to watch your event bus: salt-run state.event pretty=True

It sounds like your master is getting swamped with events - that should give you an idea bout what events they actually are. Right now it's not clear what's causing your issue.

tjyang · 2019-12-14T16:31:56Z

@waynew
After salt-master HW resource was doubled to 16 G RAM/CPUs and performance setting was configured. The master is still responsive after 3hrs.

Please lower the importance of this issue. Looks like this is an operation issue.
Is there a formula on how performance setting should be set ?

Number of minions
CPU speed
CPU number
RAM
max_open_files
worker_threads
pub_hwm:
salt_event_pub_hwm
event_publisher_pub_hwm

Following is example is in latest version of /etc/salt/master

# On deployment with 8,000 minions, 2.4GHz CPUs, 24 cores, 32GiB memory has
# these settings:
#
#   salt_event_pub_hwm: 128000
#   event_publisher_pub_hwm: 64000

tjyang · 2019-12-15T11:07:52Z

I am closing this issue since the root cause was my salt-mater was under hardware specification needed to support 1K minion connection. Once CPU/RAM were doubled,zmq hwm settings were increased. The master now can handle "salt * test.ping" work without timeout.

enmanuel17 · 2020-06-05T05:22:54Z

@tjyang

Just want to clarify the options
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

did not do anything since you were using salt version 2019.2.0 and those options were removed long ago in #49934 #49925

I believe the only thing that might have helped you was increasing your master resources.

waynew added the info-needed waiting for more info label Dec 13, 2019

waynew added this to the Blocked milestone Dec 13, 2019

waynew moved this from Needs triage to Blocked in [Test] Triage Dec 13, 2019

tjyang closed this as completed Dec 15, 2019

[Test] Triage automation moved this from Blocked to Closed Dec 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with `--async` #55629

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with `--async` #55629

tjyang commented Dec 13, 2019 •

edited

tjyang commented Dec 13, 2019 •

edited

waynew commented Dec 13, 2019

tjyang commented Dec 14, 2019

tjyang commented Dec 15, 2019

enmanuel17 commented Jun 5, 2020

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with --async #55629

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with --async #55629

Comments

tjyang commented Dec 13, 2019 • edited

Description of Issue

Setup

Steps to Reproduce Issue

Versions Report

Debugging

Some errors in /var/log/salt/master.

Possible Resolutions

Test case

tjyang commented Dec 13, 2019 • edited

waynew commented Dec 13, 2019

tjyang commented Dec 14, 2019

tjyang commented Dec 15, 2019

enmanuel17 commented Jun 5, 2020

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with `--async` #55629

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with `--async` #55629

tjyang commented Dec 13, 2019 •

edited

tjyang commented Dec 13, 2019 •

edited