Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master 9.2.2: Salt request timed out. The master is not responding. You may need to run your command with --async #55629

Closed
tjyang opened this issue Dec 13, 2019 · 5 comments
Labels
info-needed waiting for more info
Milestone

Comments

@tjyang
Copy link
Contributor

tjyang commented Dec 13, 2019

Hi

I decided to open a new ticket given related #32465,#49105,#33750, #46868. This is to record the process of fixing this issue and hope others can help to resolve this issue.

Description of Issue

This happened to my 9.2.2 salt-master. I restored the VM back as old as 9.2.0 and issue still exist.

Setup

This is a multi-master setup, each minion allow management from salt01 and salt02

Steps to Reproduce Issue

  • Following exact command sequence able to interact with most minions but still timeout in the end
me@salt02> sudo systemctl stop salt-master && sleep 5 && sudo systemctl start salt-master \
&& sleep 5 &&  sudo salt \* test.version --output=txt

<snipped>
a31: 2019.2.0
b114: 2019.2.2
c01: 2019.2.0
dms01: 2019.2.0
eslave01: 2019.2.0
Salt request timed out. The master is not responding. You may need to run your command with
 `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job 
id (jid) and exit immediately without listening for responses. You can then use
 `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[me@salt01 ~]$
  • a few minutes later, not even one version number returned from minions.
[me@salt01 ~]$ sudo salt \* test.version --output=txt
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[me@salt01 ~]$

Versions Report

  • salt-minions versions ranged from 2017 to 2019.2.2
  • salt master has following HW resource.
8 core CPUs
8G RAM  and around 4G used
following setting was all commented out/not-exist, the master setting was one used since version 2017
#max_open_files
#worker_threads
#pub_hwm:
#salt_event_pub_hwm
#event_publisher_pub_hwm
  • setting example in /etc/salt/master
# These two ZMQ HWM settings, salt_event_pub_hwm and event_publisher_pub_hwm
# are significant for masters with thousands of minions.  When these are
# insufficiently high it will manifest in random responses missing in the CLI
# and even missing from the job cache.  Masters that have fast CPUs and many
# cores with appropriate worker_threads will not need these set as high.

# On deployment with 8,000 minions, 2.4GHz CPUs, 24 cores, 32GiB memory has
# these settings:
#
#   salt_event_pub_hwm: 128000
#   event_publisher_pub_hwm: 64000

  • version report.
[me@salt01 ~]$ sudo salt  --versions-report ;date
Salt Version:
           Salt: 2019.2.2

Dependency Versions:
           cffi: 1.6.0
       cherrypy: unknown
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: 1.3.8
         Jinja2: 2.7.2
        libgit2: 0.26.3
        libnacl: 1.6.1
       M2Crypto: 0.31.0
           Mako: 0.8.1
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: 1.2.5
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: 3.7.3
         pygit2: 0.26.4
         Python: 2.7.5 (default, Aug  7 2019, 00:51:29)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.1.4

System Versions:
           dist: centos 7.7.1908 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-1062.9.1.el7.x86_64
         system: Linux
        version: CentOS Linux 7.7.1908 Core

Thu Dec 12 16:37:33 CST 2019
[me@salt01 ~]$

Debugging

[me@salt01 ~]$ grep ^log_fmt_console /etc/salt/minion
log_fmt_console: '%(asctime)s,%(msecs)03d [%(name)-17s:%(lineno)-4d][%(levelname)-8s][%(process)d] %(message)s'
[me@salt01 ~]$
and run command like following to see minion/master interaction details.
sudo time salt-call pkg.version bash -l debug
  • look at minion's public key 1st line
sudo salt-run survey.hash \*    cmd.run 'head -1 /etc/salt/pki/minion/minion.pub'
  • I also removed all existing minions (sudo salt-key -D -y) and only keep a few minions for testing version command, still same problem.

  • " sudo salt-run state.event pretty=True" was used in another vt100 terminal to display event bus traffic, but not thing related to salt-master.

  • Looking at this ggroup post to raise up threshold if event congestion issue.

Some errors in /var/log/salt/master.

  • With log file level set to debug , the message in when failing to get a connection to saltmaster
sudo salt minion01  test.version  --log-file-level=debug
<snipped>
2019-12-13 17:50:01,285 [salt.config      :2185][DEBUG   ][27609] Reading configuration from /etc/salt/master
2019-12-13 17:50:01,290 [salt.config      :2329][DEBUG   ][27609] Including configuration from '/etc/salt/master.d/reactor.conf'
2019-12-13 17:50:01,291 [salt.config      :2185][DEBUG   ][27609] Reading configuration from /etc/salt/master.d/reactor.conf
2019-12-13 17:50:01,291 [salt.config      :2329][DEBUG   ][27609] Including configuration from '/etc/salt/master.d/salt-api.conf'
2019-12-13 17:50:01,291 [salt.config      :2185][DEBUG   ][27609] Reading configuration from /etc/salt/master.d/salt-api.conf
2019-12-13 17:50:01,293 [salt.config      :3666][DEBUG   ][27609] Using cached minion ID from /etc/salt/minion_id: salt01
2019-12-13 17:50:01,295 [salt.config      :2289][DEBUG   ][27609] Missing configuration file: /root/.saltrc
2019-12-13 17:50:01,297 [salt.utils.event :322 ][DEBUG   ][27609] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2019-12-13 17:50:01,297 [salt.utils.event :323 ][DEBUG   ][27609] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2019-12-13 17:50:01,439 [salt.transport.zeromq:138 ][DEBUG   ][27609] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/master', u'salt01_master', u'tcp://127.0.0.1:4506', u'clear')
2019-12-13 17:50:01,440 [salt.transport.zeromq:209 ][DEBUG   ][27609] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
2019-12-13 17:50:01,440 [salt.transport.zeromq:1189][DEBUG   ][27609] Trying to connect to: tcp://127.0.0.1:4506
2019-12-13 17:50:06,448 [salt.transport.zeromq:1244][DEBUG   ][27609] SaltReqTimeoutError, retrying. (1/3)
2019-12-13 17:50:11,452 [salt.transport.zeromq:1244][DEBUG   ][27609] SaltReqTimeoutError, retrying. (2/3)
2019-12-13 17:50:16,457 [salt.transport.zeromq:1244][DEBUG   ][27609] SaltReqTimeoutError, retrying. (3/3)
2019-12-13 17:50:21,464 [salt.transport.zeromq:1214][DEBUG   ][27609] Re-init ZMQ socket: Message timed out
2019-12-13 17:50:21,464 [salt.transport.zeromq:1189][DEBUG   ][27609] Trying to connect to: tcp://127.0.0.1:4506
2019-12-13 17:50:21,465 [salt.transport.ipc:364 ][DEBUG   ][27609] Closing IPCMessageSubscriber instance
2019-12-13 17:50:21,469 [salt.utils.lazy  :104 ][DEBUG   ][27609] LazyLoaded nested.output
2019-12-13 17:50:21,489 [salt.transport.zeromq:233 ][DEBUG   ][27609] Closing AsyncZeroMQReqChannel instance

  • From older minion ?
2019-12-13 15:03:16,712 [salt.loaded.int.returner.local_cache:156 ][ERROR   ][1937] An extra return was detected from minion xxcims01, please verify the minion, this could be a replay attack
<snipped>
was detected from minion xxre01, please verify the minion, this could be a replay attack
2019-12-13 15:03:45,523 [salt.loaded.int.returner.local_cache:156 ][ERROR   ][1935] An extra return was detected from minion pcims01, please verify the minion, this could be a replay attack

  • from /var/log/salt/master, this is from minion with 1st line as "-----BEGIN RSA PUBLIC KEY-----" public key issue.
2019-12-13 12:57:05,172 [tornado.application:123 ][ERROR   ][503] Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 232, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 754, in handle_message
    stream.send(self.serial.dumps(self._auth(payload['load'])))
  File "/usr/lib/python2.7/site-packages/salt/transport/mixins/auth.py", line 436, in _auth
    pub = salt.crypt.get_rsa_pub_key(pubfn)
  File "/usr/lib/python2.7/site-packages/salt/crypt.py", line 213, in get_rsa_pub_key
    key = RSA.load_pub_key_bio(bio)
  File "/usr/lib64/python2.7/site-packages/M2Crypto/RSA.py", line 441, in load_pub_key_bio
    rsa_error()
  File "/usr/lib64/python2.7/site-packages/M2Crypto/RSA.py", line 330, in rsa_error
    raise RSAError(Err.get_error_message())
RSAError: no start line

Possible Resolutions

  • performance setting for master with 16G RAM/16 core CPUs and 1,000 connected minions.
    following setting is able pass "for i in 1 2 3 4 5; do sudo salt * test.version --out=txt; sleep 60; done" stress test.
[me@salt01 ~]$ egrep '^pub_hwm|^salt_event|^event|^worker|^max_open' /etc/salt/master
max_open_files: 65535
worker_threads: 80
pub_hwm: 10000
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000
[me@salt01 ~]$

Test case

  • Stress test 1
    following pass everytime on a cenots 7.7/salt-master 9.2.2 with only 3 minions ;) but not on another one with 1,000 minions corrected.
[me@salt01 ~]$ for i in 1 2 3 4 5; do sudo salt -t 5 \* test.version --out=txt; sleep 10; done
@tjyang
Copy link
Contributor Author

tjyang commented Dec 13, 2019

I tried to raise up the threshold of worker_thread, zmq stuff as show above. but it only can last for a while before I need to restart saltmaster to be able to send test.version to one single minion. I am thinking there is bug somewhere.

Look like a bug in version 9.2.2 to me.

@waynew
Copy link
Contributor

waynew commented Dec 13, 2019

This is definitely strange. Have you tried running your command with --async, and then using salt-run jobs.find_job <JID> with whatever the JID was? That would show you that at least the results are returning 🤔

You can use the state event runner to watch your event bus: salt-run state.event pretty=True

It sounds like your master is getting swamped with events - that should give you an idea bout what events they actually are. Right now it's not clear what's causing your issue.

@waynew waynew added the info-needed waiting for more info label Dec 13, 2019
@waynew waynew added this to the Blocked milestone Dec 13, 2019
@waynew waynew moved this from Needs triage to Blocked in [Test] Triage Dec 13, 2019
@tjyang
Copy link
Contributor Author

tjyang commented Dec 14, 2019

@waynew
After salt-master HW resource was doubled to 16 G RAM/CPUs and performance setting was configured. The master is still responsive after 3hrs.

  • Please lower the importance of this issue. Looks like this is an operation issue.
  • Is there a formula on how performance setting should be set ?
Number of minions
CPU speed
CPU number
RAM
max_open_files
worker_threads
pub_hwm:
salt_event_pub_hwm
event_publisher_pub_hwm
  • Following is example is in latest version of /etc/salt/master
# On deployment with 8,000 minions, 2.4GHz CPUs, 24 cores, 32GiB memory has
# these settings:
#
#   salt_event_pub_hwm: 128000
#   event_publisher_pub_hwm: 64000

@tjyang
Copy link
Contributor Author

tjyang commented Dec 15, 2019

I am closing this issue since the root cause was my salt-mater was under hardware specification needed to support 1K minion connection. Once CPU/RAM were doubled,zmq hwm settings were increased. The master now can handle "salt * test.ping" work without timeout.

@tjyang tjyang closed this as completed Dec 15, 2019
[Test] Triage automation moved this from Blocked to Closed Dec 15, 2019
@enmanuel17
Copy link

@tjyang

Just want to clarify the options
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

did not do anything since you were using salt version 2019.2.0 and those options were removed long ago in #49934 #49925

I believe the only thing that might have helped you was increasing your master resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
info-needed waiting for more info
Projects
No open projects
[Test] Triage
  
Closed
Development

No branches or pull requests

3 participants