Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]salt.exceptions.SaltReqTimeoutError: Message timed out when working on 1000 Minions #63310

Open
saiaprameya opened this issue Dec 13, 2022 · 3 comments
Labels
Bug broken, incorrect, or confusing behavior needs-triage Performance

Comments

@saiaprameya
Copy link

Description
We are trying to execute scripts by copying on to the minion and a custom module.
We have close to 1000 VMs and when the below said commands are executed it exits out with the following backtrace:

...:     salt_handle.cmd_async(f"{source}", 'cp.get_file', [src, trg])
...:     salt_handle.cmd_async(f"{source}", 'ping_minion.clean_ping')

First command would copy the file and 2nd command is calling the custom built module present on the minions to execute the copied script.

Master side i see the following messages:

Dec 1 16:20:07 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:20:07 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#41973: Invalid argument
Dec 1 16:20:07 bb11t1-mgmt named[1689]: client @0x7f9cb402dc80 172.16.0.5#41973 (bb12-5gfi-d-c14-b02.bb11t1.local): error sending response: invalid file
Dec 1 16:20:10 bb11t1-mgmt salt-master[509142]: [DEBUG ] Process <class 'salt.master.FileserverUpdate'> (1230778) died with exit status 0, restarting...

SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with --async in order to bypass the congested event bus. With --async, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use salt-run jobs.lookup_jid to look up the results of the job in the job cache later.

Here is the complete traceback:

 53         trg = f"/root/clean_{target}.sh"
 54         salt_handle.cmd_async(f"{target}", 'cp.get_file', [src, trg])
 55         salt_handle.cmd_async(f"{target}", 'ping_minion.clean_ping')

ipdb> c
56%|██████████████████████████████████████████████████████████████████████████▉ | 556/987 [00:23<00:17, 24.12it/s]
Traceback (most recent call last):
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/init.py", line 1901, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/channel/client.py", line 292, in send
ret = yield self._uncrypted_transfer(load, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/channel/client.py", line 269, in _uncrypted_transfer
timeout=timeout,
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/transport/zeromq.py", line 914, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/transport/zeromq.py", line 624, in send
recv = yield future
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/init.py", line 396, in run_job
**kwargs
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/init.py", line 1905, in pub
"Salt request timed out. The master is not responding. You "
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with --async in order to bypass the congested event bus. With --async, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use salt-run jobs.lookup_jid to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "clean.py", line 59, in
run()
File "clean.py", line 17, in run
result = {}
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/init.py", line 495, in cmd_async
tgt, fun, arg, tgt_type, ret, jid=jid, kwarg=kwarg, listen=False, **kwargs
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/init.py", line 409, in run_job
raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with --async in order to bypass the congested event bus. With --async, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use salt-run jobs.lookup_jid to look up the results of the job in the job cache later.

If you suspect this is an IPython 7.20.0 bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipython-dev@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True

[ERROR ] Message timed out

Setup

salt --versions-report
Salt Version:
Salt: 3005.1
 
Dependency Versions:
cffi: 1.14.5
cherrypy: Not Installed
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.11.3
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.4
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.20
pycrypto: 3.9.8
pycryptodome: 3.16.0
pygit2: Not Installed
Python: 3.7.7 (default, Feb 16 2021, 12:37:08)
python-gnupg: Not Installed
PyYAML: 5.3.1
PyZMQ: 20.0.0
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.3
 
System Versions:
dist: centos 8 Core
locale: UTF-8
machine: x86_64
release: 4.18.0-193.el8.x86_64
system: Linux
version: CentOS Linux 8 Core

Please be as specific as possible and give set-up details.

  • [ NO] on-prem machine
  • [YES ] VM (Virtualbox, KVM, etc. please specify)
  • [ NO] VM running on a cloud service, please be explicit and add details
  • [ NO] container (Kubernetes, Docker, containerd, etc. please specify)
  • [NO ] or a combination, please be explicit
  • [ NO] jails if it is FreeBSD
  • [ NO] classic packaging
  • [ NO] onedir packaging
  • [ NO] used bootstrap to install
  • Used pip install of SALT

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)
Debug logs:

b12-5gfi-g-r002-VM4', 'success': True, 'return': '/root/clean_bb12-5gfi-g-r002-VM4.sh', 'retcode': 0, 'jid': '20221212185651822144', 'fun': 'cp.get_file', 'fun_args': ['salt://scratchpad/clean_bb12-5gfi-g-r002-VM4.sh', '/root/clean_bb12-5gfi-g-r002-VM4.sh'], '_stamp': '2022-12-12T18:56:52.208636'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Returning cached minion list
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] Got return from bb12-5gfi-h-r018-VM20 for job 20221212185651858822
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185651858822/ret/bb12-5gfi-h-r018-VM20; data = {'cmd': '_return', 'id': 'bb12-5gfi-h-r018-VM20', 'success': True, 'return': '/root/clean_bb12-5gfi-h-r018-VM20.sh', 'retcode': 0, 'jid': '20221212185651858822', 'fun': 'cp.get_file', 'fun_args': ['salt://scratchpad/clean_bb12-5gfi-h-r018-VM20.sh', '/root/clean_bb12-5gfi-h-r018-VM20.sh'], '_stamp': '2022-12-12T18:56:52.210548'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = 20221212185652210790; data = {'minions': ['bb12-5gfi-h-r024-VM3'], '_stamp': '2022-12-12T18:56:52.211073'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185652210790/new; data = {'jid': '20221212185652210790', 'tgt_type': 'glob', 'tgt': 'bb12-5gfi-h-r024-VM3', 'user': 'root', 'fun': 'cp.get_file', 'arg': ['salt://scratchpad/clean_bb12-5gfi-h-r024-VM3.sh', '/root/clean_bb12-5gfi-h-r024-VM3.sh'], 'minions': ['bb12-5gfi-h-r024-VM3'], 'missing': [], '_stamp': '2022-12-12T18:56:52.211584'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Adding minions for job 20221212185652210790: ['bb12-5gfi-h-r024-VM3']
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] Got return from bb12-5gfi-d-c11-b01-VM6 for job 20221212185643902167
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] User root Published command cp.get_file with jid 20221212185652210790
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185643902167/ret/bb12-5gfi-d-c11-b01-VM6; data = {'cmd': '_return', 'id': 'bb12-5gfi-d-c11-b01-VM6', 'success': True, 'return': True, 'retcode': 0, 'jid': '20221212185643902167', 'fun': 'ping_minion.clean_ping', 'fun_args': [], '_stamp': '2022-12-12T18:56:52.212508'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Published command details {'fun': 'cp.get_file', 'arg': ['salt://scratchpad/clean_bb12-5gfi-h-r024-VM3.sh', '/root/clean_bb12-5gfi-h-r024-VM3.sh'], 'tgt': 'bb12-5gfi-h-r024-VM3', 'jid': '20221212185652210790', 'ret': '', 'tgt_type': 'glob', 'user': 'root'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] Got return from bb12-5gfi-g-r011-VM10 for job 20221212185643931271
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending payload to publish daemon. jid=20221212185652210790 load={'fun': 'cp.get_file', 'arg': ['salt://s
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sent payload to publish daemon.
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185643931271/ret/bb12-5gfi-g-r011-VM10; data = {'cmd': '_return', 'id': 'bb12-5gfi-g-r011-VM10', 'success': True, 'return': True, 'retcode': 0, 'jid': '20221212185643931271', 'fun': 'ping_minion.clean_ping', 'fun_args': [], '_stamp': '2022-12-12T18:56:52.212801'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Signing data packet
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] salt.crypt.get_rsa_key: Loading private key
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] salt.crypt.sign_message: Signing message.
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Returning cached minion list
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] Got return from bb12-5gfi-d-r032-VM2 for job 20221212185643971507
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185643971507/ret/bb12-5gfi-d-r032-VM2; data = {'cmd': '_return', 'id': 'bb12-5gfi-d-r032-VM2', 'success': True, 'return': True, 'retcode': 0, 'jid': '20221212185643971507', 'fun': 'ping_minion.clean_ping', 'fun_args': [], '_stamp': '2022-12-12T18:56:52.216822'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] Got return from bb12-5gfi-g-r013-VM13 for job 20221212185643944287
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185643944287/ret/bb12-5gfi-g-r013-VM13; data = {'cmd': '_return', 'id': 'bb12-5gfi-g-r013-VM13', 'success': True, 'return': True, 'retcode': 0, 'jid': '20221212185643944287', 'fun': 'ping_minion.clean_ping', 'fun_args': [], '_stamp': '2022-12-12T18:56:52.217004'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = 20221212185652216834; data = {'minions': ['bb12-5gfi-h-r024-VM3'], '_stamp': '2022-12-12T18:56:52.217068'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185652216834/new; data = {'jid': '20221212185652216834', 'tgt_type': 'glob', 'tgt': 'bb12-5gfi-h-r024-VM3', 'user': 'root', 'fun': 'ping_minion.clean_ping', 'arg': [], 'minions': ['bb12-5gfi-h-r024-VM3'], 'missing': [], '_stamp': '2022-12-12T18:56:52.217588'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Signing data packet
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Adding minions for job 20221212185652216834: ['bb12-5gfi-h-r024-VM3']
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] salt.crypt.get_rsa_key: Loading private key
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] salt.crypt.sign_message: Signing message.
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] User root Published command ping_minion.clean_ping with jid 20221212185652216834
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Published command details {'fun': 'ping_minion.clean_ping', 'arg': [], 'tgt': 'bb12-5gfi-h-r024-VM3', 'jid': '20221212185652216834', 'ret': '', 'tgt_type': 'glob', 'user': 'root'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending payload to publish daemon. jid=20221212185652216834 load={'fun': 'ping_minion.clean_ping', 'arg':
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sent payload to publish daemon.
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] Got return from bb12-5gfi-h-r019-VM9 for job 20221212185651833818
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185651833818/ret/bb12-5gfi-h-r019-VM9; data = {'cmd': '_return', 'id': 'bb12-5gfi-h-r019-VM9', 'success': True, 'return': '/root/clean_bb12-5gfi-h-r019-VM9.sh', 'retcode': 0, 'jid': '20221212185651833818', 'fun': 'cp.get_file', 'fun_args': ['salt://scratchpad/clean_bb12-5gfi-h-r019-VM9.sh', '/root/clean_bb12-5gfi-h-r019-VM9.sh'], '_stamp': '2022-12-12T18:56:52.221690'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Returning cached minion list
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = 20221212185652222972; data = {'minions': ['bb12-5gfi-h-r019-VM17'], '_stamp': '2022-12-12T18:56:52.223220'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Signing data packet
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] salt.crypt.get_rsa_key: Loading private key
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] salt.crypt.sign_message: Signing message.
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending event: tag = salt/job/20221212185652222972/new; data = {'jid': '20221212185652222972', 'tgt_type': 'glob', 'tgt': 'bb12-5gfi-h-r019-VM17', 'user': 'root', 'fun': 'cp.get_file', 'arg': ['salt://scratchpad/clean_bb12-5gfi-h-r019-VM17.sh', '/root/clean_bb12-5gfi-h-r019-VM17.sh'], 'minions': ['bb12-5gfi-h-r019-VM17'], 'missing': [], '_stamp': '2022-12-12T18:56:52.223731'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Adding minions for job 20221212185652222972: ['bb12-5gfi-h-r019-VM17']
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [INFO ] User root Published command cp.get_file with jid 20221212185652222972
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Published command details {'fun': 'cp.get_file', 'arg': ['salt://scratchpad/clean_bb12-5gfi-h-r019-VM17.sh', '/root/clean_bb12-5gfi-h-r019-VM17.sh'], 'tgt': 'bb12-5gfi-h-r019-VM17', 'jid': '20221212185652222972', 'ret': '', 'tgt_type': 'glob', 'user': 'root'}
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sending payload to publish daemon. jid=20221212185652222972 load={'fun': 'cp.get_file', 'arg': ['salt://s
Dec 12 10:56:52 bb11t1-mgmt salt-master[3242056]: [DEBUG ] Sent payload to publish daemon.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
PASTE HERE

Additional context
Add any other context about the problem here.

@saiaprameya saiaprameya added Bug broken, incorrect, or confusing behavior needs-triage labels Dec 13, 2022
@OrangeDog
Copy link
Contributor

You are probably just overloading the master by making 1000 copy requests simultaneously.
Or it's possible this is another case of whatever is happening in #62881.

@saiaprameya
Copy link
Author

saiaprameya commented Dec 14, 2022

One observation is that we had these 2 statements in a single for loop:

...: salt_handle.cmd_async(f"{source}", 'cp.get_file', [src, trg])
...: salt_handle.cmd_async(f"{source}", 'ping_minion.clean_ping')

This causes problem when the host size is 980+.

But when we split the 2 statements in to different for loop then it works.
Can you comment on why would it not work when we have those 2 statements in the same loop as its a async call and should not be blocking.

Is it something to do with the ZMQ since the same source is being used twice for 2 different commands when the host size is ~1000?

More over the system config where salt is being run has 16 cores, 64 GB RAM. Do you think controlling 1000VMs is a problem with this kind of configuration?

Regards
Sai

@saiaprameya
Copy link
Author

Can anyone please give some pointers to the above said problem.

"""
But when we split the 2 statements in to different for loop then it works.
Can you comment on why would it not work when we have those 2 statements in the same loop as its a async call and should not be blocking.

Is it something to do with the ZMQ since the same source is being used twice for 2 different commands?
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage Performance
Projects
None yet
Development

No branches or pull requests

2 participants