Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pillar fetch timeouts on proxy minion after ~60 seconds #63824

Open
2 tasks done
dmulyalin opened this issue Mar 7, 2023 · 7 comments
Open
2 tasks done

[BUG] Pillar fetch timeouts on proxy minion after ~60 seconds #63824

dmulyalin opened this issue Mar 7, 2023 · 7 comments
Labels
Bug broken, incorrect, or confusing behavior needs-triage Pillar

Comments

@dmulyalin
Copy link

Description

Pillar fetch timeouts on proxy minion after ~60 seconds

Setup

SaltStack 3005.1

Master and Proxy Minion running in containers on RockyLinux VM in a VirtualBox

  • VM (Virtualbox, KVM, etc. please specify)
  • container (Kubernetes, Docker, containerd, etc. please specify)

Steps to Reproduce the behavior

Create external pillar and make it to sit doing nothing for longer then 60 seconds

Expected behavior

Being able to configure extrnal pillar timeout as a parameter.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
bash-4.4# salt --versions-report
Salt Version:
          Salt: 3005
 
Dependency Versions:
          cffi: 1.15.1
      cherrypy: Not Installed
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: 4.0.10
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.4
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.17
        pygit2: Not Installed
        Python: 3.9.13 (main, Nov 16 2022, 15:31:39)
  python-gnupg: Not Installed
        PyYAML: 6.0
         PyZMQ: 20.0.0
         smmap: 5.0.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.3
 
Salt Extensions:
   salt-nornir: 0.19.0
 
System Versions:
          dist: rocky 8.5 Green Obsidian
        locale: utf-8
       machine: x86_64
       release: 4.18.0-425.3.1.el8.x86_64
        system: Linux
       version: Rocky Linux 8.5 Green Obsidian

Additional context

I created custom external pillar but it takes longer then 60 seconds to fetch data and while master is in the process of retrieving that data proxy minoion errors out with timeout error constantly sending another pillar refresh request to master.

Getting this traceback on proxy-minion:

salt-minion-3005-1  | 14:59:37,425 [salt.pillar                              ][ERROR   ] Exception getting pillar:
salt-minion-3005-1  | Traceback (most recent call last):
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/pillar/__init__.py", line 262, in compile_pillar
salt-minion-3005-1  |     ret_pillar = yield self.channel.crypted_transfer_decode_dictentry(
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1056, in run
salt-minion-3005-1  |     value = future.result()
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
salt-minion-3005-1  |     raise_exc_info(self._exc_info)
salt-minion-3005-1  |   File "<string>", line 4, in raise_exc_info
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1064, in run
salt-minion-3005-1  |     yielded = self.gen.throw(*exc_info)
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/channel/client.py", line 171, in crypted_transfer_decode_dictentry
salt-minion-3005-1  |     ret = yield self.transport.send(
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1056, in run
salt-minion-3005-1  |     value = future.result()
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
salt-minion-3005-1  |     raise_exc_info(self._exc_info)
salt-minion-3005-1  |   File "<string>", line 4, in raise_exc_info
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1064, in run
salt-minion-3005-1  |     yielded = self.gen.throw(*exc_info)
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/transport/zeromq.py", line 914, in send
salt-minion-3005-1  |     ret = yield self.message_client.send(load, timeout=timeout)
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1056, in run
salt-minion-3005-1  |     value = future.result()
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
salt-minion-3005-1  |     raise_exc_info(self._exc_info)
salt-minion-3005-1  |   File "<string>", line 4, in raise_exc_info
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1064, in run
salt-minion-3005-1  |     yielded = self.gen.throw(*exc_info)
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/transport/zeromq.py", line 624, in send
salt-minion-3005-1  |     recv = yield future
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 1056, in run
salt-minion-3005-1  |     value = future.result()
salt-minion-3005-1  |   File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
salt-minion-3005-1  |     raise_exc_info(self._exc_info)
salt-minion-3005-1  |   File "<string>", line 4, in raise_exc_info
salt-minion-3005-1  | salt.exceptions.SaltReqTimeoutError: Message timed out
salt-minion-3005-1  | 14:59:37,432 [salt.minion                              ][ERROR   ] Error while bringing up minion for multi-master. Is master at salt-master responding?

Was trying to play with master and proxy minion timeout parameter by setting it to a big value as recommended here but that does not seems to have any effect.

Looked at traceback and found this call - it seems timeout value is hardcoded to 60 seconds here and I was not able to figure out the way to influence that value through proxy minon or master settings or command line parameters.

If pillar fetch timeout is not configurable at the moment, having a setting to control that timeout value would be a very useful feature to implement as a workaround for this bahviour.

@dmulyalin dmulyalin added Bug broken, incorrect, or confusing behavior needs-triage labels Mar 7, 2023
@OrangeDog
Copy link
Contributor

OrangeDog commented Mar 7, 2023

Please update the issue title to something descriptive. e.g. "[BUG} Cannot increase timeout for custom pillar".

Though note that a pillar that takes that long is going to cripple your entire system, as state.apply is going to call it and wait for it every time for every minion.

@dmulyalin dmulyalin changed the title [BUG] [BUG] Pillar fetch timeouts on proxy minion after ~60 seconds Mar 7, 2023
@dmulyalin
Copy link
Author

In my case it would take about 2-3 min to fetch data from external databse, my plan was to use pillar cache as well to speed up the process after first fetch happens, but for pillar cache to kick in need to fetch it at least once.

@OrangeDog
Copy link
Contributor

That is incredibly slow for a database query.

I'm going to guess this is also more then just secret information? If it is that slow, and you really can't fix it, at least fetch it as part of he state template, not in pillar.

@dmulyalin
Copy link
Author

It is what it is, can only speed up the DB query to certain level. But, in general case, forcing 60 seconds as a timeframe to get pillar with going through all external pillars, rendering and sending back to minion, might sound a bit concerning.

What I am trying to say there are might be legit cases when pillar fetch might take longer then 60 seconds, allowing user to adjust system behaviour to accomodate those cases is desirable.

@OrangeDog
Copy link
Contributor

OrangeDog commented Mar 7, 2023

Again, if pillar does take that long, the whole Salt installation will be almost useless. Pillar needs to be fast. Do not use it to store arbitrary data, especially data that takes minutes(!) to build.

And a database query that also takes minutes to run is a big indicator of a massive design failure.

I agree that the timeout should be configurable, but with a view to making it shorter, not longer.

@dmulyalin
Copy link
Author

Ok, what is the reasonable amount of time for pillar to finish its work in that case? Also, are there anything else we can use instead of pillar or sourcing data on the fly during state execution? The idea was to fetch data into pillar once, use pillar cache, and as such have all the neccesary data available to proxy minion for fast state and template rendering execution, also lowering the burden on db and using salt master as the only entity to whitelist on DB side.

@davidrjonas
Copy link

I'm running into this same timeout. Modifying the call to crypted_transfer_decode_dictentry() in compile_pillar() to include a timeout a bit larger than 60s fixes the issue we see.

We use gpg encrypted pillar data and have about 140 entries to decode. Gpg-agent is single threaded and on an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz that seems to take about 80s. There doesn't seem to be any way to speed this up - gpg-agent is the bottleneck.

We'd gladly convert to something faster. For now we use the pillar cache and run pillar.items first, which hits the same timeout, but then the state.apply is reasonably short and successful. I plan on trying out the python bindings for sequoia-pgp at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage Pillar
Projects
None yet
Development

No branches or pull requests

3 participants