Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactor periodically can't find files in gitfs #47206

Closed
clallen opened this issue Apr 20, 2018 · 14 comments
Closed

Reactor periodically can't find files in gitfs #47206

clallen opened this issue Apr 20, 2018 · 14 comments
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged
Milestone

Comments

@clallen
Copy link
Contributor

clallen commented Apr 20, 2018

Description of Issue/Question

Every so often (sometimes days, sometimes minutes) the reactor will throw these errors:

2018-04-20 03:45:09,750 [salt.utils.reactor][ERROR   ][5409] Can not render SLS  for tag minion_start. File missing or not found.
2018-04-20 06:32:41,921 [salt.loaded.int.module.cp][ERROR   ][5409] Unable to cache file 'salt://_reactors/minion_start.sls' from saltenv 'base'.

However, files are always accessible via states, and fileserver.file_list shows all files available.
Restarting the master fixes it for a while.
I tried turning on granular debugging for salt.utils.reactor and salt.loaded.int.module.cp, but didn't see anything that looked useful.
I have also tried shutting down the master, deleting all the gitfs cache files, and restarting it to rebuild them. The issue still comes back a while later.
I realize we're using a pretty old version, and will be upgrading this year, I am mainly looking for clues as to where to look for the cause. If I can patch it temporarily that will be fine with me.

Setup

Master config settings:

fileserver_backend:
  - git
  - roots

gitfs_provider: gitpython

gitfs_remotes:
  - file:///srv/salt.git:
    - root: files

Steps to Reproduce Issue

None that I know of, it works for a while and then starts failing.

Versions Report

Salt Version:
Salt: 2016.11.6

Dependency Versions:
cffi: 1.10.0
cherrypy: unknown
dateutil: 2.6.1
docker-py: Not Installed
gitdb: 2.0.2
gitpython: 2.1.5
ioflo: Not Installed
Jinja2: 2.9.6
libgit2: Not Installed
libnacl: 1.5.1
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: Not Installed
pycparser: 2.18
pycrypto: 2.7a1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.11 (default, Jul 18 2017, 12:45:26)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: 2.0.3
timelib: 0.2.4
Tornado: 4.5.1
ZMQ: 4.1.6

System Versions:
dist: redhat 7.4 Maipo
machine: x86_64
release: 4.1.12-112.14.10.el7uek.x86_64
system: Linux
version: Red Hat Enterprise Linux Server 7.4 Maipo

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 20, 2018

could you add this logging so we can get some more insight:

diff --git a/salt/utils/reactor.py b/salt/utils/reactor.py
index 4289f097b4..f7f72623b3 100644
--- a/salt/utils/reactor.py
+++ b/salt/utils/reactor.py
@@ -58,9 +58,12 @@ class Reactor(salt.utils.process.SignalHandlingMultiprocessingProcess, salt.stat
         react = {}
 
         if glob_ref.startswith('salt://'):
+            log.debug('glob_ref before: {0}'.format(glob_ref))
             glob_ref = self.minion.functions['cp.cache_file'](glob_ref) or ''
         globbed_ref = glob.glob(glob_ref)
         if not globbed_ref:
+            log.error('globbed_ref: {0}'.format(globbed_ref))
+            log.error('glob_ref after: {0}'.format(glob_ref))
             log.error('Can not render SLS {0} for tag {1}. File missing or not found.'.format(glob_ref, tag))
         for fn_ in globbed_ref:
             try:

@Ch3LL Ch3LL added the info-needed waiting for more info label Apr 20, 2018
@Ch3LL Ch3LL added this to the Blocked milestone Apr 20, 2018
@clallen
Copy link
Contributor Author

clallen commented Apr 20, 2018

Sure, I have that in place now and restarted the master. Will update when I get results.

@clallen
Copy link
Contributor Author

clallen commented Apr 20, 2018

Got a couple of hits:

2018-04-20 12:48:34,773 [salt.loaded.int.module.cp][ERROR   ][10734] Unable to cache file 'salt://_reactors/minion_start.sls' from saltenv 'base'.
2018-04-20 12:48:34,773 [salt.utils.reactor][ERROR   ][10734] globbed_ref: []
2018-04-20 12:48:34,774 [salt.utils.reactor][ERROR   ][10734] glob_ref after:
2018-04-20 12:48:34,774 [salt.utils.reactor][ERROR   ][10734] Can not render SLS  for tag minion_start. File missing or not found.
2018-04-20 12:55:16,789 [salt.loaded.int.module.cp][ERROR   ][10734] Unable to cache file 'salt://_reactors/nagios/enable.sls' from saltenv 'base'.
2018-04-20 12:55:16,789 [salt.utils.reactor][ERROR   ][10734] globbed_ref: []
2018-04-20 12:55:16,789 [salt.utils.reactor][ERROR   ][10734] glob_ref after:
2018-04-20 12:55:16,790 [salt.utils.reactor][ERROR   ][10734] Can not render SLS  for tag nagios/enable. File missing or not found.

Files are there, according to fileserver.file_list:

# salt-run fileserver.file_list|grep enable.sls
- _reactors/nagios/enable.sls
# salt-run fileserver.file_list|grep minion_start.sls
- _reactors/minion_start.sls

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 23, 2018

Thanks for adding that information.

Can you share your sanitized reactor master config?

And also when this occurs can you include more sanitized debug output before and after to show more context?

@clallen
Copy link
Contributor Author

clallen commented Apr 23, 2018

Reactor config:

reactor:
  - 'minion_start':
    - salt://_reactors/minion_start.sls

  - 'autobld/postinst':
    - salt://_reactors/autobld/postinst/base.sls

  - 'autobld/postinst/rac':
    - salt://_reactors/autobld/postinst/rac.sls

  - 'autobld/complete/rac':
    - salt://_reactors/autobld/complete/rac.sls

  - 'nagios/failover':
    - salt://_reactors/nagios/failover.sls

  - 'nagios/disable':
    - salt://_reactors/nagios/disable.sls

  - 'nagios/enable':
    - salt://_reactors/nagios/enable.sls

  - 'salt/beacon/*/netapp_wfa':
    - salt://_reactors/gsi_refresh.sls

  - 'salt/netapi/hook/gsi_refresh':
    - salt://_reactors/gsi_refresh_webhook.sls

  - 'salt/fileserver/gitfs/update':
    - salt://_reactors/update_fileserver.sls

I've turned up logging to debug level, will post that when I see the issue again.

@clallen
Copy link
Contributor Author

clallen commented Apr 27, 2018

Got some more hits, unfortunately it doesn't look like there's much more info with debugging. I've included all the lines that I think might be relevant.
It's a fair amount of text so I attached a file.
reactor_debug.txt

@Ch3LL
Copy link
Contributor

Ch3LL commented May 1, 2018

yeah doesn't look like much more information. Thanks for doing that.

ping @terminalmage any other ideas here?

@Ch3LL Ch3LL added Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged and removed info-needed waiting for more info labels May 1, 2018
@johje349
Copy link

johje349 commented May 8, 2018

I have a similar problem on version 2017.7.5, with local files though.

Reactor:

startup-orchestration:
  runner.state.orchestrate:
    - args:
        - mods: orchestration.minion-created
        - pillar:
            id: {{ data['name'] }}

Master config:

reactor:
  - 'salt/cloud/*/created':
    - '/srv/reactor/minion-created.sls'

Master logfile:
2018-05-04 13:44:41,833 [salt.fileclient :1070][DEBUG ][8811] Could not find file 'salt://orchestration/minion-created.sls' in saltenv 'base'
2018-05-04 13:44:41,836 [salt.fileclient :1070][DEBUG ][8811] Could not find file 'salt://orchestration/minion-created/init.sls' in saltenv 'base'
2018-05-04 13:44:41,837 [salt.template :48 ][DEBUG ][8811] compile template: False
2018-05-04 13:44:41,837 [salt.utils.event :35 ][TRACE ][11825] _get_event() waited 0 seconds and received nothing
2018-05-04 13:44:41,837 [salt.template :62 ][ERROR ][8811] Template was specified incorrectly: False

The reactor works like 50% of the time. The minion-created.sls file does indeed exist.
Calling it manually works every time:
salt-run state.orchestrate orchestration.minion-created pillar='{"id":"xxx-xxx"}'

I also have several other reactors configured, including beacons with 120+ minions reporting in, and a reactor on each job return (to check for highstate results). I removed the reactor for job return earlier today and things have been working better since then, but I still get errors occasionally.

@anitakrueger
Copy link
Contributor

I have the exact same issue on 2018.3.0 also with local files. The reactor basically stopped working, with the following errors in the master log file:

2018-05-29 14:17:12,112 [salt.utils.reactor                                    ][DEBUG   ] Gathering reactors for tag git/salt/master/repos/update
2018-05-29 14:17:12,112 [salt.utils.reactor                                    ][DEBUG   ] Compiling reactions for tag git/salt/master/repos/update
2018-05-29 14:17:12,113 [salt.fileclient                                       ][DEBUG   ] Could not find file 'salt://reactor/saltmaster_orch.sls' in saltenv 'base'
2018-05-29 14:17:12,113 [salt.loaded.int.module.cp                             ][ERROR   ] Unable to cache file 'salt://reactor/saltmaster_orch.sls' from saltenv 'base'.
2018-05-29 14:17:12,113 [salt.utils.reactor                                    ][ERROR   ] Can not render SLS  for tag git/salt/master/repos/update. File missing or not found.
2018-05-29 14:17:12,117 [salt.loaded.int.returner.local_cache                  ][DEBUG   ] Adding minions for job 20180529141712115064: [u'minion']

versions report:

Salt Version:
           Salt: 2018.3.0

Dependency Versions:
           cffi: 1.11.2
       cherrypy: 11.0.0
       dateutil: 2.7.2
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.18
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Aug  4 2017, 00:39:18)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.5

System Versions:
           dist: centos 7.4.1708 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-693.11.6.el7.x86_64
         system: Linux
        version: CentOS Linux 7.4.1708 Core

I know it stopped working today, because I have successful reactor events from yesterday.

I tried clearing the minion cache (which also runs on the master), but that didn't help. The reactor files themselves are located in /var/cache/salt/master/files/reactor/*.

Restarting the salt-master process seems to fix it temporarily.

@clallen
Copy link
Contributor Author

clallen commented May 29, 2018

@anitakrueger, @johje349 - are you running masters in a failover configuration by any chance?

That's what we're doing and it just occurred to me that it might have some impact on this.

@anitakrueger
Copy link
Contributor

@clallen no, no multiple masters here. Just a single master with about 100 minions.

@Ch3LL
Copy link
Contributor

Ch3LL commented May 30, 2018

this is possibly a duplicate of #47539 which is assigned to a core engineer.

@sathieu
Copy link
Contributor

sathieu commented Nov 7, 2018

#47539 is marked fixed "and it will be in the 2018.3.4 release"

@clallen
Copy link
Contributor Author

clallen commented Sep 15, 2019

I haven't seen the issue after upgrading masters to 2018.3.3 about 6 months ago, so I'll close this.

@clallen clallen closed this as completed Sep 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged
Projects
None yet
Development

No branches or pull requests

5 participants