Copying files from gitfs in file.recurse state fails #23110

martinhoefling · 2015-04-27T21:56:38Z

What's the problem?

When highstating, copying a 'large file' tree (~250 files) from a git repository fails sometimes. The traceback looks like if a lock file is cleaned up too early in some cases. The fix in #18839 does not resolve the issue.

How are we using gitfs

We have several minions (which are masters themselves) and each one obtains the correct version of the (state-)tree by matching the branch in git. We also use an 'intermediate' top file to workaround issue #12483, sth. like:

salt-beta:
  'beta-instancemaster':
    - instance-beta

The matching branch here is salt-beta, the environment salt-beta exclusively targets beta-instancemaster and the top.sls replacement is instance-beta.sls.

Traceback on Server

[DEBUG   ] Updating fileserver for git_pillar module
[ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/master.py", line 1348, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.7/dist-packages/salt/fileserver/__init__.py", line 512, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.7/dist-packages/salt/fileserver/__init__.py", line 466, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/fileserver/gitfs.py", line 1509, in find_file
    with salt.utils.fopen(lk_fn, 'w+') as fp_:
  File "/usr/lib/python2.7/dist-packages/salt/utils/__init__.py", line 1086, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/var/cache/salt/master/gitfs/hash/salt-staging/top.sls.lk'

Minion Output

----------
          ID: /srv/salt-common
    Function: file.recurse
      Result: False
     Comment: #### /srv/salt-common/redis-server/init.sls ####
              Source file salt://|common/states/redis-server/init.sls not found
     Started: 01:53:45.613956
    Duration: 52353.076 ms
     Changes:

Salt Version

root@77ff44f4d585:/# salt --versions-report
                  Salt: 2014.7.4
                Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
                Jinja2: 2.7.2
              M2Crypto: 0.21.1
        msgpack-python: 0.3.0
          msgpack-pure: Not Installed
              pycrypto: 2.6.1
               libnacl: Not Installed
                PyYAML: 3.10
                 ioflo: Not Installed
                 PyZMQ: 14.0.1
                  RAET: Not Installed
                   ZMQ: 4.0.4
                  Mako: 0.9.1
 Debian source package: 2014.7.4+ds-1trusty1

Master Config

fileserver_backend:
  - git
  - roots

gitfs_provider: gitpython

gitfs_env_whitelist:
  - salt-*

gitfs_remotes:

# State tree used by the overlord itself:
  - file:///somerepo.git:
    - root: deployment/salt/common/states
  - file:///somerepo.git:
    - root: deployment/salt/overlord/states

# State trees and Pillar trees for export to minions:
  - file:///somerepo.git:
    - root: deployment/salt/common
    - mountpoint: salt://common
  - file:///somerepo.git:
    - root: deployment/salt/instance-masters
    - mountpoint: salt://instance-masters

file_roots:
  base:
    - /packages

file_recv: True

ext_pillar:
  - git: salt-daily file:///somerepo.git root=deployment/salt/overlord/pillar/daily
  - git: salt-beta file:///somerepo.git root=deployment/salt/overlord/pillar/beta
  - git: salt-staging file:///somerepo.git root=deployment/salt/overlord/pillar/staging
  - git: salt-www file:///somerepo.git root=deployment/salt/overlord/pillar/www

The text was updated successfully, but these errors were encountered:

rallytime · 2015-04-28T15:29:06Z

Thanks for this very thorough bug report and that is too bad that the previous fix didn't work! I thought we had all of the file-locking problems cleaned up. Maybe something else is happening here.

ping @terminalmage

terminalmage · 2015-04-28T15:55:37Z

We didn't have any file locking issues, you may be remembering the repo locking issues from 2014.7.2 which have since been fixed.

This is not a problem of a lock file being cleaned. Note that the line in the traceback is occurring when we're trying to open the file for writing, not reading.

This is a problem of the destination directory for the lock file not existing. The weird thing is that we create this directory in this block of code (note that lk_fn and blobshadest are in the same directory), well before where we try to lay down the lock file here. So this traceback should not happen.

The only thing I can think of is that some code is triggering a clearing of the gitfs cache during execution. Either that or os.makedirs() doesn't block until the directory is actually created, and we end up with a race condition.

martinhoefling · 2015-04-28T20:00:24Z

Interesting, one missing piece of information in my report is that our salt-master runs as a docker container. Thus, it writes to an aufs filesystem. We could mount a volume as /var/cache/salt to check if this is an aufs / docker bug.

terminalmage · 2015-04-28T20:04:43Z

@martinhoefling Sure, by all means test that.

jhenry82 · 2015-04-28T22:10:19Z

I'm (still) trying to upgrade our 2014.1.7 install to the 2014.7 series, and am getting consistent highstate failures that look like this. I manually patched 2014.7.5 with the fix from issue #22987, which was preventing GitFS from working at all. Now our custom grains and modules are syncing properly, but I still get random tracebacks that look very much like what @martinhoefling posted. The file in question does exist on the master, and copies just fine 100% of the time using version 2014.1.7.

We are NOT using Docker in any way.

Let me know if there's any more testing or info I can provide. Thanks.

2015-04-28 15:45:40,078 [salt.master                                ][ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1348, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 512, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 466, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/gitfs.py", line 1539, in find_file
    with salt.utils.fopen(lk_fn, 'w+') as fp_:
  File "/usr/lib/python2.6/site-packages/salt/utils/__init__.py", line 1098, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/var/cache/salt/master/gitfs/hash/production/geoip/GeoIPCity.dat.lk'

           Salt: 2014.7.5
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.1.9.final
   msgpack-pure: Not Installed
       pycrypto: 2.0.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.0.2
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: Not Installed

CentOS 6.6, 64-bit

martinhoefling · 2015-05-07T21:34:55Z

As @jhenry82 also indicated, this seems to have nothing to do with aufs. Replacing the caching volume in the docker container had no effect.

martinhoefling · 2015-05-08T09:21:31Z

@terminalmage I was able to reproduce the trace and it seems to correlate with fileserver.update

Setup: Two docker containers, master / minion, master exports gitfs as described above.

On the minion in a loop:

salt-call cp.get_dir salt://common/states /tmp/cstates saltenv=salt-daily

This works fine, and copies ~200 files. When I start (also in a loop) on the master...

salt-run fileserver.update

...I sometimes get the identical stacktrace as described in my report.

martinhoefling · 2015-05-08T12:43:24Z

@terminalmage: I have prepared a fix/workaround in pull request #23496. If this is an acceptable solution, would be nice to see this in 2014.7. as well.

This resolves issues when the freshly created directory is removed by fileserver.update.

Fix for issue #23110

jfindlay · 2015-05-08T17:34:43Z

@martinhoefling, @jhenry82, report here if the fix doesn't resolve the issues you were having. If it does then we can close this issue.

jhenry82 · 2015-05-08T17:42:00Z

Can that commit be backported to 2014.7 (and/or 2015.5?) Then I can give it a whirl. I'm not really in a position to test random builds straight from develop right now.

jfindlay · 2015-05-08T17:51:08Z

Yes, it will make it back to 2014.7.

martinhoefling · 2015-05-08T19:04:02Z

So far, I verified the effect 'only' in my test setup with two docker containers. I wasn't able to reproduce the above stacktrace. Also patched our saltmaster, and no problems so far. Let me report back after a couple of days / deployments if there's another trace as reported in production.

This resolves issues when the freshly created directory is removed by fileserver.update.

jhenry82 · 2015-05-20T18:02:32Z

I've manually patched a copy of the 2015.5.0 release with @martinhoefling's fix, and restarted the Salt master. I am still seeing the same basic stack trace (with different line numbers, obviously, since the original report was against the 2014.7 series).

2015-05-20 11:08:09,442 [salt.master                              ][ERROR   ][14576] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1456, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 523, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 477, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/gitfs.py", line 1540, in find_file
    with salt.utils.fopen(lk_fn, 'w+') as fp_:
  File "/usr/lib/python2.6/site-packages/salt/utils/__init__.py", line 1031, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/var/cache/salt/master/gitfs/hash/production/zabbix/scripts/images/k20d-image.jpg.lk'

The state that is looking for the above file:

zabbix-scripts:
  file.recurse:
    - name: /usr/local/bin/zabbix_scripts/
    - source: salt://zabbix/scripts
    - user: zabbix
    - group: zabbix
    - dir_mode: 755
    - file_mode: 755
    - require:
{% if grains['os_family'] == 'RedHat' %}
      - pkg: zabbix-pkgs
{% endif %}
      - user: zabbix

Versions:

Salt: 2015.5.0
Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
Jinja2: 2.2.1
M2Crypto: 0.20.2
msgpack-python: 0.1.9.final
msgpack-pure: Not Installed
pycrypto: 2.0.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 13.0.2
RAET: Not Installed
ZMQ: 3.2.3
Mako: Not Installed

cat /etc/centos-release
CentOS release 6.6 (Final)

yum list installed python-pygit2
python-pygit2.x86_64                                             0.20.3-4.el6

Non-default master settings:

worker_threads: 24
timeout: 30
auto_accept: True
external_auth:
  pam:
    salt-api:
      - service.*
      - test.ping
      - 'saltutil.*'
      - '@runner'
      - cmd.run_all
      - cmd.run
      - grains.*
      - state.highstate
      - pkg.*
state_verbose: False
fileserver_backend:
  - git
gitfs_provider: pygit2
gitfs_remotes:
  - file:///srv/sls.git
ext_pillar:
  - git: master file:///srv/repos/sysops/conf/pillar.git
  - git: production file:///srv/repos/sysops/conf/pillar.git
  - git: stage file:///srv/repos/sysops/conf/pillar.git
log_level: warning
log_level_logfile: warning
rest_cherrypy:
  port: 8000
  ssl_crt: /etc/pki/tls/certs/salt-api.crt
  ssl_key: /etc/pki/tls/certs/salt-api.key

martinhoefling · 2015-05-20T22:18:54Z

So first of all: I can confirm that the fix solves the problem in production for our case. Version is still 2014.7.4 with patch applied.

@jhenry82 can you add debugging output to the directory cleanup in fileserver.update? Maybe as well at the point, where the parent directory is created. If for some reason the time between creating and using the directory is > 60s the patch does not work. This then would sound like a bug / problem elsewhere. @terminalmage @jfindlay any idea if there are other cleanup jobs that could cause @jhenry82 s problem?

jhenry82 · 2015-05-22T15:43:17Z

I'll try adding debug code today.

One pattern I've noticed is that it's almost always large (>10MB), binary files that are triggering the stack trace on my 2015.5.0 master. Though I have seen it once on a random tiny .sls file. That, and entire branches in gitfs:

2015-05-22 01:07:18,174 [salt.master                              ][ERROR   ][14588] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1456, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 523, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 477, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/gitfs.py", line 1426, in find_file
    os.remove(hashdir)
OSError: [Errno 21] Is a directory: '/var/cache/salt/master/gitfs/hash/apachestats_refactor'

We do have a number of large binary files in the repo. Perhaps that is making processing of the repo slow enough that the 60 second delay is not sufficient?

martinhoefling · 2015-05-23T14:08:00Z

@jhenry82, sounds like a possible explanation. Lets see if your debug log on the directory deletes correlate with the missing dir. @terminalmage: How about entirely disabling / disallowing a fileserver.update cleanup during find_file? Not sure which consequences this might have since one could run in the situation that it's never cleaned up. What would be the best way to coordinate between such a mutual exclusion. A file lock?

jhenry82 · 2015-05-30T03:39:45Z

For what it's worth, I updated the check in salt/fileserver/init.py from 60 seconds to 120 seconds. I have deployed the patched version to 1000 minions and haven't seen the stack trace--or a failed highstate resulting from it--in 2 days. I was seeing it dozens of times per day beforehand.

So that "works", but suggests to me we are working around some deeper problem. For someone with an even bigger git repo than ours, 120 seconds might still not be long enough. I don't know the code well enough right now to comment further, though.

martinhoefling · 2015-06-01T12:37:37Z

The default loop interval for cleanup jobs is 60s. Increasing to 2 * loop_interval should be no problem. In fact making it loop interval dependent at all is probably a got idea. However, this still feels more like a workaround for the actual problem.

martinhoefling · 2015-09-09T22:01:46Z

@jfindlay We did not observer this problem anymore, shall we close the issue?

jfindlay · 2015-09-09T22:09:11Z

@martinhoefling, sure, thanks.

rallytime added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 File Servers labels Apr 28, 2015

rallytime added this to the Approved milestone Apr 28, 2015

jfindlay mentioned this issue May 8, 2015

Fix for issue #23110 #23496

Merged

jfindlay pushed a commit that referenced this issue May 8, 2015

Fix for issue #23110

e49326f

This resolves issues when the freshly created directory is removed by fileserver.update.

jfindlay added a commit that referenced this issue May 8, 2015

Merge pull request #23496 from martinhoefling/develop

0d37ace

Fix for issue #23110

jfindlay added the fixed-pls-verify fix is linked, bug author to confirm fix label May 8, 2015

rallytime pushed a commit to rallytime/salt that referenced this issue May 12, 2015

Fix for issue saltstack#23110

d5ae1d2

This resolves issues when the freshly created directory is removed by fileserver.update.

jfindlay removed the fixed-pls-verify fix is linked, bug author to confirm fix label May 20, 2015

jfindlay added the Platform Relates to OS, containers, platform-based utilities like FS, system based apps label May 26, 2015

rallytime closed this as completed Sep 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copying files from gitfs in file.recurse state fails #23110

Copying files from gitfs in file.recurse state fails #23110

martinhoefling commented Apr 27, 2015

rallytime commented Apr 28, 2015

terminalmage commented Apr 28, 2015

martinhoefling commented Apr 28, 2015

terminalmage commented Apr 28, 2015

jhenry82 commented Apr 28, 2015

martinhoefling commented May 7, 2015

martinhoefling commented May 8, 2015

martinhoefling commented May 8, 2015

jfindlay commented May 8, 2015

jhenry82 commented May 8, 2015

jfindlay commented May 8, 2015

martinhoefling commented May 8, 2015

jhenry82 commented May 20, 2015

martinhoefling commented May 20, 2015

jhenry82 commented May 22, 2015

martinhoefling commented May 23, 2015

jhenry82 commented May 30, 2015

martinhoefling commented Jun 1, 2015

martinhoefling commented Sep 9, 2015

jfindlay commented Sep 9, 2015

Copying files from gitfs in file.recurse state fails #23110

Copying files from gitfs in file.recurse state fails #23110

Comments

martinhoefling commented Apr 27, 2015

What's the problem?

How are we using gitfs

Traceback on Server

Minion Output

Salt Version

Master Config

rallytime commented Apr 28, 2015

terminalmage commented Apr 28, 2015

martinhoefling commented Apr 28, 2015

terminalmage commented Apr 28, 2015

jhenry82 commented Apr 28, 2015

martinhoefling commented May 7, 2015

martinhoefling commented May 8, 2015

martinhoefling commented May 8, 2015

jfindlay commented May 8, 2015

jhenry82 commented May 8, 2015

jfindlay commented May 8, 2015

martinhoefling commented May 8, 2015

jhenry82 commented May 20, 2015

martinhoefling commented May 20, 2015

jhenry82 commented May 22, 2015

martinhoefling commented May 23, 2015

jhenry82 commented May 30, 2015

martinhoefling commented Jun 1, 2015

martinhoefling commented Sep 9, 2015

jfindlay commented Sep 9, 2015