Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying files from gitfs in file.recurse state fails #23110

Closed
martinhoefling opened this issue Apr 27, 2015 · 20 comments
Closed

Copying files from gitfs in file.recurse state fails #23110

martinhoefling opened this issue Apr 27, 2015 · 20 comments
Labels
Bug broken, incorrect, or confusing behavior File Servers P3 Priority 3 Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@martinhoefling
Copy link
Contributor

What's the problem?

When highstating, copying a 'large file' tree (~250 files) from a git repository fails sometimes. The traceback looks like if a lock file is cleaned up too early in some cases. The fix in #18839 does not resolve the issue.

How are we using gitfs

We have several minions (which are masters themselves) and each one obtains the correct version of the (state-)tree by matching the branch in git. We also use an 'intermediate' top file to workaround issue #12483, sth. like:

salt-beta:
  'beta-instancemaster':
    - instance-beta

The matching branch here is salt-beta, the environment salt-beta exclusively targets beta-instancemaster and the top.sls replacement is instance-beta.sls.

Traceback on Server

[DEBUG   ] Updating fileserver for git_pillar module
[ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/salt/master.py", line 1348, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.7/dist-packages/salt/fileserver/__init__.py", line 512, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.7/dist-packages/salt/fileserver/__init__.py", line 466, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/fileserver/gitfs.py", line 1509, in find_file
    with salt.utils.fopen(lk_fn, 'w+') as fp_:
  File "/usr/lib/python2.7/dist-packages/salt/utils/__init__.py", line 1086, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/var/cache/salt/master/gitfs/hash/salt-staging/top.sls.lk'

Minion Output

----------
          ID: /srv/salt-common
    Function: file.recurse
      Result: False
     Comment: #### /srv/salt-common/redis-server/init.sls ####
              Source file salt://|common/states/redis-server/init.sls not found
     Started: 01:53:45.613956
    Duration: 52353.076 ms
     Changes:  

Salt Version

root@77ff44f4d585:/# salt --versions-report
                  Salt: 2014.7.4
                Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
                Jinja2: 2.7.2
              M2Crypto: 0.21.1
        msgpack-python: 0.3.0
          msgpack-pure: Not Installed
              pycrypto: 2.6.1
               libnacl: Not Installed
                PyYAML: 3.10
                 ioflo: Not Installed
                 PyZMQ: 14.0.1
                  RAET: Not Installed
                   ZMQ: 4.0.4
                  Mako: 0.9.1
 Debian source package: 2014.7.4+ds-1trusty1

Master Config

fileserver_backend:
  - git
  - roots

gitfs_provider: gitpython

gitfs_env_whitelist:
  - salt-*

gitfs_remotes:

# State tree used by the overlord itself:
  - file:///somerepo.git:
    - root: deployment/salt/common/states
  - file:///somerepo.git:
    - root: deployment/salt/overlord/states

# State trees and Pillar trees for export to minions:
  - file:///somerepo.git:
    - root: deployment/salt/common
    - mountpoint: salt://common
  - file:///somerepo.git:
    - root: deployment/salt/instance-masters
    - mountpoint: salt://instance-masters

file_roots:
  base:
    - /packages

file_recv: True

ext_pillar:
  - git: salt-daily file:///somerepo.git root=deployment/salt/overlord/pillar/daily
  - git: salt-beta file:///somerepo.git root=deployment/salt/overlord/pillar/beta
  - git: salt-staging file:///somerepo.git root=deployment/salt/overlord/pillar/staging
  - git: salt-www file:///somerepo.git root=deployment/salt/overlord/pillar/www
@rallytime rallytime added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 File Servers labels Apr 28, 2015
@rallytime rallytime added this to the Approved milestone Apr 28, 2015
@rallytime
Copy link
Contributor

Thanks for this very thorough bug report and that is too bad that the previous fix didn't work! I thought we had all of the file-locking problems cleaned up. Maybe something else is happening here.

ping @terminalmage

@terminalmage
Copy link
Contributor

We didn't have any file locking issues, you may be remembering the repo locking issues from 2014.7.2 which have since been fixed.

This is not a problem of a lock file being cleaned. Note that the line in the traceback is occurring when we're trying to open the file for writing, not reading.

This is a problem of the destination directory for the lock file not existing. The weird thing is that we create this directory in this block of code (note that lk_fn and blobshadest are in the same directory), well before where we try to lay down the lock file here. So this traceback should not happen.

The only thing I can think of is that some code is triggering a clearing of the gitfs cache during execution. Either that or os.makedirs() doesn't block until the directory is actually created, and we end up with a race condition.

@martinhoefling
Copy link
Contributor Author

Interesting, one missing piece of information in my report is that our salt-master runs as a docker container. Thus, it writes to an aufs filesystem. We could mount a volume as /var/cache/salt to check if this is an aufs / docker bug.

@terminalmage
Copy link
Contributor

@martinhoefling Sure, by all means test that.

@jhenry82
Copy link

I'm (still) trying to upgrade our 2014.1.7 install to the 2014.7 series, and am getting consistent highstate failures that look like this. I manually patched 2014.7.5 with the fix from issue #22987, which was preventing GitFS from working at all. Now our custom grains and modules are syncing properly, but I still get random tracebacks that look very much like what @martinhoefling posted. The file in question does exist on the master, and copies just fine 100% of the time using version 2014.1.7.

We are NOT using Docker in any way.

Let me know if there's any more testing or info I can provide. Thanks.

2015-04-28 15:45:40,078 [salt.master                                ][ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1348, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 512, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 466, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/gitfs.py", line 1539, in find_file
    with salt.utils.fopen(lk_fn, 'w+') as fp_:
  File "/usr/lib/python2.6/site-packages/salt/utils/__init__.py", line 1098, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/var/cache/salt/master/gitfs/hash/production/geoip/GeoIPCity.dat.lk'
           Salt: 2014.7.5
         Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.1.9.final
   msgpack-pure: Not Installed
       pycrypto: 2.0.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 13.0.2
           RAET: Not Installed
            ZMQ: 3.2.3
           Mako: Not Installed

CentOS 6.6, 64-bit

@martinhoefling
Copy link
Contributor Author

As @jhenry82 also indicated, this seems to have nothing to do with aufs. Replacing the caching volume in the docker container had no effect.

@martinhoefling
Copy link
Contributor Author

@terminalmage I was able to reproduce the trace and it seems to correlate with fileserver.update

Setup: Two docker containers, master / minion, master exports gitfs as described above.

On the minion in a loop:

salt-call cp.get_dir salt://common/states /tmp/cstates saltenv=salt-daily

This works fine, and copies ~200 files. When I start (also in a loop) on the master...

salt-run fileserver.update

...I sometimes get the identical stacktrace as described in my report.

@martinhoefling
Copy link
Contributor Author

@terminalmage: I have prepared a fix/workaround in pull request #23496. If this is an acceptable solution, would be nice to see this in 2014.7. as well.

jfindlay pushed a commit that referenced this issue May 8, 2015
This resolves issues when the freshly created directory is removed
by fileserver.update.
jfindlay added a commit that referenced this issue May 8, 2015
@jfindlay jfindlay added the fixed-pls-verify fix is linked, bug author to confirm fix label May 8, 2015
@jfindlay
Copy link
Contributor

jfindlay commented May 8, 2015

@martinhoefling, @jhenry82, report here if the fix doesn't resolve the issues you were having. If it does then we can close this issue.

@jhenry82
Copy link

jhenry82 commented May 8, 2015

Can that commit be backported to 2014.7 (and/or 2015.5?) Then I can give it a whirl. I'm not really in a position to test random builds straight from develop right now.

@jfindlay
Copy link
Contributor

jfindlay commented May 8, 2015

Yes, it will make it back to 2014.7.

@martinhoefling
Copy link
Contributor Author

So far, I verified the effect 'only' in my test setup with two docker containers. I wasn't able to reproduce the above stacktrace. Also patched our saltmaster, and no problems so far. Let me report back after a couple of days / deployments if there's another trace as reported in production.

rallytime pushed a commit to rallytime/salt that referenced this issue May 12, 2015
This resolves issues when the freshly created directory is removed
by fileserver.update.
@jhenry82
Copy link

I've manually patched a copy of the 2015.5.0 release with @martinhoefling's fix, and restarted the Salt master. I am still seeing the same basic stack trace (with different line numbers, obviously, since the original report was against the 2014.7 series).

2015-05-20 11:08:09,442 [salt.master                              ][ERROR   ][14576] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1456, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 523, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 477, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/gitfs.py", line 1540, in find_file
    with salt.utils.fopen(lk_fn, 'w+') as fp_:
  File "/usr/lib/python2.6/site-packages/salt/utils/__init__.py", line 1031, in fopen
    fhandle = open(*args, **kwargs)
IOError: [Errno 2] No such file or directory: '/var/cache/salt/master/gitfs/hash/production/zabbix/scripts/images/k20d-image.jpg.lk'

The state that is looking for the above file:

zabbix-scripts:
  file.recurse:
    - name: /usr/local/bin/zabbix_scripts/
    - source: salt://zabbix/scripts
    - user: zabbix
    - group: zabbix
    - dir_mode: 755
    - file_mode: 755
    - require:
{% if grains['os_family'] == 'RedHat' %}
      - pkg: zabbix-pkgs
{% endif %}
      - user: zabbix

Versions:

Salt: 2015.5.0
Python: 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
Jinja2: 2.2.1
M2Crypto: 0.20.2
msgpack-python: 0.1.9.final
msgpack-pure: Not Installed
pycrypto: 2.0.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 13.0.2
RAET: Not Installed
ZMQ: 3.2.3
Mako: Not Installed

cat /etc/centos-release
CentOS release 6.6 (Final)

yum list installed python-pygit2
python-pygit2.x86_64                                             0.20.3-4.el6

Non-default master settings:

worker_threads: 24
timeout: 30
auto_accept: True
external_auth:
  pam:
    salt-api:
      - service.*
      - test.ping
      - 'saltutil.*'
      - '@runner'
      - cmd.run_all
      - cmd.run
      - grains.*
      - state.highstate
      - pkg.*
state_verbose: False
fileserver_backend:
  - git
gitfs_provider: pygit2
gitfs_remotes:
  - file:///srv/sls.git
ext_pillar:
  - git: master file:///srv/repos/sysops/conf/pillar.git
  - git: production file:///srv/repos/sysops/conf/pillar.git
  - git: stage file:///srv/repos/sysops/conf/pillar.git
log_level: warning
log_level_logfile: warning
rest_cherrypy:
  port: 8000
  ssl_crt: /etc/pki/tls/certs/salt-api.crt
  ssl_key: /etc/pki/tls/certs/salt-api.key

@jfindlay jfindlay removed the fixed-pls-verify fix is linked, bug author to confirm fix label May 20, 2015
@martinhoefling
Copy link
Contributor Author

So first of all: I can confirm that the fix solves the problem in production for our case. Version is still 2014.7.4 with patch applied.

@jhenry82 can you add debugging output to the directory cleanup in fileserver.update? Maybe as well at the point, where the parent directory is created. If for some reason the time between creating and using the directory is > 60s the patch does not work. This then would sound like a bug / problem elsewhere. @terminalmage @jfindlay any idea if there are other cleanup jobs that could cause @jhenry82 s problem?

@jhenry82
Copy link

I'll try adding debug code today.

One pattern I've noticed is that it's almost always large (>10MB), binary files that are triggering the stack trace on my 2015.5.0 master. Though I have seen it once on a random tiny .sls file. That, and entire branches in gitfs:

2015-05-22 01:07:18,174 [salt.master                              ][ERROR   ][14588] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1456, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 523, in file_hash
    fnd = self.find_file(load['path'], load['saltenv'])
  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 477, in find_file
    fnd = self.servers[fstr](path, saltenv, **kwargs)
  File "/usr/lib/python2.6/site-packages/salt/fileserver/gitfs.py", line 1426, in find_file
    os.remove(hashdir)
OSError: [Errno 21] Is a directory: '/var/cache/salt/master/gitfs/hash/apachestats_refactor'

We do have a number of large binary files in the repo. Perhaps that is making processing of the repo slow enough that the 60 second delay is not sufficient?

@martinhoefling
Copy link
Contributor Author

@jhenry82, sounds like a possible explanation. Lets see if your debug log on the directory deletes correlate with the missing dir. @terminalmage: How about entirely disabling / disallowing a fileserver.update cleanup during find_file? Not sure which consequences this might have since one could run in the situation that it's never cleaned up. What would be the best way to coordinate between such a mutual exclusion. A file lock?

@jfindlay jfindlay added the Platform Relates to OS, containers, platform-based utilities like FS, system based apps label May 26, 2015
@jhenry82
Copy link

For what it's worth, I updated the check in salt/fileserver/init.py from 60 seconds to 120 seconds. I have deployed the patched version to 1000 minions and haven't seen the stack trace--or a failed highstate resulting from it--in 2 days. I was seeing it dozens of times per day beforehand.

So that "works", but suggests to me we are working around some deeper problem. For someone with an even bigger git repo than ours, 120 seconds might still not be long enough. I don't know the code well enough right now to comment further, though.

@martinhoefling
Copy link
Contributor Author

The default loop interval for cleanup jobs is 60s. Increasing to 2 * loop_interval should be no problem. In fact making it loop interval dependent at all is probably a got idea. However, this still feels more like a workaround for the actual problem.

@martinhoefling
Copy link
Contributor Author

@jfindlay We did not observer this problem anymore, shall we close the issue?

@jfindlay
Copy link
Contributor

jfindlay commented Sep 9, 2015

@martinhoefling, sure, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior File Servers P3 Priority 3 Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

5 participants