-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copying files from gitfs in file.recurse state fails #23110
Comments
Thanks for this very thorough bug report and that is too bad that the previous fix didn't work! I thought we had all of the file-locking problems cleaned up. Maybe something else is happening here. ping @terminalmage |
We didn't have any file locking issues, you may be remembering the repo locking issues from 2014.7.2 which have since been fixed. This is not a problem of a lock file being cleaned. Note that the line in the traceback is occurring when we're trying to open the file for writing, not reading. This is a problem of the destination directory for the lock file not existing. The weird thing is that we create this directory in this block of code (note that lk_fn and blobshadest are in the same directory), well before where we try to lay down the lock file here. So this traceback should not happen. The only thing I can think of is that some code is triggering a clearing of the gitfs cache during execution. Either that or |
Interesting, one missing piece of information in my report is that our salt-master runs as a docker container. Thus, it writes to an aufs filesystem. We could mount a volume as |
@martinhoefling Sure, by all means test that. |
I'm (still) trying to upgrade our 2014.1.7 install to the 2014.7 series, and am getting consistent highstate failures that look like this. I manually patched 2014.7.5 with the fix from issue #22987, which was preventing GitFS from working at all. Now our custom grains and modules are syncing properly, but I still get random tracebacks that look very much like what @martinhoefling posted. The file in question does exist on the master, and copies just fine 100% of the time using version 2014.1.7. We are NOT using Docker in any way. Let me know if there's any more testing or info I can provide. Thanks.
CentOS 6.6, 64-bit |
As @jhenry82 also indicated, this seems to have nothing to do with aufs. Replacing the caching volume in the docker container had no effect. |
@terminalmage I was able to reproduce the trace and it seems to correlate with fileserver.update Setup: Two docker containers, master / minion, master exports gitfs as described above. On the minion in a loop:
This works fine, and copies ~200 files. When I start (also in a loop) on the master...
...I sometimes get the identical stacktrace as described in my report. |
@terminalmage: I have prepared a fix/workaround in pull request #23496. If this is an acceptable solution, would be nice to see this in 2014.7. as well. |
This resolves issues when the freshly created directory is removed by fileserver.update.
@martinhoefling, @jhenry82, report here if the fix doesn't resolve the issues you were having. If it does then we can close this issue. |
Can that commit be backported to 2014.7 (and/or 2015.5?) Then I can give it a whirl. I'm not really in a position to test random builds straight from develop right now. |
Yes, it will make it back to 2014.7. |
So far, I verified the effect 'only' in my test setup with two docker containers. I wasn't able to reproduce the above stacktrace. Also patched our saltmaster, and no problems so far. Let me report back after a couple of days / deployments if there's another trace as reported in production. |
This resolves issues when the freshly created directory is removed by fileserver.update.
I've manually patched a copy of the 2015.5.0 release with @martinhoefling's fix, and restarted the Salt master. I am still seeing the same basic stack trace (with different line numbers, obviously, since the original report was against the 2014.7 series).
The state that is looking for the above file:
Versions:
Non-default master settings:
|
So first of all: I can confirm that the fix solves the problem in production for our case. Version is still 2014.7.4 with patch applied. @jhenry82 can you add debugging output to the directory cleanup in fileserver.update? Maybe as well at the point, where the parent directory is created. If for some reason the time between creating and using the directory is > 60s the patch does not work. This then would sound like a bug / problem elsewhere. @terminalmage @jfindlay any idea if there are other cleanup jobs that could cause @jhenry82 s problem? |
I'll try adding debug code today. One pattern I've noticed is that it's almost always large (>10MB), binary files that are triggering the stack trace on my 2015.5.0 master. Though I have seen it once on a random tiny .sls file. That, and entire branches in gitfs:
We do have a number of large binary files in the repo. Perhaps that is making processing of the repo slow enough that the 60 second delay is not sufficient? |
@jhenry82, sounds like a possible explanation. Lets see if your debug log on the directory deletes correlate with the missing dir. @terminalmage: How about entirely disabling / disallowing a fileserver.update cleanup during |
For what it's worth, I updated the check in salt/fileserver/init.py from 60 seconds to 120 seconds. I have deployed the patched version to 1000 minions and haven't seen the stack trace--or a failed highstate resulting from it--in 2 days. I was seeing it dozens of times per day beforehand. So that "works", but suggests to me we are working around some deeper problem. For someone with an even bigger git repo than ours, 120 seconds might still not be long enough. I don't know the code well enough right now to comment further, though. |
The default loop interval for cleanup jobs is 60s. Increasing to 2 * loop_interval should be no problem. In fact making it loop interval dependent at all is probably a got idea. However, this still feels more like a workaround for the actual problem. |
@jfindlay We did not observer this problem anymore, shall we close the issue? |
@martinhoefling, sure, thanks. |
What's the problem?
When highstating, copying a 'large file' tree (~250 files) from a git repository fails sometimes. The traceback looks like if a lock file is cleaned up too early in some cases. The fix in #18839 does not resolve the issue.
How are we using gitfs
We have several minions (which are masters themselves) and each one obtains the correct version of the (state-)tree by matching the branch in git. We also use an 'intermediate' top file to workaround issue #12483, sth. like:
The matching branch here is
salt-beta
, the environmentsalt-beta
exclusively targetsbeta-instancemaster
and thetop.sls
replacement isinstance-beta.sls
.Traceback on Server
Minion Output
Salt Version
Master Config
The text was updated successfully, but these errors were encountered: