-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in function _file_hash since upgrading to 0.17.0 #7755
Comments
Interesting. No problems like this from whatever version you upgraded from? With same config and so forth? From which version did you upgrade? |
I've upgraded from 0.16.4. Searched the logs, the error does not show before the upgrade. The config is almost the same, meaning I've only played with the log_* directives. |
Got the same problem after upgrading from 0.16.4 to 0.17.0. The issue is reproducible. I have a state, which defines:
If we run this state on one node, the output is fine: State: - file Name: /root/.vim Function: recurse Result: True Comment: The directory /root/.vim is in the correct state Changes: However, when applying state to all 73 nodes random "file not found" errors start popping up, like: ---------- node059: ---------- State: - pkg Name: vim Function: installed Result: True Comment: Package vim is already installed Changes: ---------- State: - file Name: /root/.vimrc Function: managed Result: True Comment: File /root/.vimrc is in the correct state Changes: ---------- State: - file Name: /root/.vim Function: recurse Result: False Comment: #### /root/.vim/bundle/syntastic-master/syntax_checkers/matlab/mlint.vim #### Source file salt://|configs/vim/.vim/bundle/syntastic-master/syntax_checkers/matlab/mlint.vim not found Changes: If the state is applied on this individual node (e.g. node059), everything goes fine. When 'source file not found' errors appear, the following is recorded in master log: 2013-10-25 17:51:26,386 [salt.master ][ERROR ] Error in function _file_hash: Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/salt/master.py", line 1444, in run_func ret = getattr(self, func)(load) File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash return self.servers[fstr](load, fnd) File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 153, in file_hash hsum, mtime = fp_.read().split(':') ValueError: need more than 1 value to unpack File names themselves are random, too, no visible pattern can be recognized. My environment: # salt-call --versions-report Salt: 0.17.0 Python: 2.7.3 (default, Aug 1 2012, 05:14:39) Jinja2: 2.6 M2Crypto: 0.21.1 msgpack-python: 0.1.10 msgpack-pure: Not Installed pycrypto: 2.4.1 PyYAML: 3.10 PyZMQ: 2.2.0.1 ZMQ: 3.2.2 # cat /etc/salt/master | grep ^[^#] interface: 0.0.0.0 file_roots: base: - /srv/salt |
Exception is thrown because cache file is empty at the moment it's read: 2013-10-25 18:45:54,057 [salt.master ][ERROR ] Error in function _file_hash: Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/salt/master.py", line 1444, in run_func ret = getattr(self, func)(load) File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash return self.servers[fstr](load, fnd) File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 155, in file_hash raise Exception(" | ".join([cache_path, file_content])) Exception: /var/cache/salt/master/roots/hash/base/configs/vim/.vim/bundle/vim-powerline-develop/doc/Powerline.txt.hash.md5 | (ugly way to debug, but still.) Looks like some sort of race condition. It doesn't crawl to surface while the number of nodes is relatively low. |
@ddsstt Thanks for the additional info. We're upgrading this to high severity so we can get some eyeballs on this issue ASAP. |
First pass at some file-locking and retry logic for file hashes. Also a ...
@victormuse and @ddsstt, the above merged PR introduces some file-locking logic which is designed to prevent this issue from occurring. Given that you're obviously seeing this issue in production, I'm guessing you don't want to run development code on those servers. It's working thus far in my tests, however. Would you like to close this issue for the time being and re-open if you see it in future releases of Salt or is this fix something that's easy for you to verify? I'll tag this as Fixed Pending Verification in the meantime. Many thanks to both of you for the issue report and the detective work! Very helpful! |
@cachedout That's great news, thank you! I'll be able to verify the fix tomorrow. |
@cachedout I've applied the patch (https://github.com/saltstack/salt/commit/db2e4758fdc1ff3483dab97b9b74bf845abb29f1.patch) on master and all nodes, then restarted all services. However, the issue still keeps popping up, with the same symptoms. Is it possible I'm doing it wrong here? |
@ddsstt I doubt you've applied it incorrectly. Thanks so much for taking the time to apply that patch and test it. I'll keep poking at it and see if I can improve things. We'll get this figured out. |
I don't know if it's of help, but I have the same issue with random source files not being found, getting the same "need more than 1 value to unpack" error. I use Ubuntu 12.04.3 LTS.
I put an inotifywait -m on /var/cache/salt/master/roots/hash/base/configs/rsyslog and I saw that the cache files are written to dozens of times in short sequence (and apparently, at the same time sometimes). Should once not be enough? I was trying to get the output of inotify, but I can't reproduce the bug right now. |
@halfgaar I didn't think of using inotify to gather more debugging info! Output of pyinotify when observing the issue is available here: http://pastebin.com/LKCbg5mu Hope this helps. |
The above merge basically ups the scope of the try/catch and also fires off a retry if for some reason the file locking doesn't behave as expected. |
@victormuse and @ddsstt, the above merge for more aggressive retry logic should have made it into 0.17.2. Have either of you had a chance to try it? If so, could you post your experience? Thanks so much! |
Thank you for your hard work, Mike! I'll definitely test your patch either tomorrow or on Monday and will share the experiences. By the way — as one can see from the comments in relevant parts of code, we have to check if a file is locked via catching OSError on failed salt.utils.fopen() call. You elaborate by noticing we can't use select() since we don't want to loose Windows compatibility. However, would using select() save us from having this issue pop up on Linux? If so, how would you comment on a workaround that would use select() on Linux, and fall back to fopen() on Windows? Correct me, please, if I misunderstood something. Sent from handheld On 21.11.2013, at 20:45, Mike Place notifications@github.com wrote:
|
The current solution worked for me in most cases. However, when copying relatively heavy directories (like vim config directory), the following can sometimes appear up in the log file:
Corresponding action on minion fails at this time. This makes sense, since heavily accessed cache file might be unavailable long enough for recursion limit to become an issue, since we keep calling |
Disregard my two pull requests, please, they just add
|
Just a side note: I believe using file-based caching approach might be an issue when scaling SaltStack in specific scenarios. Even if we manage to come up with a reliable file locking mechanism, with a large number of objects most of the code will be waiting for the lock to become free. This, in turn, might lead to a serious performance degradation. Would it be worth looking into alternative caching approaches? |
@ddsstt Sorry for the delay in getting back to you here. I agree that a file-based caching approach has practical limitations and that in the long-run, it's worth exploring alternatives to this approach. I also agree with your analysis about the underlying cause for recursion errors. I tend to wonder, however, if we might be able to get away with simply throwing in a very short sleep before the retry loop kicks off? In all but the most extreme scenarios, we shouldn't be waiting all that long for a lock to become free and an artificial performance penalty incurred by a sleep seems quite preferable over exceptions. Thoughts? |
I believe this was ultimately resolved by the commit here: d2c492f. I haven't seen this error re-occur since then and as a result, I'm going to go ahead and close this. Thanks to all for for all the help resolving this issue! |
Hi, [DEBUG ] S3 Response Status Code: 200 Using Salt: 2014.7.0rc2 on the master. |
@soodr Mind opening a new issue? Include your |
Hi,
The following error keeps popping up:
However, it doesn't seem to affect the functionality of this test setup.
The text was updated successfully, but these errors were encountered: