Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in function _file_hash since upgrading to 0.17.0 #7755

Closed
victormuse opened this issue Oct 11, 2013 · 22 comments
Closed

Error in function _file_hash since upgrading to 0.17.0 #7755

victormuse opened this issue Oct 11, 2013 · 22 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@victormuse
Copy link

Hi,

# salt-call --versions-report
           Salt: 0.17.0
         Python: 2.6.8 (unknown, Mar 14 2013, 09:31:22)
         Jinja2: 2.6
       M2Crypto: 0.20.2
 msgpack-python: 0.1.13
   msgpack-pure: Not Installed
       pycrypto: 2.6
         PyYAML: 3.10
          PyZMQ: 2.2.0.1
            ZMQ: 3.2.4

# cat /etc/salt/master
user: salt
open_mode: true
log_file: file:///dev/log/LOG_LOCAL3
log_level_logfile: warning
log_level: quiet

fileserver_backend:
  - roots

file_roots:
  base:
    - /var/data/salt/srv/salt
file_ignore_regex:
  - '/\.svn($|/)'
  - '/\.git($|/)'
file_ignore_glob:
  - '*.pyc'

pillar_roots:
  base:
    - /var/data/salt/srv/pillar

client_acl:
  jenkins:
    - .*

The following error keeps popping up:

[ERROR   ] Error in function _file_hash:
#012Traceback (most recent call last):
#012  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1444, in run_func
#012    ret = getattr(self, func)(load)
#012  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 233, in file_hash
#012    return self.servers[fstr](load, fnd)
#012  File "/usr/lib/python2.6/site-packages/salt/fileserver/roots.py", line 153, in file_hash
#012    hsum, mtime = fp_.read().split(':')
#012ValueError: need more than 1 value to unpack

However, it doesn't seem to affect the functionality of this test setup.

@basepi
Copy link
Contributor

basepi commented Oct 11, 2013

Interesting. No problems like this from whatever version you upgraded from? With same config and so forth?

From which version did you upgrade?

@victormuse
Copy link
Author

I've upgraded from 0.16.4. Searched the logs, the error does not show before the upgrade. The config is almost the same, meaning I've only played with the log_* directives.

@ghost ghost assigned cachedout Oct 15, 2013
@ddsstt
Copy link

ddsstt commented Oct 25, 2013

Got the same problem after upgrading from 0.16.4 to 0.17.0. The issue is reproducible.

I have a state, which defines:

/root/.vim:
    file.recurse:
        - source: salt://configs/vim/.vim
        - user: root
        - group: root
        - dir_mode: 755
        - file_mode: 644

If we run this state on one node, the output is fine:

    State: - file
    Name:      /root/.vim
    Function:  recurse
        Result:    True
        Comment:   The directory /root/.vim is in the correct state
        Changes:

However, when applying state to all 73 nodes random "file not found" errors start popping up, like:

----------
node059:
----------
    State: - pkg
    Name:      vim
    Function:  installed
        Result:    True
        Comment:   Package vim is already installed
        Changes:
----------
    State: - file
    Name:      /root/.vimrc
    Function:  managed
        Result:    True
        Comment:   File /root/.vimrc is in the correct state
        Changes:
----------
    State: - file
    Name:      /root/.vim
    Function:  recurse
        Result:    False
        Comment:   #### /root/.vim/bundle/syntastic-master/syntax_checkers/matlab/mlint.vim ####
Source file salt://|configs/vim/.vim/bundle/syntastic-master/syntax_checkers/matlab/mlint.vim not found
        Changes:

If the state is applied on this individual node (e.g. node059), everything goes fine.

When 'source file not found' errors appear, the following is recorded in master log:

2013-10-25 17:51:26,386 [salt.master      ][ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/master.py", line 1444, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash
    return self.servers[fstr](load, fnd)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 153, in file_hash
    hsum, mtime = fp_.read().split(':')
ValueError: need more than 1 value to unpack

File names themselves are random, too, no visible pattern can be recognized.

My environment:

# salt-call --versions-report
           Salt: 0.17.0
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 2.2.0.1
            ZMQ: 3.2.2
# cat /etc/salt/master | grep ^[^#]
interface: 0.0.0.0
file_roots:
  base:
    - /srv/salt

@ddsstt
Copy link

ddsstt commented Oct 25, 2013

Exception is thrown because cache file is empty at the moment it's read:

2013-10-25 18:45:54,057 [salt.master      ][ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/master.py", line 1444, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash
    return self.servers[fstr](load, fnd)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 155, in file_hash
    raise Exception(" | ".join([cache_path, file_content]))
Exception: /var/cache/salt/master/roots/hash/base/configs/vim/.vim/bundle/vim-powerline-develop/doc/Powerline.txt.hash.md5 |

(ugly way to debug, but still.)

Looks like some sort of race condition. It doesn't crawl to surface while the number of nodes is relatively low.

@cachedout
Copy link
Contributor

@ddsstt Thanks for the additional info. We're upgrading this to high severity so we can get some eyeballs on this issue ASAP.

thatch45 added a commit that referenced this issue Oct 26, 2013
First pass at some file-locking and retry logic for file hashes. Also a ...
@cachedout
Copy link
Contributor

@victormuse and @ddsstt, the above merged PR introduces some file-locking logic which is designed to prevent this issue from occurring. Given that you're obviously seeing this issue in production, I'm guessing you don't want to run development code on those servers. It's working thus far in my tests, however.

Would you like to close this issue for the time being and re-open if you see it in future releases of Salt or is this fix something that's easy for you to verify? I'll tag this as Fixed Pending Verification in the meantime.

Many thanks to both of you for the issue report and the detective work! Very helpful!

@ddsstt
Copy link

ddsstt commented Oct 27, 2013

@cachedout That's great news, thank you!

I'll be able to verify the fix tomorrow.

@ddsstt
Copy link

ddsstt commented Oct 29, 2013

@cachedout I've applied the patch (https://github.com/saltstack/salt/commit/db2e4758fdc1ff3483dab97b9b74bf845abb29f1.patch) on master and all nodes, then restarted all services. However, the issue still keeps popping up, with the same symptoms.

Is it possible I'm doing it wrong here?

@cachedout
Copy link
Contributor

@ddsstt I doubt you've applied it incorrectly. Thanks so much for taking the time to apply that patch and test it. I'll keep poking at it and see if I can improve things. We'll get this figured out.

@halfgaar
Copy link
Contributor

halfgaar commented Nov 1, 2013

I don't know if it's of help, but I have the same issue with random source files not being found, getting the same "need more than 1 value to unpack" error. I use Ubuntu 12.04.3 LTS.

2013-11-01 11:37:52,893 [salt.master      ][ERROR   ] Error in function _file_hash:      
Traceback (most recent call last):                                                       
  File "/usr/lib/pymodules/python2.7/salt/master.py", line 1494, in run_func             
    ret = getattr(self, func)(load)                                                      
  File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash
    return self.servers[fstr](load, fnd)                                                 
  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 153, in file_hash   
    hsum, mtime = fp_.read().split(':')                                                  
ValueError: need more than 1 value to unpack                                             
# salt-call --versions-report                          
           Salt: 0.17.1                                
         Python: 2.7.3 (default, Sep 26 2013, 20:03:06)
         Jinja2: 2.6                                   
       M2Crypto: 0.21.1                                
 msgpack-python: 0.1.10                                
   msgpack-pure: Not Installed                         
       pycrypto: 2.4.1                                 
         PyYAML: 3.10                                  
          PyZMQ: 13.0.0                                
            ZMQ: 3.2.2

I put an inotifywait -m on /var/cache/salt/master/roots/hash/base/configs/rsyslog and I saw that the cache files are written to dozens of times in short sequence (and apparently, at the same time sometimes). Should once not be enough?

I was trying to get the output of inotify, but I can't reproduce the bug right now.

@ddsstt
Copy link

ddsstt commented Nov 1, 2013

@halfgaar I didn't think of using inotify to gather more debugging info!

Output of pyinotify when observing the issue is available here: http://pastebin.com/LKCbg5mu

Hope this helps.

basepi added a commit that referenced this issue Nov 1, 2013
#7755 More aggressive re-try logic
@cachedout
Copy link
Contributor

The above merge basically ups the scope of the try/catch and also fires off a retry if for some reason the file locking doesn't behave as expected.

@cachedout
Copy link
Contributor

@victormuse and @ddsstt, the above merge for more aggressive retry logic should have made it into 0.17.2. Have either of you had a chance to try it? If so, could you post your experience? Thanks so much!

@ddsstt
Copy link

ddsstt commented Nov 21, 2013

Thank you for your hard work, Mike! I'll definitely test your patch either tomorrow or on Monday and will share the experiences.

By the way — as one can see from the comments in relevant parts of code, we have to check if a file is locked via catching OSError on failed salt.utils.fopen() call. You elaborate by noticing we can't use select() since we don't want to loose Windows compatibility. However, would using select() save us from having this issue pop up on Linux? If so, how would you comment on a workaround that would use select() on Linux, and fall back to fopen() on Windows? Correct me, please, if I misunderstood something.

Sent from handheld

On 21.11.2013, at 20:45, Mike Place notifications@github.com wrote:

@victormuse and @ddsstt, the above merge for more aggressive retry logic should have made it into 0.17.2. Have either of you had a chance to try it? If so, could you post your experience? Thanks so much!


Reply to this email directly or view it on GitHub.

ddsstt added a commit to ddsstt/salt that referenced this issue Nov 22, 2013
@ddsstt
Copy link

ddsstt commented Nov 22, 2013

The current solution worked for me in most cases.

However, when copying relatively heavy directories (like vim config directory), the following can sometimes appear up in the log file:

  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 144, in file_hash
    with salt.utils.fopen(mtime_map_path, 'w') as fp_:
  File "/usr/lib/python2.7/genericpath.py", line 32, in isfile
    return stat.S_ISREG(st.st_mode)
  File "/usr/lib/python2.7/stat.py", line 50, in S_ISREG
    return S_IFMT(mode) == S_IFREG
RuntimeError: maximum recursion depth exceeded

Corresponding action on minion fails at this time.

This makes sense, since heavily accessed cache file might be unavailable long enough for recursion limit to become an issue, since we keep calling file_hash().

@ddsstt
Copy link

ddsstt commented Nov 22, 2013

Disregard my two pull requests, please, they just add ret['hsum'] = None before returning ret in case we catch ValueError on salt.utils.fopen() call. This helped with random crashes occurring on minions, which looked like this:

----------
    State: - file
    Name:      /root/.vim
    Function:  recurse
        Result:    False
        Comment:   An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1256, in call
    *cdata['args'], **cdata['kwargs'])
  File "/usr/lib/pymodules/python2.7/salt/states/file.py", line 1621, in recurse
    manage_file(dest, src)
  File "/usr/lib/pymodules/python2.7/salt/states/file.py", line 1544, in manage_file
    **pass_kwargs)
  File "/usr/lib/pymodules/python2.7/salt/states/file.py", line 1131, in managed
    contents)
  File "/usr/lib/pymodules/python2.7/salt/modules/file.py", line 1944, in manage_file
    if source and source_sum['hsum'] != name_sum:
KeyError: 'hsum'

        Changes:

@ddsstt
Copy link

ddsstt commented Nov 22, 2013

Just a side note: I believe using file-based caching approach might be an issue when scaling SaltStack in specific scenarios. Even if we manage to come up with a reliable file locking mechanism, with a large number of objects most of the code will be waiting for the lock to become free. This, in turn, might lead to a serious performance degradation.

Would it be worth looking into alternative caching approaches?

@cachedout
Copy link
Contributor

@ddsstt Sorry for the delay in getting back to you here.

I agree that a file-based caching approach has practical limitations and that in the long-run, it's worth exploring alternatives to this approach.

I also agree with your analysis about the underlying cause for recursion errors. I tend to wonder, however, if we might be able to get away with simply throwing in a very short sleep before the retry loop kicks off? In all but the most extreme scenarios, we shouldn't be waiting all that long for a lock to become free and an artificial performance penalty incurred by a sleep seems quite preferable over exceptions.

Thoughts?

@basepi basepi modified the milestones: Approved, Outstanding Bugs Apr 21, 2014
@cachedout
Copy link
Contributor

I believe this was ultimately resolved by the commit here: d2c492f. I haven't seen this error re-occur since then and as a result, I'm going to go ahead and close this. Thanks to all for for all the help resolving this issue!

@soodr
Copy link

soodr commented Nov 5, 2014

Hi,
I noticed this when using s3fs . This is happening for a file whose name ends with a "zip+" extension and size > 450Mb

[DEBUG ] S3 Response Status Code: 200
[ERROR ] Exception need more than 1 value to unpack occurred in file server update
Traceback (most recent call last):
File "/pyvenv/salt/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 216, in fileserver_update
fileserver.update()
File "/pyvenv/salt/lib/python2.7/site-packages/salt/fileserver/init.py", line 306, in update
self.serversfstr
File "/pyvenv/salt/lib/python2.7/site-packages/salt/fileserver/s3fs.py", line 115, in update
_get_file_from_s3(metadata, saltenv, bucket, file_path, cached_file_path)
File "/pyvenv/salt/lib/python2.7/site-packages/salt/fileserver/s3fs.py", line 556, in _get_file_from_s3
name, value = header.split(':', 1)

Using Salt: 2014.7.0rc2 on the master.

@basepi
Copy link
Contributor

basepi commented Nov 10, 2014

@soodr Mind opening a new issue? Include your salt --versions-report and as much information as you have, including the error you've included here.

@soodr
Copy link

soodr commented Nov 10, 2014

#17270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

6 participants