Error in function _file_hash since upgrading to 0.17.0 #7755

victormuse · 2013-10-11T09:52:31Z

Hi,

# salt-call --versions-report
           Salt: 0.17.0
         Python: 2.6.8 (unknown, Mar 14 2013, 09:31:22)
         Jinja2: 2.6
       M2Crypto: 0.20.2
 msgpack-python: 0.1.13
   msgpack-pure: Not Installed
       pycrypto: 2.6
         PyYAML: 3.10
          PyZMQ: 2.2.0.1
            ZMQ: 3.2.4

# cat /etc/salt/master
user: salt
open_mode: true
log_file: file:///dev/log/LOG_LOCAL3
log_level_logfile: warning
log_level: quiet

fileserver_backend:
  - roots

file_roots:
  base:
    - /var/data/salt/srv/salt
file_ignore_regex:
  - '/\.svn($|/)'
  - '/\.git($|/)'
file_ignore_glob:
  - '*.pyc'

pillar_roots:
  base:
    - /var/data/salt/srv/pillar

client_acl:
  jenkins:
    - .*

The following error keeps popping up:

[ERROR   ] Error in function _file_hash:
#012Traceback (most recent call last):
#012  File "/usr/lib/python2.6/site-packages/salt/master.py", line 1444, in run_func
#012    ret = getattr(self, func)(load)
#012  File "/usr/lib/python2.6/site-packages/salt/fileserver/__init__.py", line 233, in file_hash
#012    return self.servers[fstr](load, fnd)
#012  File "/usr/lib/python2.6/site-packages/salt/fileserver/roots.py", line 153, in file_hash
#012    hsum, mtime = fp_.read().split(':')
#012ValueError: need more than 1 value to unpack

However, it doesn't seem to affect the functionality of this test setup.

The text was updated successfully, but these errors were encountered:

basepi · 2013-10-11T21:51:04Z

Interesting. No problems like this from whatever version you upgraded from? With same config and so forth?

From which version did you upgrade?

victormuse · 2013-10-14T08:30:43Z

I've upgraded from 0.16.4. Searched the logs, the error does not show before the upgrade. The config is almost the same, meaning I've only played with the log_* directives.

ddsstt · 2013-10-25T14:18:20Z

Got the same problem after upgrading from 0.16.4 to 0.17.0. The issue is reproducible.

I have a state, which defines:

/root/.vim:
    file.recurse:
        - source: salt://configs/vim/.vim
        - user: root
        - group: root
        - dir_mode: 755
        - file_mode: 644

If we run this state on one node, the output is fine:

    State: - file
    Name:      /root/.vim
    Function:  recurse
        Result:    True
        Comment:   The directory /root/.vim is in the correct state
        Changes:

However, when applying state to all 73 nodes random "file not found" errors start popping up, like:

----------
node059:
----------
    State: - pkg
    Name:      vim
    Function:  installed
        Result:    True
        Comment:   Package vim is already installed
        Changes:
----------
    State: - file
    Name:      /root/.vimrc
    Function:  managed
        Result:    True
        Comment:   File /root/.vimrc is in the correct state
        Changes:
----------
    State: - file
    Name:      /root/.vim
    Function:  recurse
        Result:    False
        Comment:   #### /root/.vim/bundle/syntastic-master/syntax_checkers/matlab/mlint.vim ####
Source file salt://|configs/vim/.vim/bundle/syntastic-master/syntax_checkers/matlab/mlint.vim not found
        Changes:

If the state is applied on this individual node (e.g. node059), everything goes fine.

When 'source file not found' errors appear, the following is recorded in master log:

2013-10-25 17:51:26,386 [salt.master      ][ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/master.py", line 1444, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash
    return self.servers[fstr](load, fnd)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 153, in file_hash
    hsum, mtime = fp_.read().split(':')
ValueError: need more than 1 value to unpack

File names themselves are random, too, no visible pattern can be recognized.

My environment:

# salt-call --versions-report
           Salt: 0.17.0
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 2.2.0.1
            ZMQ: 3.2.2
# cat /etc/salt/master | grep ^[^#]
interface: 0.0.0.0
file_roots:
  base:
    - /srv/salt

ddsstt · 2013-10-25T14:49:54Z

Exception is thrown because cache file is empty at the moment it's read:

2013-10-25 18:45:54,057 [salt.master      ][ERROR   ] Error in function _file_hash:
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/master.py", line 1444, in run_func
    ret = getattr(self, func)(load)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash
    return self.servers[fstr](load, fnd)
  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 155, in file_hash
    raise Exception(" | ".join([cache_path, file_content]))
Exception: /var/cache/salt/master/roots/hash/base/configs/vim/.vim/bundle/vim-powerline-develop/doc/Powerline.txt.hash.md5 |

(ugly way to debug, but still.)

Looks like some sort of race condition. It doesn't crawl to surface while the number of nodes is relatively low.

cachedout · 2013-10-25T15:24:20Z

@ddsstt Thanks for the additional info. We're upgrading this to high severity so we can get some eyeballs on this issue ASAP.

First pass at some file-locking and retry logic for file hashes. Also a ...

cachedout · 2013-10-27T17:59:55Z

@victormuse and @ddsstt, the above merged PR introduces some file-locking logic which is designed to prevent this issue from occurring. Given that you're obviously seeing this issue in production, I'm guessing you don't want to run development code on those servers. It's working thus far in my tests, however.

Would you like to close this issue for the time being and re-open if you see it in future releases of Salt or is this fix something that's easy for you to verify? I'll tag this as Fixed Pending Verification in the meantime.

Many thanks to both of you for the issue report and the detective work! Very helpful!

ddsstt · 2013-10-27T18:09:11Z

@cachedout That's great news, thank you!

I'll be able to verify the fix tomorrow.

ddsstt · 2013-10-29T09:29:08Z

@cachedout I've applied the patch (https://github.com/saltstack/salt/commit/db2e4758fdc1ff3483dab97b9b74bf845abb29f1.patch) on master and all nodes, then restarted all services. However, the issue still keeps popping up, with the same symptoms.

Is it possible I'm doing it wrong here?

cachedout · 2013-10-29T15:47:11Z

@ddsstt I doubt you've applied it incorrectly. Thanks so much for taking the time to apply that patch and test it. I'll keep poking at it and see if I can improve things. We'll get this figured out.

halfgaar · 2013-11-01T12:29:12Z

I don't know if it's of help, but I have the same issue with random source files not being found, getting the same "need more than 1 value to unpack" error. I use Ubuntu 12.04.3 LTS.

2013-11-01 11:37:52,893 [salt.master      ][ERROR   ] Error in function _file_hash:      
Traceback (most recent call last):                                                       
  File "/usr/lib/pymodules/python2.7/salt/master.py", line 1494, in run_func             
    ret = getattr(self, func)(load)                                                      
  File "/usr/lib/pymodules/python2.7/salt/fileserver/__init__.py", line 233, in file_hash
    return self.servers[fstr](load, fnd)                                                 
  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 153, in file_hash   
    hsum, mtime = fp_.read().split(':')                                                  
ValueError: need more than 1 value to unpack

# salt-call --versions-report                          
           Salt: 0.17.1                                
         Python: 2.7.3 (default, Sep 26 2013, 20:03:06)
         Jinja2: 2.6                                   
       M2Crypto: 0.21.1                                
 msgpack-python: 0.1.10                                
   msgpack-pure: Not Installed                         
       pycrypto: 2.4.1                                 
         PyYAML: 3.10                                  
          PyZMQ: 13.0.0                                
            ZMQ: 3.2.2

I put an inotifywait -m on /var/cache/salt/master/roots/hash/base/configs/rsyslog and I saw that the cache files are written to dozens of times in short sequence (and apparently, at the same time sometimes). Should once not be enough?

I was trying to get the output of inotify, but I can't reproduce the bug right now.

ddsstt · 2013-11-01T15:05:34Z

@halfgaar I didn't think of using inotify to gather more debugging info!

Output of pyinotify when observing the issue is available here: http://pastebin.com/LKCbg5mu

Hope this helps.

#7755 More aggressive re-try logic

cachedout · 2013-11-04T16:44:40Z

The above merge basically ups the scope of the try/catch and also fires off a retry if for some reason the file locking doesn't behave as expected.

cachedout · 2013-11-21T16:45:18Z

@victormuse and @ddsstt, the above merge for more aggressive retry logic should have made it into 0.17.2. Have either of you had a chance to try it? If so, could you post your experience? Thanks so much!

ddsstt · 2013-11-21T19:28:09Z

Thank you for your hard work, Mike! I'll definitely test your patch either tomorrow or on Monday and will share the experiences.

By the way — as one can see from the comments in relevant parts of code, we have to check if a file is locked via catching OSError on failed salt.utils.fopen() call. You elaborate by noticing we can't use select() since we don't want to loose Windows compatibility. However, would using select() save us from having this issue pop up on Linux? If so, how would you comment on a workaround that would use select() on Linux, and fall back to fopen() on Windows? Correct me, please, if I misunderstood something.

Sent from handheld

On 21.11.2013, at 20:45, Mike Place notifications@github.com wrote:

@victormuse and @ddsstt, the above merge for more aggressive retry logic should have made it into 0.17.2. Have either of you had a chance to try it? If so, could you post your experience? Thanks so much!

—
Reply to this email directly or view it on GitHub.

ddsstt · 2013-11-22T08:03:57Z

The current solution worked for me in most cases.

However, when copying relatively heavy directories (like vim config directory), the following can sometimes appear up in the log file:

  File "/usr/lib/pymodules/python2.7/salt/fileserver/roots.py", line 144, in file_hash
    with salt.utils.fopen(mtime_map_path, 'w') as fp_:
  File "/usr/lib/python2.7/genericpath.py", line 32, in isfile
    return stat.S_ISREG(st.st_mode)
  File "/usr/lib/python2.7/stat.py", line 50, in S_ISREG
    return S_IFMT(mode) == S_IFREG
RuntimeError: maximum recursion depth exceeded

Corresponding action on minion fails at this time.

This makes sense, since heavily accessed cache file might be unavailable long enough for recursion limit to become an issue, since we keep calling file_hash().

ddsstt · 2013-11-22T08:06:17Z

Disregard my two pull requests, please, they just add ret['hsum'] = None before returning ret in case we catch ValueError on salt.utils.fopen() call. This helped with random crashes occurring on minions, which looked like this:

----------
    State: - file
    Name:      /root/.vim
    Function:  recurse
        Result:    False
        Comment:   An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/salt/state.py", line 1256, in call
    *cdata['args'], **cdata['kwargs'])
  File "/usr/lib/pymodules/python2.7/salt/states/file.py", line 1621, in recurse
    manage_file(dest, src)
  File "/usr/lib/pymodules/python2.7/salt/states/file.py", line 1544, in manage_file
    **pass_kwargs)
  File "/usr/lib/pymodules/python2.7/salt/states/file.py", line 1131, in managed
    contents)
  File "/usr/lib/pymodules/python2.7/salt/modules/file.py", line 1944, in manage_file
    if source and source_sum['hsum'] != name_sum:
KeyError: 'hsum'

        Changes:

ddsstt · 2013-11-22T08:13:46Z

Just a side note: I believe using file-based caching approach might be an issue when scaling SaltStack in specific scenarios. Even if we manage to come up with a reliable file locking mechanism, with a large number of objects most of the code will be waiting for the lock to become free. This, in turn, might lead to a serious performance degradation.

Would it be worth looking into alternative caching approaches?

cachedout · 2013-12-02T22:09:26Z

@ddsstt Sorry for the delay in getting back to you here.

I agree that a file-based caching approach has practical limitations and that in the long-run, it's worth exploring alternatives to this approach.

I also agree with your analysis about the underlying cause for recursion errors. I tend to wonder, however, if we might be able to get away with simply throwing in a very short sleep before the retry loop kicks off? In all but the most extreme scenarios, we shouldn't be waiting all that long for a lock to become free and an artificial performance penalty incurred by a sleep seems quite preferable over exceptions.

Thoughts?

cachedout · 2014-05-22T16:53:19Z

I believe this was ultimately resolved by the commit here: d2c492f. I haven't seen this error re-occur since then and as a result, I'm going to go ahead and close this. Thanks to all for for all the help resolving this issue!

soodr · 2014-11-05T15:24:20Z

Hi,
I noticed this when using s3fs . This is happening for a file whose name ends with a "zip+" extension and size > 450Mb

[DEBUG ] S3 Response Status Code: 200
[ERROR ] Exception need more than 1 value to unpack occurred in file server update
Traceback (most recent call last):
File "/pyvenv/salt/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 216, in fileserver_update
fileserver.update()
File "/pyvenv/salt/lib/python2.7/site-packages/salt/fileserver/init.py", line 306, in update
self.serversfstr
File "/pyvenv/salt/lib/python2.7/site-packages/salt/fileserver/s3fs.py", line 115, in update
_get_file_from_s3(metadata, saltenv, bucket, file_path, cached_file_path)
File "/pyvenv/salt/lib/python2.7/site-packages/salt/fileserver/s3fs.py", line 556, in _get_file_from_s3
name, value = header.split(':', 1)

Using Salt: 2014.7.0rc2 on the master.

basepi · 2014-11-10T20:53:52Z

@soodr Mind opening a new issue? Include your salt --versions-report and as much information as you have, including the error you've included here.

soodr · 2014-11-10T21:16:26Z

#17270

ghost assigned cachedout Oct 15, 2013

cachedout mentioned this issue Oct 25, 2013

First pass at some file-locking and retry logic for file hashes. Also a ... #8103

Merged

thatch45 added a commit that referenced this issue Oct 26, 2013

Merge pull request #8103 from cachedout/#7755

db2e475

First pass at some file-locking and retry logic for file hashes. Also a ...

cachedout pushed a commit to cachedout/salt that referenced this issue Nov 1, 2013

More aggressive exception handling and retry logic. Refs saltstack#7755.

d0d7155

cachedout mentioned this issue Nov 1, 2013

#7755 More aggressive re-try logic #8219

Merged

basepi added a commit that referenced this issue Nov 1, 2013

Merge pull request #8219 from cachedout/#7755

8d9f71b

#7755 More aggressive re-try logic

cachedout pushed a commit that referenced this issue Nov 8, 2013

More aggressive exception handling and retry logic. Refs #7755.

825ec2e

cachedout mentioned this issue Nov 21, 2013

file.managed spurious file not found errors #8632

Closed

ddsstt added a commit to ddsstt/salt that referenced this issue Nov 22, 2013

Small fixes for commit 825ec2e regarding issue saltstack#7755

364b581

basepi modified the milestones: Approved, Outstanding Bugs Apr 21, 2014

cachedout closed this as completed May 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in function _file_hash since upgrading to 0.17.0 #7755

Error in function _file_hash since upgrading to 0.17.0 #7755

victormuse commented Oct 11, 2013

basepi commented Oct 11, 2013

victormuse commented Oct 14, 2013

ddsstt commented Oct 25, 2013

ddsstt commented Oct 25, 2013

cachedout commented Oct 25, 2013

cachedout commented Oct 27, 2013

ddsstt commented Oct 27, 2013

ddsstt commented Oct 29, 2013

cachedout commented Oct 29, 2013

halfgaar commented Nov 1, 2013

ddsstt commented Nov 1, 2013

cachedout commented Nov 4, 2013

cachedout commented Nov 21, 2013

ddsstt commented Nov 21, 2013

ddsstt commented Nov 22, 2013

ddsstt commented Nov 22, 2013

ddsstt commented Nov 22, 2013

cachedout commented Dec 2, 2013

cachedout commented May 22, 2014

soodr commented Nov 5, 2014

basepi commented Nov 10, 2014

soodr commented Nov 10, 2014

Error in function _file_hash since upgrading to 0.17.0 #7755

Error in function _file_hash since upgrading to 0.17.0 #7755

Comments

victormuse commented Oct 11, 2013

basepi commented Oct 11, 2013

victormuse commented Oct 14, 2013

ddsstt commented Oct 25, 2013

ddsstt commented Oct 25, 2013

cachedout commented Oct 25, 2013

cachedout commented Oct 27, 2013

ddsstt commented Oct 27, 2013

ddsstt commented Oct 29, 2013

cachedout commented Oct 29, 2013

halfgaar commented Nov 1, 2013

ddsstt commented Nov 1, 2013

cachedout commented Nov 4, 2013

cachedout commented Nov 21, 2013

ddsstt commented Nov 21, 2013

ddsstt commented Nov 22, 2013

ddsstt commented Nov 22, 2013

ddsstt commented Nov 22, 2013

cachedout commented Dec 2, 2013

cachedout commented May 22, 2014

soodr commented Nov 5, 2014

basepi commented Nov 10, 2014

soodr commented Nov 10, 2014