Occasionaly git_pillar pull fails causing incorrect results of highstate (when running highstate for multiple minions) #29239

timwsuqld · 2015-11-27T01:02:45Z

When running state.highstate for a single minion, everything works fine.
When running state.highstate for all minions (5), it sometimes gives incorrect results. (All highstate commands are being run with test=True)
Digging down, it appears that for some minions, the git_pillar fails to update, and so the pillar data for that minion is empty, causing the states to give the wrong output. Ideally if the git_pillar (ext_pillar) fails it shouldn't try and compile states for the minion, as the data is incorrect. I'm also not sure why the pillar data appears to be empty instead of using the last successful pull.

Some of the workarounds that I've seen involve just using Cron to pull the pillars repo, and then pointing at that. This would probably speed things up, but I'd expect salt to already do that.

Lines such as the following appear in the logs when this occurs

2015-11-27 10:12:18,883 [salt.utils.gitfs ][ERROR   ][26560] Failed to checkout master from git_pillar remote 'master git@version-control:it-internal/saltstack-config.git': aabb80c2f754bdcb6a9100c16445a05c4858c309: The index is locked. This might be due to a concurrent or crashed process
2015-11-27 10:38:09,410 [salt.utils.gitfs ][ERROR   ][26556] Failed to checkout master from git_pillar remote 'master git@version-control:it-internal/saltstack-config.git': Failed to create locked file '/var/cache/salt/master/git_pillar/5e3205db799031016a50dbe438df411c/.git/index.lock': File exists

My understanding of #22962 and #19994 is that this should have been fixed in 2015.8.0. Maybe this is related, maybe not.

Running on Centos 7 with 4Gb of RAM

$ salt --version
salt 2015.8.1 (Beryllium)

The text was updated successfully, but these errors were encountered:

jfindlay · 2015-11-30T15:40:51Z

@timwsuqld, thanks for the report.

oznah · 2016-01-09T03:09:32Z

+1

RHEL 7

salt --versions-report
Salt Version:
           Salt: 2015.8.3

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.7.0
         Python: 2.7.5 (default, Oct 11 2015, 17:47:16)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.21.4
   python-gnupg: Not Installed
          smmap: 0.8.1
        timelib: Not Installed

System Versions:
           dist: redhat 7.2 Maipo
        machine: x86_64
        release: 3.10.0-229.el7.x86_64
         system: Red Hat Enterprise Linux Server 7.2 Maipo

hal58th · 2016-02-02T00:36:33Z

Just ran into this problem with a similar error. It seems to happen when I have multiple host trying to refresh their pillar at the same time. Can this get escalated to P1? It breaks my boxes at random and is severely annoying.

2016-02-01 16:22:18,028 [salt.utils.gitfs ][ERROR ][31180] Failed to checkout master from git_pillar remote 'master ssh://git@myhost.com:/myrepo.git': remote ref does not exist

The problem I was experiencing is a random pillar file will not be found by the minion and this error will pop up on the salt master log. While I was able to get this to occur with highstate, I was able to reproduce this issue more consistently with the following command.

salt '*' saltutil.refresh_pillar

But I was able to get it to happen less when I put a batch of 1.
salt '*' -b 1 saltutil.refresh_pillar

Salt Version:
           Salt: 2015.8.3

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: Not Installed
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: 0.8.2
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-32-generic
         system: Ubuntu 14.04 trusty

aabognah · 2016-03-08T19:38:54Z

Hi,

I am seeing the same problem where for some minions the wrong set of data (old data or no data at all) is returned by git_pillar. I am using pillar data to template sudoers file and this is resulting is corrupted sudoers files.

I tried to trace the error in the log file with debug level. This is the only error I see:

2016-03-08 14:08:54,109 [salt.utils.gitfs ][ERROR   ][5582] Failed to checkout master from git_pillar remote 'master gitlab@don.private.uwaterloo.ca:ist-tis-sas/salt-root.git': remote ref does not exist
2016-03-08 14:08:54,119 [salt.loaded.int.pillar.git_pillar][DEBUG   ][5591] git_pillar is processing pillar SLS from /var/cache/salt/master/git_pillar/0bac491499545b545cd9d407aa125c19/pillar/base for pillar env 'base'

The remote exists and the error seems random (maybe caused by multiple attempts to checkout the repo at the same time!).

Moreover, the file in the cache is fine and the returned pillar should not be corrupted, but they are.

salt '*' pillar.item returnes the correct set of data so I think the problem is happening when highstate is templating the files.

salt --versions-report

Salt Version:
           Salt: 2015.8.7

Dependency Versions:
         Jinja2: unknown
       M2Crypto: 0.20.2
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
         Python: 2.6.6 (r266:84292, May 22 2015, 08:34:51)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: 1.4.1
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libgit2: 0.20.0
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: 0.20.3
   python-gnupg: Not Installed
          smmap: 0.8.1
        timelib: Not Installed

System Versions:
           dist: redhat 6.7 Santiago
        machine: x86_64
        release: 2.6.32-573.12.1.el6.x86_64
         system: Red Hat Enterprise Linux Server 6.7 Santiago

aabognah · 2016-03-09T15:05:12Z

removing the git_pillar_provider: gitpython option from master config file so that salt uses the default pygit2 provider seems to have resolved the issue for me. I do not get corrupted pillar data anymore. so the issue seems to have been with gitpython.

terminalmage · 2016-03-12T17:49:15Z

This seems to be related to #31293, which was caused by concurrent master funcs attempting to evaluate git_pillar at the same time and hitting a race condition. I have addressed this in this pull request, which was opened last night.

Anyone who is willing to test can either use this GitHub walkthrough to checkout the pull request into your git clone, or wait until it is merged and install from the head of the 2015.8 branch. Only the master needs to be updated.

anlutro · 2016-04-21T10:38:11Z

I switched back to gitfs in production using 2015.8.8.2 and haven't ran into any issues so far.

terminalmage · 2016-04-22T06:39:25Z

@anlutro Thanks for confirming, I'll go ahead and close this.

anlutro · 2016-04-27T11:21:31Z

I am seeing a lot of these now instead:

2016-04-27 13:17:51,834 [ WARNING] [12961] [salt.utils.gitfs] Update lock file is present for git_pillar remote 'master REDACTED', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/git_pillar/0fee6ef19f5d8fea99738e1d23b5f4a79616c41661d58ec009e485f062130a38/.git/update.lk or running 'salt-run cache.clear_git_lock git_pillar type=update' will allow updates to continue for this remote.

anlutro · 2016-04-27T11:23:21Z

I'll open a separate issue for it, I think I see a pattern.

lsh-0 · 2016-06-20T12:58:33Z

@anlutro did you find a satisfactory solution to all those Upload lock file is present issues?

anlutro · 2016-06-20T13:14:03Z

#32888

EvaSDK · 2016-07-15T12:12:05Z

Hello there, I am currently running 2015.8.10+ds-1 and 2016.3.1+ds-1 from saltstack debian repository and seeing the same issue (tested with salt '*' saltutil.refresh_pillar):

# salt-master --versions-report
Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: 0.6.4
      gitpython: 2.0.2
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: Not Installed
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.5 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.5

EvaSDK · 2016-07-15T12:22:47Z

Ok, after cleaning up cache likeso:

# rm /var/cache/salt/gitfs/* /var/cache/salt/git_pillar/*
# rm /var/cache/salt/minion/*

Pillar is now returning good data but master's log now show a worrying error message: 2016-07-15 14:18:35,696 [salt.template ][ERROR ][2584] Template does not exist: (yes, no name for that template). Hopefully this isn't related.

terminalmage · 2016-07-15T19:39:21Z

@EvaSDK Please open a new issue, and provide the information requested in the issue template to assist us in troubleshooting. Feel free to link to this issue.

jfindlay added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Platform Relates to OS, containers, platform-based utilities like FS, system based apps Pillar P2 Priority 2 labels Nov 30, 2015

jfindlay added this to the Approved milestone Nov 30, 2015

jfindlay added P1 Priority 1 and removed P2 Priority 2 labels Feb 3, 2016

terminalmage mentioned this issue Mar 12, 2016

Fix git_pillar race condition #31836

Merged

meggiebot modified the milestones: B 0, Approved Mar 15, 2016

meggiebot assigned terminalmage Mar 15, 2016

terminalmage added the fixed-pls-verify fix is linked, bug author to confirm fix label Mar 15, 2016

anlutro mentioned this issue Mar 18, 2016

Received bad data when setting the match from the top file - gitfs fails to check out #30500

Closed

terminalmage removed this from the B 0 milestone Mar 22, 2016

jfindlay added this to the Approved milestone Mar 22, 2016

terminalmage closed this as completed Apr 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasionaly git_pillar pull fails causing incorrect results of highstate (when running highstate for multiple minions) #29239

Occasionaly git_pillar pull fails causing incorrect results of highstate (when running highstate for multiple minions) #29239

timwsuqld commented Nov 27, 2015

jfindlay commented Nov 30, 2015

oznah commented Jan 9, 2016

hal58th commented Feb 2, 2016

aabognah commented Mar 8, 2016

aabognah commented Mar 9, 2016

terminalmage commented Mar 12, 2016

anlutro commented Apr 21, 2016

terminalmage commented Apr 22, 2016

anlutro commented Apr 27, 2016 •

edited

Loading

anlutro commented Apr 27, 2016

lsh-0 commented Jun 20, 2016

anlutro commented Jun 20, 2016

EvaSDK commented Jul 15, 2016 •

edited

Loading

EvaSDK commented Jul 15, 2016

terminalmage commented Jul 15, 2016

Occasionaly git_pillar pull fails causing incorrect results of highstate (when running highstate for multiple minions) #29239

Occasionaly git_pillar pull fails causing incorrect results of highstate (when running highstate for multiple minions) #29239

Comments

timwsuqld commented Nov 27, 2015

jfindlay commented Nov 30, 2015

oznah commented Jan 9, 2016

hal58th commented Feb 2, 2016

aabognah commented Mar 8, 2016

aabognah commented Mar 9, 2016

terminalmage commented Mar 12, 2016

anlutro commented Apr 21, 2016

terminalmage commented Apr 22, 2016

anlutro commented Apr 27, 2016 • edited Loading

anlutro commented Apr 27, 2016

lsh-0 commented Jun 20, 2016

anlutro commented Jun 20, 2016

EvaSDK commented Jul 15, 2016 • edited Loading

EvaSDK commented Jul 15, 2016

terminalmage commented Jul 15, 2016

anlutro commented Apr 27, 2016 •

edited

Loading

EvaSDK commented Jul 15, 2016 •

edited

Loading