Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasionaly git_pillar pull fails causing incorrect results of highstate (when running highstate for multiple minions) #29239

Closed
timwsuqld opened this issue Nov 27, 2015 · 15 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P1 Priority 1 Pillar Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@timwsuqld
Copy link

When running state.highstate for a single minion, everything works fine.
When running state.highstate for all minions (5), it sometimes gives incorrect results. (All highstate commands are being run with test=True)
Digging down, it appears that for some minions, the git_pillar fails to update, and so the pillar data for that minion is empty, causing the states to give the wrong output. Ideally if the git_pillar (ext_pillar) fails it shouldn't try and compile states for the minion, as the data is incorrect. I'm also not sure why the pillar data appears to be empty instead of using the last successful pull.

Some of the workarounds that I've seen involve just using Cron to pull the pillars repo, and then pointing at that. This would probably speed things up, but I'd expect salt to already do that.

Lines such as the following appear in the logs when this occurs

2015-11-27 10:12:18,883 [salt.utils.gitfs ][ERROR   ][26560] Failed to checkout master from git_pillar remote 'master git@version-control:it-internal/saltstack-config.git': aabb80c2f754bdcb6a9100c16445a05c4858c309: The index is locked. This might be due to a concurrent or crashed process
2015-11-27 10:38:09,410 [salt.utils.gitfs ][ERROR   ][26556] Failed to checkout master from git_pillar remote 'master git@version-control:it-internal/saltstack-config.git': Failed to create locked file '/var/cache/salt/master/git_pillar/5e3205db799031016a50dbe438df411c/.git/index.lock': File exists

My understanding of #22962 and #19994 is that this should have been fixed in 2015.8.0. Maybe this is related, maybe not.

Running on Centos 7 with 4Gb of RAM

$ salt --version
salt 2015.8.1 (Beryllium)
@jfindlay jfindlay added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Platform Relates to OS, containers, platform-based utilities like FS, system based apps Pillar P2 Priority 2 labels Nov 30, 2015
@jfindlay jfindlay added this to the Approved milestone Nov 30, 2015
@jfindlay
Copy link
Contributor

@timwsuqld, thanks for the report.

@oznah
Copy link

oznah commented Jan 9, 2016

+1

RHEL 7

salt --versions-report
Salt Version:
           Salt: 2015.8.3

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.7.0
         Python: 2.7.5 (default, Oct 11 2015, 17:47:16)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
         pygit2: 0.21.4
   python-gnupg: Not Installed
          smmap: 0.8.1
        timelib: Not Installed

System Versions:
           dist: redhat 7.2 Maipo
        machine: x86_64
        release: 3.10.0-229.el7.x86_64
         system: Red Hat Enterprise Linux Server 7.2 Maipo

@hal58th
Copy link
Contributor

hal58th commented Feb 2, 2016

Just ran into this problem with a similar error. It seems to happen when I have multiple host trying to refresh their pillar at the same time. Can this get escalated to P1? It breaks my boxes at random and is severely annoying.

2016-02-01 16:22:18,028 [salt.utils.gitfs ][ERROR ][31180] Failed to checkout master from git_pillar remote 'master ssh://git@myhost.com:/myrepo.git': remote ref does not exist

The problem I was experiencing is a random pillar file will not be found by the minion and this error will pop up on the salt master log. While I was able to get this to occur with highstate, I was able to reproduce this issue more consistently with the following command.

salt '*' saltutil.refresh_pillar

But I was able to get it to happen less when I put a batch of 1.
salt '*' -b 1 saltutil.refresh_pillar

Salt Version:
           Salt: 2015.8.3

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: Not Installed
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: 0.8.2
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-32-generic
         system: Ubuntu 14.04 trusty

@jfindlay jfindlay added P1 Priority 1 and removed P2 Priority 2 labels Feb 3, 2016
@aabognah
Copy link
Contributor

aabognah commented Mar 8, 2016

Hi,

I am seeing the same problem where for some minions the wrong set of data (old data or no data at all) is returned by git_pillar. I am using pillar data to template sudoers file and this is resulting is corrupted sudoers files.

I tried to trace the error in the log file with debug level. This is the only error I see:

2016-03-08 14:08:54,109 [salt.utils.gitfs ][ERROR   ][5582] Failed to checkout master from git_pillar remote 'master gitlab@don.private.uwaterloo.ca:ist-tis-sas/salt-root.git': remote ref does not exist
2016-03-08 14:08:54,119 [salt.loaded.int.pillar.git_pillar][DEBUG   ][5591] git_pillar is processing pillar SLS from /var/cache/salt/master/git_pillar/0bac491499545b545cd9d407aa125c19/pillar/base for pillar env 'base'

The remote exists and the error seems random (maybe caused by multiple attempts to checkout the repo at the same time!).

Moreover, the file in the cache is fine and the returned pillar should not be corrupted, but they are.

salt '*' pillar.item returnes the correct set of data so I think the problem is happening when highstate is templating the files.

salt --versions-report

Salt Version:
           Salt: 2015.8.7

Dependency Versions:
         Jinja2: unknown
       M2Crypto: 0.20.2
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.5.0
         Python: 2.6.6 (r266:84292, May 22 2015, 08:34:51)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: 1.4.1
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libgit2: 0.20.0
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: 0.20.3
   python-gnupg: Not Installed
          smmap: 0.8.1
        timelib: Not Installed

System Versions:
           dist: redhat 6.7 Santiago
        machine: x86_64
        release: 2.6.32-573.12.1.el6.x86_64
         system: Red Hat Enterprise Linux Server 6.7 Santiago

@aabognah
Copy link
Contributor

aabognah commented Mar 9, 2016

removing the git_pillar_provider: gitpython option from master config file so that salt uses the default pygit2 provider seems to have resolved the issue for me. I do not get corrupted pillar data anymore. so the issue seems to have been with gitpython.

@terminalmage
Copy link
Contributor

This seems to be related to #31293, which was caused by concurrent master funcs attempting to evaluate git_pillar at the same time and hitting a race condition. I have addressed this in this pull request, which was opened last night.

Anyone who is willing to test can either use this GitHub walkthrough to checkout the pull request into your git clone, or wait until it is merged and install from the head of the 2015.8 branch. Only the master needs to be updated.

@anlutro
Copy link
Contributor

anlutro commented Apr 21, 2016

I switched back to gitfs in production using 2015.8.8.2 and haven't ran into any issues so far.

@terminalmage
Copy link
Contributor

@anlutro Thanks for confirming, I'll go ahead and close this.

@anlutro
Copy link
Contributor

anlutro commented Apr 27, 2016

I am seeing a lot of these now instead:

2016-04-27 13:17:51,834 [ WARNING] [12961] [salt.utils.gitfs] Update lock file is present for git_pillar remote 'master REDACTED', skipping. If this warning persists, it is possible that the update process was interrupted, but the lock could also have been manually set. Removing /var/cache/salt/master/git_pillar/0fee6ef19f5d8fea99738e1d23b5f4a79616c41661d58ec009e485f062130a38/.git/update.lk or running 'salt-run cache.clear_git_lock git_pillar type=update' will allow updates to continue for this remote.

@anlutro
Copy link
Contributor

anlutro commented Apr 27, 2016

I'll open a separate issue for it, I think I see a pattern.

@lsh-0
Copy link

lsh-0 commented Jun 20, 2016

@anlutro did you find a satisfactory solution to all those Upload lock file is present issues?

@anlutro
Copy link
Contributor

anlutro commented Jun 20, 2016

#32888

@EvaSDK
Copy link
Contributor

EvaSDK commented Jul 15, 2016

Hello there, I am currently running 2015.8.10+ds-1 and 2016.3.1+ds-1 from saltstack debian repository and seeing the same issue (tested with salt '*' saltutil.refresh_pillar):

# salt-master --versions-report
Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: 0.8.6
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: 0.6.4
      gitpython: 2.0.2
          ioflo: Not Installed
         Jinja2: 2.7.3
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: Not Installed
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: debian 8.5 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: Linux
        version: debian 8.5 

@EvaSDK
Copy link
Contributor

EvaSDK commented Jul 15, 2016

Ok, after cleaning up cache likeso:

# rm /var/cache/salt/gitfs/* /var/cache/salt/git_pillar/*
# rm /var/cache/salt/minion/*

Pillar is now returning good data but master's log now show a worrying error message: 2016-07-15 14:18:35,696 [salt.template ][ERROR ][2584] Template does not exist: (yes, no name for that template). Hopefully this isn't related.

@terminalmage
Copy link
Contributor

@EvaSDK Please open a new issue, and provide the information requested in the issue template to assist us in troubleshooting. Feel free to link to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix P1 Priority 1 Pillar Platform Relates to OS, containers, platform-based utilities like FS, system based apps severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

10 participants