New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file.managed erased content of the file #33223

Open
kt97679 opened this Issue May 12, 2016 · 13 comments

Comments

Projects
None yet
3 participants
@kt97679
Contributor

kt97679 commented May 12, 2016

Description of Issue/Question

Hi folks,

I manage /etc/pam.d/ files using salt and here is excerpt from the salt state that does this:

{% for file in pam_files %}
/etc/pam.d/{{ file }}:
  file.managed:
    - mode: 644
{% if (some-condition) %}
    - source: salt://ad/exception/common-session
{% else %}
    - source: salt://ad/pam.d/{{ grains['os'] }}/{{ file }}
{% endif %}
    - template: jinja
{% endfor %}

here is content of the common-account file:

#
# /etc/pam.d/common-account - authorization settings common to all services
#
# This file is included from other service-specific PAM config files,
# and should contain a list of the authorization modules that define
# the central access policy for use on the system.  The default is to
# only deny service to users whose accounts are expired in /etc/shadow.
#
# As of pam 1.0.1-6, this file is managed by pam-auth-update by default.
# To take advantage of this, it is recommended that you configure any
# local modules either before or after the default block, and use
# pam-auth-update to manage selection of other modules.  See
# pam-auth-update(8) for details.
#

# here are the per-package modules (the "Primary" block)
account [success=1 new_authtok_reqd=done default=ignore]        pam_unix.so 
# here's the fallback if no module succeeds
account requisite                       pam_deny.so
# prime the stack with a positive return value if there isn't one already;
# this avoids us returning an error just because nothing sets a success code
# since the modules above will each just jump around
account required                        pam_permit.so
# and here are more per-package modules (the "Additional" block)
account sufficient                                      pam_localuser.so 
account [default=bad success=ok user_unknown=ignore]    pam_sss.so 
# end of pam-auth-update config

After running salt state using salt-ssh this file got empty on the target machine. From the log I saw, that salt really removed every single line from the file. This happened so far only once and I can't reproduce this, but this change broke machine severely. All cron jobs stopped running, there was no way to login to the machine even using predeployed root ssh key. Nevertheless I'm very concerned by this case. I see 2 possible ways for this to happen. Either wrong jinja template was packed in the bundle that was sent to the target machine or rendering failed on the target machine itself. Please let me know if you know what caused issue above and what can be done to avoid this in the future.

Setup

(Please provide relevant configs and/or SLS files (Be sure to remove sensitive info).)
Provided above.

Steps to Reproduce Issue

(Include debug logs if possible and relevant.)
Unfortunately I can't reproduce this issue.

Versions Report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
Salt: 2015.8.7

Dependency Versions:
Jinja2: 2.8
M2Crypto: Not Installed
Mako: Not Installed
PyYAML: 3.11
PyZMQ: 15.2.0
Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
RAET: Not Installed
Tornado: 4.3
ZMQ: 4.1.2
cffi: Not Installed
cherrypy: Not Installed
dateutil: Not Installed
gitdb: Not Installed
gitpython: Not Installed
ioflo: Not Installed
libgit2: Not Installed
libnacl: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.7
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pygit2: Not Installed
python-gnupg: Not Installed
smmap: Not Installed
timelib: Not Installed

System Versions:
dist: Ubuntu 14.04 trusty
machine: x86_64
release: 3.13.0-85-generic
system: Ubuntu 14.04 trusty

@Ch3LL

This comment has been minimized.

Contributor

Ch3LL commented May 13, 2016

I also am not able to replicate this. Here is my test case:

➜  salt git:(tagv2015.8.7) cat /srv/salt/issues/33223.sls 
{% set pam_files = ['testfile', 'testfile2'] %}
{% for file in pam_files %}
/etc/pam.d/{{ file }}:
  file.managed:
    - mode: 644
{% if grains['os'] == 'CentOS' %}
    - source: salt://testfile
{% else %}
    - source: salt://testfile2
{% endif %}
    - template: jinja
{% endfor %}
➜  salt sudo salt-ssh 'ch3ll-vm*' state.sls issues.33223
ch3ll-vmware:
----------
          ID: /etc/pam.d/testfile
    Function: file.managed
      Result: True
     Comment: File /etc/pam.d/testfile updated
     Started: 09:17:21.870249
    Duration: 249.15 ms
     Changes:   
              ----------
              diff:
                  New file
              mode:
                  0644
----------
          ID: /etc/pam.d/testfile2
    Function: file.managed
      Result: True
     Comment: File /etc/pam.d/testfile2 updated
     Started: 09:17:22.119680
    Duration: 34.276 ms
     Changes:   
              ----------
              diff:
                  New file
              mode:
                  0644

Summary for ch3ll-vmware
------------
Succeeded: 2 (changed=2)
Failed:    0
------------
Total states run:     2

And the files are not empty:

➜  salt sudo salt-ssh 'ch3ll-vm*' cmd.run "cat /etc/pam.d/{testfile,testfile2}"
ch3ll-vmware:
    testfile
    testfile

I'm not sure what would have caused this, although some debug logs would be very useful in this situation. @kt97679 : Do you happen to have the debug logs to provide when this occurred?

@Ch3LL Ch3LL added the Info Needed label May 13, 2016

@Ch3LL Ch3LL added this to the Blocked milestone May 13, 2016

@kt97679

This comment has been minimized.

Contributor

kt97679 commented May 13, 2016

@Ch3LL sorry, I don't have debug logs. I totally understand that value of this issue without robust way to reproduce is low. But I'm also very concerned, because salt manages critical parts of our infrastructure and if this will happen again we have a good chance to face production issue affecting our customers.

@Ch3LL

This comment has been minimized.

Contributor

Ch3LL commented May 13, 2016

@kt97679 I completely understand the concern. My only guess is that some reason the jinja did not work correctly and when file.managed does not have a source and the file does not exist it will create an empty file. But I'm guessing the files you are managing do already exist so this would not apply either.

@cachedout do you have any ideas as to why the above might have happened? Or ideas on how to possibly replicate this.

@kt97679

This comment has been minimized.

Contributor

kt97679 commented May 13, 2016

@Ch3LL in our infrastructure we have daily scheduled job that applies certain salt states (including pam configuration) to all machines (6k+ servers). So we are still at risk.

@Ch3LL

This comment has been minimized.

Contributor

Ch3LL commented May 18, 2016

@kt97679 do the /etc/pam.d/* files you are managing on the boxes already exist? Also can you supply a sanitized version of the roster file for the box this occurred on?

@kt97679

This comment has been minimized.

Contributor

kt97679 commented May 18, 2016

yes, pam file existed before this happened. Unfortunately I can't provide roster file, it is generated on the fly by our wrapper, but it is very straightforward, just label and host for each machine. Here is diff generated by the salt:

"changes": {"diff": "--- \n+++ \n@@ -1,
27 +0,
0 @@\n-#\n-# /etc/pam.d/common-account - authorization settings common to all services\n-#\n-# This file is included from other service-specific PAM config files,
\n-# and should contain a list of the authorization modules that define\n-# the central access policy for use on the system.  The default is to\n-# only deny service to users whose accounts are expired in /etc/shadow.\n-#\n-# As of pam 1.0.1-6,
 this file is managed by pam-auth-update by default.\n-# To take advantage of this,
 it is recommended that you configure any\n-# local modules either before or after the default block,
 and use\n-# pam-auth-update to manage selection of other modules.  See\n-# pam-auth
-update(8) for details.\n-#\n-\n-# here are the per-package modules (the \"Primary\" block)\n-account\t[success=1 new_authtok_reqd=done default=ignore]\tpam_unix.so \n-# here's the fallback if no module succeeds\n-account\trequisite\t\t\tpam_deny.so\n-# prime the stack with a positive return value if there isn't one already;\n-# this avoids us returning an error just because nothing sets a success code\n-# since the modules above will each just jump around\n-account\trequired\t\t\tpam_permit.so\n-# and here are more per-package modules (the \"Additional\" block)\n-account\tsufficient\t\t\t\t\tpam_localuser.so \n-account\t[default=bad success=ok user_unknown=ignore]\tpam_sss.so \n-# end of pam-auth-update config\n"}}

Unfortunately this is all information I have at this point.

@kt97679

This comment has been minimized.

Contributor

kt97679 commented Jun 29, 2016

I looked at the salt state once again and noticed that jinja templating is not needed for all pam files. So I made templating conditional and added alerting. I've got alert today that salt created empty file that is not using jinja templating so this increases chance that this issue may be caused by salt itself.

@kt97679

This comment has been minimized.

Contributor

kt97679 commented Jul 2, 2016

Got another alert today, checked salt_state.tgz on the client, file is empty in the archive. Something fishy happens during archive creation.

@kt97679

This comment has been minimized.

Contributor

kt97679 commented Jul 7, 2016

I think I found race condition in the fileclient.py. I'm going to check if following fix will work:

--- fileclient.py.orig  2016-07-07 12:22:22.860338370 -0700
+++ fileclient.py       2016-07-07 13:49:02.436409000 -0700
@@ -11,6 +11,7 @@
 import os
 import shutil
 import ftplib
+import uuid
 
 # Import salt libs
 from salt.exceptions import (
@@ -1047,6 +1048,7 @@
             load['gzip'] = gzip
 
         fn_ = None
+        dest_tmp = None
         if dest:
             destdir = os.path.dirname(dest)
             if not os.path.isdir(destdir):
@@ -1085,11 +1087,12 @@
                 if not fn_:
                     with self._cache_loc(data['dest'], saltenv) as cache_dest:
                         dest = cache_dest
+                        dest_tmp = "{0}/{1}".format(os.path.dirname(dest), str(uuid.uuid4()))
                         # If a directory was formerly cached at this path, then
                         # remove it to avoid a traceback trying to write the file
                         if os.path.isdir(dest):
                             salt.utils.rm_rf(dest)
-                        fn_ = salt.utils.fopen(dest, 'wb+')
+                        fn_ = salt.utils.fopen(dest_tmp, 'wb+')
                 if data.get('gzip', None):
                     data = salt.utils.gzip_util.uncompress(data['data'])
                 else:
@@ -1112,6 +1115,8 @@
 
         if fn_:
             fn_.close()
+            if dest_tmp:
+                os.rename(dest_tmp, dest)
             log.info(
                 'Fetching file from saltenv {0!r}, ** done ** {1!r}'.format(
                     saltenv, path
@kt97679

This comment has been minimized.

Contributor

kt97679 commented Jul 18, 2016

Patched salt-ssh is running for a week on 6k+ machines, no empty files so far.

@kt97679

This comment has been minimized.

Contributor

kt97679 commented May 2, 2018

I checked that this is still an issue on the release 2018.3.0. Any chance this can be fixed?

@rallytime

This comment has been minimized.

Contributor

rallytime commented May 2, 2018

Hey @kt97679 - Looks like you've already got a good grasp on the fix. :)

Would you mind submitting a pull request?

@kt97679

This comment has been minimized.

Contributor

kt97679 commented May 2, 2018

Sure, #47440

rallytime added a commit to rallytime/salt that referenced this issue May 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment