nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

MelkorLord · 2016-05-26T16:44:23Z

Description of Issue/Question

I manage different sets of servers. By sets I mean "one master + several minions" and every set is unrelated to each other, they are independent. On one set, I updated salt (2015.5.3) to the latest version (2016.3.0) to check for any functional regressions.

The nodegroups configuration now produce different results after update.

Setup

salt-master relevant configuration part

nodegroups:
  Linux: '* and G@kernel:Linux and not G@virtual_subtype:LXC'
  LXC: '* and G@kernel:Linux and G@virtual_subtype:LXC'

This allows me to target the physical servers (Linux) and LXC containers.

Steps to Reproduce Issue

With version 2015.5.3 (and all versions I worked with before that) salt '*' test.ping', salt -N Linux test.ping and salt -N LXC test.ping work exactly as expected.

Starting with version 2015.8.10, including 2016.3.0 we have :

salt '*' test.ping => Works as expected : OK
salt -N LXC test.ping => Returns True from all LXC targets then stales for some time and returns Minion did not return. [Not connected] from all physical servers which should NOT have been targeted in the first place!
salt -N Linux test.ping => Returns

No minions matched the target. No command was sent, no jid was assigned.
ERROR: No return received

The text was updated successfully, but these errors were encountered:

jfindlay · 2016-05-26T22:10:59Z

@MelkorLord, thanks for the report.

MelkorLord · 2016-06-09T08:07:19Z

Hi,

I've taken some time to thoroughly investigate this issue from a user perspective. Here are my findings

I've used one of the systems I managed. A single host running Ubuntu Server 14.04 with 6 LXC instances, all of them Ubuntu 14.04 handcrafted by hand (deboostrap + few postinstall scripts).

One of the LXC instances is a salt-master and every other LXC instance and the host have a salt-minion. The host and LXC instances were running Salt 2015.3.5 without trouble for a long time now.

I only worked on the salt-master (dedicated LXC instance) to see where the problem lies. I decided to gradually upgrade Salt in a step-by-step basis.

1/ Change the APT repo and key to point to the new SaltStack repo (I was using the old PPA repo) and upgrade to the latest 2015.5 branch which is 2015.5.10. The upgrade went well and surprise : The "nogroups" issue described above does not show up, everything seems to work fine. This is strange but ok then, proceeed.

2/ Upgrade to the 2015.8 branch which is 2015.10. Same as above! This is strange, on my other system, the "nodegroups" issue is clearly showing!

3/ Upgrade to the 2016.3 branch which is 2016.3.0. A lot more packages were pulled to upgrade Salt. The logs showed a complaint : (the log is in one line, I broke is at dashes for readability)

[ERROR   ] salt.log.setup: An un-handled exception was caught by salt's global exception
handler:
#012 OSError: [Errno 3] No such process
#012 Traceback (most recent call last):#012  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
#012    func(*targs, **kargs)#012  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
#012    p._popen.terminate()#012  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
#012    os.kill(self.pid, signal.SIGTERM)
#012 OSError: [Errno 3] No such process

at this point, if I invoke-rc.d salt-master stop it works but leaves orphan "salt-master" processes that I have to kill by hand (SIGTERM). This is true everytime I stop the salt-master.

Anyway, the "nodegroups" issue is still not showing up which is really confusing now!

4/ Then I remember : I read something about cleaning up /var/cache/salt/master after upgrades, especially since the "hash_type" in "master" config must be changed. So I stopped the salt-master, clean up the master cache and restart the salt-master. OK, now the "nodegroups" issue is back as "expected"!

Obviously, something that was "cached" in some form allowed the salt-master to behave correctly even after major upgrades but breaks the salt-master "nodegroups" handling once cleared up.

I hope this helps pin-point the source of the problem. Sorry for being so lengthy but I think more is better when trying to debug something :-)

MelkorLord · 2016-06-09T08:59:21Z

Hi,

I'm getting really annoyed with Salt behaviour, it is unpredictable at best in the current situation...

I took some more actions to see what happened.

1/ Downgrade 2016.3.0 to 2015.8.10. Keeping /var/cache/salt/master or deleting it is the same, "nodegroups" issue shows up

2/ Downgrade 2015.8.10 to 2015.5.10. Same as above

3/ Downgrade 2015.5.10 to 2015.5.3 (from PPA). Same as above.

This is a problem, even getting back to the original situation does not fix the situation :-(

Fortunately, I backed up the LXC instance before playing with it. I stopped salt-master and salt-minion and I only restored "/var/cache/salt/master". Restarting the salt-master (and minion) gave me back the "nodegroups" functionality as I want it to work.

Obviously, there's something in the way "/var/cache/salt/master" is handled that makes Salt behave erratically. Something got broken at least before 2015.5.3.

I never had to cleanup /var/cache/salt/master until I upgraded to 2016.3 which recommended it because of the "hash_type" option (deprecated md5 default)

I started using Salt with 0.17 (Ubuntu 14.04 official package) then using the PPA upgraded to the 2014.x branch and then 2015.5.3 which got stuck there until I wanted to use repo.saltstack.com.

I hope this helps.

stale · 2018-05-20T09:43:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

MelkorLord changed the title ~~nodegroups compound change between 2015.5.3 and 2015.8.10~~ nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 May 26, 2016

jfindlay added this to the Approved milestone May 26, 2016

stale bot added the stale label May 20, 2018

stale bot closed this as completed May 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

MelkorLord commented May 26, 2016 •

edited

jfindlay commented May 26, 2016

MelkorLord commented Jun 9, 2016

MelkorLord commented Jun 9, 2016

stale bot commented May 20, 2018

nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

Comments

MelkorLord commented May 26, 2016 • edited

Description of Issue/Question

Setup

Steps to Reproduce Issue

jfindlay commented May 26, 2016

MelkorLord commented Jun 9, 2016

MelkorLord commented Jun 9, 2016

stale bot commented May 20, 2018

MelkorLord commented May 26, 2016 •

edited