Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

Closed
MelkorLord opened this issue May 26, 2016 · 4 comments
Closed

nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 #33553

MelkorLord opened this issue May 26, 2016 · 4 comments
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P2 Priority 2 Regression The issue is a bug that breaks functionality known to work in previous releases. severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale
Milestone

Comments

@MelkorLord
Copy link

MelkorLord commented May 26, 2016

Description of Issue/Question

I manage different sets of servers. By sets I mean "one master + several minions" and every set is unrelated to each other, they are independent. On one set, I updated salt (2015.5.3) to the latest version (2016.3.0) to check for any functional regressions.

The nodegroups configuration now produce different results after update.

Setup

salt-master relevant configuration part

nodegroups:
  Linux: '* and G@kernel:Linux and not G@virtual_subtype:LXC'
  LXC: '* and G@kernel:Linux and G@virtual_subtype:LXC'

This allows me to target the physical servers (Linux) and LXC containers.

Steps to Reproduce Issue

With version 2015.5.3 (and all versions I worked with before that) salt '*' test.ping', salt -N Linux test.ping and salt -N LXC test.ping work exactly as expected.

Starting with version 2015.8.10, including 2016.3.0 we have :

  • salt '*' test.ping => Works as expected : OK
  • salt -N LXC test.ping => Returns True from all LXC targets then stales for some time and returns Minion did not return. [Not connected] from all physical servers which should NOT have been targeted in the first place!
  • salt -N Linux test.ping => Returns
No minions matched the target. No command was sent, no jid was assigned.
ERROR: No return received
@MelkorLord MelkorLord changed the title nodegroups compound change between 2015.5.3 and 2015.8.10 nodegroups compound change between 2015.5.3 and 2015.8.10/2016.3.0 May 26, 2016
@jfindlay jfindlay added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Regression The issue is a bug that breaks functionality known to work in previous releases. P2 Priority 2 Core relates to code central or existential to Salt labels May 26, 2016
@jfindlay jfindlay added this to the Approved milestone May 26, 2016
@jfindlay
Copy link
Contributor

@MelkorLord, thanks for the report.

@MelkorLord
Copy link
Author

Hi,

I've taken some time to thoroughly investigate this issue from a user perspective. Here are my findings

I've used one of the systems I managed. A single host running Ubuntu Server 14.04 with 6 LXC instances, all of them Ubuntu 14.04 handcrafted by hand (deboostrap + few postinstall scripts).

One of the LXC instances is a salt-master and every other LXC instance and the host have a salt-minion. The host and LXC instances were running Salt 2015.3.5 without trouble for a long time now.

I only worked on the salt-master (dedicated LXC instance) to see where the problem lies. I decided to gradually upgrade Salt in a step-by-step basis.

1/ Change the APT repo and key to point to the new SaltStack repo (I was using the old PPA repo) and upgrade to the latest 2015.5 branch which is 2015.5.10. The upgrade went well and surprise : The "nogroups" issue described above does not show up, everything seems to work fine. This is strange but ok then, proceeed.

2/ Upgrade to the 2015.8 branch which is 2015.10. Same as above! This is strange, on my other system, the "nodegroups" issue is clearly showing!

3/ Upgrade to the 2016.3 branch which is 2016.3.0. A lot more packages were pulled to upgrade Salt. The logs showed a complaint : (the log is in one line, I broke is at dashes for readability)

[ERROR   ] salt.log.setup: An un-handled exception was caught by salt's global exception
handler:
#012 OSError: [Errno 3] No such process
#012 Traceback (most recent call last):#012  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
#012    func(*targs, **kargs)#012  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
#012    p._popen.terminate()#012  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
#012    os.kill(self.pid, signal.SIGTERM)
#012 OSError: [Errno 3] No such process

at this point, if I invoke-rc.d salt-master stop it works but leaves orphan "salt-master" processes that I have to kill by hand (SIGTERM). This is true everytime I stop the salt-master.

Anyway, the "nodegroups" issue is still not showing up which is really confusing now!

4/ Then I remember : I read something about cleaning up /var/cache/salt/master after upgrades, especially since the "hash_type" in "master" config must be changed. So I stopped the salt-master, clean up the master cache and restart the salt-master. OK, now the "nodegroups" issue is back as "expected"!

Obviously, something that was "cached" in some form allowed the salt-master to behave correctly even after major upgrades but breaks the salt-master "nodegroups" handling once cleared up.

I hope this helps pin-point the source of the problem. Sorry for being so lengthy but I think more is better when trying to debug something :-)

@MelkorLord
Copy link
Author

Hi,

I'm getting really annoyed with Salt behaviour, it is unpredictable at best in the current situation...

I took some more actions to see what happened.

1/ Downgrade 2016.3.0 to 2015.8.10. Keeping /var/cache/salt/master or deleting it is the same, "nodegroups" issue shows up

2/ Downgrade 2015.8.10 to 2015.5.10. Same as above

3/ Downgrade 2015.5.10 to 2015.5.3 (from PPA). Same as above.

This is a problem, even getting back to the original situation does not fix the situation :-(

Fortunately, I backed up the LXC instance before playing with it. I stopped salt-master and salt-minion and I only restored "/var/cache/salt/master". Restarting the salt-master (and minion) gave me back the "nodegroups" functionality as I want it to work.

Obviously, there's something in the way "/var/cache/salt/master" is handled that makes Salt behave erratically. Something got broken at least before 2015.5.3.

I never had to cleanup /var/cache/salt/master until I upgraded to 2016.3 which recommended it because of the "hash_type" option (deprecated md5 default)

I started using Salt with 0.17 (Ubuntu 14.04 official package) then using the PPA upgraded to the 2014.x branch and then 2015.5.3 which got stuck there until I wanted to use repo.saltstack.com.

I hope this helps.

@stale
Copy link

stale bot commented May 20, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label May 20, 2018
@stale stale bot closed this as completed May 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P2 Priority 2 Regression The issue is a bug that breaks functionality known to work in previous releases. severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around stale
Projects
None yet
Development

No branches or pull requests

2 participants