Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

publish.publish should return same info as running on salt-master #4402

Closed
ahammond opened this issue Apr 5, 2013 · 31 comments
Closed

publish.publish should return same info as running on salt-master #4402

ahammond opened this issue Apr 5, 2013 · 31 comments
Assignees
Milestone

Comments

@ahammond
Copy link
Contributor

ahammond commented Apr 5, 2013

When I run a command directly via salt on the salt-master, I get responses from all my minions. When I run the same command on a minion via salt-call publish.publish, I get responses from only about 3/4 of my minions.

I updated salt and 0mq to lastest stable:

root@salt:~# salt \*  test.version --out=yaml | sort 
api01: 0.14.0
app04: 0.14.0
app05: 0.14.0
app06: 0.14.0
data01: 0.14.0
data02: 0.14.0
db-admin03: 0.14.0
db-clone: 0.14.0
db-command02: 0.14.0
db-staging01: 0.14.0
flint-mq01: 0.14.0
flint-mux01: 0.14.0
front02: 0.14.0
front03: 0.14.0
front05: 0.14.0
front06: 0.14.0
graphite01: 0.14.0
heartbeat01: 0.14.0
jenkins01: 0.14.0
ls-broker01: 0.14.0
ls-elasticsearch01: 0.14.0
ls-indexer01: 0.14.0
ls-shipper01: 0.14.0
ls-web01: 0.14.0
mail02: 0.14.0
mq01: 0.14.0
proc01: 0.14.0
proc02: 0.14.0
repo01: 0.14.0
rsdev01: 0.14.0
rsdev02: 0.14.0
salt: 0.14.0
spf01: 0.14.0
staging01: 0.14.0
staging02: 0.14.0
staging03: 0.14.0
staging04: 0.14.0
staging05: 0.14.0
staging-heartbeat01: 0.14.0
ws-trans02: 0.14.0
ws-trans03: 0.14.0
ws-trans04: 0.14.0
ws-trans05: 0.14.0
ws-trans06: 0.14.0

And also the master:

root@salt:~# salt --versions-report
           Salt: 0.14.0
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: not installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 13.0.0

I'm using the following command on the master for testing:

root@salt:~# salt '*' network.ip_addrs --out=yaml | grep ':$'  | sort
api01:
app04:
app05:
app06:
data01:
data02:
db-admin03:
db-clone:
db-command02:
db-staging01:
flint-mq01:
flint-mux01:
front02:
front03:
front05:
front06:
graphite01:
heartbeat01:
jenkins01:
ls-broker01:
ls-elasticsearch01:
ls-indexer01:
ls-shipper01:
ls-web01:
mail02:
mq01:
proc01:
proc02:
repo01:
rsdev01:
rsdev02:
salt:
spf01:
staging01:
staging02:
staging03:
staging04:
staging05:
staging-heartbeat01:
ws-trans02:
ws-trans03:
ws-trans04:
ws-trans05:
ws-trans06:

Which looks like it's getting results from all minions. However

ahammond@staging05:~$ sudo salt-call publish.publish '*' network.ip_addrs --out=yaml | grep ':$' | sort
[INFO    ] Loaded configuration file: /etc/salt/minion
[INFO    ] Package debconf-utils is not installed.
api01:
app04:
app05:
data01:
data02:
db-command02:
db-staging01:
flint-mq01:
front03:
front05:
front06:
graphite01:
heartbeat01:
jenkins01:
ls-broker01:
ls-elasticsearch01:
ls-indexer01:
ls-shipper01:
ls-web01:
mail02:
mq01:
proc01:
proc02:
repo01:
salt:
staging01:
staging02:
staging03:
staging04:
staging05:
staging-heartbeat01:
ws-trans02:
ws-trans03:
ws-trans04:
ws-trans05:
ws-trans06:

Even when I increase the timeout for publish.publish, I still am not getting responses from all systems.

ahammond@staging05:~$ sudo salt-call publish.publish '*' network.ip_addrs '' 'glob' 30 --out=yaml | grep ':$' | sort
[INFO    ] Loaded configuration file: /etc/salt/minion
[INFO    ] Package debconf-utils is not installed.
api01:
app04:
app05:
app06:
data01:
data02:
db-admin03:
db-clone:
db-staging01:
flint-mq01:
flint-mux01:
front02:
front05:
front06:
heartbeat01:
ls-broker01:
ls-indexer01:
ls-shipper01:
ls-web01:
mail02:
mq01:
proc01:
proc02:
repo01:
rsdev01:
rsdev02:
salt:
spf01:
staging01:
staging02:
staging03:
staging04:
staging-heartbeat01:
ws-trans02:
ws-trans05:
ws-trans06:

I ran this command a number of times and saved the output, looking for commonalities in which minions are not responding, but all of the minions responded at least once. As near as I can tell, which minions are included in the output is random, but it always hovers around 36. Changing the command to test.ping has no effect on the number of responding minions, nor which minions respond.

I checked logfiles for both minion and master. The only error I see is "The value for the 'max_open_files' setting, 100000, is higher than what the user running salt is allowed to raise to, 4096. Defaulting to 4096."

@thatch45
Copy link
Contributor

thatch45 commented Apr 5, 2013

Hmm, this is serious, we may need to update some of the peer system, I will take care of this

@ahammond
Copy link
Contributor Author

ahammond commented Apr 9, 2013

If there's anything I can do, please let me know.

@thatch45
Copy link
Contributor

thatch45 commented Apr 9, 2013

Thanks, I think that this interface probably just needs to be updated with some of the changes in the event bus. It is mostly getting to it time-wise for me, I might be a few more days.

If you want to take a look it is in the minion_publish method in master.py

@ahammond
Copy link
Contributor Author

ahammond commented Apr 9, 2013

I found minion_publish. Can you point me at the relevant changes to the event bus?

@thatch45
Copy link
Contributor

The question is what method is being used in the localclient, the object is called self.client or self.local and it should use a function that uses the event bus like cmd_iter

@ahammond
Copy link
Contributor Author

It calls self.local 3 times in the bottom 1/3rd of the method. First is https://github.com/saltstack/salt/blob/develop/salt/master.py#L1087
which doesn't look relevant. The next two are

        if ret_form == 'clean':
            return self.local.get_returns(                             # 2nd call
                jid,
                self.ckminions.check_minions(
                    clear_load['tgt'],
                    expr_form
                ),
                timeout
            )
        elif ret_form == 'full':
            ret = self.local.get_full_returns(                         # 3rd call
                    jid,
                    self.ckminions.check_minions(
                        clear_load['tgt'],
                        expr_form
                        ),
                    timeout
                    )
            ret['__jid__'] = jid
            return ret

I don't see a cmd_iter used in any of these contexts. I drilled down into self.local.get_returns and get_full_returns in client.py.
https://github.com/saltstack/salt/blob/develop/salt/client.py#L621
https://github.com/saltstack/salt/blob/develop/salt/client.py#L665
I do not see cmd_iter used anywhere in there, either, but it looks like the place where the minion (I think) waits for other hosts to check in.

There's also a CkMinions.check_minions method that looks relevant.
https://github.com/saltstack/salt/blob/develop/salt/utils/minions.py#L195

I traced thorough into the various self._check_foo_minions and did not see anything that looked relevant.

Next steps? Where can I read about event bus stuff?

@thatch45
Copy link
Contributor

Ouch, yes, this needs to be changed over. There are some docs on the event system:
http://docs.saltstack.com/topics/event/index.html

The local client has some legacy commands that use filesystem polling, I thought I had gotten them all out, but these get commands use filesystem polling.
we should be able to use something like cmd_iter instead, just a straight swap out should do it, take a look at the other commands in client.py, I should go in there and beef up the docstrings

@ahammond
Copy link
Contributor Author

Fixed by 029bdad

@ahammond ahammond reopened this May 6, 2013
@ahammond
Copy link
Contributor Author

ahammond commented May 6, 2013

So, I just upgraded to 0.15.0 across all boxes. Still have this issue.

ahammond@staging05:~$ sudo salt-call publish.publish '*' network.ip_addrs '' 'glob' 30 --out=yaml | grep ':$' | sort
[INFO    ] Loaded configuration file: /etc/salt/minion
[INFO    ] Package debconf-utils is not installed.
api01:
app04:
app05:
app06:
chat:
data02:
db-admin03:
db-staging01:
domU-12-31-39-09-A2-16.compute-1.internal:
front02:
front06:
graphite01:
jenkins01:
ls-broker01:
ls-elasticsearch01:
proc01:
proc02:
repo01:
rsdev01:
salt:
spf01:
staging01:
staging04:
staging05:
staging-heartbeat01:
st-nimbus01:
ws-trans02:
ws-trans04:
zookeeper01:
zookeeper02:
zookeeper03:

There are currently 51 boxes, and here I have only 31 responses.

@thatch45
Copy link
Contributor

thatch45 commented May 6, 2013

Very odd indeed, I could by no means reproduce this one I made the changes, we will revisit this!

@ahammond
Copy link
Contributor Author

ahammond commented May 6, 2013

Let me know if there's anything I can do to help repro the issue. We have a pretty vanilla deploy, but maybe it's environmental? How would I nail that down?

@thatch45
Copy link
Contributor

thatch45 commented May 6, 2013

I wonder if bumping up the worker_threads on the master might help? What are they at now?

@ahammond
Copy link
Contributor Author

ahammond commented May 6, 2013

#worker_threads: 5

Commented out in the config (default). What would be a reasonable setting? Any other settings I should also consider adjusting?

@thatch45
Copy link
Contributor

thatch45 commented May 6, 2013

the peer system can hold onto workers, and so it can cause for some things to be dripped here, lets try bumping it up to 15 and see if that helps.

@ahammond
Copy link
Contributor Author

ahammond commented May 6, 2013

I bumped to 15 and that seems to reduce the number of missing results. I got 45 answers. I bumped to 20 worker threads and got 46 answers pretty consistently. I note that my master's logfile has

2013-05-06 23:34:51,663 [salt.master      ][WARNING ] The value for the 'max_open_files' setting, 100000, is higher than what the user running salt is allowed to raise to, 4096. Defaulting to 4096.
2013-05-06 23:34:51,663 [salt.master      ][WARNING ] Raising max open files value to 4096
2013-05-06 23:34:51,663 [salt.master      ][WARNING ] New values for max open files soft/hard values: 4096/4096

I'm guessing there's some tuning I need to do here. Is there a wiki page with instructions for this?

@thatch45
Copy link
Contributor

thatch45 commented May 6, 2013

oh, those are always there, I need to change these log messages to info. I am wondering if there is a timeout issue somewhere, how long does it take for the data to return?

@ahammond
Copy link
Contributor Author

ahammond commented May 6, 2013

I fixed that error message by putting

ulimit 100000

in the /etc/init/salt-master.conf script stanza. That'd be a pretty easy thing to throw into the .debs. Who can I talk to about that?

Running

time sudo salt-call publish.publish '*' network.ip_addrs '' 'glob' 30 | grep ':$' | sort | wc -l

I'm getting between 11s and 15s pretty consistently.

@thatch45
Copy link
Contributor

thatch45 commented May 7, 2013

hmm, so it is hitting the timeouts, this is good information, although it just confirms that the minions are not all being caught, it might be a case of salt being too fast for salt as well, although I thought we got these covered. after the peer call does the command show up in:
salt-run jobs.list_jobs
and does looking up the jid return for all minions of just the ones that made it back to the publish command.

This is important to hunt down, but once 0.15.1 comes out with the mine interface fixed up it might be a faster and more reliable interface to use here.

@ahammond
Copy link
Contributor Author

ahammond commented May 7, 2013

ahammond@staging05:~$ time sudo salt-call publish.publish '*' network.ip_addrs '' 'glob' 30 | grep ':$' | sort | wc -l
[INFO    ] Configuration file path: /etc/salt/minion
[INFO    ] Package debconf-utils is not installed.
[INFO    ] Publishing 'network.ip_addrs' to tcp://209.114.36.150:4506
46

real    0m12.815s
user    0m0.890s
sys 0m0.380s

Appears in list_jobs:

'20130507013136745102':
  Arguments: []
  Function: network.ip_addrs
  Start Time: 2013, May 07 01:31:36.745102
  Target: '*'
  Target-type: glob
  User: root

And the output matches what I'm getting when I run the command directly on the master:

root@salt:/srv/salt# salt-run jobs.lookup_jid 20130507013136745102 | grep ':$' | wc -l
48
root@salt:/srv/salt# salt \* network.ip_addrs | grep ':$' | wc -l
48

So... apparently it's a matter of getting the results back to the minion that published the call?

@thatch45
Copy link
Contributor

thatch45 commented May 9, 2013

very good data this!
I think that the problem is that a few minions are returning before the event bus is being listened to by the publishing function. The fix is to add a call to localclient.get_returns (or one of the get returns) to double check via the cache before sending it back, I am on it!

@thatch45
Copy link
Contributor

thatch45 commented May 9, 2013

ok, this is going to require a new method to be added to the localclient class, so I will have this soon :)

@ghost ghost assigned thatch45 May 9, 2013
@ahammond
Copy link
Contributor Author

Any news on this? Can I help?

@UtahDave
Copy link
Contributor

I believe this has been fixed in the develop branch. Can you test that @ahammond ?

@ahammond
Copy link
Contributor Author

I'm willing to do some testing. Do I need to upgrade just the master or all the minions, too?

@UtahDave
Copy link
Contributor

You'd probably have to upgrade both the master and the minions. Is that incorrect, @thatch45 ?

@thatch45
Copy link
Contributor

Yes, I just pushed updates to git while I was on the plane that should fix this but I have only been able to test it a little. You should only need to update the master

@ahammond
Copy link
Contributor Author

I upgraded the master to 665965f this afternoon.

root@salt:~# salt --versions-report 
           Salt: 0.15.0-774-g665965f
         Python: 2.7.3 (default, Aug  1 2012, 05:14:39)
         Jinja2: 2.6
       M2Crypto: 0.21.1
 msgpack-python: 0.1.10
   msgpack-pure: Not Installed
       pycrypto: 2.4.1
         PyYAML: 3.10
          PyZMQ: 13.0.0
            ZMQ: 3.2.2

And now my publish.publish call from the minion is getting zero results:

ahammond@staging05:~$ sudo salt-call publish.publish \* network.ip_addrs
local:

I checked /var/log/salt/master and found:

2013-05-24 23:22:24,298 [salt.master      ][ERROR   ] Error in function minion_publish:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/salt/master.py", line 1234, in run_func
    ret = getattr(self, func)(load)
  File "/usr/local/lib/python2.7/dist-packages/salt/master.py", line 1199, in minion_publish
    for key, val in self.local.get_cache_returns(ret['__jid__']).items():
KeyError: '__jid__'
2013-05-24 23:23:05,406 [salt.master      ][ERROR   ] Error in function minion_publish:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/salt/master.py", line 1234, in run_func
    ret = getattr(self, func)(load)
  File "/usr/local/lib/python2.7/dist-packages/salt/master.py", line 1199, in minion_publish
    for key, val in self.local.get_cache_returns(ret['__jid__']).items():
KeyError: '__jid__'
2013-05-24 23:23:14,797 [salt.master      ][ERROR   ] Error in function minion_publish:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/salt/master.py", line 1234, in run_func
    ret = getattr(self, func)(load)
  File "/usr/local/lib/python2.7/dist-packages/salt/master.py", line 1199, in minion_publish
    for key, val in self.local.get_cache_returns(ret['__jid__']).items():
KeyError: '__jid__'

@thatch45
Copy link
Contributor

Yes, I just fixed this yesterday, there was a case where it was working and a case where it was failing that I did not see initially. It should be better now

@ahammond
Copy link
Contributor Author

Looks good, although I see that __jid__ is being returned as part of the result set:

ahammond@staging05:~$ sudo salt-call publish.publish \* network.ip_addrs
__jid__:
    20130528184413773119
api01:
    - 10.177.219.117
...

That strikes me as a little bit odd / awkward. Will I need to filter it out of other publish.publish result sets?

@thatch45
Copy link
Contributor

Thanks, nice catch

@ahammond
Copy link
Contributor Author

Thanks for fixing this! Now I've just got to get the

2013-05-28 19:25:20,239 [salt.minion      ][WARNING ] Missing arguments executing "state.sls": ArgSpec(args=['mods', 'env', 'test', 'exclude'], varargs=None, keywords='kwargs', defaults=('base', None, None))

issue resolved and my (rather complicated, but hopefully clever and useful) hosts management state will be working. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants