Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salt '*' highstate returns 'minion did not return', salt [minion] highstate works #8647

Closed
Mrten opened this issue Nov 19, 2013 · 39 comments
Closed
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided Core relates to code central or existential to Salt P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@Mrten
Copy link
Contributor

Mrten commented Nov 19, 2013

There's an apparent difference in targeting '*' and a host directly. (0.17.2 all the way now)

I have this state that updates salt to the latest version:

salt-minion:
  pkg:
    - latest
    - order: last  #zodat een upgrade niet de states halvewege vernielt
  service.running:
    - require:
      - pkg: salt-minion
    - enable: true
    - watch:
      - file: /etc/salt/minion.d/id.conf
      - file: /etc/salt/minion.d/tcp-keepalive.conf

It should be a no-op, since I'm on the latest version (I checked this time :)

It returns with a no-op when targeted directly, but it doesn't return at all when targeted with '*'. It does run, I can see that because I run the minion on the console with salt-minion -d debug.

So, to re-iterate:

salt '*' state.sls all-salt-minion-config

gives me

[lots of output]
idd0010:
    Minion did not return
[more output]

but

salt idd0010 state.sls all-salt-minion-config

gives me

root@salt-master:/home/salt/conf# salt idd0010 state.sls all-salt-minion-config
Executing job with jid 20131119222944774230
-------------------------------------------

Execution is still running on idd0010
idd0010:

Summary
-----------
Succeeded: 0
Failed:   0
-----------
Total:    0

Complete output from salt-minion -d debug is on http://pastebin.com/10h2YETr

@basepi
Copy link
Contributor

basepi commented Nov 20, 2013

That's very strange. Is it just the one minion that is acting this way? And have you modified the default timeout at all? What OS is the minion?

@Mrten
Copy link
Contributor Author

Mrten commented Nov 20, 2013

Ubuntu 12.04, all the way. No, it happens across all minions. Haven't edited the timeout afaik.

@basepi
Copy link
Contributor

basepi commented Nov 20, 2013

Stranger still! The only difference between '*' targeting and exact match targeting is that we only wait for the required minions to return in the latter case. I can't imagine why the targeting would be affecting this output. I'll have to think about this.

@Mrten
Copy link
Contributor Author

Mrten commented Nov 20, 2013

What timeouts are there to tune by the way? A 'ls' works for all minions, for example.

The weird part is that (I always run with -v, aliased 'salt' as 'salt -v') the periodic logging of 'Execution is still running on XXX' batches of servers get progressively smaller, if you get my drift.

@Mrten
Copy link
Contributor Author

Mrten commented Nov 21, 2013

With salt -v -t 60 I do get results. Is it perhaps that there is a max to the number of times a minion is queried?

@thedrow
Copy link
Contributor

thedrow commented Nov 21, 2013

What kind of networking setup do you have? Could it be an issue of salt not gracefully failing to communicate with minions?

@Mrten
Copy link
Contributor Author

Mrten commented Nov 21, 2013

Nothing changed (AFAIK) to the network setup recently. However:
Most of the minions are VMs, distributed over seven physical servers (also minions), distributed over two locations. Between the locations is a layer2 VPN, so that all minions see the same internal network.

The second location is quite small, I would recognize the set of minions over there. Everything else is either on the same switch or even on the same physical machine. I don't think the network is an issue, there's no high amounts of traffic on the times I see this (backups are overnight).

@thedrow
Copy link
Contributor

thedrow commented Nov 22, 2013

I am not claiming something has changed with your network that now prevents salt to work. I'm claiming that salt might have had a regression regarding how it communicates with other minions. What happens when you disconnect the VPN?

Can you show us the logs of a hanging minion?
Can you show us system logs?

@tomasfejfar
Copy link

I also experienced this bug. Unfortunatelly it's not directly replicable. In the next run the minion returns. I'm targeting all (~10) minions. In the next run the minion return without problems. I try to increase the timeout to 120s and see if it appears again. But it looks that it "does not respond" much later than current timeout of 5s.

@basepi basepi modified the milestones: Approved, Outstanding Bugs Apr 21, 2014
@excavador
Copy link

This bug affects me as well :(
2014.1.3

@basepi
Copy link
Contributor

basepi commented Apr 24, 2014

@zamotivator Can you elaborate? Does it happen every time? Same minions? Reproducible?

@excavador
Copy link

@basepi #12246

@basepi
Copy link
Contributor

basepi commented Apr 25, 2014

@zamotivator Ah, thanks.

@pwaller
Copy link
Contributor

pwaller commented Jun 18, 2014

@basepi

I think this bug is the "salt randomly does nothing" bug that I've complained about.

$ salt 'specific-machine' state.sls specific-state

real    0m9.794s
user    0m0.523s
sys 0m0.150s

I've had trouble reporting it in the past because it only happens one in hundreds of times and I've not been able to reproduce it before. Finally, it was happening repeatedly so I ran with -ldebug:

$ salt -ldebug 'specific-machine' state.sls specific-state
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Including configuration from '/etc/salt/master.d/disable-state-events.conf'
[DEBUG   ] Reading configuration from /etc/salt/master.d/disable-state-events.conf
[DEBUG   ] Missing configuration file: /home/pwaller/.salt
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Including configuration from '/etc/salt/master.d/disable-state-events.conf'
[DEBUG   ] Reading configuration from /etc/salt/master.d/disable-state-events.conf
[DEBUG   ] Missing configuration file: /home/pwaller/.salt
[DEBUG   ] LocalClientEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] LocalClientEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Checking whether jid 20140618160754499734 is still running
[DEBUG   ] get_returns for jid 20140618160759507638 sent to set(['specific-machine']) will timeout at 17:08:01
[INFO    ] jid 20140618160759507638 minions set(['specific-machine']) did not return in time

real    0m9.844s
user    0m0.504s
sys 0m0.222s

@pwaller
Copy link
Contributor

pwaller commented Jun 18, 2014

In this case, it turns out the minion had stopped on specific-machine. It's unhelpful that salt gives no indication as to what is wrong. And it has definitely happened in the past when the minion was running.

@basepi
Copy link
Contributor

basepi commented Jun 18, 2014

Well, keep us posted if it happens again when the minion is running. Hopefully we'll be able to track down what's going wrong there.

@pwaller
Copy link
Contributor

pwaller commented Jun 19, 2014

I have a feeling that it must be the minion crashing or being killed by the OOM killer or something of this form. My complaint is that the salt master doesn't give any output to tell you what might have gone wrong.

And then there is nothing in the logs to indicate something went wrong.

Since it's pretty rare, we should prepare now rather than wait until it next happens. What should I do to prepare the system for understanding what's wrong? The minion is very quiet by default, which I find surprising.

@basepi
Copy link
Contributor

basepi commented Jun 19, 2014

So there's nothing in the minion config, or nothing in the master config?

This is relevant for reporting back to master: #11631

As far as the minion goes, if there was an uncaught exception or something, I would definitely expect it to be in the log. Maybe change the log to debug mode on your minions in the hopes that there's information there? I'm not sure I've ever heard of a minion just stopping without putting anything in the logs, unless it was stopped externally, maybe.

@frumpel
Copy link

frumpel commented Jul 31, 2014

I keep seeing this fairly regularly with windows minions in 2014.1.4 :(

@oyvjel
Copy link

oyvjel commented Sep 1, 2014

I observe this on 2014.1.10:
salt '' state.highstate : returns nothing, no errors in master nor minion logs.
salt -v '
' state.highstate : Lots of "Minion did not return". Usually only 1 or 2 minions respond properly. Still nothing in logs.
salt -v -t 10 '*' state.highstate : Problem solved!(?)

Highstate is kind of a promise all hosts are OK, so I find this critical for the confidense in Saltstacks ability to keep the systems in a controlled state.

@basepi
Copy link
Contributor

basepi commented Sep 2, 2014

@oyvjel Adding a -t 10 made them all return properly? Wonder if it's one of the find_job race conditions we fixed recently?

@oyvjel
Copy link

oyvjel commented Sep 3, 2014

Not sure if it is related, but I have to repeat "salt-run manage.status" 4-5 times before it returns correct status.
In a test-environment I have 8 kvm-guests, 5 is Up, but salt-run starts out with only reporting 1 up, slowly increasing on each run to the correct number 5. Only happens if I have not run any salt-commands for a while (hours).

@basepi
Copy link
Contributor

basepi commented Sep 3, 2014

It's possible this is related to the resetting AES session that @cachedout refers to here.

@hvnsweeting
Copy link
Contributor

I'm using salt 2014.1.10 and I also got this problem, highstate sometimes only work if I put -t ABIGAMOUNTOFTIME

root@master:~# restart salt-master
salt-master start/running, process 5049
root@master:~# salt ci test.ping # probably https://github.com/saltstack/salt/issues/16518
root@master:~# salt ci test.ping
ci:
    True
root@master:~# salt ci state.highstate -v
Executing job with jid 20150116102524461622
-------------------------------------------

ci:
    Minion did not return
root@master:~# salt ci state.highstate -v -t86400
Executing job with jid 20150116102759159587
-------------------------------------------

ci:
    Data failed to compile:
----------
    The function "state.highstate" is running as PID 13311 and was started at 2015, Jan 16 10:25:24.461622 with jid 20150116102524461622

@frumpel
Copy link

frumpel commented Feb 11, 2015

Unfortunately -t ABIGAMOUNTOFTIME is not a viable solution in automation because it will wait for ABIGAMOUNTOFTIME even if all minions have returned.

@thedrow
Copy link
Contributor

thedrow commented Feb 11, 2015

@frumpel That's another bug. Timeouts should be the maximum number of seconds to wait for responses.
Can you please open a new bug report about it?

@basepi
Copy link
Contributor

basepi commented Aug 11, 2015

Are people still running into this? Reading back through, it sounds to me like the saltutil.find_job wasn't working properly to see if minions were done with the job. It would be useful to set gather_job_timeout: 15 (it's 5 by default) in the master config to see if that resolves these issues.

@basepi basepi added severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P3 Priority 3 cannot-reproduce cannot be replicated with info/context provided Core relates to code central or existential to Salt and removed severity-low 4th level, cosemtic problems, work around exists Regression The issue is a bug that breaks functionality known to work in previous releases. labels Aug 11, 2015
@basepi basepi modified the milestones: Blocked, Approved Aug 11, 2015
@kracekumar
Copy link

Yes, I am facing similar issue

ubuntu@ip-172-31-25-111:/opt/switch/provisioner$ sudo salt 'worker-stage-*' state.highstate -l debug
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Missing configuration file: /home/ubuntu/.saltrc
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Missing configuration file: /home/ubuntu/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2015-09-10T10:00:11.155890'}
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] get_iter_returns for jid 20150910100011163303 sent to set(['worker-stage-1']) will timeout at 10:00:16.166048
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2015-09-10T10:00:16.173730'}
[DEBUG   ] Checking whether jid 20150910100011163303 is still running
[DEBUG   ] LazyLoaded no_return.output
worker-stage-1:
    Minion did not return. [No response]
ubuntu@ip-172-31-25-111:/opt/switch/provisioner$ sudo salt 'worker-stage-*' state.highstate -l debug -t 10
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Missing configuration file: /home/ubuntu/.saltrc
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Missing configuration file: /home/ubuntu/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2015-09-10T10:06:33.398880'}
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] get_iter_returns for jid 20150910100633405998 sent to set(['worker-stage-1']) will timeout at 10:06:43.408915
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2015-09-10T10:06:43.417091'}
[DEBUG   ] Checking whether jid 20150910100633405998 is still running
[DEBUG   ] LazyLoaded no_return.output
worker-stage-1:
    Minion did not return. [No response]
ubuntu@ip-172-31-25-111:/opt/switch/provisioner$ sudo salt '*' state.highstate -l debug -t 10
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Missing configuration file: /home/ubuntu/.saltrc
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Missing configuration file: /home/ubuntu/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2015-09-10T10:06:59.775947'}
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] get_iter_returns for jid 20150910100659782572 sent to set(['worker-stage-1']) will timeout at 10:07:09.785015
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2015-09-10T10:07:09.787750'}
[DEBUG   ] Checking whether jid 20150910100659782572 is still running
[DEBUG   ] LazyLoaded no_return.output
worker-stage-1:
    Minion did not return. [No response]

Version info

ubuntu@ip-172-31-25-111:/opt/switch/provisioner$ sudo salt --versions-report
                  Salt: 2015.5.3
                Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
                Jinja2: 2.7.2
              M2Crypto: 0.21.1
        msgpack-python: 0.3.0
          msgpack-pure: Not Installed
              pycrypto: 2.6.1
               libnacl: Not Installed
                PyYAML: 3.10
                 ioflo: Not Installed
                 PyZMQ: 14.0.1
                  RAET: Not Installed
                   ZMQ: 4.0.4
                  Mako: Not Installed
               Tornado: Not Installed
 Debian source package: 2015.5.3+ds-1trusty1
ubuntu@ip-172-31-25-111:/opt/switch/provisioner$

@jordanrinke
Copy link

In an older 2014.1.5 environment we have adding "gather_job_timeout: 15" fixed this for us.

@ssgward
Copy link

ssgward commented Nov 9, 2015

@Mrten - can we close this issue or are you still seeing a problem?

@Mrten
Copy link
Contributor Author

Mrten commented Nov 10, 2015

I doubt any code is the same since then :)

@basepi
Copy link
Contributor

basepi commented Nov 11, 2015

@kracekumar Can you try increasing the gather_job_timeout as @jordanrinke mentioned?

@kracekumar
Copy link

Thanks @basepi and @jordanrinke increasing time out works.

@basepi
Copy link
Contributor

basepi commented Nov 11, 2015

I'm thinking we should probably increase the gather_job_timeout by default. It will increase the total time it takes the CLI to return for fast-running jobs when minions are offline, but may be worth avoiding issues such as this one.

@kracekumar
Copy link

As of now, once minion doesn't report, master stops trying. How about min retry of n times in x interval ?

@basepi
Copy link
Contributor

basepi commented Nov 12, 2015

That's not correct, actually. Or it shouldn't be. What happens is the master sends a job to the minions. It then waits timeout seconds for the minions to return. If not all minions have returned, it sends a saltutil.find_job command to the minions to see which minions are still running the job. That secondary job has a timeout as well, gather_job_timeout. If minions reply that they are still running the job, then the master starts the cycle over again, waiting timeout seconds. This is how salt allows for long-running jobs.

If a minion doesn't reply to the saltutil.find_job within gather_job_timeout seconds, then the master decides that minion is offline and stops listening.

Are you seeing different behavior?

@basepi basepi closed this as completed in c09243d Nov 13, 2015
@dpkirchner
Copy link

This same behavior occurs while running test.ping. We're seeing it while using 2015.5.3 when running 'salt -t 10 host test.ping' on a largely idle master and minion (~0.5 load average). We have a single master configuration and there is low latency between the master and minion.

Here is the debug output when the test.ping request fails:

[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-08T15:18:11.300338'}
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] get_iter_returns for jid 20160108151812308781 sent to set(['minion-1']) will timeout at 15:18:42.311917
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-08T15:18:42.315115'}
[DEBUG   ] Checking whether jid 20160108151812308781 is still running
[DEBUG   ] LazyLoaded no_return.output
minion-1:
    Minion did not return. [No response]

and on the minion:

2016-01-08 15:18:12,329 [salt.minion                              ][INFO    ][9714] User sudo_dpk Executing command test.ping with jid 20160108151812308781
2016-01-08 15:18:12,335 [salt.minion                              ][INFO    ][10846] Starting a new job with PID 10846
2016-01-08 15:18:12,365 [salt.minion                              ][INFO    ][10846] Returning information for job: 20160108151812308781

And here's the debug output when the test.ping request succeeds:

[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-08T15:18:03.204118'}
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] get_iter_returns for jid 20160108151803210905 sent to set(['minion-1']) will timeout at 15:18:33.213893
[DEBUG   ] jid 20160108151803210905 return from minion-1
[DEBUG   ] LazyLoaded nested.output
minion-1:
    True

and on the minion:

2016-01-08 15:18:03,231 [salt.minion                              ][INFO    ][9714] User sudo_dpk Executing command test.ping with jid 20160108151803210905
2016-01-08 15:18:03,236 [salt.minion                              ][INFO    ][10701] Starting a new job with PID 10701
2016-01-08 15:18:03,265 [salt.minion                              ][INFO    ][10701] Returning information for job: 20160108151803210905

The zmq connection is up throughout. tcpdump shows that the response is making it back to the master server (or at least some data is making it back). It looks like the master is either ignoring the result or dropping it on the floor.

              Salt: 2015.5.3
            Python: 2.7.3 (default, Mar 13 2014, 11:03:55)
            Jinja2: 2.6
          M2Crypto: 0.21.1
    msgpack-python: 0.1.10
      msgpack-pure: Not Installed
          pycrypto: 2.6
           libnacl: Not Installed
            PyYAML: 3.10
             ioflo: Not Installed
             PyZMQ: 13.1.0
              RAET: Not Installed
               ZMQ: 3.2.3
              Mako: 0.7.0
           Tornado: 4.0.2
 Debian source package: 2015.5.3+ds-1~bpo70+2

Unfortunately I don't master server debug logs to show. I'll add those when I have 'em. (Restarting the master with debug logging enabled "fixed" the problem, naturally!)

@dpkirchner
Copy link

This post includes debug logs from the master.

$ sudo salt -l debug minion-* test.ping
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt-master
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: salt-master
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-11T14:41:48.378035'}
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] get_iter_returns for jid 20160111144149386344 sent to set(['minion-q1b1']) will timeout at 14:42:19.389744
[DEBUG   ] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-11T14:42:19.391802'}
[DEBUG   ] Checking whether jid 20160111144149386344 is still running
[DEBUG   ] LazyLoaded no_return.output
minion-q1b1:
    Minion did not return. [Not connected]

Logs on the master for this request:

2016-01-11 14:41:48,048 [salt.client      ][DEBUG   ][21021] jid 20160111144147891289 found all minions set(['minion-q1b1'])
2016-01-11 14:41:48,322 [salt.config      ][DEBUG   ][21037] Reading configuration from /etc/salt/master
2016-01-11 14:41:48,376 [salt.config      ][DEBUG   ][21037] Using cached minion ID from /etc/salt/minion_id: salt-master
2016-01-11 14:41:48,377 [salt.config      ][DEBUG   ][21037] Missing configuration file: /root/.saltrc
2016-01-11 14:41:48,377 [salt.utils.event ][DEBUG   ][21037] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
2016-01-11 14:41:48,378 [salt.utils.event ][DEBUG   ][21037] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
2016-01-11 14:41:48,378 [salt.utils.event ][DEBUG   ][21037] Sending event - data = {'_stamp': '2016-01-11T14:41:48.378035'}
2016-01-11 14:41:49,386 [salt.master                              ][INFO    ][12897] Clear payload received with command publish
2016-01-11 14:41:49,387 [salt.utils.event                         ][DEBUG   ][12897] Sending event - data = {'_stamp': '2016-01-11T14:41:49.386669', 'minions': ['minion-q1b1']}
2016-01-11 14:41:49,387 [salt.utils.event                         ][DEBUG   ][12897] Sending event - data = {'tgt_type': 'glob', 'jid': '20160111144149386344', 'tgt': 'minion-*', '_stamp': '2016-01-11T14:41:49.386865', 'user': 'sudo_dpk', 'arg': [], 'fun': 'test.ping', 'minions': ['minion-q1b1']}
2016-01-11 14:41:49,387 [salt.master                              ][INFO    ][12897] User sudo_dpk Published command test.ping with jid 20160111144149386344
2016-01-11 14:41:49,387 [salt.master                              ][DEBUG   ][12897] Published command details {'tgt_type': 'glob', 'jid': '20160111144149386344', 'tgt': 'minion-*', 'ret': '', 'user': 'sudo_dpk', 'arg': [], 'fun': 'test.ping'}
2016-01-11 14:41:49,389 [salt.utils.lazy  ][DEBUG   ][21037] LazyLoaded local_cache.get_load
2016-01-11 14:41:49,390 [salt.client      ][DEBUG   ][21037] get_iter_returns for jid 20160111144149386344 sent to set(['minion-q1b1']) will timeout at 14:42:19.389744
2016-01-11 14:41:49,532 [salt.utils.job                           ][INFO    ][12914] Got return from minion-q1b1 for job 20160111144149386344
2016-01-11 14:41:49,533 [salt.utils.event                         ][DEBUG   ][12914] Sending event - data = {'fun_args': [], 'jid': '20160111144149386344', 'return': True, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2016-01-11T14:41:49.532646', 'fun': 'test.ping', 'id': 'minion-q1b1'}
2016-01-11 14:42:19,391 [salt.utils.event ][DEBUG   ][21037] MasterEvent PUB socket URI: ipc:///var/run/salt/master/master_event_pub.ipc
2016-01-11 14:42:19,391 [salt.utils.event ][DEBUG   ][21037] MasterEvent PULL socket URI: ipc:///var/run/salt/master/master_event_pull.ipc
2016-01-11 14:42:19,392 [salt.utils.event ][DEBUG   ][21037] Sending event - data = {'_stamp': '2016-01-11T14:42:19.391802'}
2016-01-11 14:42:19,392 [salt.client      ][DEBUG   ][21037] Checking whether jid 20160111144149386344 is still running
2016-01-11 14:42:19,393 [salt.master                              ][INFO    ][12890] Clear payload received with command publish
2016-01-11 14:42:19,394 [salt.utils.event                         ][DEBUG   ][12890] Sending event - data = {'_stamp': '2016-01-11T14:42:19.394021', 'minions': ['minion-q1b1']}
2016-01-11 14:42:19,394 [salt.utils.event                         ][DEBUG   ][12890] Sending event - data = {'tgt_type': 'glob', 'jid': '20160111144219393788', 'tgt': 'minion-*', '_stamp': '2016-01-11T14:42:19.394225', 'user': 'sudo_dpk', 'arg': ['20160111144149386344'], 'fun': 'saltutil.find_job', 'minions
': ['minion-q1b1']}
2016-01-11 14:42:19,395 [salt.master                              ][INFO    ][12890] User sudo_dpk Published command saltutil.find_job with jid 20160111144219393788
2016-01-11 14:42:19,395 [salt.master                              ][DEBUG   ][12890] Published command details {'tgt_type': 'glob', 'jid': '20160111144219393788', 'tgt': 'minion-*', 'ret': '', 'user': 'sudo_dpk', 'arg': ['20160111144149386344'], 'fun': 'saltutil.find_job'}
2016-01-11 14:42:19,524 [salt.utils.job                           ][INFO    ][12881] Got return from minion-q1b1 for job 20160111144219393788
2016-01-11 14:42:19,524 [salt.utils.event                         ][DEBUG   ][12881] Sending event - data = {'fun_args': ['20160111144149386344'], 'jid': '20160111144219393788', 'return': {}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2016-01-11T14:42:19.524379', 'fun': 'saltutil.find_job', 'id': 'minion-q1b1'}

I may not have captured every line of this response -- the master is busy with other things. However, this is at least every line with either jid *3788 or jid *6344.

salt-run jobs.lookup_id for either of these jids returns the expected True result. A subsequent test found that jobs.lookup_id returns a result even while the original salt command line shows that minion did not return error.

I've verified that none of the salt processes are out of fds or other resource limits (e.g. ulimit limits).

strace shows that the connection is active, that the client is continuously polling (select()) for output on the zmq port.

All of this points to the bug being in the server, IMO. I suspect that the server not sending the result down the zmq socket.

@basepi
Copy link
Contributor

basepi commented Jan 11, 2016

@dpkirchner Interesting, thanks for the detective work. The issue you're seeing looks a bit different from this original issue. Would you mind opening a new one with the same findings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided Core relates to code central or existential to Salt P3 Priority 3 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests