Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Minion did not return" executing state with long running command, 2016.3 regression #34872

Closed
cbuechler opened this issue Jul 22, 2016 · 42 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P2 Priority 2 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZRELEASED - 2016.3.4
Milestone

Comments

@cbuechler
Copy link

This is a setup I've been using for about a year, on a variety of 2015 versions, with no issue. Upgrading the minion to 2016.3.1 makes it stop working. I believe it's long-running commands, but maybe something else specific to what I'm doing.

Take the following state:

gzcat /root/image-latest.img.gz | dd of=/dev/null bs=1M:
  cmd.run

Where image-latest.img.gz is a gzipped disk image file, which in normal operation is sent to dd with output to a hard drive, but for sake of example, /dev/null behaves in the same way. Any sufficiently large gz file (100 MB would suffice) will show the end result problem.

When executing that state from the master via salt my-deploy state.sls write-disk, the master claims the minion didn't return.

[DEBUG   ] Reading configuration from /usr/local/etc/salt/master
[DEBUG   ] Using cached minion ID from /usr/local/etc/salt/minion_id: dev-salt-master.example.com
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Configuration file path: /usr/local/etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Reading configuration from /usr/local/etc/salt/master
[DEBUG   ] Using cached minion ID from /usr/local/etc/salt/minion_id: dev-salt-master.example.com
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/66/c9f0cb3dfb31a4e0eae44ad32a108d/.minions.p
[DEBUG   ] get_iter_returns for jid 20160722001528110729 sent to set(['my-deploy']) will timeout at 00:15:33.119337
[DEBUG   ] Checking whether jid 20160722001528110729 is still running
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] LazyLoaded no_return.output
my-deploy:
    Minion did not return. [Not connected]

But if you watch the minion's log, it's running the command, and it finishes successfully. So the minion works, it just apparently doesn't reply back to the master, leaving it thinking it disconnected.

Minion log when master reports "Minion did not return".

[INFO    ] Running state [gzcat /root/image-latest.img.gz | dd of=/dev/null bs=1M] at time 00:15:37.880888
[INFO    ] Executing state cmd.run for gzcat /root/image-latest.img.gz | dd of=/dev/null bs=1M
[INFO    ] Executing command 'gzcat /root/image-latest.img.gz | dd of=/dev/null bs=1M' in directory '/root'
[INFO    ] {'pid': 2344, 'retcode': 0, 'stderr': '0+131072 records in\n8191+1 records out\n8589888000 bytes transferred in 33.619007 secs (255506891 bytes/sec)', 'stdout': ''}
[INFO    ] Completed state [gzcat /root/image-latest.img.gz | dd of=/dev/null bs=1M] at time 00:16:11.539727 duration_in_ms=33658.839
[INFO    ] Returning information for job: 20160722001528110729

Master log from 2015.8.1 which works correctly.

[DEBUG   ] Reading configuration from /usr/local/etc/salt/master
[DEBUG   ] Using cached minion ID from /usr/local/etc/salt/minion_id: dev-salt-master.example.com
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Configuration file path: /usr/local/etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Reading configuration from /usr/local/etc/salt/master
[DEBUG   ] Using cached minion ID from /usr/local/etc/salt/minion_id: dev-salt-master.example.com
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/ac/a63465b4ec61ac13c59cb3545fa507/.minions.p
[DEBUG   ] get_iter_returns for jid 20160722042210217054 sent to set(['my-deploy']) will timeout at 04:22:15.225950
[DEBUG   ] Checking whether jid 20160722042210217054 is still running
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Checking whether jid 20160722042210217054 is still running
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Checking whether jid 20160722042210217054 is still running
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] Checking whether jid 20160722042210217054 is still running
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/master', 'dev-salt-master.example.com_master', 'tcp://[::]:4506', 'clear')
[DEBUG   ] jid 20160722042210217054 return from my-deploy
[DEBUG   ] LazyLoaded highstate.output
[DEBUG   ] LazyLoaded nested.output

Versions Report

Not working:

Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.3
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.11 (default, Mar 17 2016, 01:24:00)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.3
            ZMQ: 4.1.4

System Versions:
           dist:
        machine: amd64
        release: 10.2-RELEASE-p14
         system: FreeBSD
        version: Not Installed

Working:

    Salt Version:
               Salt: 2015.8.1

    Dependency Versions:
             Jinja2: 2.8
           M2Crypto: Not Installed
               Mako: Not Installed
             PyYAML: 3.11
              PyZMQ: 14.7.0
             Python: 2.7.10 (default, Nov  5 2015, 01:15:39)
               RAET: Not Installed
            Tornado: 4.3
                ZMQ: 4.1.3
               cffi: Not Installed
           cherrypy: Not Installed
           dateutil: 2.3
              gitdb: Not Installed
          gitpython: Not Installed
              ioflo: Not Installed
            libnacl: Not Installed
       msgpack-pure: Not Installed
     msgpack-python: 0.4.2
       mysql-python: Not Installed
          pycparser: Not Installed
           pycrypto: 2.6.1
             pygit2: Not Installed
       python-gnupg: Not Installed
              smmap: Not Installed
            timelib: Not Installed

    System Versions:
               dist:
            machine: amd64
            release: 10.1-RELEASE-p24

The OS used doesn't matter either way, FreeBSD 10.2 with any 2015 salt version works fine.

@eradman
Copy link
Contributor

eradman commented Jul 22, 2016

I think I'm running into the same issue, when running tasks that may not come back for 1-3 minutes:

postgresql-9.5-replicate:
  cmd.run:
    - name: /usr/pgsql-9.5/bin/pg_basebackup -D /pg_data/9.5 -X stream -h virtdb01
    - creates: /pg_data/9.5/pg_hba.conf
    - runas: postgres

Results in

Minion did not return. [No response] 

But the task does complete. I'm using CentOS 7 with salt 2016.3.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jul 26, 2016

@cbuechler i'm not seeing this on 2016.3.1 using this test example:

sleep 300:
  cmd.run

Are your minions also 2016.3.1?

@Ch3LL Ch3LL added the info-needed waiting for more info label Jul 26, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Jul 26, 2016
@Ch3LL Ch3LL added the cannot-reproduce cannot be replicated with info/context provided label Jul 26, 2016
@cbuechler
Copy link
Author

It only happens when the minions are 2016.3.1, those versions shown are the minion version. The 2015.8 minion shown on the same 2016.3.1 master works fine.

That 'sleep 300' cmd.run exhibits the same behavior. Sleep of 15 seconds is the first that fails in the manner described, 14 seconds or less succeeds. Logs just like what I included in the original report. This is on a freshly-deployed FreeBSD instance on EC2.

It seems to be specific to FreeBSD, doing same on Linux doesn't exhibit that behavior. From a Ubuntu minion that doesn't have the issue:

Salt Version:
           Salt: 2016.3.1

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.5.0
       dateutil: 2.4.2
          gitdb: 0.6.4
      gitpython: 1.0.1
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.21.1
           Mako: 1.0.3
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.12 (default, Jul  1 2016, 15:12:24)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.2.0
           RAET: Not Installed
          smmap: 0.9.0
        timelib: Not Installed
        Tornado: 4.2
            ZMQ: 4.1.4

System Versions:
           dist: Ubuntu 16.04 xenial
        machine: x86_64
        release: 4.4.0-31-generic
         system: Linux
        version: Ubuntu 16.04 xenial

@cbuechler
Copy link
Author

It's not just a long-running cmd.run either, applying state.highstate times out in the same fashion on FreeBSD minions even though it completes successfully. Logs along the lines of what I included in the original post.

@eradman
Copy link
Contributor

eradman commented Jul 27, 2016

This is also a problem on Cent 7.2. I have only observed this using
state.highstate. Here is a minimal test case:

# top.sls
{{saltenv}}:
  '*':
    - test_wait

# test_wait.sls
sleep 20:
  cmd.run

In my environment it appears to time out at about 18 seconds.

sudo salt '*' state.highstate saltenv=$USER

The same problem does not occur when running state.sls. It's also not intermittent--I can reproduce this reliably on multiple hosts.

@Ch3LL
Copy link
Contributor

Ch3LL commented Aug 5, 2016

hmmm i'm still not able to replicate this behavior on centos7, even when adding a separate environment to the test case. Is there something unique about your environment? Anything in your master or minion configs?

@eradman
Copy link
Contributor

eradman commented Aug 5, 2016

Thanks for checking back in on this. I am now not able to reproduce this problem either.

My minion/master configs are nearly default from the 2015.8 release except for the pillar and file root paths.

@cbuechler are you still observing this problem?

@thatch45
Copy link
Contributor

thatch45 commented Aug 5, 2016

@eradman what changed in your environment to make this problem go away?
@cbuechler I am just curious if increasing the timeout option helps here. We did introduce a name caching system in the cli client in 2016.3 to speed up the resolution, I think @cachedout added it, I wonder if that could be doing it?

@eradman
Copy link
Contributor

eradman commented Aug 5, 2016

I was not able to identify anything in my environment changed--I left my test configuration in a tmux session and we didn't run update any RPMs. While trying to root cause this I did restart the master and minion daemons, but at the time that didn't solve the problem.

@thatch45
Copy link
Contributor

thatch45 commented Aug 5, 2016

Ok, please let us know if it resurfaces.

Have the minion systems been under extra load when this happens?

@rbgarga
Copy link

rbgarga commented Aug 5, 2016

@eradman, as @cbuechler mentioned, for us it only happens when minion is 2016.3.1. With 2015.8 minion it works as expect

@thatch45
Copy link
Contributor

thatch45 commented Aug 5, 2016

thats very interesting, when the minion is 2016.3.1 does the job show up in the job cache? can you look it up with salt-run jobs.lookup_jid <jid>
?

@rbgarga
Copy link

rbgarga commented Aug 5, 2016

@thatch45 here are relevant data:

# salt pfsense-deploy state.sls ec2-pfs-deploy.copy-write-image
pfsense-deploy:
    Minion did not return. [Not connected]

Then on minion side:

2016-08-05 19:51:17,786 [salt.minion                                                ][INFO    ][986] User root Executing command state.sls with jid 20160805195121484805
2016-08-05 19:51:17,807 [salt.minion                                                ][INFO    ][46849] Starting a new job with PID 46849
2016-08-05 19:51:21,328 [salt.state                                                 ][INFO    ][46849] Loading fresh modules for state activity
2016-08-05 19:51:21,782 [salt.minion                                                ][INFO    ][986] User root Executing command saltutil.find_job with jid 20160805195126878542
2016-08-05 19:51:21,803 [salt.minion                                                ][INFO    ][46852] Starting a new job with PID 46852
2016-08-05 19:51:21,817 [salt.minion                                                ][INFO    ][46852] Returning information for job: 20160805195126878542
2016-08-05 19:51:21,847 [salt.fileclient                                            ][INFO    ][46849] Fetching file from saltenv 'base', ** done ** 'ec2-pfs-deploy/copy-write-image.sls'
2016-08-05 19:51:21,861 [salt.state                                                 ][INFO    ][46849] Running state [/root/image-latest.img.gz] at time 19:51:21.860895
2016-08-05 19:51:21,861 [salt.state                                                 ][INFO    ][46849] Executing state file.managed for /root/image-latest.img.gz
2016-08-05 19:52:08,790 [salt.fileclient                                            ][INFO    ][46849] Fetching file from saltenv 'base', ** done ** 'ec2-pfs-deploy/pfSense-netgate-ec2-2.3.2-RELEASE-amd64.img.gz'
2016-08-05 19:52:09,644 [salt.state                                                 ][INFO    ][46849] File /root/image-latest.img.gz is in the correct state
2016-08-05 19:52:09,648 [salt.state                                                 ][INFO    ][46849] Completed state [/root/image-latest.img.gz] at time 19:52:09.648240 duration_in_ms=47787.345
2016-08-05 19:52:09,649 [salt.state                                                 ][INFO    ][46849] Running state [gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M] at time 19:52:09.649200
2016-08-05 19:52:09,650 [salt.state                                                 ][INFO    ][46849] Executing state cmd.run for gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M
2016-08-05 19:52:09,651 [salt.loaded.int.module.cmdmod                              ][INFO    ][46849] Executing command 'gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M' in directory '/'
2016-08-05 19:54:30,555 [salt.state                                                 ][INFO    ][46849] {'pid': 46853, 'retcode': 0, 'stderr': '0+131072 records in\n8191+1 records out\n8589888000 bytes transferred in 140.889495 secs (60968974 bytes/sec)', 'stdout': ''}
2016-08-05 19:54:30,555 [salt.state                                                 ][INFO    ][46849] Completed state [gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M] at time 19:54:30.554877 duration_in_ms=140905.677
2016-08-05 19:54:30,555 [salt.state                                                 ][INFO    ][46849] Running state [gpart recover /dev/xbd5] at time 19:54:30.555153
2016-08-05 19:54:30,555 [salt.state                                                 ][INFO    ][46849] Executing state cmd.run for gpart recover /dev/xbd5
2016-08-05 19:54:30,556 [salt.loaded.int.module.cmdmod                              ][INFO    ][46849] Executing command 'gpart recover /dev/xbd5' in directory '/'
2016-08-05 19:54:30,578 [salt.state                                                 ][INFO    ][46849] {'pid': 46856, 'retcode': 0, 'stderr': '', 'stdout': 'xbd5 recovered'}
2016-08-05 19:54:30,578 [salt.state                                                 ][INFO    ][46849] Completed state [gpart recover /dev/xbd5] at time 19:54:30.578211 duration_in_ms=23.058
2016-08-05 19:54:30,578 [salt.state                                                 ][INFO    ][46849] Running state [fsck -p /dev/xbd5p2] at time 19:54:30.578459
2016-08-05 19:54:30,579 [salt.state                                                 ][INFO    ][46849] Executing state cmd.run for fsck -p /dev/xbd5p2
2016-08-05 19:54:30,579 [salt.loaded.int.module.cmdmod                              ][INFO    ][46849] Executing command 'fsck -p /dev/xbd5p2' in directory '/'
2016-08-05 19:54:30,625 [salt.state                                                 ][INFO    ][46849] {'pid': 46857, 'retcode': 0, 'stderr': '', 'stdout': '/dev/xbd5p2: FILE SYSTEM CLEAN; SKIPPING CHECKS\n/dev/xbd5p2: clean, 1887675 free (1987 frags, 235711 blocks, 0.1% fragmentation)'}
2016-08-05 19:54:30,625 [salt.state                                                 ][INFO    ][46849] Completed state [fsck -p /dev/xbd5p2] at time 19:54:30.624970 duration_in_ms=46.511
2016-08-05 19:54:30,629 [salt.minion                                                ][INFO    ][46849] Returning information for job: 20160805195121484805

And while it was being executed, I ran on master:

# salt-run jobs.lookup_jid 20160805195121484805
# salt-run jobs.lookup_jid 20160805195126878542
pfsense-deploy:
    ----------

@thatch45
Copy link
Contributor

thatch45 commented Aug 5, 2016

After the work is done and you run salt-run jobs.lookup_jid 20160805195126878542 does it show the job? I am wondering if after the job is done, once you are sure it is done running on the minion, is it in the job cache?
I am only asking because your statement leads me to think that you ran the lookup_jid before it had a chance to return

@rbgarga
Copy link

rbgarga commented Aug 5, 2016

Interesting @thatch45, here are results after job is done:

root@dev-salt-master:/usr/local/etc/salt # salt-run jobs.lookup_jid 20160805195126878542
pfsense-deploy:
    ----------
root@dev-salt-master:/usr/local/etc/salt # salt-run jobs.lookup_jid 20160805195121484805
pfsense-deploy:
    ----------
    cmd_|-fsck -p /dev/xbd5p2_|-fsck -p /dev/xbd5p2_|-run:
        ----------
        __id__:
            fsck -p /dev/xbd5p2
        __run_num__:
            3
        changes:
            ----------
            pid:
                46857
            retcode:
                0
            stderr:
            stdout:
                /dev/xbd5p2: FILE SYSTEM CLEAN; SKIPPING CHECKS
                /dev/xbd5p2: clean, 1887675 free (1987 frags, 235711 blocks, 0.1% fragmentation)
        comment:
            Command "fsck -p /dev/xbd5p2" run
        duration:
            46.511
        name:
            fsck -p /dev/xbd5p2
        result:
            True
        start_time:
            19:54:30.578459
    cmd_|-gpart recover /dev/xbd5_|-gpart recover /dev/xbd5_|-run:
        ----------
        __id__:
            gpart recover /dev/xbd5
        __run_num__:
            2
        changes:
            ----------
            pid:
                46856
            retcode:
                0
            stderr:
            stdout:
                xbd5 recovered
        comment:
            Command "gpart recover /dev/xbd5" run
        duration:
            23.058
        name:
            gpart recover /dev/xbd5
        result:
            True
        start_time:
            19:54:30.555153
    cmd_|-gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M_|-gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M_|-run:
        ----------
        __id__:
            gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M
        __run_num__:
            1
        changes:
            ----------
            pid:
                46853
            retcode:
                0
            stderr:
                0+131072 records in
                8191+1 records out
                8589888000 bytes transferred in 140.889495 secs (60968974 bytes/sec)
            stdout:
        comment:
            Command "gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M" run
        duration:
            140905.677
        name:
            gzcat /root/image-latest.img.gz | dd of=/dev/xbd5 bs=1M
        result:
            True
        start_time:
            19:52:09.649200
    file_|-/root/image-latest.img.gz_|-/root/image-latest.img.gz_|-managed:
        ----------
        __id__:
            /root/image-latest.img.gz
        __run_num__:
            0
        changes:
            ----------
        comment:
            File /root/image-latest.img.gz is in the correct state
        duration:
            47787.345
        name:
            /root/image-latest.img.gz
        pchanges:
            ----------
        result:
            True
        start_time:
            19:51:21.860895

@thatch45
Copy link
Contributor

thatch45 commented Aug 5, 2016

Excellent, then the issue can be isolated to the client, in that it is returning to the cli before the job is finished, something difficult in Salt because of the disconnected nature of things. When this happens this is the first thing to check.
The problem I see though is that the claim is that the minion version matters, so I need to ask, what are the minion and master versions? We may have a bug due to a version hicup

@rbgarga
Copy link

rbgarga commented Aug 5, 2016

Thanks @thatch45,

When we have both master and minion on 2015.8.1 it was working. After master was upgraded to 2016.3.1 it kept working. The problem started when we upgraded minion to 2016.3.1.

All these examples I sent you were done running both on 2016.3.1

@thatch45
Copy link
Contributor

thatch45 commented Aug 5, 2016

Very odd indeed, but this gives us a better place to hunt, I will ask more questions as we find them

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around P2 Priority 2 Core relates to code central or existential to Salt and removed cannot-reproduce cannot be replicated with info/context provided info-needed waiting for more info labels Aug 11, 2016
@Ch3LL Ch3LL modified the milestones: Approved, Blocked Aug 11, 2016
@cro
Copy link
Contributor

cro commented Oct 7, 2016

Greetings--this does not appear to be a problem in 2016.3.3, but I also can't reproduce it with master at 2016.3.3 and minion at 2016.3.1. Master is Ubuntu 14.04, minion is FreeBSD 10.3. Versions-report and states I used are below. I'm happy to investigate further, I am curious about the network setup. Here are a few questions re: the setup:

  1. Is it possible there is a NAT or other aggressive firewall in the mix that might be dropping connections?
  2. Is multiprocessing: False present in any of the master or minion configs?
  3. After a highstate or other command that returns 'Minion not connected', can you target that minion and get a response, or is it gone for good?

top.sls:

base:
  '*':
    - sleeptest

sleeptest.sls

sleep 300:
  cmd.run

state.highstate working:

root@utest:/srv/salt# salt \* state.highstate
utest:
----------
          ID: sleep 300
    Function: cmd.run
      Result: True
     Comment: Command "sleep 300" run
     Started: 19:33:22.376720
    Duration: 300017.041 ms
     Changes:
              ----------
              pid:
                  8630
              retcode:
                  0
              stderr:
              stdout:

Summary for utest
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1
Total run time: 300.017 s
fbsdtest:
----------
          ID: sleep 300
    Function: cmd.run
      Result: True
     Comment: Command "sleep 300" run
     Started: 19:33:22.393582
    Duration: 300024.806 ms
     Changes:
              ----------
              pid:
                  6967
              retcode:
                  0
              stderr:
              stdout:

Summary for fbsdtest
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1
Total run time: 300.025 s

Versions:

root@utest:/srv/salt# salt \* test.versions_report
utest:
    Salt Version:
               Salt: 2016.3.3

    Dependency Versions:
               cffi: Not Installed
           cherrypy: Not Installed
           dateutil: 1.5
              gitdb: Not Installed
          gitpython: Not Installed
              gitdb: Not Installed
          gitpython: Not Installed
              ioflo: Not Installed
             Jinja2: 2.7.2
            libgit2: Not Installed
            libnacl: Not Installed
           M2Crypto: Not Installed
               Mako: Not Installed
       msgpack-pure: Not Installed
     msgpack-python: 0.4.6
       mysql-python: Not Installed
          pycparser: Not Installed
           pycrypto: 2.6.1
             pygit2: Not Installed
             Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
       python-gnupg: Not Installed
             PyYAML: 3.10
              PyZMQ: 14.0.1
               RAET: Not Installed
              smmap: Not Installed
            timelib: Not Installed
            Tornado: 4.2.1
                ZMQ: 4.0.5

    System Versions:
               dist: Ubuntu 14.04 trusty
            machine: x86_64
            release: 3.13.0-96-generic
             system: Linux
            version: Ubuntu 14.04 trusty

fbsdtest:
    Salt Version:
               Salt: 2016.3.1

    Dependency Versions:
               cffi: Not Installed
           cherrypy: Not Installed
           dateutil: 2.5.3
              gitdb: Not Installed
          gitpython: Not Installed
              ioflo: Not Installed
             Jinja2: 2.8
            libgit2: Not Installed
            libnacl: Not Installed
           M2Crypto: Not Installed
               Mako: Not Installed
       msgpack-pure: Not Installed
     msgpack-python: 0.4.7
       mysql-python: Not Installed
          pycparser: Not Installed
           pycrypto: 2.6.1
             pygit2: Not Installed
             Python: 2.7.12 (default, Oct  4 2016, 01:33:46)
       python-gnupg: Not Installed
             PyYAML: 3.11
              PyZMQ: 15.4.0
               RAET: Not Installed
              smmap: Not Installed
            timelib: Not Installed
            Tornado: 4.4.2
                ZMQ: 4.1.5

    System Versions:
               dist:
            machine: amd64
            release: 10.3-RELEASE-p7
             system: FreeBSD
            version: Not Installed

@cro cro modified the milestones: C 2, C 3 Oct 7, 2016
@cro
Copy link
Contributor

cro commented Oct 8, 2016

OK, I dug a little deeper and discovered that a similar FreeBSD 10.3 machine created in Amazon does show the issue. However, the FreeBSD machine I created in Vultr does not. So it seems to have something to do with Amazon (configuration of that AMI, or Amazon's network)

@cypher386
Copy link

Happens on our vmware environment, so its not specific to Amazon.
We see the problem on our network, with a flat network/no NAT or firewall. The templates deploy, but the salt bootstrapping process fails.
Master and Minions are at defaults for virtually all options. (Minions case this is literally true.) Master might have a few options configured, but nothing special.

@cro
Copy link
Contributor

cro commented Oct 11, 2016

@cypher386 are your minions FreeBSD also?

@mamalos
Copy link
Contributor

mamalos commented Oct 13, 2016

I'm facing the same issue on a FreeBSD 10.1-RELEASE where I'm working on adding some FreeBSD modules, using py27-salt-2016.3.3 (master and minion on the same host). The salt commands I'm sending return with message:

Minion did not return. [No response]

but the minion's logs show that the command is still executing and successfully returns when it's done, without the master being notified about it.

What I see if I run salt with flag -t 60 is that the master decides that the minion won't return a few seconds after the following messages on the minion's log:

[INFO    ] User root Executing command saltutil.find_job with jid 20161013121823197477
[DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20161013121823197477', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['20161013121722815280'], 'fun': 'saltutil.find_job'}
[INFO    ] Starting a new job with PID 41389
[DEBUG   ] LazyLoaded saltutil.find_job
[DEBUG   ] LazyLoaded direct_call.get
[DEBUG   ] Minion return retry timer set to 10 seconds (randomized)
[INFO    ] Returning information for job: 20161013121823197477
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/usr/local/etc/salt/pki/minion', 'myhost', 'tcp://10.0.2.15:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/usr/local/etc/salt/pki/minion', 'fbsd10.1', 'tcp://10.0.2.15:4506')

I think that the master loses the minion X seconds after this message appears, where X=10 in the above case, but I'm not exactly sure about the exact timing. The only thing I'm sure about is that the master loses the minion only after this message has appeared (I've used different delay times in -t and always got the same behaviour).

I don't know if that helps, but I'd be glad to test your ideas on my machine in order to resolve this issue.

@cro
Copy link
Contributor

cro commented Oct 13, 2016

@mamalos Is this a VM? If so, what hypervisor or cloud provider?

@cro
Copy link
Contributor

cro commented Oct 13, 2016

Also, that's excellent that you were able to reproduce it w/master and minion on the same box. That rules out an issue with the network.

@mamalos
Copy link
Contributor

mamalos commented Oct 14, 2016

@cro The FreeBSD 10.1-RELEASE amd64 is running on a Virtualbox VM, but I'm experiencing the exact same issue, with the exact same configuration (master/minion on the same host) on a FreeBSD 10.2-RELEASE-p14 amd64 which is installed on an IBM workstation.

So running salt --versions-report on my machines shows:

On FreeBSD 10.1-RELEASE:

Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.5.3
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.12 (default, Sep 27 2016, 01:24:16)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.2
            ZMQ: 4.1.5

System Versions:
           dist:   
        machine: amd64
        release: 10.1-RELEASE
         system: FreeBSD
        version: Not Installed

and on my FreeBSD 10.2-RELEASE-p14:

Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.5.0
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.7
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.12 (default, Sep 27 2016, 01:49:05)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.4.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.1
            ZMQ: 4.1.4

System Versions:
           dist:   
        machine: amd64
        release: 10.2-RELEASE-p14
         system: FreeBSD
        version: Not Installed

It is very easy to reproduce it -as others have already mentioned in this thread- by using the cmd.run module with a command that takes long to execute. I usually reproduce it by running salt '*' cmd.run 'freebsd-update fetch' and watch my master returning a Minion did not return. [No response] message while the client still executes freebsd-update command and at some time later I can see the command's output in the minion's log and I can also verify that it ran successfully. Running salt '*' cmd.run 'sleep 30' works like a charm too.

To be honest, I've only experienced this issue when using the cmdmod module, because all the commands I've run so far that behave like this are using it. On the other hand, I'm very new to salt and I'm still experimenting with it, so I haven't tried many of its features yet.

@cro
Copy link
Contributor

cro commented Oct 14, 2016

Greetings. I think I fixed this. It turned out to be far simpler than we thought. Incidentally it was also broken on macOS. Let me know if I nailed it or not.

@mamalos
Copy link
Contributor

mamalos commented Oct 15, 2016

@cro, bravo that you found the source of the problem! I can confirm that the message does not appear any more, but _check_cmdline() still doesn't run correctly on FreeBSD (since it always returns True). More info and suggestions on your pull request.

@rbgarga
Copy link

rbgarga commented Oct 17, 2016

I just confirmed the root cause mounting /proc before start the command the takes a long time to return and it worked. FreeBSD supports /proc but it's not mounted by default, and not recommended. I just did it to validate the issue.

@meggiebot
Copy link

PR was merged, closing

@ghost
Copy link

ghost commented May 4, 2017

Sorry to comment on a closed topic, but this is happening on CentOS 7 + salt (minion & master) 2016.11.4-1.

@cachedout cachedout reopened this May 4, 2017
@mamalos
Copy link
Contributor

mamalos commented May 8, 2017

Hi @amraam83, could you please provide some code to reproduce?

I've setup a CentOS 7 on a VM with salt 2016.11.4 (not 2016.11.4-1, but I'm not using RH, so I followed salt's guide of installing the latest version of salt using bootstrap, and that's the version it installed. If you need me to install a different one, please give some guidance.) and ran salt '*' cmd.run 'sleep 30' which worked without any issues.

@ghost
Copy link

ghost commented May 9, 2017

Hi @mamalos , it turned out it was a very nasty bug on our states (a dependency that restarted the salt-minion daemon during a highstate). Sorry for bothering :(

@mamalos
Copy link
Contributor

mamalos commented May 9, 2017

Cool! Because it would be reaaally weird if this issue was affecting Linux systems as well! :)
Glad you sorted it out, I guess now this issue can be closed again.

@ntvietvn
Copy link

Hi,
I have the same issue on centos6. I have a salt state which launches several unit tests on the target machine (so the state takes time) and i've reproduced this issue with the salt-minion version 2016.11.5 and also 2016.3.6, regardless the master version. It works if the salt-minion is downgraded to 2015.8.13. Please notice that:

  • cmd.run "sleep 60" works perfectly

  • if i set the timeout to 60 in /etc/salt/master, the state also works

Thanks for your help

@mamalos
Copy link
Contributor

mamalos commented Jun 28, 2017

@ntvietvn, as stated before, this issue is hard to be related to a Linux machine, since Linux has a proc filesystem mounted that contains the information that salt is asking about.

As @amraam83 commented a few months ago, his analogous issue (on a Centos7 machine) was eventaully related to a bug in their states. Are you sure this is not the case for you? Have you ran the above commands on a clean centos6 installation using the salt versions you've metnioned?

@gaizeror
Copy link

gaizeror commented Sep 5, 2017

Happens to me also.
Version 2017.7.1 (happend since 2017.7.0)
salt master is Ubuntu 16.04
salt-minion is Windows Server 2012R2

Only when setting timeout to 200sec or even more, we get output.
Hard for me to paste the salt info because we are in an offline env, if there is something we can tell you to help us, will be great.

It doesn't matter which state it is, as long as it is a long state, it happens.

@mamalos
Copy link
Contributor

mamalos commented Sep 6, 2017

Hi @gaizeror, if you see the code this issue is referring to, you'll see that you shouldn't have a problem on a Windows Server (at least wrt this issues' cause of the problem).

In order to see it more deeply, it would be good if you could submit some more details about your configuration, but based on your minion, I don't think that your problem's cause is the same that this issue is referring to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt P2 Priority 2 Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around ZRELEASED - 2016.3.4
Projects
None yet
Development

No branches or pull requests