Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting master causes minion to hang #37238

Closed
cmclaughlin opened this issue Oct 25, 2016 · 6 comments
Closed

Restarting master causes minion to hang #37238

cmclaughlin opened this issue Oct 25, 2016 · 6 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 severity-high 2nd top severity, seen by most users, causes major problems
Milestone

Comments

@cmclaughlin
Copy link
Contributor

Description of Issue/Question

I use https://github.com/saltstack-formulas/salt-formula to manage Salt itself. When I was running 2015.08 I could run highstate on my master and if the master process restarted, the job output would be returned to the local minion and I'd see the changes applied. After upgrading to 2016.3.3 the minion on my master hangs when then master process restarts during a highstate run.

If I stop my highstate commend with control+c, then re-run it - it works. The changes from the first highstate were already applied in this scenario so the second run applies with no changes.

I noticed this in my master logs:

2016-10-25 18:22:57,094 [salt.log.setup   ][ERROR   ][20141] An un-handled exception was caught by salt's global exception handler:
OSError: [Errno 3] No such process
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/lib/python2.7/multiprocessing/util.py", line 321, in _exit_function
    p._popen.terminate()
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 171, in terminate
    os.kill(self.pid, signal.SIGTERM)
OSError: [Errno 3] No such process

So I found this issue and PR:
#35480
#36555

I tried that patch and the "No such process" error went away, but I still have the problem where restarting the master causes the local minion to hang.

Here's some debug logs from my minion:

[INFO    ] File changed:

---
+++
@@ -851,4 +851,4 @@



-# TEST6+# TEST7
[INFO    ] Completed state [/etc/salt/master] at time 18:22:55.932824 duration_in_ms=43.519
[INFO    ] Running state [salt-master] at time 18:22:55.935496
[INFO    ] Executing state service.running for salt-master
[INFO    ] Executing command ['service', 'salt-master', 'status'] in directory '/root'
[DEBUG   ] output: salt-master start/running, process 20141
[INFO    ] The service salt-master is already running
[INFO    ] Completed state [salt-master] at time 18:22:55.957080 duration_in_ms=21.584
[INFO    ] Running state [salt-master] at time 18:22:55.957265
[INFO    ] Executing state service.mod_watch for salt-master
[INFO    ] Executing command ['service', 'salt-master', 'status'] in directory '/root'
[DEBUG   ] output: salt-master start/running, process 20141
[INFO    ] Executing command ['service', 'salt-master', 'restart'] in directory '/root'
[DEBUG   ] output: salt-master stop/waiting
salt-master start/running, process 23422
[INFO    ] {'salt-master': True}
[INFO    ] Completed state [salt-master] at time 18:22:57.125567 duration_in_ms=1168.302

....

[DEBUG   ] Minion return retry timer set to 7 seconds (randomized)
[INFO    ] Returning information for job: 20161025182244052511
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', '...mysaltmaster address...', 'tcp://127.0.0.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', '...mysaltmaster address...', 'tcp://127.0.0.1:4506')
^[^[





[DEBUG   ] Failed to authenticate message
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', '...mysaltmaster address...', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[INFO    ] User root Executing command saltutil.find_job with jid 20161025183004333300
[DEBUG   ] Command details {'tgt_type': 'glob', 'jid': '20161025183004333300', 'tgt': '*saltmaster*', 'ret': '', 'user': 'root', 'arg': ['20161025182244052511'], 'fun': 'saltutil.find_job'}
[INFO    ] Starting a new job with PID 26660
[DEBUG   ] LazyLoaded saltutil.find_job
[DEBUG   ] LazyLoaded direct_call.get
[DEBUG   ] Minion return retry timer set to 8 seconds (randomized)
[INFO    ] Returning information for job: 20161025183004333300
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', '...mysaltmaster address...', 'tcp://127.0.0.1:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', '...mysaltmaster address...', 'tcp://127.0.0.1:4506')

I don't really understand what's happening here... I guess the master restarts and the minion doesn't reconnect. Any idea what changed when I upgraded that broke this? Do you know of any other issues or patches I can try?

Versions Report

# salt --versions-report
Salt Version:
           Salt: 2016.3.3

Dependency Versions:
           cffi: Not Installed
       cherrypy: 3.2.2
       dateutil: 2.5.3
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: 0.9.1
   msgpack-pure: Not Installed
 msgpack-python: 0.4.6
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
   python-gnupg: 0.3.6
         PyYAML: 3.10
          PyZMQ: 14.0.1
           RAET: Not Installed
          smmap: 0.8.2
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-91-generic
         system: Linux
        version: Ubuntu 14.04 trusty
@gtmanfred gtmanfred added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems P2 Priority 2 TEAM Core labels Oct 26, 2016
@gtmanfred gtmanfred added this to the Approved milestone Oct 26, 2016
@gtmanfred
Copy link
Contributor

#37254

It looks like @DmitryKuzmenko has already submitted a fix for this issue.

It won't make it into 2016.3.4, but it will be in 2016.3.5

@meggiebot
Copy link

@gtmanfred actually this will be in for 2016.3.4.

@gtmanfred
Copy link
Contributor

awesome

@meggiebot meggiebot added the fixed-pls-verify fix is linked, bug author to confirm fix label Oct 26, 2016
@meggiebot meggiebot modified the milestones: C 2, Approved Oct 26, 2016
@cmclaughlin
Copy link
Contributor Author

Thanks... I tried the patch from PR #37254, but still have the problem.

I just extracted a fresh copy of salt/log/setup.py from my apt-cache and see that changed line is the same:

grep -n "__MP_LOGGING_QUEUE_PROCESS.daemon" usr/lib/python2.7/dist-packages/salt/log/setup.py
801:    __MP_LOGGING_QUEUE_PROCESS.daemon = True

So I don't think commit c9c45a5 made it into Debian/Ubuntu package for Salt 2016.3.3... so perhaps that's not the cause.

I also tried patching my master with salt/log/setup.py from the develop branch and still have the problem.

I'd be happy to try other things to troubleshoot if anyone has suggestions.

@DmitryKuzmenko
Copy link
Contributor

@cmclaughlin got it. Thank you for testing. I'll check it.

@DmitryKuzmenko
Copy link
Contributor

I've reproduced this and semi-fixed for now.

@DmitryKuzmenko DmitryKuzmenko modified the milestones: C 1, C 2 Nov 1, 2016
@meggiebot meggiebot removed the fixed-pls-verify fix is linked, bug author to confirm fix label Nov 1, 2016
cachedout pushed a commit that referenced this issue Nov 4, 2016
…_master_restart

Fix for #37238 salt hang on master restart
@cachedout cachedout added the fixed-pls-verify fix is linked, bug author to confirm fix label Nov 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 severity-high 2nd top severity, seen by most users, causes major problems
Projects
None yet
Development

No branches or pull requests

5 participants