Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport TCP minions don't reconnect/recover #39463

Closed
githubcdr opened this issue Feb 16, 2017 · 12 comments
Closed

Transport TCP minions don't reconnect/recover #39463

githubcdr opened this issue Feb 16, 2017 · 12 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 severity-critical top severity, seen by most users, serious issues severity-high 2nd top severity, seen by most users, causes major problems Transport ZRELEASED - 2016.11.4
Milestone

Comments

@githubcdr
Copy link
Contributor

githubcdr commented Feb 16, 2017

Description of Issue/Question

When using "tcp" as transport option I noticed that minions don't alway reconnect when connection to master is lost. Simply restart the salt-master and minions are lost... forever :(

Setup

Saltmaster running on CentOS 6.8, with 250+ minions on CentOS 5,6 and 7. Network is stable and this issue seems to be related to 2016.11.2 only.

Steps to Reproduce Issue

  • Run salt-master -l debug (default config, normal worker threads and "transport: tcp" option.)
  • Run salt-minion -l debug
  • stop salt-master && restart

Sometimes minions reconnect, but most of the time I see errors like this;

Failing minion

[DEBUG   ] tcp stream to 3xxx5:4505 closed, unable to recv
[DEBUG   ] Sending event: tag = __master_disconnected; data = {'_stamp': '2017-02-16T20:53:23.504616', 'master': 'saltstack'}
[DEBUG   ] Minion of "saltstack" is handling event tag '__master_disconnected'
[INFO    ] Connection to master saltstack lost
[DEBUG   ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/minion', 'minion', 'tcp://37xxxx5:4506', 'clear')
[ERROR   ] Exception in callback <functools.partial object at 0x5afd520>
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/tornado/ioloop.py", line 591, in _run_callback
    ret = callback()
  File "/usr/lib64/python2.6/site-packages/tornado/stack_context.py", line 274, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/tornado/ioloop.py", line 597, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/tcp.py", line 423, in connect_callback
    yield self.send_id(self.tok, self._reconnected)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/tcp.py", line 408, in send_id
    yield self.auth.authenticate()
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltClientError: Attempt to authenticate with the salt master failed with timeout error
[ERROR   ] Got response for message_id 1 that we are not tracking

Working minion

[DEBUG   ] tcp stream to 3xxxx5:4505 closed, unable to recv
[DEBUG   ] Sending event: tag = __master_disconnected; data = {'_stamp': '2017-02-16T20:56:35.601754', 'master': 'saltstack'}
[DEBUG   ] Minion of "saltstack" is handling event tag '__master_disconnected'
[INFO    ] Connection to master saltstack lost
[DEBUG   ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/minion', 'minion', 'tcp://3xxxx5:4506', 'clear')
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_7cdd3fe6c4_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_7cdd3fe6c4_pull.ipc
[DEBUG   ] Initializing new IPCClient for path: /var/run/salt/minion/minion_event_7cdd3fe6c4_pull.ipc
[DEBUG   ] Sending event: tag = salt/auth/creds; data = {'_stamp': '2017-02-16T20:56:51.170782', 'creds': {'publish_port': 4505, 'aes': 'yQfqC+JngW7sKlUDqUbjAAvGZSJJliu0YT9ZMlxy66SBCemXVSXvjJK5t7KDpzBgehxfIvqkgkw=', 'master_uri': 'tcp://37.203.216.75:4506'}, 'key': ('/etc/salt/pki/minion', 'minion', 'tcp://3xxxx5:4506')}
[DEBUG   ] tcp stream to 3xxxx5:4506 closed, unable to recv
[DEBUG   ] Sending event: tag = __master_connected; data = {'_stamp': '2017-02-16T20:56:51.173179', 'master': 'saltstack'}
[DEBUG   ] Initializing new AsyncTCPReqChannel for ('/etc/salt/pki/minion', 'minion', 'tcp://37.203.216.75:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion', 'tcp://37.203.216.75:4506')
[DEBUG   ] Minion of "saltstack" is handling event tag 'salt/auth/creds'
[DEBUG   ] Updating auth data for ('/etc/salt/pki/minion', 'minion', 'tcp://3xxxx5:4506'): {'publish_port': 4505, 'aes': 'yQfqC+JngW7sKlUDqUbjAAvGZSJJliu0YT9ZMlxy66SBCemXVSXvjJK5t7KDpzBgehxfIvqkgkw=', 'master_uri': 'tcp://3xxxx5:4506'} -> {'publish_port': 4505, 'aes': 'yQfqC+JngW7sKlUDqUbjAAvGZSJJliu0YT9ZMlxy66SBCemXVSXvjJK5t7KDpzBgehxfIvqkgkw=', 'master_uri': 'tcp://3xxxx5:4506'}
[DEBUG   ] Minion of "saltstack" is handling event tag '__master_connected'
[INFO    ] Connection to master saltstack re-established

I'm sure the salt-master has enought resources and no IO wait etc. I even disabled reactors and increased worker threads etc, but no dice. The result seems to depend on "luck", when restarting the salt-master sometimes minions reconnect.

Versions Report

Salt Version:
           Salt: 2016.11.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.4.0
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: 1.1.0
         Jinja2: 2.5.5
        libgit2: Not Installed
        libnacl: 1.4.0
       M2Crypto: 0.21.1
           Mako: 1.0.0
   msgpack-pure: 0.1.3
 msgpack-python: 0.4.5
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
         Python: 2.6.8 (unknown, Aug 19 2015, 14:55:21)
   python-gnupg: Not Installed
         PyYAML: 3.08
          PyZMQ: 14.5.0
           RAET: 0.4.2
          smmap: Not Installed
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.0.5

System Versions:
           dist: redhat 5.11 Final
        machine: x86_64
        release: 2.6.18-417.el5
         system: Linux
        version: CentOS 5.11 Final

@Ch3LL
Copy link
Contributor

Ch3LL commented Feb 16, 2017

Looks like i'm able to replicate this. The way I replicated this was using your exact steps but when I stopped the master I had to wait 5-10 minutes before starting up the master again to see this behavior. We will definitely need to get this fixed.

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems P2 Priority 2 Transport labels Feb 16, 2017
@Ch3LL Ch3LL added this to the Approved milestone Feb 16, 2017
@githubcdr
Copy link
Contributor Author

Thanks, this issue has cost me some hair ;)

@Talkless
Copy link
Contributor

I noticed similar exception after upgrade to 2016.3.5 on Debian Jessie minions. Few times a week I get this:

2017-02-14 15:34:53,628 [tornado.application][ERROR   ][2101] Future exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run 
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 434, in wrap_callback
    payload = yield self._decode_messages(messages)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run 
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run 
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 411, in _decode_messages
    ret = yield self._decode_payload(payload)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run 
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run 
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/dist-packages/salt/transport/mixins/auth.py", line 47, in _decode_payload
    yield self.auth.authenticate()
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run 
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltClientError: Attempt to authenticate with the salt master failed with timeout error

Though I could not say that minions lose connections, they seems to be accessible without restarting. Master keeps running 24/7.

@Ch3LL Ch3LL added ZRELEASED - 2016.11.4 severity-critical top severity, seen by most users, serious issues labels Feb 17, 2017
@gtmanfred gtmanfred self-assigned this Mar 31, 2017
@cachedout cachedout modified the milestones: Nitrogen 7, Approved Apr 7, 2017
@gtmanfred gtmanfred added the fixed-pls-verify fix is linked, bug author to confirm fix label Apr 10, 2017
@gtmanfred
Copy link
Contributor

@githubcdr we have a pending PR that should fix this if you want to take a look #40614
Thanks!
Daniel

@githubcdr
Copy link
Contributor Author

I will give it a shot this week, works kinda busy

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 11, 2017

Will close since this is resolved. Please let us know if we need to re-open if there is an issue with the fix.

@Ch3LL Ch3LL closed this as completed Apr 11, 2017
@austinpapp
Copy link
Contributor

will this be making its way into 2016.11.4 ?

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 14, 2017

Yes

@qihanglys
Copy link

Could you tell me the above problem solved,I use the 2016.11.3 also encountered the same problem

@gtmanfred
Copy link
Contributor

It is in 2016.11.4

@qihanglys
Copy link

Confirm, it is solved in version 2016.11.4 ?

@gtmanfred
Copy link
Contributor

I fixed it in 2016.11.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix P2 Priority 2 severity-critical top severity, seen by most users, serious issues severity-high 2nd top severity, seen by most users, causes major problems Transport ZRELEASED - 2016.11.4
Projects
None yet
Development

No branches or pull requests

7 participants