Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tornado: error deleting fd from IOLoop when master disconnects for a long time in multi-failover #32768

Closed
anlutro opened this issue Apr 22, 2016 · 3 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Multi-Master P3 Priority 3 severity-high 2nd top severity, seen by most users, causes major problems Transport
Milestone

Comments

@anlutro
Copy link
Contributor

anlutro commented Apr 22, 2016

Using a multi-master failover configuration with a single master, if the master is shut down for a while, I get this in the salt minion logs on the master host:

2016-04-22 08:16:51,740 [salt.minion] [INFO] fire_master failed: master could not be contacted. Request timed out.
2016-04-22 08:16:51,741 [salt.utils.schedule] [INFO] Running scheduled job: __master_alive
2016-04-22 08:16:51,742 [salt.utils.schedule] [DEBUG] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2016-04-22 08:16:51,742 [salt.utils.schedule] [DEBUG] schedule: This job was scheduled with a max number of 2
2016-04-22 08:16:51,746 [salt.utils.schedule] [DEBUG] schedule.handle_func: Checking job against fun status.master: {'fun': 'status.master', 'jid': '20160422080453546650', 'pid': 1936, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,747 [salt.utils.schedule] [DEBUG] schedule.handle_func: Checking job against fun status.master: {'fun': 'status.master', 'jid': '20160422081151722659', 'pid': 2002, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,748 [salt.utils.schedule] [DEBUG] schedule.handle_func: Checking job against fun status.master: {'fun': 'status.master', 'jid': '20160422080452545186', 'pid': 1927, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,752 [salt.utils.schedule] [INFO] Running scheduled job: __update_grains
2016-04-22 08:16:51,752 [salt.utils.schedule] [DEBUG] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True)
2016-04-22 08:16:51,752 [salt.utils.schedule] [DEBUG] schedule: This job was scheduled with a max number of 1
2016-04-22 08:16:51,760 [salt.utils.schedule] [DEBUG] schedule.handle_func: Checking job against fun event.fire: {'fun': 'status.master', 'jid': '20160422080453546650', 'pid': 1936, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,762 [salt.utils.schedule] [DEBUG] schedule.handle_func: adding this job to the jobcache with data {'fun': 'status.master', 'jid': '20160422081651745533', 'pid': 2052, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,763 [salt.utils.schedule] [DEBUG] schedule.handle_func: Checking job against fun event.fire: {'fun': 'status.master', 'jid': '20160422081151722659', 'pid': 2002, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,766 [salt.utils.schedule] [DEBUG] schedule.handle_func: Checking job against fun event.fire: {'fun': 'status.master', 'jid': '20160422080452545186', 'pid': 1927, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__master_alive'}
2016-04-22 08:16:51,770 [salt.transport.zeromq] [DEBUG] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-04-22 08:16:51,771 [salt.crypt] [DEBUG] Initializing new SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-04-22 08:16:51,777 [salt.utils.schedule] [DEBUG] schedule.handle_func: adding this job to the jobcache with data {'fun': 'event.fire', 'jid': '20160422081651759116', 'pid': 2054, 'id': 'orch.test.infra.vagrant.local', 'schedule': '__update_grains'}
2016-04-22 08:16:51,780 [salt.utils.event] [DEBUG] SaltEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_b1192db92a_pub.ipc
2016-04-22 08:16:51,781 [salt.utils.event] [DEBUG] SaltEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_b1192db92a_pull.ipc
2016-04-22 08:16:51,782 [salt.utils.event] [DEBUG] Sending event - data = {'_stamp': '2016-04-22T06:16:51.782290'}
2016-04-22 08:16:51,784 [salt.minion] [DEBUG] Handling event 'grains_refresh\n\n\x81\xa6_stamp\xba2016-04-22T06:16:51.782290'
2016-04-22 08:16:51,811 [salt.transport.zeromq] [DEBUG] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-04-22 08:16:51,811 [salt.crypt] [DEBUG] Initializing new SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-04-22 08:17:51,543 [salt.transport.zeromq] [DEBUG] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506', 'aes')
2016-04-22 08:17:51,543 [salt.crypt] [DEBUG] Initializing new SAuth for ('/etc/salt/pki/minion', 'orch.test.infra.vagrant.local', 'tcp://10.0.1.99:4506')
2016-04-22 08:17:51,830 [salt.transport.zeromq] [DEBUG] SaltReqTimeoutError, retrying. (1/3)
2016-04-22 08:17:51,830 [salt.transport.zeromq] [DEBUG] SaltReqTimeoutError, retrying. (1/3)
2016-04-22 08:18:51,599 [salt.transport.zeromq] [DEBUG] SaltReqTimeoutError, retrying. (1/3)
2016-04-22 08:18:51,891 [tornado.general] [DEBUG] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 115, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 71, in unregister
    idx = self._map.pop(socket)
KeyError: None

It looks like it might be related to trying to reconnect in multiple threads at the same time? retrying. (1/3) shows up 3 times within a short period of time.

Salt Version:
           Salt: 2015.8.8.2

Dependency Versions:
         Jinja2: 2.7.3
       M2Crypto: Not Installed
           Mako: Not Installed
         PyYAML: 3.11
          PyZMQ: 14.4.0
         Python: 2.7.9 (default, Mar  1 2015, 12:57:24)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.5
           cffi: 0.8.6
       cherrypy: 3.2.3
       dateutil: 2.2
          gitdb: 0.5.4
      gitpython: 0.3.2 RC1
          ioflo: Not Installed
        libgit2: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.2
   mysql-python: 1.2.3
      pycparser: 2.10
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: 0.8.2
        timelib: Not Installed

System Versions:
           dist: debian 8.3 
        machine: x86_64
        release: 3.16.0-4-amd64
         system: debian 8.3

Possibly related to #32609

@jfindlay jfindlay added the info-needed waiting for more info label Apr 22, 2016
@jfindlay jfindlay modified the milestones: Blocked, Approved Apr 22, 2016
@jfindlay jfindlay added Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems Transport P3 Priority 3 Bug broken, incorrect, or confusing behavior Multi-Master and removed info-needed waiting for more info labels Apr 22, 2016
@jfindlay
Copy link
Contributor

Thanks for reporting, @anlutro.

@rallytime
Copy link
Contributor

@DmitryKuzmenko Can you take a look at this one?

cc @meggiebot

@DmitryKuzmenko
Copy link
Contributor

Fixed by #31164 that is already in 2015.8 branch and will be available in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Multi-Master P3 Priority 3 severity-high 2nd top severity, seen by most users, causes major problems Transport
Projects
None yet
Development

No branches or pull requests

5 participants