Minion startup extremely delayed when first master in failover multi master setup is down #30183

jakehilton · 2016-01-06T19:36:25Z

My minion config looks like so:

master:
  - smaster2-dusw.xxxx.com
  - smaster1-dusw.xxxx.com
master_type: failover
master_alive_interval: 10

When I start up my minion I would expect that if the smaster2 is down/unresponsive that it would try the next one in the list right away.

As stated here: https://docs.saltstack.com/en/latest/topics/highavailability/index.html
"Changing the master_type parameter from str to failover will cause minions to connect to the first responding master in the list of masters."

It seems like it tries the first master until all retries are complete.. then moves on.

Here is the output.

# salt-minion -l debug
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: dusw1-content17
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[INFO    ] Processing `log_handlers.sentry`
[DEBUG   ] No 'sentry_handler' key was found in the configuration
[INFO    ] The `log_handlers.sentry.setup_handlers()` function returned `False` which means no logging handler was configured on purpose. Continuing...
[DEBUG   ] Configuration file path: /etc/salt/minion
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[INFO    ] Setting up the Salt Minion "dusw1-content17"
[DEBUG   ] Created pidfile: /var/run/salt-minion.pid
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[INFO    ] The salt minion is starting up
[INFO    ] Minion is starting as user 'root'
[DEBUG   ] AsyncEventPublisher PUB socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[DEBUG   ] AsyncEventPublisher PULL socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[INFO    ] Starting pub socket on ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[INFO    ] Starting pull socket on ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[DEBUG   ] Minion 'dusw1-content17' trying to tune in
[DEBUG   ] sync_connect_master
[INFO    ] Got list of available master addresses: ['smaster2-dusw.xxxx.com', 'smaster1-dusw.xxxx.com']
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (6337)
[DEBUG   ] Setting zmq_reconnect_ivl to '6337ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'clear')
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xafdusw1-content17\xa6_stamp\xba2016-01-06T19:26:05.664440'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb472f788>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'
[DEBUG   ] SaltReqTimeoutError, retrying. (1/7)
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xafdusw1-content17\xa6_stamp\xba2016-01-06T19:27:05.759040'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb46d2b50>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'

Here is the 7th attempt log:

[DEBUG   ] SaltReqTimeoutError, retrying. (7/7)
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xafdusw1-content17\xa6_stamp\xba2016-01-06T19:33:06.504627'
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb46d2af8>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'

So it goes through all 7 tries and then finally shows this:

[INFO    ] Master smaster2-dusw.xxxx.com could not be reached, trying next master (if any)
[WARNING ] Master ip address changed from 172.31.32.38 to 172.31.47.127
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (1245)
[DEBUG   ] Setting zmq_reconnect_ivl to '1245ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'clear')
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'aes')
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[DEBUG   ] LazyLoaded timezone.get_offset
[DEBUG   ] LazyLoaded cmd.run
[INFO    ] Executing command ['date', '+%z'] in directory '/root'
[DEBUG   ] output: +0000
[DEBUG   ] LazyLoaded config.merge
[DEBUG   ] LazyLoaded mine.update
[INFO    ] Added mine.update to scheduler
[INFO    ] Updating job settings for scheduled job: __mine_interval
[DEBUG   ] SaltEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-06T19:33:57.585724', 'complete': True, 'schedule': {'__master_alive': {'function': 'status.master', 'seconds': 10, 'jid_include': True, 'maxrunning': 2, 'kwargs': {'connected': False, 'master': 'smaster2-dusw.xxxx.com'}}, '__mine_interval': {'function': 'mine.update', 'jid_include': True, 'minutes': 60, 'maxrunning': 2}}}
[DEBUG   ] Persisting schedule
[INFO    ] Updating job settings for scheduled job: __master_alive
[DEBUG   ] SaltEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-06T19:33:57.613148', 'complete': True, 'schedule': {'__master_alive': {'function': 'status.master', 'seconds': 10, 'jid_include': True, 'maxrunning': 1, 'kwargs': {'connected': True, 'master': 'smaster1-dusw.xxxx.com'}}, '__mine_interval': {'function': 'mine.update', 'jid_include': True, 'minutes': 60, 'maxrunning': 2}}}
[DEBUG   ] Persisting schedule
[DEBUG   ] Handling event '/salt/minion/minion_schedule_add_complete\n\n\x83\xa6_stamp\xba2016-01-06T19:33:57.585724\xa8complete\xc3\xa8schedule\x82\xae__master_alive\x85\xa8function\xadstatus.master\xa7seconds\n\xabjid_include\xc3\xaamaxrunning\x02\xa6kwargs\x82\xa9connected\xc2\xa6master\xbasmaster2-dusw.xxxx.com\xaf__mine_interval\x84\xa8function\xabmine.update\xabjid_include\xc3\xa7minutes<\xaamaxrunning\x02'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[INFO    ] Minion is ready to receive requests!

All in all it took over 10 minutes for the minion to failover to the secondary master. That seems flawed.. anyway to speed that up?

Thank you!

Salt Version:
           Salt: 2015.8.3

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.22
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-74-generic
         system: Ubuntu 14.04 trusty

The text was updated successfully, but these errors were encountered:

Ch3LL · 2016-01-07T23:34:27Z

@jakehilton I believe this is related to #24243 and #29567

I am inclined to label this a duplicate but just have one question for you for further clarification. Once your minion does failover to the second master are you able to run commands against the minion from the second master?

Some of the other issues reported for failover multi master state the minion does not failover, unless you change the master_alive_interval setting. In my testing I found if I changed master_alive_interval to 15 or below the minion would failover but i would not be able to run commands against the minion.

jakehilton · 2016-01-08T17:06:06Z

I just ran a test and can verify that I can send commands against the minion once it has connected to the secondary master.

Ch3LL · 2016-01-11T22:55:00Z

I am going to test this today or tomorrow, because this behavior is a little different from other issue reports. I'll update my findings after I test. Thanks for the update

Ch3LL · 2016-01-12T23:47:00Z

@jakehilton looks like i can recreate this although I am not seeing these errors at all:

[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb472f788>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'

Here is what my output looks like when the minion is running through the SaltReqTimeoutError:

2016-01-12 23:23:58,465 [salt.transport.zeromq][DEBUG   ][19032] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://192.168.50.10:4506', 'clear')
2016-01-12 23:24:58,531 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (1/7)
2016-01-12 23:25:58,591 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (2/7)
2016-01-12 23:26:58,651 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (3/7)
2016-01-12 23:27:58,712 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (4/7)
2016-01-12 23:28:58,772 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (5/7)
2016-01-12 23:29:58,833 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (6/7)
2016-01-12 23:30:58,893 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (7/7)
2016-01-12 23:31:58,961 [salt.minion      ][INFO    ][19032] Master 192.168.50.10 could not be reached, trying next master (if any)
2016-01-12 23:31:58,961 [salt.minion      ][WARNING ][19032] Master ip address changed from 192.168.50.10 to 192.168.50.11
2016-01-12 23:31:58,961 [salt.crypt       ][DEBUG   ][19032] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://192.168.50.11:4506')
2016-01-12 23:31:58,963 [salt.transport.zeromq][DEBUG   ][19032] Generated random reconnect delay between '1000ms' and '11000ms' (1720)

This process took about 10 minutes just as you stated previously. Which in my opinion I agree with you it is taking a long time to failover to the other master, when the master is initially down on minion startup.

Also to note when it fails over I see this stack trace:

[DEBUG   ] schedule.handle_func: The scheduled job __master_alive was not started, 1 already running
[ERROR   ] Unhandled exception running status.master
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 710, in handle_func
    channel.send(load)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 73, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 83, in _block_future
    return future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 231, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 199, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 185, in _do_transfer
    tries=tries,
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out
[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160112223537410440
^C[INFO    ] The salt minion is shut down
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
2016-01-12 23:31:59,891 [salt.loaded.int.module.cmdmod][DEBUG   ][19032] output: +0000

Thank you for the report.

meggiebot · 2016-02-12T21:42:51Z

Need this fixed for March point release, 2015.8.8

DmitryKuzmenko · 2016-02-19T18:21:24Z

Currently (develop branch) the described use case works properly with tcp transport and hangs minion forever with ZeroMQ. The PR #31364 makes ZeroMQ transport working back in the way described in this issue because it fixes ZeroMQ transport timeout handling issue.

The issue described here is deeper insight the core of the transport logic. I continue work on it.

DmitryKuzmenko · 2016-02-20T14:50:09Z

Fixed by disabling auth retry if multimaster is set to failover mode.

Ch3LL · 2016-03-03T17:32:41Z

@DmitryKuzmenko I'm re-opening this because I just tested on the head of 2015.8 again and I am still seeing this particular issue with multi-master failover. I'm guessing those two PRs still need to be added to 2015.8.

DmitryKuzmenko · 2016-03-04T08:26:18Z

@Ch3LL thank you for checking. I'll re-test this when I'll be less busy.

meggiebot · 2016-03-24T17:53:32Z

@DmitryKuzmenko can you re-test this now that 2015.8.8 is live?

DmitryKuzmenko · 2016-03-25T14:02:57Z

@meggiebot my PR was ported into 2015.5 and then merged into 2015.8. During the merge the fix was gone. I've created a new PR for 2015.8.

meggiebot · 2016-03-25T17:44:30Z

@DmitryKuzmenko So this did not make it into 2015.8.8 but will be in the next release, 2015.8.9?

DmitryKuzmenko · 2016-03-28T08:25:20Z

@meggiebot sounds like this.
Before release I've checked that all critical bugfixes are merged, but I've checked by commit IDs. Here is how I've checked this:

$ git tag --contains 3d09c3b7a326d9e705e4714ea1b937e40407193d
v2015.8.8
$ git show 3d09c3b7a326d9e705e4714ea1b937e40407193d
commit 3d09c3b7a326d9e705e4714ea1b937e40407193d
Author: Dmitry Kuzmenko <dmitry.kuzmenko@dsr-company.com>
Date:   Wed Feb 24 19:00:32 2016 +0300

    Set auth retry count to 0 if multimaster mode is failover.

    Backport of PR #31382 into 2015.5.

diff --git a/salt/minion.py b/salt/minion.py
index 8413c91..2c99279 100644
--- a/salt/minion.py
+++ b/salt/minion.py
@@ -815,6 +815,7 @@ class Minion(MinionBase):
                     opts['master_active_list'] = opts['master']
                     if opts.get('master_shuffle'):
                         shuffle(opts['master_list'])
+                    opts['auth_tries'] = 0
                 elif isinstance(opts['master'], str):
                     # We have a string, but a list was what was intended. Convert.
                     # See issue 23611 for details

But the actual change was broken during the merge 2015.5 into 2015.8 and unfortunately git doesn't show this. It could be found by the only code review.

meggiebot · 2016-04-07T21:13:34Z

@DmitryKuzmenko so the fix for this issue on 2015.8 is #32143
?

Ch3LL · 2016-04-07T21:42:13Z

@DmitryKuzmenko I tested this again at the head of 2015.8

[root@saltminion2 ~]# salt --version
salt 2015.8.8-281-g51fb2ac (Beryllium)

It appears that is failing over but I have two concerns.

it takes about 30-60 seconds to fail over each time even though I have the following setting in the minion config file: master_alive_interval: 15 . Is this expected behavior?
I am seeing a stack trace error when it fails over:

[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://192.168.50.11:4506')
[INFO    ] Minion is ready to receive requests!
[DEBUG   ] Handling event '/salt/minion/minion_schedule_add_complete\n\n\x83\xa6_stamp\xba2016-04-07T21:37:02.750031\xa8complete\x
c3\xa8schedule\x82\xae__master_alive\x85\xa8function\xadstatus.master\xa7seconds\x0f\xabjid_include\xc3\xaamaxrunning\x01\xa6kwarg
s\x82\xa9connected\xc3\xa6master\xad192.168.50.11\xaf__mine_interval\x84\xa8function\xabmine.update\xabjid_include\xc3\xa7minutes<
\xaamaxrunning\x02'
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 714, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/ioloop.py", line 115, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib64/python2.7/site-packages/zmq/sugar/poll.py", line 71, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] LazyLoaded status.master

DmitryKuzmenko · 2016-04-08T13:21:17Z

@Ch3LL the error message should be fixed by 7bd97d6 that isn't in 2015.8.8. Should I pick the commit from the original pull request #31164 and send new PR for 2015.8.8?

DmitryKuzmenko · 2016-04-08T13:26:35Z

#31364 should also is missing in 2015.8.8 it should fix the timeout problem: before the PR timeouts were handling incorrectly on the transport level.

meggiebot · 2016-04-08T16:27:46Z

#32441

Ch3LL · 2016-04-11T16:48:47Z

When my minion config looks like the following:

master:
  - 192.168.50.10
  - 192.168.50.11

master_type: failover
master_alive_interval: 5

It still takes about 30-60 seconds before it will fail over to the other master. But when I change the following:

master:
  - 192.168.50.10
  - 192.168.50.11

master_type: failover
master_alive_interval: 5
auth_timeout: 10
auth_tries: 0

It now fails over within 10 seconds. I am guessing this is to be expected because in this issue the minion is initially attempting to connect to a master so it is using the auth_timeout and auth_tries configurations. @DmitryKuzmenko is this to be expected? I guess my concern is I think a user would expect to only need to configure master_alive_interval to determine the amount of time it will failover to the next master in the list on startup instead of needing to change the auth* configuration settings.

DmitryKuzmenko · 2016-04-12T10:12:42Z

At least I can say it's not a bug. It's how it's programmed.
But I agree, our timeouts configuration isn't obvious enough.

Ch3LL · 2016-04-12T15:53:44Z

@DmitryKuzmenko thanks for the clarification. I'm going to go ahead and close this now since its now working as expected. Thanks for all your help.

Ch3LL added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems Multi-Master P2 Priority 2 labels Jan 7, 2016

Ch3LL added this to the Blocked milestone Jan 7, 2016

Ch3LL added info-needed waiting for more info severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around and removed severity-high 2nd top severity, seen by most users, causes major problems labels Jan 7, 2016

Ch3LL mentioned this issue Jan 13, 2016

2015.5.0: Minion Queries Downed Master After Failover in master_type: Failover #24243

Closed

Ch3LL removed the info-needed waiting for more info label Jan 13, 2016

Ch3LL modified the milestones: Approved, Blocked Jan 13, 2016

meggiebot modified the milestones: B 2, Approved Feb 12, 2016

meggiebot added the TEAM Core label Feb 12, 2016

meggiebot assigned DmitryKuzmenko Feb 16, 2016

DmitryKuzmenko added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Feb 19, 2016

This was referenced Feb 19, 2016

Don't send REQ while another one is waiting for response. #31364

Merged

Set auth retry count to 0 if multimaster mode is failover #31382

Merged

rallytime closed this as completed in #31382 Feb 24, 2016

Ch3LL mentioned this issue Mar 1, 2016

multi-master failover stack trace when minion fails over to other master #30643

Closed

Ch3LL reopened this Mar 3, 2016

meggiebot modified the milestones: B -1, B 2 Mar 10, 2016

DmitryKuzmenko mentioned this issue Mar 25, 2016

Set auth retry count to 0 if multimaster mode is failover. #32143

Merged

DmitryKuzmenko added the fixed-pls-verify fix is linked, bug author to confirm fix label Apr 7, 2016

DmitryKuzmenko modified the milestones: B -2, B -1 Apr 11, 2016

Ch3LL closed this as completed Apr 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minion startup extremely delayed when first master in failover multi master setup is down #30183

Minion startup extremely delayed when first master in failover multi master setup is down #30183

jakehilton commented Jan 6, 2016

Ch3LL commented Jan 7, 2016

jakehilton commented Jan 8, 2016

Ch3LL commented Jan 11, 2016

Ch3LL commented Jan 12, 2016

meggiebot commented Feb 12, 2016

DmitryKuzmenko commented Feb 19, 2016

DmitryKuzmenko commented Feb 20, 2016

Ch3LL commented Mar 3, 2016

DmitryKuzmenko commented Mar 4, 2016

meggiebot commented Mar 24, 2016

DmitryKuzmenko commented Mar 25, 2016

meggiebot commented Mar 25, 2016

DmitryKuzmenko commented Mar 28, 2016

meggiebot commented Apr 7, 2016

Ch3LL commented Apr 7, 2016

DmitryKuzmenko commented Apr 8, 2016

DmitryKuzmenko commented Apr 8, 2016

meggiebot commented Apr 8, 2016

Ch3LL commented Apr 11, 2016

DmitryKuzmenko commented Apr 12, 2016

Ch3LL commented Apr 12, 2016

Minion startup extremely delayed when first master in failover multi master setup is down #30183

Minion startup extremely delayed when first master in failover multi master setup is down #30183

Comments

jakehilton commented Jan 6, 2016

Ch3LL commented Jan 7, 2016

jakehilton commented Jan 8, 2016

Ch3LL commented Jan 11, 2016

Ch3LL commented Jan 12, 2016

meggiebot commented Feb 12, 2016

DmitryKuzmenko commented Feb 19, 2016

DmitryKuzmenko commented Feb 20, 2016

Ch3LL commented Mar 3, 2016

DmitryKuzmenko commented Mar 4, 2016

meggiebot commented Mar 24, 2016

DmitryKuzmenko commented Mar 25, 2016

meggiebot commented Mar 25, 2016

DmitryKuzmenko commented Mar 28, 2016

meggiebot commented Apr 7, 2016

Ch3LL commented Apr 7, 2016

DmitryKuzmenko commented Apr 8, 2016

DmitryKuzmenko commented Apr 8, 2016

meggiebot commented Apr 8, 2016

Ch3LL commented Apr 11, 2016

DmitryKuzmenko commented Apr 12, 2016

Ch3LL commented Apr 12, 2016