Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion startup extremely delayed when first master in failover multi master setup is down #30183

Closed
jakehilton opened this issue Jan 6, 2016 · 21 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix Multi-Master P2 Priority 2 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Milestone

Comments

@jakehilton
Copy link

My minion config looks like so:

master:
  - smaster2-dusw.xxxx.com
  - smaster1-dusw.xxxx.com
master_type: failover
master_alive_interval: 10

When I start up my minion I would expect that if the smaster2 is down/unresponsive that it would try the next one in the list right away.

As stated here: https://docs.saltstack.com/en/latest/topics/highavailability/index.html
"Changing the master_type parameter from str to failover will cause minions to connect to the first responding master in the list of masters."

It seems like it tries the first master until all retries are complete.. then moves on.

Here is the output.

# salt-minion -l debug
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: dusw1-content17
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[INFO    ] Processing `log_handlers.sentry`
[DEBUG   ] No 'sentry_handler' key was found in the configuration
[INFO    ] The `log_handlers.sentry.setup_handlers()` function returned `False` which means no logging handler was configured on purpose. Continuing...
[DEBUG   ] Configuration file path: /etc/salt/minion
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[INFO    ] Setting up the Salt Minion "dusw1-content17"
[DEBUG   ] Created pidfile: /var/run/salt-minion.pid
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[INFO    ] The salt minion is starting up
[INFO    ] Minion is starting as user 'root'
[DEBUG   ] AsyncEventPublisher PUB socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[DEBUG   ] AsyncEventPublisher PULL socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[INFO    ] Starting pub socket on ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[INFO    ] Starting pull socket on ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[DEBUG   ] Minion 'dusw1-content17' trying to tune in
[DEBUG   ] sync_connect_master
[INFO    ] Got list of available master addresses: ['smaster2-dusw.xxxx.com', 'smaster1-dusw.xxxx.com']
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (6337)
[DEBUG   ] Setting zmq_reconnect_ivl to '6337ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'clear')
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xafdusw1-content17\xa6_stamp\xba2016-01-06T19:26:05.664440'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb472f788>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'
[DEBUG   ] SaltReqTimeoutError, retrying. (1/7)
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xafdusw1-content17\xa6_stamp\xba2016-01-06T19:27:05.759040'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb46d2b50>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'

Here is the 7th attempt log:

[DEBUG   ] SaltReqTimeoutError, retrying. (7/7)
[DEBUG   ] Handling event '_minion_mine\n\n\x85\xa5clear\xc2\xa3cmd\xa5_mine\xa4data\x80\xa2id\xafdusw1-content17\xa6_stamp\xba2016-01-06T19:33:06.504627'
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.32.38:4506')
[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb46d2af8>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'

So it goes through all 7 tries and then finally shows this:

[INFO    ] Master smaster2-dusw.xxxx.com could not be reached, trying next master (if any)
[WARNING ] Master ip address changed from 172.31.32.38 to 172.31.47.127
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (1245)
[DEBUG   ] Setting zmq_reconnect_ivl to '1245ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'clear')
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'aes')
[DEBUG   ] Re-using SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Reading configuration from /etc/salt/minion
[DEBUG   ] Including configuration from '/etc/salt/minion.d/_schedule.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/_schedule.conf
[DEBUG   ] Including configuration from '/etc/salt/minion.d/uplynk.conf'
[DEBUG   ] Reading configuration from /etc/salt/minion.d/uplynk.conf
[DEBUG   ] Please install 'virt-what' to improve results of the 'virtual' grain.
[DEBUG   ] LazyLoaded timezone.get_offset
[DEBUG   ] LazyLoaded cmd.run
[INFO    ] Executing command ['date', '+%z'] in directory '/root'
[DEBUG   ] output: +0000
[DEBUG   ] LazyLoaded config.merge
[DEBUG   ] LazyLoaded mine.update
[INFO    ] Added mine.update to scheduler
[INFO    ] Updating job settings for scheduled job: __mine_interval
[DEBUG   ] SaltEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-06T19:33:57.585724', 'complete': True, 'schedule': {'__master_alive': {'function': 'status.master', 'seconds': 10, 'jid_include': True, 'maxrunning': 2, 'kwargs': {'connected': False, 'master': 'smaster2-dusw.xxxx.com'}}, '__mine_interval': {'function': 'mine.update', 'jid_include': True, 'minutes': 60, 'maxrunning': 2}}}
[DEBUG   ] Persisting schedule
[INFO    ] Updating job settings for scheduled job: __master_alive
[DEBUG   ] SaltEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_a6604d7069_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-06T19:33:57.613148', 'complete': True, 'schedule': {'__master_alive': {'function': 'status.master', 'seconds': 10, 'jid_include': True, 'maxrunning': 1, 'kwargs': {'connected': True, 'master': 'smaster1-dusw.xxxx.com'}}, '__mine_interval': {'function': 'mine.update', 'jid_include': True, 'minutes': 60, 'maxrunning': 2}}}
[DEBUG   ] Persisting schedule
[DEBUG   ] Handling event '/salt/minion/minion_schedule_add_complete\n\n\x83\xa6_stamp\xba2016-01-06T19:33:57.585724\xa8complete\xc3\xa8schedule\x82\xae__master_alive\x85\xa8function\xadstatus.master\xa7seconds\n\xabjid_include\xc3\xaamaxrunning\x02\xa6kwargs\x82\xa9connected\xc2\xa6master\xbasmaster2-dusw.xxxx.com\xaf__mine_interval\x84\xa8function\xabmine.update\xabjid_include\xc3\xa7minutes<\xaamaxrunning\x02'
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506', 'aes')
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'dusw1-content17', 'tcp://172.31.47.127:4506')
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 715, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 125, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 80, in unregister
    idx = self._map.pop(socket)
KeyError: None
[INFO    ] Minion is ready to receive requests!

All in all it took over 10 minutes for the minion to failover to the secondary master. That seems flawed.. anyway to speed that up?

Thank you!

Salt Version:
           Salt: 2015.8.3

Dependency Versions:
         Jinja2: 2.7.2
       M2Crypto: 0.22
           Mako: 0.9.1
         PyYAML: 3.10
          PyZMQ: 14.0.1
         Python: 2.7.6 (default, Jun 22 2015, 17:58:13)
           RAET: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.0.4
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.2
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
        libnacl: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.3.0
   mysql-python: 1.2.3
      pycparser: Not Installed
       pycrypto: 2.6.1
         pygit2: Not Installed
   python-gnupg: Not Installed
          smmap: Not Installed
        timelib: Not Installed

System Versions:
           dist: Ubuntu 14.04 trusty
        machine: x86_64
        release: 3.13.0-74-generic
         system: Ubuntu 14.04 trusty
@Ch3LL
Copy link
Contributor

Ch3LL commented Jan 7, 2016

@jakehilton I believe this is related to #24243 and #29567

I am inclined to label this a duplicate but just have one question for you for further clarification. Once your minion does failover to the second master are you able to run commands against the minion from the second master?

Some of the other issues reported for failover multi master state the minion does not failover, unless you change the master_alive_interval setting. In my testing I found if I changed master_alive_interval to 15 or below the minion would failover but i would not be able to run commands against the minion.

@Ch3LL Ch3LL added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt severity-high 2nd top severity, seen by most users, causes major problems Multi-Master P2 Priority 2 labels Jan 7, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Jan 7, 2016
@Ch3LL Ch3LL added info-needed waiting for more info severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around and removed severity-high 2nd top severity, seen by most users, causes major problems labels Jan 7, 2016
@jakehilton
Copy link
Author

I just ran a test and can verify that I can send commands against the minion once it has connected to the secondary master.

@Ch3LL
Copy link
Contributor

Ch3LL commented Jan 11, 2016

I am going to test this today or tomorrow, because this behavior is a little different from other issue reports. I'll update my findings after I test. Thanks for the update

@Ch3LL
Copy link
Contributor

Ch3LL commented Jan 12, 2016

@jakehilton looks like i can recreate this although I am not seeing these errors at all:

[ERROR   ] Exception in callback <functools.partial object at 0x7f2cb472f788>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 592, in _run_callback
    ret = callback()
  File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tornado/ioloop.py", line 598, in <lambda>
    self.add_future(ret, lambda f: f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 230, in wrapper
    yielded = next(result)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1528, in handle_event
    self._mine_send(package)
  File "/usr/lib/python2.7/dist-packages/salt/minion.py", line 1498, in _mine_send
    load['tok'] = self.tok
AttributeError: 'Minion' object has no attribute 'tok'

Here is what my output looks like when the minion is running through the SaltReqTimeoutError:

2016-01-12 23:23:58,465 [salt.transport.zeromq][DEBUG   ][19032] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'minion1', 'tcp://192.168.50.10:4506', 'clear')
2016-01-12 23:24:58,531 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (1/7)
2016-01-12 23:25:58,591 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (2/7)
2016-01-12 23:26:58,651 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (3/7)
2016-01-12 23:27:58,712 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (4/7)
2016-01-12 23:28:58,772 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (5/7)
2016-01-12 23:29:58,833 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (6/7)
2016-01-12 23:30:58,893 [salt.transport.zeromq][DEBUG   ][19032] SaltReqTimeoutError, retrying. (7/7)
2016-01-12 23:31:58,961 [salt.minion      ][INFO    ][19032] Master 192.168.50.10 could not be reached, trying next master (if any)
2016-01-12 23:31:58,961 [salt.minion      ][WARNING ][19032] Master ip address changed from 192.168.50.10 to 192.168.50.11
2016-01-12 23:31:58,961 [salt.crypt       ][DEBUG   ][19032] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://192.168.50.11:4506')
2016-01-12 23:31:58,963 [salt.transport.zeromq][DEBUG   ][19032] Generated random reconnect delay between '1000ms' and '11000ms' (1720)

This process took about 10 minutes just as you stated previously. Which in my opinion I agree with you it is taking a long time to failover to the other master, when the master is initially down on minion startup.

Also to note when it fails over I see this stack trace:

[DEBUG   ] schedule.handle_func: The scheduled job __master_alive was not started, 1 already running
[ERROR   ] Unhandled exception running status.master
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 710, in handle_func
    channel.send(load)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 73, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 83, in _block_future
    return future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 231, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 199, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 185, in _do_transfer
    tries=tries,
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out
[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20160112223537410440
^C[INFO    ] The salt minion is shut down
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
2016-01-12 23:31:59,891 [salt.loaded.int.module.cmdmod][DEBUG   ][19032] output: +0000

Thank you for the report.

@meggiebot
Copy link

Need this fixed for March point release, 2015.8.8

@DmitryKuzmenko DmitryKuzmenko added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Feb 19, 2016
@DmitryKuzmenko
Copy link
Contributor

Currently (develop branch) the described use case works properly with tcp transport and hangs minion forever with ZeroMQ. The PR #31364 makes ZeroMQ transport working back in the way described in this issue because it fixes ZeroMQ transport timeout handling issue.

The issue described here is deeper insight the core of the transport logic. I continue work on it.

@DmitryKuzmenko
Copy link
Contributor

Fixed by disabling auth retry if multimaster is set to failover mode.

@Ch3LL
Copy link
Contributor

Ch3LL commented Mar 3, 2016

@DmitryKuzmenko I'm re-opening this because I just tested on the head of 2015.8 again and I am still seeing this particular issue with multi-master failover. I'm guessing those two PRs still need to be added to 2015.8.

@Ch3LL Ch3LL reopened this Mar 3, 2016
@DmitryKuzmenko
Copy link
Contributor

@Ch3LL thank you for checking. I'll re-test this when I'll be less busy.

@meggiebot meggiebot modified the milestones: B -1, B 2 Mar 10, 2016
@meggiebot
Copy link

@DmitryKuzmenko can you re-test this now that 2015.8.8 is live?

@DmitryKuzmenko
Copy link
Contributor

@meggiebot my PR was ported into 2015.5 and then merged into 2015.8. During the merge the fix was gone. I've created a new PR for 2015.8.

@meggiebot
Copy link

@DmitryKuzmenko So this did not make it into 2015.8.8 but will be in the next release, 2015.8.9?

@DmitryKuzmenko
Copy link
Contributor

@meggiebot sounds like this.
Before release I've checked that all critical bugfixes are merged, but I've checked by commit IDs. Here is how I've checked this:

$ git tag --contains 3d09c3b7a326d9e705e4714ea1b937e40407193d
v2015.8.8
$ git show 3d09c3b7a326d9e705e4714ea1b937e40407193d
commit 3d09c3b7a326d9e705e4714ea1b937e40407193d
Author: Dmitry Kuzmenko <dmitry.kuzmenko@dsr-company.com>
Date:   Wed Feb 24 19:00:32 2016 +0300

    Set auth retry count to 0 if multimaster mode is failover.

    Backport of PR #31382 into 2015.5.

diff --git a/salt/minion.py b/salt/minion.py
index 8413c91..2c99279 100644
--- a/salt/minion.py
+++ b/salt/minion.py
@@ -815,6 +815,7 @@ class Minion(MinionBase):
                     opts['master_active_list'] = opts['master']
                     if opts.get('master_shuffle'):
                         shuffle(opts['master_list'])
+                    opts['auth_tries'] = 0
                 elif isinstance(opts['master'], str):
                     # We have a string, but a list was what was intended. Convert.
                     # See issue 23611 for details

But the actual change was broken during the merge 2015.5 into 2015.8 and unfortunately git doesn't show this. It could be found by the only code review.

@DmitryKuzmenko DmitryKuzmenko added the fixed-pls-verify fix is linked, bug author to confirm fix label Apr 7, 2016
@meggiebot
Copy link

@DmitryKuzmenko so the fix for this issue on 2015.8 is #32143
?

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 7, 2016

@DmitryKuzmenko I tested this again at the head of 2015.8

[root@saltminion2 ~]# salt --version
salt 2015.8.8-281-g51fb2ac (Beryllium)

It appears that is failing over but I have two concerns.

  1. it takes about 30-60 seconds to fail over each time even though I have the following setting in the minion config file: master_alive_interval: 15 . Is this expected behavior?
  2. I am seeing a stack trace error when it fails over:
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion1', 'tcp://192.168.50.11:4506')
[INFO    ] Minion is ready to receive requests!
[DEBUG   ] Handling event '/salt/minion/minion_schedule_add_complete\n\n\x83\xa6_stamp\xba2016-04-07T21:37:02.750031\xa8complete\x
c3\xa8schedule\x82\xae__master_alive\x85\xa8function\xadstatus.master\xa7seconds\x0f\xabjid_include\xc3\xaamaxrunning\x01\xa6kwarg
s\x82\xa9connected\xc3\xa6master\xad192.168.50.11\xaf__mine_interval\x84\xa8function\xabmine.update\xabjid_include\xc3\xa7minutes<
\xaamaxrunning\x02'
[DEBUG   ] Error deleting fd from IOLoop
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 714, in remove_handler
    self._impl.unregister(fd)
  File "/usr/lib64/python2.7/site-packages/zmq/eventloop/ioloop.py", line 115, in unregister
    return self._poller.unregister(fd)
  File "/usr/lib64/python2.7/site-packages/zmq/sugar/poll.py", line 71, in unregister
    idx = self._map.pop(socket)
KeyError: None
[DEBUG   ] LazyLoaded status.master

@DmitryKuzmenko
Copy link
Contributor

@Ch3LL the error message should be fixed by 7bd97d6 that isn't in 2015.8.8. Should I pick the commit from the original pull request #31164 and send new PR for 2015.8.8?

@DmitryKuzmenko
Copy link
Contributor

#31364 should also is missing in 2015.8.8 it should fix the timeout problem: before the PR timeouts were handling incorrectly on the transport level.

@meggiebot
Copy link

#32441

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 11, 2016

When my minion config looks like the following:

master:
  - 192.168.50.10
  - 192.168.50.11

master_type: failover
master_alive_interval: 5

It still takes about 30-60 seconds before it will fail over to the other master. But when I change the following:

master:
  - 192.168.50.10
  - 192.168.50.11

master_type: failover
master_alive_interval: 5
auth_timeout: 10
auth_tries: 0

It now fails over within 10 seconds. I am guessing this is to be expected because in this issue the minion is initially attempting to connect to a master so it is using the auth_timeout and auth_tries configurations. @DmitryKuzmenko is this to be expected? I guess my concern is I think a user would expect to only need to configure master_alive_interval to determine the amount of time it will failover to the next master in the list on startup instead of needing to change the auth* configuration settings.

@DmitryKuzmenko DmitryKuzmenko modified the milestones: B -2, B -1 Apr 11, 2016
@DmitryKuzmenko
Copy link
Contributor

At least I can say it's not a bug. It's how it's programmed.
But I agree, our timeouts configuration isn't obvious enough.

@Ch3LL
Copy link
Contributor

Ch3LL commented Apr 12, 2016

@DmitryKuzmenko thanks for the clarification. I'm going to go ahead and close this now since its now working as expected. Thanks for all your help.

@Ch3LL Ch3LL closed this as completed Apr 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE Core relates to code central or existential to Salt fixed-pls-verify fix is linked, bug author to confirm fix Multi-Master P2 Priority 2 severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

4 participants