Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-master failover stack trace when minion fails over to other master #30643

Closed
Ch3LL opened this issue Jan 26, 2016 · 9 comments

Comments

@Ch3LL
Copy link
Contributor

commented Jan 26, 2016

Similar to the following issues #29567 and #24243

In 2015.8.4 the same behavior as reported in the above issues is the same:

master:
  - 192.168.50.10
  - 192.168.50.11
master_type: failover
master_alive_interval: 15
  1. start both masters and the minion
  2. stop the first master in the list (192.168.50.10)
  3. wait for minion to failover to second master (this takes probably about 5-10 minutes)
  4. cannot run commands from the second master(failover master)

The only new behavior i am reporting is during step 3 when the minion fails over to the second master there is now a stack trace error in 2015.8.4 as follows:

[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion2', 'tcp://192.168.50.10:4506')                                                                                                                                                                                                                                                                                                                                                                                                              [94/42493]
[DEBUG   ] SaltReqTimeoutError, retrying. (1/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (1/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (1/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (2/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (2/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (2/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (3/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (3/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (3/3)
Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 729, in handle_func
Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 729, in handle_func
    channel.send(load)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 73, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 83, in _block_future
    return future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 231, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 199, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 185, in _do_transfer
    channel.send(load)
    tries=tries,
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 73, in wrap
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out
    ret = self._block_future(ret)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 83, in _block_future
    return future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 231, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 199, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 185, in _do_transfer
    tries=tries,
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out
[DEBUG   ] Handling event '__master_disconnected\n\n\x82\xa6_stamp\xba2016-01-26T16:59:04.648499\xa6master\xad192.168.50.10'
[DEBUG   ] SaltEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_87e509139f_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_87e509139f_pull.ipc
[DEBUG   ] Sending event - data = {'_stamp': '2016-01-26T17:03:04.903718', 'complete': True, 'schedule': {'__mine_interval': {'function': 'mine.update', 'jid_include': True, 'minutes': 60, 'maxrunning': 2, 'name': '__mine_interval'}}}
[DEBUG   ] Persisting schedule
[DEBUG   ] Persisting schedule
Process Process-1:5:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 232, in _bootstrap
    self.run()
  File "/usr/lib64/python2.6/multiprocessing/process.py", line 88, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 729, in handle_func
    channel.send(load)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 73, in wrap
    ret = self._block_future(ret)
  File "/usr/lib/python2.6/site-packages/salt/utils/async.py", line 83, in _block_future
    return future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 231, in send
    ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 199, in _crypted_transfer
    ret = yield _do_transfer()
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.6/site-packages/salt/transport/zeromq.py", line 185, in _do_transfer
    tries=tries,
  File "/usr/lib64/python2.6/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/usr/lib64/python2.6/site-packages/tornado/concurrent.py", line 214, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out
[INFO    ] Connection to master 192.168.50.10 lost
[INFO    ] Trying to tune in to next master from master-list
[INFO    ] Removing possibly failed master 192.168.50.10 from list of masters
[WARNING ] Master ip address changed from 192.168.50.10 to 192.168.50.11
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'minion2', 'tcp://192.168.50.11:4506')

@Ch3LL Ch3LL added this to the Approved milestone Jan 26, 2016

@basepi basepi added Critical and removed High Severity labels Feb 5, 2016

@cachedout

This comment has been minimized.

Copy link
Collaborator

commented Feb 23, 2016

I suspect this may be fixed by #31382.

@Ch3LL can you please pull down that code and verify?

@Ch3LL

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2016

@cachedout I've confirmed this is now working in 2015.5. I will test the 2016.3 branch tomorrow. According to the comments in the PR, it looks like 7bd97d6 needs to be merged into 2015.8 before I can test that branch so I will keep this open until that's been completed and I have verified.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Feb 26, 2016

#31512 together with #30796 fix this.

@Ch3LL

This comment has been minimized.

Copy link
Contributor Author

commented Feb 27, 2016

@DmitryKuzmenko I am a little confused how to apply those PR's to the 2015.8 branch to test since they were added to develop. Do I just wait until they are added to 2015.8 to test? I tested on the head of the 2015.8 branch just a second ago, and tried pulling in those files as well, but still did not see any changes in multi-master, but I'm sure I'm applying these PRs incorrectly.

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2016

@Ch3LL merged them into 2015.8 for you. Check it with #31525

@Ch3LL

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2016

@DmitryKuzmenko Thanks for merging those forward I appreciate it. I have tested and this particular issue has been fixed on 2015.8 so I will go ahead and close this particular issue.

There was a problem when testing. Looks like issue #30183 still is occurring, but I believe that is because #31364 and #31382 have not been moved to 2015.8 either.

I did test this on the head of 2016.3 and it is not working, but I'm assuming this still needs to be merged forward to 2016, so I will test again later this week.

@Ch3LL Ch3LL closed this Mar 1, 2016

@timyi1212

This comment has been minimized.

Copy link

commented Mar 3, 2016

@Ch3LL
could you please tell me that is the issue fixed? i tried but it seems also has problem.
when i shutdown the first master , the minion can't failover to the second master automatically in a few minutes

@cro cro reopened this Mar 3, 2016

@DmitryKuzmenko

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

@timyi1212 could you please provide more details? What salt version do you use? What patches are applied (because the fix wasn't released yet) and salt-minion log output.

@timyi1212

This comment has been minimized.

Copy link

commented Mar 4, 2016

@Ch3LL
i test it agin, it seems work correct
i use the below code number to fix it, my version is 8.5
31382 31512 30796 31525 31364

@meggiebot meggiebot closed this Mar 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.