-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minion startup extremely delayed when first master in failover multi master setup is down #30183
Comments
@jakehilton I believe this is related to #24243 and #29567 I am inclined to label this a duplicate but just have one question for you for further clarification. Once your minion does failover to the second master are you able to run commands against the minion from the second master? Some of the other issues reported for failover multi master state the minion does not failover, unless you change the |
I just ran a test and can verify that I can send commands against the minion once it has connected to the secondary master. |
I am going to test this today or tomorrow, because this behavior is a little different from other issue reports. I'll update my findings after I test. Thanks for the update |
@jakehilton looks like i can recreate this although I am not seeing these errors at all:
Here is what my output looks like when the minion is running through the
This process took about 10 minutes just as you stated previously. Which in my opinion I agree with you it is taking a long time to failover to the other master, when the master is initially down on minion startup. Also to note when it fails over I see this stack trace:
Thank you for the report. |
Need this fixed for March point release, 2015.8.8 |
Currently (develop branch) the described use case works properly with tcp transport and hangs minion forever with ZeroMQ. The PR #31364 makes ZeroMQ transport working back in the way described in this issue because it fixes ZeroMQ transport timeout handling issue. The issue described here is deeper insight the core of the transport logic. I continue work on it. |
Fixed by disabling auth retry if multimaster is set to failover mode. |
@DmitryKuzmenko I'm re-opening this because I just tested on the head of 2015.8 again and I am still seeing this particular issue with multi-master failover. I'm guessing those two PRs still need to be added to 2015.8. |
@Ch3LL thank you for checking. I'll re-test this when I'll be less busy. |
@DmitryKuzmenko can you re-test this now that 2015.8.8 is live? |
@meggiebot my PR was ported into 2015.5 and then merged into 2015.8. During the merge the fix was gone. I've created a new PR for 2015.8. |
@DmitryKuzmenko So this did not make it into 2015.8.8 but will be in the next release, 2015.8.9? |
@meggiebot sounds like this.
But the actual change was broken during the merge 2015.5 into 2015.8 and unfortunately git doesn't show this. It could be found by the only code review. |
@DmitryKuzmenko so the fix for this issue on 2015.8 is #32143 |
@DmitryKuzmenko I tested this again at the head of 2015.8
It appears that is failing over but I have two concerns.
|
#31364 should also is missing in 2015.8.8 it should fix the timeout problem: before the PR timeouts were handling incorrectly on the transport level. |
When my minion config looks like the following:
It still takes about 30-60 seconds before it will fail over to the other master. But when I change the following:
It now fails over within 10 seconds. I am guessing this is to be expected because in this issue the minion is initially attempting to connect to a master so it is using the |
At least I can say it's not a bug. It's how it's programmed. |
@DmitryKuzmenko thanks for the clarification. I'm going to go ahead and close this now since its now working as expected. Thanks for all your help. |
My minion config looks like so:
When I start up my minion I would expect that if the smaster2 is down/unresponsive that it would try the next one in the list right away.
As stated here: https://docs.saltstack.com/en/latest/topics/highavailability/index.html
"Changing the master_type parameter from str to failover will cause minions to connect to the first responding master in the list of masters."
It seems like it tries the first master until all retries are complete.. then moves on.
Here is the output.
Here is the 7th attempt log:
So it goes through all 7 tries and then finally shows this:
All in all it took over 10 minutes for the minion to failover to the secondary master. That seems flawed.. anyway to speed that up?
Thank you!
The text was updated successfully, but these errors were encountered: