Minion connection does not survive IP address change #2358

Closed
madduck opened this Issue Oct 26, 2012 · 17 comments

Comments

Projects
None yet
@madduck
Contributor

madduck commented Oct 26, 2012

[from http://bugs.debian.org/690525]

If my provider disconnects the DSL line and subsequently assigns a new IP address, the salt-minion connection to the server dies and is not re-established.

Following the disconnect, the minion tries over and over and over again to establish a new connection, which gets shut down by the remote:

2.216183  10.178.17.2 -> 77.109.139.93 TCP 76 42836 > 4505 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=294780260 TSecr=0 WS=16
2.244185 77.109.139.93 -> 10.178.17.2  TCP 76 4505 > 42836 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1452 SACK_PERM=1 TSval=250894922 TSecr=294780260 WS=16
2.244185  10.178.17.2 -> 77.109.139.93 TCP 68 42836 > 4505 [ACK] Seq=1 Ack=1 Win=14608 Len=0 TSval=294780267 TSecr=250894922
2.244185  10.178.17.2 -> 77.109.139.93 TCP 93 42836 > 4505 [PSH, ACK] Seq=1 Ack=1 Win=14608 Len=25 TSval=294780267 TSecr=250894922
2.272187 77.109.139.93 -> 10.178.17.2  TCP 68 4505 > 42836 [ACK] Seq=1 Ack=26 Win=14480 Len=0 TSval=250894929 TSecr=294780267
2.272187 77.109.139.93 -> 10.178.17.2  TCP 68 4505 > 42836 [FIN, ACK] Seq=1 Ack=26 Win=14480 Len=0 TSval=250894929 TSecr=294780267
2.272187  10.178.17.2 -> 77.109.139.93 TCP 68 42836 > 4505 [FIN, ACK] Seq=26 Ack=2 Win=14608 Len=0 TSval=294780274 TSecr=250894929
2.300190 77.109.139.93 -> 10.178.17.2  TCP 68 4505 > 42836 [ACK] Seq=2 Ack=27 Win=14480 Len=0 TSval=250894936 TSecr=294780274
[and the next port:]
2.404198  10.178.17.2 -> 77.109.139.93 TCP 76 42837 > 4505 [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=294780307 TSecr=0 WS=16
[…]

On the master side, this corresponds to several hundred of lingering connections, according to netstat:

tcp        0      0 77.109.139.93:4505      82.135.64.143:42250     TIME_WAIT   -

# netstat -natp | grep -c :4505.*TIME_WAIT
484

And obviously, the master cannot communicate with the minion.

I am running both minion and master with --log-level=trace, but there is nothing in the output about this.

There are a number of issues related to this:

  1. I think that the master should periodically ping the minion. If the minion does not respond, the master should at least log this and probably tear down the connection.
  2. The minion should log when it finds an unusable connection, when it tears it down and when it re-establishes the connection.
  3. The master should log why it closes a connection attempt like above.
  4. The master should not really deny a connection attempt by an authenticated host.

I hope this is not a dep bug in ZMQ, but something trivial to fix.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Oct 26, 2012

Member

Thanks, this is a great analysis of this, there are some things in zeromq that I can do here, I will take a look

Member

thatch45 commented Oct 26, 2012

Thanks, this is a great analysis of this, there are some things in zeromq that I can do here, I will take a look

@roncohen

This comment has been minimized.

Show comment
Hide comment
@roncohen

roncohen Nov 26, 2012

Contributor

+1

Contributor

roncohen commented Nov 26, 2012

+1

@robinsmidsrod

This comment has been minimized.

Show comment
Hide comment
@robinsmidsrod

robinsmidsrod Dec 16, 2012

Contributor

@thatch45 I've also had this problem, but in a much simpler setup, namely a laptop that switches between ethernet and wifi. Whenever I switch the laptop from one or the other (which means different IP) I need to restart BOTH the minion and the master for communication to start flowing again. If this is a security (MITM) feature I think it should be configurable on a minion basis, so that roaming minions will reconnect properly when they change IP address.

Contributor

robinsmidsrod commented Dec 16, 2012

@thatch45 I've also had this problem, but in a much simpler setup, namely a laptop that switches between ethernet and wifi. Whenever I switch the laptop from one or the other (which means different IP) I need to restart BOTH the minion and the master for communication to start flowing again. If this is a security (MITM) feature I think it should be configurable on a minion basis, so that roaming minions will reconnect properly when they change IP address.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Dec 17, 2012

Member

Thanks for the extra info on this one @robinsmidsrod

Member

thatch45 commented Dec 17, 2012

Thanks for the extra info on this one @robinsmidsrod

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jan 22, 2013

Contributor

Please don't make this a MITM-prevention feature. MITM is prevented using the PKI built into ZeroMQ/Salt. The master should not care at all about the IP address of the client if the client manages to authenticate. Then it should update its internal representation to match the new IP address.

Contributor

madduck commented Jan 22, 2013

Please don't make this a MITM-prevention feature. MITM is prevented using the PKI built into ZeroMQ/Salt. The master should not care at all about the IP address of the client if the client manages to authenticate. Then it should update its internal representation to match the new IP address.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 22, 2013

Member

I agree, the authentication should be entirely based on keys, not the network interface

Member

thatch45 commented Jan 22, 2013

I agree, the authentication should be entirely based on keys, not the network interface

@umeboshi2

This comment has been minimized.

Show comment
Hide comment
@umeboshi2

umeboshi2 Mar 4, 2013

Contributor

I built zeromq version 3.2.2 and pyzmq 13.0.0 and a fresh copy of salt from git ~0.13.2. Today, I noticed that my latop did happed to survive an ip change, when it never would before. I will try to further verify that this is consistent as time permits.

Contributor

umeboshi2 commented Mar 4, 2013

I built zeromq version 3.2.2 and pyzmq 13.0.0 and a fresh copy of salt from git ~0.13.2. Today, I noticed that my latop did happed to survive an ip change, when it never would before. I will try to further verify that this is consistent as time permits.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Mar 4, 2013

Member

This is great news @umeboshi2, keep us posted!

Member

thatch45 commented Mar 4, 2013

This is great news @umeboshi2, keep us posted!

@umeboshi2

This comment has been minimized.

Show comment
Hide comment
@umeboshi2

umeboshi2 Mar 6, 2013

Contributor

I have already verified that I no longer have to restart the salt-master when my laptop minion goes to another network. Today, I have verified that the minion also needs no restart, but it takes about five minutes (or possibly less) to reestablish the connection. I think that this problem has been fixed, at least with the 0qm, pyzmq, and salt that I'm using.

Contributor

umeboshi2 commented Mar 6, 2013

I have already verified that I no longer have to restart the salt-master when my laptop minion goes to another network. Today, I have verified that the minion also needs no restart, but it takes about five minutes (or possibly less) to reestablish the connection. I think that this problem has been fixed, at least with the 0qm, pyzmq, and salt that I'm using.

@UtahDave

This comment has been minimized.

Show comment
Hide comment
@UtahDave

UtahDave Mar 6, 2013

Member

Great to hear! Thanks for letting us know. Do you think we can close this issue now?

Member

UtahDave commented Mar 6, 2013

Great to hear! Thanks for letting us know. Do you think we can close this issue now?

@umeboshi2

This comment has been minimized.

Show comment
Hide comment
@umeboshi2

umeboshi2 Mar 6, 2013

Contributor

I feel confident that this can be closed if 0mq >=3.2.2, pyzmq >= 13.0.0 and salt >= 0.13.1 is required. :)

Contributor

umeboshi2 commented Mar 6, 2013

I feel confident that this can be closed if 0mq >=3.2.2, pyzmq >= 13.0.0 and salt >= 0.13.1 is required. :)

@UtahDave UtahDave closed this Mar 6, 2013

@sedie-photobucket

This comment has been minimized.

Show comment
Hide comment
@sedie-photobucket

sedie-photobucket May 15, 2013

Contributor

I'm seeing this problem again in Salt 0.15.1. I'm running zeromq 3.2.3 and pyzmq 13.0.2. I configured aggressive keep-alive settings on the minion:

tcp_keepalive: True
tcp_keepalive_idle: 30
tcp_keepalive_cnt: 3
tcp_keepalive_intvl: 15

I have two masters both with identical contents in /etc/salt/pki/master. A DNS CNAME points to the active master. I test by updating the CNAME to point to the fail-over master and then shutting down salt-master on the primary master. After doing so, minions don't connect to the fail-over master even if I wait a long time. I did a tcpdump on the minion, and it looks like the minion keeps trying to communicate with the primary master. If I restart salt-minion on the minion, the minion connects to the fail-over master without issue.

Here's an interesting twist: If instead of restarting the minion, I run a salt-call on the minion, the minion does connect to the fail-over master and the salt-call runs okay. I can even run a state.highstate without problems. However, I still cannot run commands on the minion from the fail-over master using salt until I restart the minion. When I try to do so using -v, I get Minion did not return.

Unfortunately, running salt-minion -l garbage doesn't provide any help at all. Nothing gets logged after the salt-master is shut down. Also, the behavior is no different even if I pull-the-plug on the master instead of shutting it down cleanly.

Contributor

sedie-photobucket commented May 15, 2013

I'm seeing this problem again in Salt 0.15.1. I'm running zeromq 3.2.3 and pyzmq 13.0.2. I configured aggressive keep-alive settings on the minion:

tcp_keepalive: True
tcp_keepalive_idle: 30
tcp_keepalive_cnt: 3
tcp_keepalive_intvl: 15

I have two masters both with identical contents in /etc/salt/pki/master. A DNS CNAME points to the active master. I test by updating the CNAME to point to the fail-over master and then shutting down salt-master on the primary master. After doing so, minions don't connect to the fail-over master even if I wait a long time. I did a tcpdump on the minion, and it looks like the minion keeps trying to communicate with the primary master. If I restart salt-minion on the minion, the minion connects to the fail-over master without issue.

Here's an interesting twist: If instead of restarting the minion, I run a salt-call on the minion, the minion does connect to the fail-over master and the salt-call runs okay. I can even run a state.highstate without problems. However, I still cannot run commands on the minion from the fail-over master using salt until I restart the minion. When I try to do so using -v, I get Minion did not return.

Unfortunately, running salt-minion -l garbage doesn't provide any help at all. Nothing gets logged after the salt-master is shut down. Also, the behavior is no different even if I pull-the-plug on the master instead of shutting it down cleanly.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 May 15, 2013

Member

I think that what you are describing is using a DNS failover system. Salt does not remap on DNS host changes like this, I agree that it would be a nice thing to have but HA masters can be configured using a VIP or in 0.16.0 support is being added for active multiple masters.
So I will say, wait for 0.16.0 in the next few weeks to be released, or set up a VIP using keepalived or ucarp

Member

thatch45 commented May 15, 2013

I think that what you are describing is using a DNS failover system. Salt does not remap on DNS host changes like this, I agree that it would be a nice thing to have but HA masters can be configured using a VIP or in 0.16.0 support is being added for active multiple masters.
So I will say, wait for 0.16.0 in the next few weeks to be released, or set up a VIP using keepalived or ucarp

@chhibber

This comment has been minimized.

Show comment
Hide comment
@chhibber

chhibber Oct 28, 2013

Wanted to bump this as this to me is a problem for cloud solutions that entirely dependent on DNS. AWS comes to mind here.

For simplicity I have setup an ELB and a salt autoscaling group and this is patter we constantly use as it is simple. If my salt server or availlibity zones takes a dive it just comes up in another zones and bootstraps itself. If the saltserver is down for a short period of time it is fine in our case. The problem here is that the ELB will have it's IP change which breaks all the minions until I can restart them (not ideal). Is there anyway the minions can do a DNS check after a timeout period vs doing a restart of the daemons?

Or is there a solution I may have missed in the docs?

Thanks,
Sono

Wanted to bump this as this to me is a problem for cloud solutions that entirely dependent on DNS. AWS comes to mind here.

For simplicity I have setup an ELB and a salt autoscaling group and this is patter we constantly use as it is simple. If my salt server or availlibity zones takes a dive it just comes up in another zones and bootstraps itself. If the saltserver is down for a short period of time it is fine in our case. The problem here is that the ELB will have it's IP change which breaks all the minions until I can restart them (not ideal). Is there anyway the minions can do a DNS check after a timeout period vs doing a restart of the daemons?

Or is there a solution I may have missed in the docs?

Thanks,
Sono

@basepi

This comment has been minimized.

Show comment
Hide comment
@basepi

basepi Oct 29, 2013

Collaborator

I don't think there's currently a solution to your problem. @thatch45 mentioned multiple masters above -- you could use a second master which would stay connected to all the minions and then could restart them so they would connect to the first master again. Or you could try the other solutions he mentioned (VIP using keepalived or ucarp). We definitely still want to improve the minion's reconnect abilities, but it's not there yet.

Collaborator

basepi commented Oct 29, 2013

I don't think there's currently a solution to your problem. @thatch45 mentioned multiple masters above -- you could use a second master which would stay connected to all the minions and then could restart them so they would connect to the first master again. Or you could try the other solutions he mentioned (VIP using keepalived or ucarp). We definitely still want to improve the minion's reconnect abilities, but it's not there yet.

@rawipfel

This comment has been minimized.

Show comment
Hide comment
@rawipfel

rawipfel Oct 16, 2015

Suppose I want to use Salt to manage the network configuration of the minions; and have them change the network interface IP address being used by the Minion itself. The flow would be (1) change IP (2) restart minion (3) reconnect to master. Seems like a reasonable configuration management use case, I'm guessing it's possible with Salt, but is it known/expected to work?

Suppose I want to use Salt to manage the network configuration of the minions; and have them change the network interface IP address being used by the Minion itself. The flow would be (1) change IP (2) restart minion (3) reconnect to master. Seems like a reasonable configuration management use case, I'm guessing it's possible with Salt, but is it known/expected to work?

@dgorissen

This comment has been minimized.

Show comment
Hide comment
@dgorissen

dgorissen Sep 23, 2016

@rawipfel did you ever find a solution? I have been researching the exact same issue. I did find that network.system has a require_reboot argument but unclear as to what that actually does.

@rawipfel did you ever find a solution? I have been researching the exact same issue. I did find that network.system has a require_reboot argument but unclear as to what that actually does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment