Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minion losing connection and not returning without a restart of service #6231

Closed
leonhedding opened this issue Jul 19, 2013 · 57 comments

Comments

Projects
None yet
@leonhedding
Copy link

commented Jul 19, 2013

I did a test.ping I had handful not responding. The versions reports are below. I can also get at random times 20 clients not responding, but am not right now. I have a cron job which restarts the service on my minions ever hour. Some of these servers are on the other side of a WAN connection. I have configured a TCP keepalive of 60 seconds because of the NAT'ing for them.

Why do I keep loosing my connection with the minions?

Master on CentOS 6
[root@saltstack ~]# salt --versions-report
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3
[root@saltstack ~]# cat /etc/redhat-release
CentOS release 6.4 (Final)
[root@saltstack ~]#

CentOS 6.4 Minion
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Feb 21 2013, 23:54:59)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3

CentOS 6.4 Minion
Salt: 0.15.3
Python: 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
Jinja2: 2.2.1
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3

CentOS release 5.9 (Final)
Salt: 0.15.3
Python: 2.6.8 (unknown, Nov 7 2012, 14:47:34)
Jinja2: unknown
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.08
PyZMQ: 2.1.9
ZMQ: 2.1.9

Windows 2008 R2 64-bit Minion
Salt: 0.16.0
Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.2

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Jul 19, 2013

The primary cause for minions losing their connection is ZMQ2, which I see on at least one of your version reports. Definitely upgrade to ZMQ3 to prevent many of these issues.

However, we have been seeing a few reports of minions losing connection recently. Some problems are solved with a minion restart, some with a master. Is there anything in the logs of the disconnected minions or the master that might clue us into what is happening?

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Jul 19, 2013

Also, are you using any IPv6? We're wondering if the recent problems are related to re-enabling IPv6 support.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Jul 19, 2013

Oh, and I just barely connected you with the other "restarting the master" issue. =P Thanks for creating a separate issue.

@leonhedding

This comment has been minimized.

Copy link
Author

commented Jul 22, 2013

Yeah, I am the person who reported my issue incorrectly on another ticket. We are not using IPv6.

I have looked at upgrading to ZMQ3, but I can't find any RPM's for my CentOS 5 machines which work. There are too many version conflicts when I have tried to get ZMQ3 onto my CentOS 5 machines. But I am not that worried since in reality the issue is more widespread and is effecting my ZMQ3 clients just as much. If I can get the ZMQ3 machines work well I would be happy.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2013

Ya, this is high on our priority list. The problem is the difficulty of reproducing issues like this. =\

@leonhedding

This comment has been minimized.

Copy link
Author

commented Jul 23, 2013

Good, I am now seeing 20 of my 30 odd minions go offline until the cron job restarts the minion and then they are reachable for about 5-15 minutes then unreachable until the cronjob runs again. I am not finding salt to be that useful in this situation. This is for zmq3 and zmq2 minions.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Jul 23, 2013

20 out of 30? that's high, I don't think I've seen anyone with that high of a percentage of disconnects. We'll definitely look into it.

@tateeskew

This comment has been minimized.

Copy link

commented Jul 24, 2013

I'm having the same problem, but mine is reproduceable.

If the master server's IP address changes and DNS is updated, the minions have to be restarted to regain connection. There was a fix about a year ago for this where the minion would re-resolve the IP address of the master if it lost connection.

I'm not sure if that is happening correctly now. The IP address of a saltmaster running in EC2 changes often when it's shutdown and brought back up.

This is super annoying because my remote execution to restart the minion process is well...done by salt :)

@leonhedding

This comment has been minimized.

Copy link
Author

commented Jul 30, 2013

my master's IP address is not changing. We have a static IP for it. I have 19 servers this morning not responding 30 minutes after the minion service was restarted.

I have a local cron job that runs each at 3 minutes past every hour on each minion server to restart salt-minion. But I can only then reach them right after the service restarts. This is a real pain for actually trying to use this as a possible solution for my organisation.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Jul 30, 2013

This is happening for minions that are on the sam LAN, as well as minions connecting over WAN, right?

@leonhedding

This comment has been minimized.

Copy link
Author

commented Jul 31, 2013

Same LAN and across a WAN. My Windows box that I have put into salt is usually the first to go and I know it is running ZMQ3. Not bothered putting more machines into salt because of this issue.

Cheers,

Leon Hedding

On 30 Jul 2013 22:54:41, Colton Myers notifications@github.com wrote:

This is happening for minions that are on the sam LAN, as well as minions connecting over WAN, right?


Reply to this email directly or view it on GitHubhttps://github.com//issues/6231#issuecomment-21825901.


OM International Services (Carlisle) Ltd - Unit B Clifford Court, Cooper Way - Carlisle CA3 0JG - United Kingdom
Charity reg no: 1112655 - Company reg no: 5649412 (England and Wales)

@leonhedding

This comment has been minimized.

Copy link
Author

commented Aug 7, 2013

Ok, I think I might have a solution for my Linux machines. We run 20 odd CentOS 5 machines and without ZMQ3 installed with its ability to configure TCP Keep-Alive all these machines go offline after some time. I tried using a cron job on the master which did a test.ping ever 5 minutes to all devices as one way to do a poor man's keep alive, but my minions would still go offline occasionally.

I eventually found some time to build my own 32-bit and 64-bit RPM's for PyZMQ 13.1.0 which was the limiter before. There is a public repo for ZMQ3, but none for a version of PyZMQ that supports ZMQ3 on CentOS. I was getting library incompatibilities until I had the latest version of PyZMQ.

It is early days still, but I have yet to loose any minions since getting ZMQ3 onto my CentOS 5 machines. I really wish there was a public repo for ZMQ3 and PyZMQ 13.1.0 because I don't want to have maintain my own RPM's for PyZMQ for both i386 and x64. I saw a ticket open about getting a salt repo setup for CentOS 5. This would be brilliant for others that plan to run CentOS 5 until its support ends in 2017 and also want to run salt.

My Windows machine still keeps dropping after 15 minutes or so. I have enabled keep alives on this Windows machine. I have a number more Windows machines I would like to use Salt on, but am holding off until I can get it working reliably on at least one Windows machine. I have seen that I need to increase the timeout for Windows machines and am doing 45 seconds for a test.ping and still no response.

Windows Server 2008 R2 Datacenter SP1 Machine:
C:\salt> salt --versions-report
Salt: 0.16.2
Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.12
msgpack-pure: Not Installed
pycrypto: 2.3
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.2

Master running CentOS 6.4
[root@saltstack ~]# salt --versions-report
Salt: 0.16.0
Python: 2.6.6 (r266:84292, Jul 10 2013, 22:48:45)
Jinja2: unknown
M2Crypto: 0.20.2
msgpack-python: 0.1.13
msgpack-pure: Not Installed
pycrypto: 2.0.1
PyYAML: 3.10
PyZMQ: 2.2.0.1
ZMQ: 3.2.3

When I run tcpdump and do a test.ping to my Windows machine absolutely nothing shows in the dump from the masters perspective. I restart the minion and I then see traffic in my tcpdump. Somehow the connection is dropping and I don't know how to test why this is happening.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Aug 7, 2013

I talked to @UtahDave (who's in charge of Windows development) and he says he's going to package up a new windows installer with ZMQ 3.2.3 just to make sure that's not the issue. Then we can work from there.

@equinoxefr

This comment has been minimized.

Copy link
Contributor

commented Aug 12, 2013

Hi, i have the same problem with my windows minions (0.16.2) .The contact is lost after a few hours except for one minion. It is in the same VLAN as salt-master. I think it's a keepalive problem. A simple restart on the minion has solved this problem. I installed 0.16.3 today. Some news tomorrow ;)

@equinoxefr

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2013

Hello, today contact has been lost with my windows minion :-(. 0.16.3 didn't address this issue.

In master log i can see some minion activity every hour:

2013-08-13 06:29:53,066 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 06:29:53,067 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
2013-08-13 07:29:52,619 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 07:29:52,619 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr
2013-08-13 08:29:52,147 [salt.master ][INFO ] Authentication request from XML01.services.sib.fr
2013-08-13 08:29:52,147 [salt.master ][INFO ] Authentication accepted from XML01.services.sib.fr

But minion doesn't respond to salt-master requests...

@equinoxefr

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2013

More tests:

I can use salt-call on minion without any problems. A salt-call state.sls xxxx is working like a charm. But on master i still have no results.

[root@salt ~]# salt -v 'XML01*' test.version
Executing job with jid 20130813100802364942

XML01.services.sib.fr:
Minion did not return

Until i restart salt-minion.

Can i help you with pcap capture or advanced logs ?

@caseybea

This comment has been minimized.

Copy link

commented Aug 19, 2013

I think I am having the same or similar issue. Minions stopping responding. I have 50+ minions out of 133 not responding now. I too, cannot update ZMQ everywhere without great massive pain.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Aug 20, 2013

@caseybea For ZMQ < 3, we really can't do anything. There are severe bugs in those versions which make the connection very unstable. Some have gotten around it by using cron to restart the minion service. But your best bet is still to upgrade ZMQ, unfortunately.

@caseybea

This comment has been minimized.

Copy link

commented Aug 21, 2013

Damn. I'm not surprised to find out ZMQ 2.x is the main problem, the unfortunate part is there's no clean way to install updates on a RHEL5 box with the zmq repo(s) out there, because there's a twisty maze of ugly dependancies that make upgrading to ZMQ3 more or less unrealistic.

That said, when I have some time I will still try, and see how it goes.

(meanwhile... I'm tossing in my vote to have SSH an option for connectivity. Just in case a crafty developer is listening :-)

SALT is still a really cool deal. I'll get it into production. Eventually.....

@equinoxefr

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2013

@basepi , i did more tests with my windows minions. If i kill tcp connection ( marked as ESTABLISHED on minion but doesn't exist on server side ) with salt server on minion side with an utility like tcpview, everything becomes OK. Do you think it's a Zmq issue (Ver 3.2.2) ?

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Aug 21, 2013

@caseybea salt-ssh is shipping with 0.17. Your wish is our command! =P (Though it's way slower than ZMQ, as expected)

@equinoxefr Strange that it's ESTABLISHED on minion and doesn't show up at all on master. That shouldn't be possible, right? o.O We've actually heard of a few different people having connection problems recently on Windows, we're still trying to track down the cause. But 3.2.2 is the most up-to-date version of ZMQ on Windows, so if it's ZMQ, it's an unfixed bug.

@caseybea

This comment has been minimized.

Copy link

commented Aug 21, 2013

Wooo! I didn’t know salt-ssh is already in the mix. This is good news for us stuck with redhat5. (actually, the whole thing could be resolved is the 0MQ folks updated the repos with 3.x for RHEL--- but I know that’s not your responsibility. I’ll take whichever solution arrives first-- salt-ssh or zeromq 3.x for rhel. ☺

From: Colton Myers [mailto:notifications@github.com]
Sent: Wednesday, August 21, 2013 2:57 PM
To: saltstack/salt
Cc: Brodie, Kent
Subject: Re: [salt] Minion losing connection and not returning without a restart of service (#6231)

@caseybeahttps://github.com/caseybea salt-ssh is shipping with 0.17. Your wish is our command! =P (Though it's way slower than ZMQ, as expected)

@equinoxefrhttps://github.com/equinoxefr Strange that it's ESTABLISHED on minion and doesn't show up at all on master. That shouldn't be possible, right? o.O We've actually heard of a few different people having connection problems recently on Windows, we're still trying to track down the cause. But 3.2.2 is the most up-to-date version of ZMQ on Windows, so if it's ZMQ, it's an unfixed bug.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6231#issuecomment-23045223.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Aug 21, 2013

Ya, we've been considering creating our own repos for RHEL 5 to host ZMQ3 packages, but we just haven't had time.

@equinoxefr

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2013

@basepi You are right, that shouldn't be possible but... http://www.evanjones.ca/tcp-stuck-connection-mystery.html

I did more test and i can confirm that:

  • Only windows minion is concerned (0.16.3)
  • Connections between minion / master without firewall = OK
  • Connections between minion / master with firewall (Cisco) = KO after a few hours.
  • On master no tcp connection from minion on tcp 4505. On minion a stuck TCP connection is ESTABLISHED ! If i restart minion or if i kill this shadow TCP connection, everything becomes OK.
  • If i use salt-call on minion, it opens TCP connection to master on 4506 but on connection on 4505 still stuck.

Do you want me to open another issue only for the Windows minion ?

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2013

@equinoxefr Yes, could create a new issue? That seems to be a different problem from the linux disconnection issues.

Please include as much information from this thread as you think is relevant. Thanks!

@nbari

This comment has been minimized.

Copy link

commented Aug 31, 2013

Same issue here, master and minion using FreeBSD 9.1 and saltstack installed from ports, any ideas or possible workaround ?

@Abukamel

This comment has been minimized.

Copy link

commented Oct 19, 2013

@basepi @equinoxefr i confirm the issue of a connection established in the minion and not established on the master

and the minion is running centos 5 not windows

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Oct 21, 2013

Thanks for the input, @Abukamel. It's helpful to know that it can occur on non-windows machines as well.

@Abukamel

This comment has been minimized.

Copy link

commented Nov 18, 2013

i have wrote a simple script to install salt and it's dependencies from source on centos 5 to solve ZMQ problem
here is the gist:
https://gist.github.com/Abukamel/7515248

@Mrten

This comment has been minimized.

Copy link
Contributor

commented Dec 5, 2013

Try a cronjob, then?
*/5 * * * * salt '*' test.ping > /dev/null

I think salt should have this as a feature, maybe it has but I haven't found it yet.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Dec 5, 2013

Ya, we don't have the ability to just ping the minions on a regular basis built into salt yet. cron is the way to go.

If Azure isn't respecting keepalive, that could definitely be causing your problems. As of right now, the minions will not attempt to reconnect outside of the ZMQ keepalive routines. (We recognize that this is a problem -- the biggest blocker is the fact that ZMQ is not very good at reporting that connections are dead. We've been trying to find a good way around this problem)

@Plasma

This comment has been minimized.

Copy link
Contributor

commented Dec 5, 2013

Pretty sure AWS ELBs and other load balancers in general shed "idle"
connections (clear routing tables, etc) constantly.

Would a cron being setup on the master to ping clients be good enough?

On Fri, Dec 6, 2013 at 4:07 AM, Colton Myers notifications@github.comwrote:

Ya, we don't have the ability to just ping the minions on a regular basis
built into salt yet. cron is the way to go.

If Azure isn't respecting keepalive, that could definitely be causing your
problems. As of right now, the minions will not attempt to reconnect
outside of the ZMQ keepalive routines. (We recognize that this is a problem
-- the biggest blocker is the fact that ZMQ is not very good at reporting
that connections are dead. We've been trying to find a good way around this
problem)


Reply to this email directly or view it on GitHubhttps://github.com//issues/6231#issuecomment-29916298
.

@Abukamel

This comment has been minimized.

Copy link

commented Dec 6, 2013

you may watch down minions from master server via this command
salt-run -t30 manage.down

if the return value is not none then it will be line delimited value
with 1 minion on each line

you may loop over them and try to restart the minions to get them back online

i suggest you monitor the master and minions via nagios and nrpe and
then fire an nrpe script on minions that has been reported to be down
by the master nagios nrpe plugin to restart them

this is the solution that i ended to be using yesterday to overcome this problem

On 12/6/13, Andrew Armstrong notifications@github.com wrote:

Pretty sure AWS ELBs and other load balancers in general shed "idle"
connections (clear routing tables, etc) constantly.

Would a cron being setup on the master to ping clients be good enough?

On Fri, Dec 6, 2013 at 4:07 AM, Colton Myers
notifications@github.comwrote:

Ya, we don't have the ability to just ping the minions on a regular basis
built into salt yet. cron is the way to go.

If Azure isn't respecting keepalive, that could definitely be causing
your
problems. As of right now, the minions will not attempt to reconnect
outside of the ZMQ keepalive routines. (We recognize that this is a
problem
-- the biggest blocker is the fact that ZMQ is not very good at reporting
that connections are dead. We've been trying to find a good way around
this
problem)


Reply to this email directly or view it on
GitHubhttps://github.com//issues/6231#issuecomment-29916298
.


Reply to this email directly or view it on GitHub:
#6231 (comment)

Regards,
Ahmed Kamel
Linux/Hosting Systems Engineer

@bkeroack

This comment has been minimized.

Copy link

commented Feb 4, 2014

I'm seeing this with 0.17.4-4 minions on Windows Server 2012 R2. Minion will stop responding to master, but local salt-call commands work normally.

@bigmstone

This comment has been minimized.

Copy link

commented Feb 17, 2014

I'm seeing this issue with Windows Server 2008 R2 with 0.17.5-2 as well.

@bkeroack

This comment has been minimized.

Copy link

commented Feb 17, 2014

Workaround was to make a cron job on the master that test.ping's all minions every minute.

@NoCoBonobo

This comment has been minimized.

Copy link

commented Feb 17, 2014

@bkeroack : We are trying this workaround.

@jakwas

This comment has been minimized.

Copy link

commented Mar 17, 2014

I am also experiencing this problem with hosts connected via VPN, running 2014.1.0 on Debian Wheezy (amd64) on all hosts except one.

I first tried using a cronjob on the master to do a test.ping to all hosts every 5 minutes, but that did not help, so I changed it to run every minute, which seems to help, except for the one Windows 7 minion...

@jakwas

This comment has been minimized.

Copy link

commented Apr 2, 2014

I suspect there might be another issue with the Windows minion, will investigate further, but the problem now is that the job history for the test.ping cronjob is causing my master's file system to run out of inodes :(

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Apr 2, 2014

@jakwas Have you set a non-default value for keep_jobs in your master config? We fixed some weirdness in the job cleanup routines which hopefully will have fixed this issue for you in future versions of salt. Additionally, the latest Windows installers contain the newest version of ZeroMQ, which fixes the keepalive routines for Windows. So you shouldn't actually need your test.ping cron job anymore!

@jakwas

This comment has been minimized.

Copy link

commented Apr 2, 2014

@basepi No, 'keep_jobs' is commented in my master config, so default should be 24. That is good news, but unfortunately without the test.ping cron job, my Debian Wheezy minions running 2014.1.0 don't reconnect until I restart the minion service on each host. Please let me know if there is anything else I should try.

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Apr 2, 2014

@jakwas Can you inspect your job cache (/var/cache/salt/master/jobs/) and see if there are any files in there with timestamps greater than 24 hours? Curious if you're being bit by the cleanup, or if you're just generating a ton of cached jobs in a 24-hour period and running out of inodes that way. You could also set keep_jobs to a lower setting (like 1 hour) and see if that solves your problem.

@jakwas

This comment has been minimized.

Copy link

commented Apr 2, 2014

@basepi I have already deleted it, since the master service did not want to start, but it was over 10GB in size, and I only have about 50 minions. I have set keep_jobs to 1 hour and will reply here if it happens again. Any idea if/why the other minions still require the cron job?

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Apr 3, 2014

As far as I understood, as long as you have at least ZMQ 3.2 on unix minions, and 4.0.4 on Windows minions, the keepalive routines are pretty solid. I suppose certain network situations could maybe cause problems, but most often an old ZMQ version is to blame. Can't imagine that you would have a very old version of ZMQ on Wheezy, though....

@jakwas

This comment has been minimized.

Copy link

commented Apr 3, 2014

All my Linux minions have ZMQ v3.2.3. I have installed the newest Windows version and will test.

It might be worth noting that most of my minions are connecting via dynamic public IP addresses to my master, which is also on a dynamic public IP address...

@basepi

This comment has been minimized.

Copy link
Collaborator

commented Apr 3, 2014

Interesting. That may very well be the problem. But it seems like a problem that test.ping wouldn't necessarily solve, so I'm not sure.

@leonhedding

This comment has been minimized.

Copy link
Author

commented Apr 16, 2014

As the original poster on this problem I would like to say that the latest salt-minion 2014.1.1 with ZMQ 4.0.4 is actually working. Before it was a joke for me because my salt master is in the DMZ and most of my Windows clients are across either WAN connections or on our inside network. The keep alive was not working until now. I have actually now rolled out salt-minions to my Windows Servers now because I feel it works now.

Thanks for getting this problem resolved.

@cachedout

This comment has been minimized.

Copy link
Collaborator

commented Apr 16, 2014

@leonhedding I'm glad to hear this is working for you! I'll go ahead and close this issue out.

@cachedout cachedout closed this Apr 16, 2014

@UtahDave

This comment has been minimized.

Copy link
Member

commented Apr 16, 2014

@leonhedding thanks for the report! I'm glad it's working for you now. Your help has been much appreciated.

@yytsui

This comment has been minimized.

Copy link

commented Apr 10, 2015

In my settings, I have 10 minions and a master ubuntu 12.04 instance on Azure. The connections are not stable. Sometimes some of the minions can reconnect after I restart the salt-minion service, but then they
will lost connections soon...like less than 5 minutes. Here are the versions report:

$ salt-minion --versions-report
Salt: 2014.7.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.4.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.5.0
RAET: Not Installed
ZMQ: 4.0.5
Mako: Not Installed
Debian source package: 2014.7.2+ds-1precise2

on master
$ sudo salt --versions-report
Salt: 2014.7.2
Python: 2.7.3 (default, Dec 18 2014, 19:10:20)
Jinja2: 2.6
M2Crypto: 0.21.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.6.1
libnacl: Not Installed
PyYAML: 3.10
ioflo: Not Installed
PyZMQ: 14.5.0
RAET: Not Installed
ZMQ: 4.0.5
Mako: Not Installed
Debian source package: 2014.7.2+ds-1precise2

The 10 minions were created by salt-cloud with map file.
I have to manual upgrade them from ZMQ2 to ZMQ4.
Not sure if is a Azure related issue? any thing else I can try or any other useful information I can help to provide?

@codekoala

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2016

I was having problems with my Azure minions maintaining a connection to the salt master (which was also on Azure). I moved the master to another provider and continued to have issues maintaining contact with the Azure minions. Minions on other providers were fine.

Today I set some explicit keepalive settings on the Azure minions:

tcp_keepalive: True
tcp_keepalive_idle: 60

I've not had an issue keeping in touch with these minions the way I used to before this change. Just a few minutes ago, I created a new minion on Azure without these keepalive settings, and it's already lost contact with the master.

I'm going to bounce that new minion again and see if it loses contact. If it does, I'll update the keepalive settings and see how it looks.

@codekoala

This comment has been minimized.

Copy link
Contributor

commented Mar 24, 2016

A day later, the keepalive settings seem to have solved everything for me with my Azure minions.

@lesar

This comment has been minimized.

Copy link

commented Jun 12, 2018

I put my minion in dmz and then no lost connection any more so I remove the minion from dmz and use the @codekoala solution:

tcp_keepalive: True
tcp_keepalive_idle: 60

and it work for me. so there is some reason why we have to use this solution. some configuration on master server or on minion it makes necessary communication from master to minion on some close port or use the configuration above.

best regards

P.S. I'm using 2018.03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.