master does not suspend minion that went down #2361

Closed
madduck opened this Issue Oct 26, 2012 · 20 comments

Comments

Projects
None yet
7 participants
@madduck
Contributor

madduck commented Oct 26, 2012

If a minion goes down (e.g. SIGTERM, or SIGINT), it appears to do so
gracefully. On the master, subsequently, the connection no longer appears in
the netstat output.

However, it seems as if the master still tries to publish commands to the
minion that is now no longer reachable. See:

# # all 4 minions running
# time su salt -s /bin/sh -c 'salt \* test.ping'
vizier.madduck.net: True
arnold.madduck.net: True
pict.gern: True
clegg.lehel.madduck.net: True

real    0m0.453s
user    0m0.276s
sys     0m0.048s

# # minion pict shut down
# time su salt -s /bin/sh -c 'salt \* test.ping'
vizier.madduck.net: True
arnold.madduck.net: True
clegg.lehel.madduck.net: True
[here it waits for 10+ seconds]

real    0m12.771s
user    0m0.284s
sys     0m0.060s

# # minion pict back up
# time su salt -s /bin/sh -c 'salt \* test.ping'
vizier.madduck.net: True
arnold.madduck.net: True
pict.gern: True
clegg.lehel.madduck.net: True

real    0m0.453s
user    0m0.252s
sys     0m0.072s

Since the minion tore down the connection, the master should know about it,
should probably log that there is no connection from this minion, but then
skip it instead of waiting for 10+ seconds.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Oct 26, 2012

Member

Minions are not tracked by IP, so we cannot cleanly determine which minion is down, also much of the tracking code has been improved, what version are you running?

Member

thatch45 commented Oct 26, 2012

Minions are not tracked by IP, so we cannot cleanly determine which minion is down, also much of the tracking code has been improved, what version are you running?

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Oct 27, 2012

Contributor

also sprach Thomas S Hatch notifications@github.com [2012.10.26.1738 +0200]:

Minions are not tracked by IP, so we cannot cleanly determine
which minion is down, also much of the tracking code has been
improved, what version are you running?

I am running 0.10.4.

Without having inspected the code, somewhere inside the master there
must be a collection of client connection socket objects, or ZMQ
objects. I could imagine it as a dictionary mapping hostnames found
in the PKI directory to those objects, or just a list of those
objects. When I then run 'salt * test.ping', essentially this list
will be iterated one way or another.

So in fact, the master does track the minions (and could even
determine their IPs from the OS).

When a minion goes down and the connection is gracefully terminated,
then either the socket/ZMQ object provides a callback, or the object
is simply rendered into some sort of disconnected state.

When a minion goes down ungracefully, such a callback would not be
called, and it is not really clear what state the connecion will be
in.

So whenever the master iterates this list and it encounters
a disconnected object, it should log this and remove the object.
Having a callback in addition would make for even better logging.

In addition, I think that the master should iterate all client
connections, say every 180 seconds (configurable, obviously), e.g.
by sending a ping, and if only as a keepalive mechanism (not sure
how keepalive is currently handled). If it receives a pong, then it
moves on. If it does not, or encounters a disconnected object, it
should log and remove.

Potentially, this would also fix #2358.

Contributor

madduck commented Oct 27, 2012

also sprach Thomas S Hatch notifications@github.com [2012.10.26.1738 +0200]:

Minions are not tracked by IP, so we cannot cleanly determine
which minion is down, also much of the tracking code has been
improved, what version are you running?

I am running 0.10.4.

Without having inspected the code, somewhere inside the master there
must be a collection of client connection socket objects, or ZMQ
objects. I could imagine it as a dictionary mapping hostnames found
in the PKI directory to those objects, or just a list of those
objects. When I then run 'salt * test.ping', essentially this list
will be iterated one way or another.

So in fact, the master does track the minions (and could even
determine their IPs from the OS).

When a minion goes down and the connection is gracefully terminated,
then either the socket/ZMQ object provides a callback, or the object
is simply rendered into some sort of disconnected state.

When a minion goes down ungracefully, such a callback would not be
called, and it is not really clear what state the connecion will be
in.

So whenever the master iterates this list and it encounters
a disconnected object, it should log this and remove the object.
Having a callback in addition would make for even better logging.

In addition, I think that the master should iterate all client
connections, say every 180 seconds (configurable, obviously), e.g.
by sending a ping, and if only as a keepalive mechanism (not sure
how keepalive is currently handled). If it receives a pong, then it
moves on. If it does not, or encounters a disconnected object, it
should log and remove.

Potentially, this would also fix #2358.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Oct 29, 2012

Member

This could require some serious work, I am going to approve it for the future.

Member

thatch45 commented Oct 29, 2012

This could require some serious work, I am going to approve it for the future.

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Oct 29, 2012

Contributor

Thank you!

Contributor

madduck commented Oct 29, 2012

Thank you!

@alekibango

This comment has been minimized.

Show comment
Hide comment

+1

@roncohen

This comment has been minimized.

Show comment
Hide comment
@roncohen

roncohen Nov 26, 2012

Contributor

+1

Contributor

roncohen commented Nov 26, 2012

+1

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Nov 26, 2012

Member

remind me to get this one on 0.10.7, it will most likely take a lot of work but I am planning more in this space anyway for 0.10.7

Member

thatch45 commented Nov 26, 2012

remind me to get this one on 0.10.7, it will most likely take a lot of work but I am planning more in this space anyway for 0.10.7

@mrud

This comment has been minimized.

Show comment
Hide comment
@mrud

mrud Jan 19, 2013

Contributor

Any update?

Contributor

mrud commented Jan 19, 2013

Any update?

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 19, 2013

Member

Many updates have happened here to ease up the minion/master connections when the master goes down I might be able to close this guy out

Member

thatch45 commented Jan 19, 2013

Many updates have happened here to ease up the minion/master connections when the master goes down I might be able to close this guy out

@roncohen

This comment has been minimized.

Show comment
Hide comment
@roncohen

roncohen Jan 20, 2013

Contributor

I can confirm that this is still an issue in 0.12.0. I have two minion that are not connected to the master, and when executing commands, it hangs for about 10 seconds after each command - presumably to wait for answers from the disconnected nodes.

Contributor

roncohen commented Jan 20, 2013

I can confirm that this is still an issue in 0.12.0. I have two minion that are not connected to the master, and when executing commands, it hangs for about 10 seconds after each command - presumably to wait for answers from the disconnected nodes.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 21, 2013

Member

I think I misread the underlying issue here for that last comment.

Yes, this is still an issue that is on the roadmap, I will put it on 0.13.0 but it might get pushed off it is turns out to be too complicated for in the time available

Member

thatch45 commented Jan 21, 2013

I think I misread the underlying issue here for that last comment.

Yes, this is still an issue that is on the roadmap, I will put it on 0.13.0 but it might get pushed off it is turns out to be too complicated for in the time available

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jan 23, 2013

Contributor

A piece of additional info:

I just ran, right after I temporarily turned off sysyphus.madduck.net:

% salt -v \* cmd.run 'uptime'
Executing job with jid 20130123233044112778
-------------------------------------------

vizier.madduck.net:
    23:30:44 up 19:22,  1 user,  load average: 0.00, 0.01, 0.05
arnold.madduck.net:
    23:30:44 up 16:24,  0 users,  load average: 0.00, 0.00, 0.00

The following minions did not return:
sysyphus.madduck.net

At the time, netstat -natp had no reference of a conncetion from sysyphus.
Therefore, it's obvious that that host could not respond to the published
command. It's nice that the master tells me about this, but it

  1. should say that the minion is not connected, rather than that it did not
    return information;
  2. be able to provide this detail immediately, rather than waiting for
    a timeout.
Contributor

madduck commented Jan 23, 2013

A piece of additional info:

I just ran, right after I temporarily turned off sysyphus.madduck.net:

% salt -v \* cmd.run 'uptime'
Executing job with jid 20130123233044112778
-------------------------------------------

vizier.madduck.net:
    23:30:44 up 19:22,  1 user,  load average: 0.00, 0.01, 0.05
arnold.madduck.net:
    23:30:44 up 16:24,  0 users,  load average: 0.00, 0.00, 0.00

The following minions did not return:
sysyphus.madduck.net

At the time, netstat -natp had no reference of a conncetion from sysyphus.
Therefore, it's obvious that that host could not respond to the published
command. It's nice that the master tells me about this, but it

  1. should say that the minion is not connected, rather than that it did not
    return information;
  2. be able to provide this detail immediately, rather than waiting for
    a timeout.
@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 24, 2013

Member

I have a lot on my plate for 0.13.0 right now, so I might need to push this off, but I will try to get it in

Member

thatch45 commented Jan 24, 2013

I have a lot on my plate for 0.13.0 right now, so I might need to push this off, but I will try to get it in

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jun 27, 2013

Contributor

Rather than looking into the kernel, I think the master should keep a list of sockets, and should send keep-alives to those peers. If a keep-alive is not answered, or a socket-type error occurs, then the corresponding entry should be pruned from the list until the minion connects again.

Contributor

madduck commented Jun 27, 2013

Rather than looking into the kernel, I think the master should keep a list of sockets, and should send keep-alives to those peers. If a keep-alive is not answered, or a socket-type error occurs, then the corresponding entry should be pruned from the list until the minion connects again.

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jun 27, 2013

Contributor

In fact, let the minion send keep-alives unprompted and let the master prune the list in the case when it doesn't hear from a minion in a while. Obviously, regular traffic from the minion would also reset the timer.

Contributor

madduck commented Jun 27, 2013

In fact, let the minion send keep-alives unprompted and let the master prune the list in the case when it doesn't hear from a minion in a while. Obviously, regular traffic from the minion would also reset the timer.

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jun 27, 2013

Member

Hmm, this I think is actually a ZeroMQ issue, I don't think that the zeromq sockets give us this data, we will need to look into that. They may now that 3.2 is out....

Member

thatch45 commented Jun 27, 2013

Hmm, this I think is actually a ZeroMQ issue, I don't think that the zeromq sockets give us this data, we will need to look into that. They may now that 3.2 is out....

@madduck

This comment has been minimized.

Show comment
Hide comment
@madduck

madduck Jun 28, 2013

Contributor

Well, then the keep-alive idea of my previous comment, which I find the most appealing anyway. I.e. the master keeps a list of connections corresponding to the minions that have checked in within the last 60 seconds (configurable interval, obviously, the master could tell the minions on connection that they are expected to report back within that time, just like DHCP does things…)

Contributor

madduck commented Jun 28, 2013

Well, then the keep-alive idea of my previous comment, which I find the most appealing anyway. I.e. the master keeps a list of connections corresponding to the minions that have checked in within the last 60 seconds (configurable interval, obviously, the master could tell the minions on connection that they are expected to report back within that time, just like DHCP does things…)

@kaithar

This comment has been minimized.

Show comment
Hide comment
@kaithar

kaithar Jan 8, 2014

Contributor

Hmmm....

@thatch45

Hmm, this I think is actually a ZeroMQ issue, I don't think that the zeromq sockets give us this data

I can help you out there actually. It /didn't/ historically give that data out, but they added this bit of awesome some time in the 3.X release cycle: http://api.zeromq.org/3-2:zmq-socket-monitor
That should allow for this mechanic?

Contributor

kaithar commented Jan 8, 2014

Hmmm....

@thatch45

Hmm, this I think is actually a ZeroMQ issue, I don't think that the zeromq sockets give us this data

I can help you out there actually. It /didn't/ historically give that data out, but they added this bit of awesome some time in the 3.X release cycle: http://api.zeromq.org/3-2:zmq-socket-monitor
That should allow for this mechanic?

@thatch45

This comment has been minimized.

Show comment
Hide comment
@thatch45

thatch45 Jan 8, 2014

Member

looks like a roundabout way of doing it in zeromq, but still better than reading from the kernel network tables, this would be great to get working in the publisher

Member

thatch45 commented Jan 8, 2014

looks like a roundabout way of doing it in zeromq, but still better than reading from the kernel network tables, this would be great to get working in the publisher

@stale

This comment has been minimized.

Show comment
Hide comment
@stale

stale bot Sep 5, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

stale bot commented Sep 5, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot closed this Sep 12, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment