Consider a way to automatically re-enable black hole hosts #58

mforsyth · 2016-02-24T21:56:29Z

It would be nice if somehow, manual intervention weren't always required to bring a host back out of its black hole status and get it re-added to the whitelist.

Need to think about a specific strategy for this.

mforsyth · 2016-02-25T19:26:44Z

@cmilloy You are well situated to think of how this could work in a way that would be most helpful. Let's use this comment stream as a sounding board for proposals/ideas. @wkf I know you suggested that blackhole exile could simply be temporary; after a configurable period of time, we start sending a black hole host tasks again as a trial run.

tnn1t1s · 2016-02-25T20:46:43Z

I'm not sold on this. I think if a host is blackholed, it should be
disabled until someone, or something, addresses the issues. We don't want
to overload satellite with concerns .

On Thu, Feb 25, 2016 at 2:26 PM Matthew Forsyth notifications@github.com
wrote:

@cmilloy https://github.com/cmilloy You are well situated to think of
how this could work in a way that would be most helpful. Let's use this
comment stream as a sounding board for proposals/ideas. @wkf
https://github.com/wkf I know you suggested that blackhole exile could
simply be temporary; after a configurable period of time, we start sending
a black hole host tasks again as a trial run.

—
Reply to this email directly or view it on GitHub
#58 (comment).

cmilloy · 2016-02-26T21:40:08Z

After socializing this internally we have come up with a few ideas for implementation to start:

As you noted, being able to set an expiration when a host is black holed would be one. It would also be nice to have some way to increment the black hole duration upon recurrence.
Adding a test/canary job that can still target the de-whitelisted hosts and will re-whitelist if the job is successful
Allow some arbitrary command to be run on the slave after de-whitelisting the host. The command may try to remediate obvious problems and reboot the host and then re-whitelist after reboot (with a configurable # of retries). Perhaps this is decoupled from de-whitelisting and takes the form of a completely separate comet each node runs to see if it has been de-whitelisted?

@tnn1t1s We can certainly discuss further. This primarily came from the prediction that some issues which cause mesos task failures will not originate inside the cluster (such as those caused by infrastructure). They will occur and be fixed independently of the support team(s) operating the cluster. The concern is that without automatic re-whitelisting such outages will cause unnecessary work and dependency on the support team(s) who are operating the cluster to resume service to users.

I think it makes sense for satellite to have a facility for automatic re-whitelisting which is configurable enough to apply to multiple use-cases. If we decide not to use it, that's OK too.

tnn1t1s · 2016-02-27T05:25:46Z

@corey - i think black hole host detection only occurs on 'task-lost',
not 'task-failed'. If that's correct, massive job failures due to
downstream dependencies e.g. my database is down, should not trigger this.
Instead, it would catch events like, 'this host can't start jobs', 'this
host has a broken mesos-agent'. If we're worried, the black-hole host
detector doesn't actually have to take the action of adding to blacklist.
For starters, we can just start notifying on this event.

That said, I'd like to keep this very simple and not overload satellite
with concerns. A few of your suggestions require run-on-host semantics that
don't exist in a Mesos world. We'd rather not try to invent that here.

As an alternative, I can imagine a configurable callback that is triggered
on blacklist events (regardless of 'how' e.g. black hole host-detector, or,
some other check). This callback could call arbitrary code, of which
examples may be 'reboot host and remove from whitelist'.

Maybe, a process outside of Satellite can monitor the blacklist and try
remedial action as per your suggestion, but I wouldn't want to try to add
that to Satellite and if it were my ops team, i wouldn't use it. This could
be left as an opinionated effort by individual ops teams.

On Fri, Feb 26, 2016 at 4:40 PM, cmilloy notifications@github.com wrote:

After socializing this internally we have come up with a few ideas for
implementation to start:

As you noted, being able to set an expiration when a host is black
holed would be one. It would also be nice to have some way to increment the
black hole duration upon recurrence.

Adding a test/canary job that can still target the de-whitelisted
hosts and will re-whitelist if the job is successful

Allow some arbitrary command to be run on the slave after
de-whitelisting the host. The command may try to remediate obvious problems
and reboot the host and then re-whitelist after reboot (with a configurable
of retries). Perhaps this is decoupled from de-whitelisting and takes the
form of a completely separate comet each node runs to see if it has been
de-whitelisted?

@tnn1t1s https://github.com/tnn1t1s We can certainly discuss further.
This primarily came from the prediction that some issues which cause mesos
task failures will not originate inside the cluster (such as those caused
by infrastructure). They will occur and be fixed independently of the
support team(s) operating the cluster. The concern is that without
automatic re-whitelisting such outages will cause unnecessary work and
dependency on the support team(s) who are operating the cluster to resume
service to users.

I think it makes sense for satellite to have a facility for automatic
re-whitelisting which is configurable enough to apply to multiple
use-cases. If we decide not to use it, that's OK too.

—
Reply to this email directly or view it on GitHub
#58 (comment).

mforsyth · 2016-02-27T11:54:31Z

@tnn1t1s the Black hole detector does actually care about failed tasks (not lost tasks).

I really like the idea of, as a first step, just having the black hole detector alert admins, rather than removing the host from the whitelist. That has some advantages:

It lets us introduce black hole detection in a way where it can only help, not harm, thus allowing the maintenance team (who are anxious about the idea of it possibly causing more work than it saves) to see when it kicks in before having to trust it to actually take hosts out of the rotation.
It's very clear functionality, and allows us to avoid (for now) what promises to be a lengthly negotiation for the functionality of both both this issue and Limit number of hosts that can be disabled via black hole detector #57.
I believe it will basically be a one-line change to the config to make this change, meaning that we don't have to divert resources from Cook right now.

tnn1t1s · 2016-02-27T19:35:30Z

This seems like the way to go. We can alert and collect the data and build
confidence and understanding of the behavior before trying to design a
solution

On Sat, Feb 27, 2016 at 6:54 AM Matthew Forsyth notifications@github.com
wrote:

@tnn1t1s https://github.com/tnn1t1s the Black hole detector does
actually care about failed tasks (not lost tasks).

I really like the idea of, as a first step, just having the black hole
detector alert admins, rather than removing the host from the whitelist.
That has some advantages:

It lets us introduce black hole detection in a way where it can only
help, not harm, thus allowing the maintenance team (who are anxious about
the idea of it possibly causing more work than it saves) to see when it
kicks in before having to trust it to actually take hosts out of the
rotation.

It's very clear functionality, and allows us to avoid (for now) what
promises to be a lengthly negotiation for the functionality of both both
this issue and Limit number of hosts that can be disabled via black hole detector #57 Limit number of hosts that can be disabled via black hole detector #57.

I believe it will basically be a one-line change to the config to make
this change, meaning that we don't have to divert resources from Cook right
now.

—
Reply to this email directly or view it on GitHub
#58 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider a way to automatically re-enable black hole hosts #58

Consider a way to automatically re-enable black hole hosts #58

mforsyth commented Feb 24, 2016

mforsyth commented Feb 25, 2016

tnn1t1s commented Feb 25, 2016

cmilloy commented Feb 26, 2016

tnn1t1s commented Feb 27, 2016

of retries). Perhaps this is decoupled from de-whitelisting and takes the

mforsyth commented Feb 27, 2016

tnn1t1s commented Feb 27, 2016

Consider a way to automatically re-enable black hole hosts #58

Consider a way to automatically re-enable black hole hosts #58

Comments

mforsyth commented Feb 24, 2016

mforsyth commented Feb 25, 2016

tnn1t1s commented Feb 25, 2016

cmilloy commented Feb 26, 2016

tnn1t1s commented Feb 27, 2016

of retries). Perhaps this is decoupled from de-whitelisting and takes the

mforsyth commented Feb 27, 2016

tnn1t1s commented Feb 27, 2016