Skip to content
This repository has been archived by the owner on Mar 15, 2019. It is now read-only.

Consider a way to automatically re-enable black hole hosts #58

Open
mforsyth opened this issue Feb 24, 2016 · 6 comments
Open

Consider a way to automatically re-enable black hole hosts #58

mforsyth opened this issue Feb 24, 2016 · 6 comments

Comments

@mforsyth
Copy link
Contributor

It would be nice if somehow, manual intervention weren't always required to bring a host back out of its black hole status and get it re-added to the whitelist.

Need to think about a specific strategy for this.

@mforsyth
Copy link
Contributor Author

@cmilloy You are well situated to think of how this could work in a way that would be most helpful. Let's use this comment stream as a sounding board for proposals/ideas. @wkf I know you suggested that blackhole exile could simply be temporary; after a configurable period of time, we start sending a black hole host tasks again as a trial run.

@tnn1t1s
Copy link

tnn1t1s commented Feb 25, 2016

I'm not sold on this. I think if a host is blackholed, it should be
disabled until someone, or something, addresses the issues. We don't want
to overload satellite with concerns .

On Thu, Feb 25, 2016 at 2:26 PM Matthew Forsyth notifications@github.com
wrote:

@cmilloy https://github.com/cmilloy You are well situated to think of
how this could work in a way that would be most helpful. Let's use this
comment stream as a sounding board for proposals/ideas. @wkf
https://github.com/wkf I know you suggested that blackhole exile could
simply be temporary; after a configurable period of time, we start sending
a black hole host tasks again as a trial run.


Reply to this email directly or view it on GitHub
#58 (comment).

@cmilloy
Copy link

cmilloy commented Feb 26, 2016

After socializing this internally we have come up with a few ideas for implementation to start:

  • As you noted, being able to set an expiration when a host is black holed would be one. It would also be nice to have some way to increment the black hole duration upon recurrence.
  • Adding a test/canary job that can still target the de-whitelisted hosts and will re-whitelist if the job is successful
  • Allow some arbitrary command to be run on the slave after de-whitelisting the host. The command may try to remediate obvious problems and reboot the host and then re-whitelist after reboot (with a configurable # of retries). Perhaps this is decoupled from de-whitelisting and takes the form of a completely separate comet each node runs to see if it has been de-whitelisted?

@tnn1t1s We can certainly discuss further. This primarily came from the prediction that some issues which cause mesos task failures will not originate inside the cluster (such as those caused by infrastructure). They will occur and be fixed independently of the support team(s) operating the cluster. The concern is that without automatic re-whitelisting such outages will cause unnecessary work and dependency on the support team(s) who are operating the cluster to resume service to users.

I think it makes sense for satellite to have a facility for automatic re-whitelisting which is configurable enough to apply to multiple use-cases. If we decide not to use it, that's OK too.

@tnn1t1s
Copy link

tnn1t1s commented Feb 27, 2016

@corey - i think black hole host detection only occurs on 'task-lost',
not 'task-failed'. If that's correct, massive job failures due to
downstream dependencies e.g. my database is down, should not trigger this.
Instead, it would catch events like, 'this host can't start jobs', 'this
host has a broken mesos-agent'. If we're worried, the black-hole host
detector doesn't actually have to take the action of adding to blacklist.
For starters, we can just start notifying on this event.

That said, I'd like to keep this very simple and not overload satellite
with concerns. A few of your suggestions require run-on-host semantics that
don't exist in a Mesos world. We'd rather not try to invent that here.

As an alternative, I can imagine a configurable callback that is triggered
on blacklist events (regardless of 'how' e.g. black hole host-detector, or,
some other check). This callback could call arbitrary code, of which
examples may be 'reboot host and remove from whitelist'.

Maybe, a process outside of Satellite can monitor the blacklist and try
remedial action as per your suggestion, but I wouldn't want to try to add
that to Satellite and if it were my ops team, i wouldn't use it. This could
be left as an opinionated effort by individual ops teams.

On Fri, Feb 26, 2016 at 4:40 PM, cmilloy notifications@github.com wrote:

After socializing this internally we have come up with a few ideas for
implementation to start:

  • As you noted, being able to set an expiration when a host is black
    holed would be one. It would also be nice to have some way to increment the
    black hole duration upon recurrence.
  • Adding a test/canary job that can still target the de-whitelisted
    hosts and will re-whitelist if the job is successful
  • Allow some arbitrary command to be run on the slave after
    de-whitelisting the host. The command may try to remediate obvious problems
    and reboot the host and then re-whitelist after reboot (with a configurable

    of retries). Perhaps this is decoupled from de-whitelisting and takes the

    form of a completely separate comet each node runs to see if it has been
    de-whitelisted?

@tnn1t1s https://github.com/tnn1t1s We can certainly discuss further.
This primarily came from the prediction that some issues which cause mesos
task failures will not originate inside the cluster (such as those caused
by infrastructure). They will occur and be fixed independently of the
support team(s) operating the cluster. The concern is that without
automatic re-whitelisting such outages will cause unnecessary work and
dependency on the support team(s) who are operating the cluster to resume
service to users.

I think it makes sense for satellite to have a facility for automatic
re-whitelisting which is configurable enough to apply to multiple
use-cases. If we decide not to use it, that's OK too.


Reply to this email directly or view it on GitHub
#58 (comment).

@mforsyth
Copy link
Contributor Author

@tnn1t1s the Black hole detector does actually care about failed tasks (not lost tasks).

I really like the idea of, as a first step, just having the black hole detector alert admins, rather than removing the host from the whitelist. That has some advantages:

  1. It lets us introduce black hole detection in a way where it can only help, not harm, thus allowing the maintenance team (who are anxious about the idea of it possibly causing more work than it saves) to see when it kicks in before having to trust it to actually take hosts out of the rotation.
  2. It's very clear functionality, and allows us to avoid (for now) what promises to be a lengthly negotiation for the functionality of both both this issue and Limit number of hosts that can be disabled via black hole detector #57.
  3. I believe it will basically be a one-line change to the config to make this change, meaning that we don't have to divert resources from Cook right now.

@tnn1t1s
Copy link

tnn1t1s commented Feb 27, 2016

This seems like the way to go. We can alert and collect the data and build
confidence and understanding of the behavior before trying to design a
solution

On Sat, Feb 27, 2016 at 6:54 AM Matthew Forsyth notifications@github.com
wrote:

@tnn1t1s https://github.com/tnn1t1s the Black hole detector does
actually care about failed tasks (not lost tasks).

I really like the idea of, as a first step, just having the black hole
detector alert admins, rather than removing the host from the whitelist.
That has some advantages:

  1. It lets us introduce black hole detection in a way where it can only
    help, not harm, thus allowing the maintenance team (who are anxious about
    the idea of it possibly causing more work than it saves) to see when it
    kicks in before having to trust it to actually take hosts out of the
    rotation.
  2. It's very clear functionality, and allows us to avoid (for now) what
    promises to be a lengthly negotiation for the functionality of both both
    this issue and Limit number of hosts that can be disabled via black hole detector #57 Limit number of hosts that can be disabled via black hole detector #57.
  3. I believe it will basically be a one-line change to the config to make
    this change, meaning that we don't have to divert resources from Cook right
    now.


Reply to this email directly or view it on GitHub
#58 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants