Skip to content
This repository has been archived by the owner on May 6, 2021. It is now read-only.

Instance down #24

Open
Dean-Christian-Armada opened this issue Feb 22, 2018 · 8 comments
Open

Instance down #24

Dean-Christian-Armada opened this issue Feb 22, 2018 · 8 comments

Comments

@Dean-Christian-Armada
Copy link

Have you ever tried creating a rule like if the node went down then it will throw an alert?

@stefanprodan
Copy link
Owner

Node exporter and cadvisor are running on each Swarm node, so you can configure an alert for up{job="node-exporter"}

@Dean-Christian-Armada
Copy link
Author

Dean-Christian-Armada commented Feb 23, 2018

I don't think it is effective enough. As the value 0 of that certain node-exporter will not be present for long. Also, it shows only the instance IP and not the node_name.. I tried grouping it with node_name but it will not show up at all please see photos below

Screenshot of up with a down node-exporter
screen shot 2018-02-23 at 10 12 15

Screenshot of up grouping it with node_meta
screen shot 2018-02-23 at 10 13 12

@stefanprodan
Copy link
Owner

You can use IF absent(node_meta) FOR 5m

@Dean-Christian-Armada
Copy link
Author

Hi @stefanprodan , what should be the expected value on the absent(node_meta) query? The case is if there is even just a single node that went down. Specifically for my case, my "swarm-node-2" went down.

The photo below is what returned when I intentionally downed my swarm-node-2

screen shot 2018-02-26 at 10 11 52

@abhisheks-cuelogic
Copy link

@Dean-Christian-Armada , I am also facing the same problem. I want to create a rule whenever a node is down.
Also if a container is down I should get alert for the same.

@Dean-Christian-Armada
Copy link
Author

@abhisheks-cuelogic , "Container down", you mean if you have a python container that went down then it will alert? I don't think it's possible with the container part. Prometheus needs node-exporter or other scraping like tool to determine metrics. Unless, there is an agent that can be installed inside the container to determine if it went down.

@abhisheks-cuelogic
Copy link

Not the container itself should alert. Can we use something like :

ALERT piwik_nginx
IF count(time() - container_last_seen{name=~"^piwik_nginx.*"} < 60)
ANNOTATIONS {
summary = "piwik_nginx container is down",
description = "piwik_nginx is down for more tha 1 minute",
}

I tried this rule, but somehow alert is always active even container is up.
prometheus-alert

@Dean-Christian-Armada
Copy link
Author

@stefanprodan , we need your advise.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants