Skip to content

Commit

Permalink
Add basic alert support to the dashboards (#267)
Browse files Browse the repository at this point in the history
* Adding a general kill container script

We are about to add an additional container, so it's a good time to
remove the duplication from the kill container functionality.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

* remove the kill grafana and prometheus scripts

* add a script to start the alert manager

* add the alert manager datasource pluging

* set the prometheus datasource alert manager to the dashbaords

* start and kill the alert manager container

* Base configuration for the prometheus and the alarmmanager

The base configuration was only added as a first step.
We expect that user would chanage it to their own use cases.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

* base alertmanager rule configuration

* Add the prometheus rules to the prometheus container

* add alarm_table class to the types.json

* add an alarm table to the main dashboard

The table was added here as a starting point.
It would probably moved and better formatted.

* set the alert manager address based on its container

* set the down time rules to 30s

* set the alert manger address dynamically

* make the prometheus config a template

* create the prometheus config file from template

* set severity to 1 instead of page

* Revert "add an alarm table to the main dashboard"

This reverts commit ca69085.

* remove the sudo from kill-container

* Revert "add alarm_table class to the types.json"

This reverts commit 0de7701.

* Add the alertmanager to the README

* alertmanager to optionaly get its port from the commandline
  • Loading branch information
amnonh authored and tzach committed Feb 21, 2018
1 parent 2da72d7 commit 91a8c62
Show file tree
Hide file tree
Showing 12 changed files with 293 additions and 98 deletions.
8 changes: 8 additions & 0 deletions README.md
Expand Up @@ -11,6 +11,7 @@ ___
The monitoring infrastructure consists of several components, wrapped in Docker containers:
* `prometheus` - collects and stores metrics
* `grafana` - dashboard server
* `alertmanager` - The alert manager collect Prometheus alerts

### Prerequisites
* git
Expand Down Expand Up @@ -171,3 +172,10 @@ For example, if you have prometheus running at `192.168.0.1:9090`, and grafana a
```
./load-grafana.sh -p 192.168.0.1:9090 -g 3000
```

### Alertmanager
Prometheus [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) handles alerts that are generated by the Prometheus server.

Alerts are generated according to the [Alerting rules](https://prometheus.io/docs/prometheus/1.8/configuration/alerting_rules/).

The Alertmanager listen on port `9093` and you can use a web-browser to connect to it.
29 changes: 12 additions & 17 deletions kill-all.sh
@@ -1,15 +1,19 @@
#!/usr/bin/env bash

usage="$(basename "$0") [-h] [-g grafana port ] [ -p prometheus port ] -- kills existing Grafana and Prometheus Docker instances at given ports"

while getopts ':hg:p:' option; do
usage="$(basename "$0") [-h] [-g grafana port ] [ -p prometheus port ] [-m alertmanager port] -- kills existing Grafana and Prometheus Docker instances at given ports"
GRAFANA_PORT=""
PROMETHEUS_PORT=""
ALERTMANAGER_PORT=""
while getopts ':hg:p:m:' option; do
case "$option" in
h) echo "$usage"
exit
;;
g) GRAFANA_PORT=$OPTARG
g) GRAFANA_PORT="-p $OPTARG"
;;
p) PROMETHEUS_PORT="-p $OPTARG"
;;
p) PROMETHEUS_PORT=$OPTARG
m) ALERTMANAGER_PORT="-p $OPTARG"
;;
:) printf "missing argument for -%s\n" "$OPTARG" >&2
echo "$usage" >&2
Expand All @@ -22,18 +26,9 @@ while getopts ':hg:p:' option; do
esac
done

if [ -z $PROMETHEUS_PORT ]; then
./kill-prometheus.sh
else
./kill-prometheus.sh -p $PROMETHEUS_PORT
fi

if [ -z $GRAFANA_PORT ]; then
./kill-grafana.sh
else
./kill-grafana.sh -g $GRAFANA_PORT
fi

./kill-container.sh $PROMETHEUS_PORT -b aprom
./kill-container.sh $GRAFANA_PORT -b agraf
./kill-container.sh $ALERTMANAGER_PORT -b aalert



40 changes: 40 additions & 0 deletions kill-container.sh
@@ -0,0 +1,40 @@
#!/usr/bin/env bash

usage="$(basename "$0") [-h] [ -p container port ] [-n optional name] [-b base name] -- kills existing Docker instances at given ports"

while getopts ':hb:p:n:' option; do
case "$option" in
h) echo "$usage"
exit
;;
p) PORT=$OPTARG
;;
n) NAME=$OPTARG
;;
b) BASE_NAME=$OPTARG
;;
:) printf "missing argument for -%s\n" "$OPTARG" >&2
echo "$usage" >&2
exit 1
;;
\?) printf "illegal option: -%s\n" "$OPTARG" >&2
echo "$usage" >&2
exit 1
;;
esac
done
if [ -z $NAME ]; then
if [ -z $PORT ]; then
NAME=$BASE_NAME
else
NAME=$BASE_NAME-$PORT
fi
fi

if [ "$(docker ps -q -f name=$NAME)" ]; then
docker kill $NAME
fi

if [[ "$(docker ps -aq --filter name=$NAME 2> /dev/null)" != "" ]]; then
docker rm -v $NAME
fi
35 changes: 0 additions & 35 deletions kill-grafana.sh

This file was deleted.

35 changes: 0 additions & 35 deletions kill-prometheus.sh

This file was deleted.

13 changes: 11 additions & 2 deletions load-grafana.sh
Expand Up @@ -6,9 +6,9 @@ GRAFANA_HOST="localhost"
GRAFANA_PORT=3000
DB_ADDRESS="127.0.0.1:9090"

usage="$(basename "$0") [-h] [-v comma separated versions ] [-g grafana port ] [-H grafana hostname] [-p ip:port address of prometheus ] [-a admin password] [-j additional dashboard to load to Grafana, multiple params are supported] -- loads the prometheus datasource and the Scylla dashboards into an existing grafana installation"
usage="$(basename "$0") [-h] [-v comma separated versions ] [-g grafana port ] [-H grafana hostname] [-m alert_manager ip:port] [-p ip:port address of prometheus ] [-a admin password] [-j additional dashboard to load to Grafana, multiple params are supported] -- loads the prometheus datasource and the Scylla dashboards into an existing grafana installation"

while getopts ':hg:H:p:v:a:j:' option; do
while getopts ':hg:H:p:v:a:j:m:' option; do
case "$option" in
h) echo "$usage"
exit
Expand All @@ -23,6 +23,8 @@ while getopts ':hg:H:p:v:a:j:' option; do
;;
p) DB_ADDRESS=$OPTARG
;;
m) AM_ADDRESS=$OPTARG
;;
a) GRAFANA_ADMIN_PASSWORD=$OPTARG
;;
esac
Expand All @@ -32,6 +34,13 @@ curl -XPOST -i http://admin:$GRAFANA_ADMIN_PASSWORD@$GRAFANA_HOST:$GRAFANA_PORT/
--data-binary '{"name":"prometheus", "type":"prometheus", "url":"'"http://$DB_ADDRESS"'", "access":"proxy", "basicAuth":false}' \
-H "Content-Type: application/json"

if [ -n $AM_ADDRESS ]
then
curl -XPOST -i http://admin:$GRAFANA_ADMIN_PASSWORD@localhost:$GRAFANA_PORT/api/datasources \
--data-binary '{"orgId":1,"name":"alertmanager","type":"camptocamp-prometheus-alertmanager-datasource","typeLogoUrl":"public/img/icn-datasource.svg","access":"proxy","url":"'"http://$AM_ADDRESS"'","password":"","user":"","database":"","basicAuth":false,"isDefault":false,"jsonData":{}}' \
-H "Content-Type: application/json"
fi

mkdir -p grafana/build
IFS=',' ;for v in $VERSIONS; do
for f in scylla-dash scylla-dash-per-server scylla-dash-io-per-server; do
Expand Down
9 changes: 9 additions & 0 deletions prometheus/prometheus.rules
@@ -0,0 +1,9 @@
# Alert for any instance that is unreachable for > 30 seconds.
ALERT InstanceDown
IF up == 0
FOR 30s
LABELS { severity = "1" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} down",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds.",
}
11 changes: 11 additions & 0 deletions prometheus/prometheus.yml → prometheus/prometheus.yml.template
Expand Up @@ -6,6 +6,17 @@ global:
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'scylla-monitor'
rule_files:
- /etc/prometheus/prometheus.rules
#
# Alerting specifies settings related to the Alertmanager.
alerting:
# alert_relabel_configs:
# [ - <relabel_config> ... ]
alertmanagers:
- static_configs:
- targets:
- AM_ADDRESS

scrape_configs:
- job_name: scylla
Expand Down
115 changes: 115 additions & 0 deletions prometheus/rule_config.yml
@@ -0,0 +1,115 @@
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'

# The root route on which each incoming alert enters.
route:
# The root route must not have any matchers as it is the entry point for
# all alerts. It needs to have a receiver configured so alerts that do not
# match any of the sub-routes are sent to someone.
receiver: 'team-X-mails'

# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster']

# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 3h

# All the above attributes are inherited by all child routes and can
# overwritten on each.

# The child route trees.
routes:
# This routes performs a regular expression match on alert labels to
# catch alerts that are related to a list of services.
- match_re:
service: ^(foo1|foo2|baz)$
receiver: team-X-mails

# The service has a sub-route for critical alerts, any alerts
# that do not match, i.e. severity != critical, fall-back to the
# parent node and are sent to 'team-X-mails'
routes:
- match:
severity: critical
receiver: team-X-pager

- match:
service: files
receiver: team-Y-mails

routes:
- match:
severity: critical
receiver: team-Y-pager

# This route handles all alerts coming from a database service. If there's
# no team to handle it, it defaults to the DB team.
- match:
service: database

receiver: team-DB-pager
# Also group alerts by affected database.
group_by: [alertname, cluster, database]

routes:
- match:
owner: team-X
receiver: team-X-pager

- match:
owner: team-Y
receiver: team-Y-pager


# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
# Apply inhibition if the alertname is the same.
equal: ['alertname']


receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.org'

- name: 'team-X-pager'
email_configs:
- to: 'team-X+alerts-critical@example.org'
pagerduty_configs:
- service_key: <team-X-key>

- name: 'team-Y-mails'
email_configs:
- to: 'team-Y+alerts@example.org'

- name: 'team-Y-pager'
pagerduty_configs:
- service_key: <team-Y-key>

- name: 'team-DB-pager'
pagerduty_configs:
- service_key: <team-DB-key>

0 comments on commit 91a8c62

Please sign in to comment.