Add basic alert support to the dashboards (#267)

* Adding a general kill container script We are about to add an additional container, so it's a good time to remove the duplication from the kill container functionality. Signed-off-by: Amnon Heiman <amnon@scylladb.com> * remove the kill grafana and prometheus scripts * add a script to start the alert manager * add the alert manager datasource pluging * set the prometheus datasource alert manager to the dashbaords * start and kill the alert manager container * Base configuration for the prometheus and the alarmmanager The base configuration was only added as a first step. We expect that user would chanage it to their own use cases. Signed-off-by: Amnon Heiman <amnon@scylladb.com> * base alertmanager rule configuration * Add the prometheus rules to the prometheus container * add alarm_table class to the types.json * add an alarm table to the main dashboard The table was added here as a starting point. It would probably moved and better formatted. * set the alert manager address based on its container * set the down time rules to 30s * set the alert manger address dynamically * make the prometheus config a template * create the prometheus config file from template * set severity to 1 instead of page * Revert "add an alarm table to the main dashboard" This reverts commit ca69085. * remove the sudo from kill-container * Revert "add alarm_table class to the types.json" This reverts commit 0de7701. * Add the alertmanager to the README * alertmanager to optionaly get its port from the commandline
scylladb · Feb 21, 2018 · 91a8c62 · 91a8c62
1 parent 2da72d7
commit 91a8c62
Show file tree

Hide file tree

Showing 12 changed files with 293 additions and 98 deletions.
diff --git a/README.md b/README.md
@@ -11,6 +11,7 @@ ___
 The monitoring infrastructure consists of several components, wrapped in Docker containers:
  * `prometheus` - collects and stores metrics
  * `grafana` - dashboard server
+ * `alertmanager` - The alert manager collect Prometheus alerts
 
 ### Prerequisites
 * git
@@ -171,3 +172,10 @@ For example, if you have prometheus running at `192.168.0.1:9090`, and grafana a
 ```
 ./load-grafana.sh -p 192.168.0.1:9090 -g 3000
 ```
+
+### Alertmanager
+Prometheus [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) handles alerts that are generated by the Prometheus server.
+
+Alerts are generated according to the [Alerting rules](https://prometheus.io/docs/prometheus/1.8/configuration/alerting_rules/).
+
+The Alertmanager listen on port `9093` and you can use a web-browser to connect to it.
diff --git a/kill-all.sh b/kill-all.sh
@@ -1,15 +1,19 @@
 #!/usr/bin/env bash
 
-usage="$(basename "$0") [-h] [-g grafana port ] [ -p prometheus port ] -- kills existing Grafana and Prometheus Docker instances at given ports"
-
-while getopts ':hg:p:' option; do
+usage="$(basename "$0") [-h] [-g grafana port ] [ -p prometheus port ] [-m alertmanager port] -- kills existing Grafana and Prometheus Docker instances at given ports"
+GRAFANA_PORT=""
+PROMETHEUS_PORT=""
+ALERTMANAGER_PORT=""
+while getopts ':hg:p:m:' option; do
   case "$option" in
     h) echo "$usage"
        exit
        ;;
-    g) GRAFANA_PORT=$OPTARG
+    g) GRAFANA_PORT="-p $OPTARG"
+       ;;
+    p) PROMETHEUS_PORT="-p $OPTARG"
        ;;
-    p) PROMETHEUS_PORT=$OPTARG
+    m) ALERTMANAGER_PORT="-p $OPTARG"
        ;;
     :) printf "missing argument for -%s\n" "$OPTARG" >&2
        echo "$usage" >&2
@@ -22,18 +26,9 @@ while getopts ':hg:p:' option; do
   esac
 done
 
-if [ -z $PROMETHEUS_PORT ]; then
-    ./kill-prometheus.sh
-else
-    ./kill-prometheus.sh -p $PROMETHEUS_PORT
-fi
-
-if [ -z $GRAFANA_PORT ]; then
-    ./kill-grafana.sh
-else
-    ./kill-grafana.sh -g $GRAFANA_PORT
-fi
-
+./kill-container.sh $PROMETHEUS_PORT -b aprom
+./kill-container.sh $GRAFANA_PORT -b agraf
+./kill-container.sh $ALERTMANAGER_PORT -b aalert
 
 
 
diff --git a/kill-container.sh b/kill-container.sh
@@ -0,0 +1,40 @@
+#!/usr/bin/env bash
+
+usage="$(basename "$0") [-h] [ -p container port ] [-n optional name] [-b base name] -- kills existing Docker instances at given ports"
+
+while getopts ':hb:p:n:' option; do
+  case "$option" in
+    h) echo "$usage"
+       exit
+       ;;
+    p) PORT=$OPTARG
+       ;;
+    n) NAME=$OPTARG
+       ;;
+    b) BASE_NAME=$OPTARG
+       ;;
+    :) printf "missing argument for -%s\n" "$OPTARG" >&2
+       echo "$usage" >&2
+       exit 1
+       ;;
+   \?) printf "illegal option: -%s\n" "$OPTARG" >&2
+       echo "$usage" >&2
+       exit 1
+       ;;
+  esac
+done
+if [ -z $NAME ]; then
+	if [ -z $PORT ]; then
+	    NAME=$BASE_NAME
+	else
+	    NAME=$BASE_NAME-$PORT
+	fi
+fi
+
+if [ "$(docker ps -q -f name=$NAME)" ]; then
+    docker kill $NAME
+fi
+
+if [[ "$(docker ps -aq --filter name=$NAME 2> /dev/null)" != "" ]]; then
+    docker rm -v $NAME
+fi
diff --git a/kill-grafana.sh b/kill-grafana.sh
diff --git a/kill-prometheus.sh b/kill-prometheus.sh
diff --git a/load-grafana.sh b/load-grafana.sh
@@ -6,9 +6,9 @@ GRAFANA_HOST="localhost"
 GRAFANA_PORT=3000
 DB_ADDRESS="127.0.0.1:9090"
 
-usage="$(basename "$0") [-h] [-v comma separated versions ] [-g grafana port ] [-H grafana hostname] [-p ip:port address of prometheus ] [-a admin password] [-j additional dashboard to load to Grafana, multiple params are supported] -- loads the prometheus datasource and the Scylla dashboards into an existing grafana installation"
+usage="$(basename "$0") [-h] [-v comma separated versions ] [-g grafana port ] [-H grafana hostname] [-m alert_manager ip:port] [-p ip:port address of prometheus ] [-a admin password] [-j additional dashboard to load to Grafana, multiple params are supported] -- loads the prometheus datasource and the Scylla dashboards into an existing grafana installation"
 
-while getopts ':hg:H:p:v:a:j:' option; do
+while getopts ':hg:H:p:v:a:j:m:' option; do
   case "$option" in
     h) echo "$usage"
        exit
@@ -23,6 +23,8 @@ while getopts ':hg:H:p:v:a:j:' option; do
        ;;
     p) DB_ADDRESS=$OPTARG
        ;;
+    m) AM_ADDRESS=$OPTARG
+       ;;
     a) GRAFANA_ADMIN_PASSWORD=$OPTARG
        ;;
   esac
@@ -32,6 +34,13 @@ curl -XPOST -i http://admin:$GRAFANA_ADMIN_PASSWORD@$GRAFANA_HOST:$GRAFANA_PORT/
      --data-binary '{"name":"prometheus", "type":"prometheus", "url":"'"http://$DB_ADDRESS"'", "access":"proxy", "basicAuth":false}' \
      -H "Content-Type: application/json"
 
+if [ -n $AM_ADDRESS ]
+then
+  curl -XPOST -i http://admin:$GRAFANA_ADMIN_PASSWORD@localhost:$GRAFANA_PORT/api/datasources \
+       --data-binary '{"orgId":1,"name":"alertmanager","type":"camptocamp-prometheus-alertmanager-datasource","typeLogoUrl":"public/img/icn-datasource.svg","access":"proxy","url":"'"http://$AM_ADDRESS"'","password":"","user":"","database":"","basicAuth":false,"isDefault":false,"jsonData":{}}' \
+       -H "Content-Type: application/json"
+fi
+
 mkdir -p grafana/build
 IFS=',' ;for v in $VERSIONS; do
 for f in scylla-dash scylla-dash-per-server scylla-dash-io-per-server; do

diff --git a/prometheus/prometheus.rules b/prometheus/prometheus.rules
@@ -0,0 +1,9 @@
+# Alert for any instance that is unreachable for > 30 seconds.
+ALERT InstanceDown
+  IF up == 0
+  FOR 30s
+  LABELS { severity = "1" }
+  ANNOTATIONS {
+    summary = "Instance {{ $labels.instance }} down",
+    description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds.",
+  }
diff --git a/prometheus/prometheus.yml → prometheus/prometheus.yml.template b/prometheus/prometheus.yml → prometheus/prometheus.yml.template
@@ -6,6 +6,17 @@ global:
   # external systems (federation, remote storage, Alertmanager).
   external_labels:
     monitor: 'scylla-monitor'
+rule_files:
+  - /etc/prometheus/prometheus.rules
+    #
+# Alerting specifies settings related to the Alertmanager.
+alerting:
+        #  alert_relabel_configs:
+        #    [ - <relabel_config> ... ]
+  alertmanagers:
+  - static_configs:
+    - targets:
+        - AM_ADDRESS
 
 scrape_configs:
 - job_name: scylla

diff --git a/prometheus/rule_config.yml b/prometheus/rule_config.yml
@@ -0,0 +1,115 @@
+global:
+  # The smarthost and SMTP sender used for mail notifications.
+  smtp_smarthost: 'localhost:25'
+  smtp_from: 'alertmanager@example.org'
+
+# The root route on which each incoming alert enters.
+route:
+  # The root route must not have any matchers as it is the entry point for
+  # all alerts. It needs to have a receiver configured so alerts that do not
+  # match any of the sub-routes are sent to someone.
+  receiver: 'team-X-mails'
+
+  # The labels by which incoming alerts are grouped together. For example,
+  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
+  # be batched into a single group.
+  group_by: ['alertname', 'cluster']
+
+  # When a new group of alerts is created by an incoming alert, wait at
+  # least 'group_wait' to send the initial notification.
+  # This way ensures that you get multiple alerts for the same group that start
+  # firing shortly after another are batched together on the first
+  # notification.
+  group_wait: 30s
+
+  # When the first notification was sent, wait 'group_interval' to send a batch
+  # of new alerts that started firing for that group.
+  group_interval: 5m
+
+  # If an alert has successfully been sent, wait 'repeat_interval' to
+  # resend them.
+  repeat_interval: 3h
+
+  # All the above attributes are inherited by all child routes and can 
+  # overwritten on each.
+
+  # The child route trees.
+  routes:
+  # This routes performs a regular expression match on alert labels to
+  # catch alerts that are related to a list of services.
+  - match_re:
+      service: ^(foo1|foo2|baz)$
+    receiver: team-X-mails
+
+    # The service has a sub-route for critical alerts, any alerts
+    # that do not match, i.e. severity != critical, fall-back to the
+    # parent node and are sent to 'team-X-mails'
+    routes:
+    - match:
+        severity: critical
+      receiver: team-X-pager
+
+  - match:
+      service: files
+    receiver: team-Y-mails
+
+    routes:
+    - match:
+        severity: critical
+      receiver: team-Y-pager
+
+  # This route handles all alerts coming from a database service. If there's
+  # no team to handle it, it defaults to the DB team.
+  - match:
+      service: database
+
+    receiver: team-DB-pager
+    # Also group alerts by affected database.
+    group_by: [alertname, cluster, database]
+
+    routes:
+    - match:
+        owner: team-X
+      receiver: team-X-pager
+
+    - match:
+        owner: team-Y
+      receiver: team-Y-pager
+
+
+# Inhibition rules allow to mute a set of alerts given that another alert is
+# firing.
+# We use this to mute any warning-level notifications if the same alert is
+# already critical.
+inhibit_rules:
+- source_match:
+    severity: 'critical'
+  target_match:
+    severity: 'warning'
+  # Apply inhibition if the alertname is the same.
+  equal: ['alertname']
+
+
+receivers:
+- name: 'team-X-mails'
+  email_configs:
+  - to: 'team-X+alerts@example.org'
+
+- name: 'team-X-pager'
+  email_configs:
+  - to: 'team-X+alerts-critical@example.org'
+  pagerduty_configs:
+  - service_key: <team-X-key>
+
+- name: 'team-Y-mails'
+  email_configs:
+  - to: 'team-Y+alerts@example.org'
+
+- name: 'team-Y-pager'
+  pagerduty_configs:
+  - service_key: <team-Y-key>
+
+- name: 'team-DB-pager'
+  pagerduty_configs:
+  - service_key: <team-DB-key>
+