Prometheus example alert rules #59

DifferentialOrange · 2021-03-10T13:21:23Z

Closes #43

Added Prometheus example alert rules:

instance state (Prometheus up),
Lua memory (warning and alert),
arena/items limit warnings and alerts,
high HTTP latency,
too many HTTP 4xx responses (on a single instance or overall on cluster),
HTTP 5xx errors,
low router HTTP activity.

Alerts are separated into three groups. common is non-Tarantool alert on process work (Prometheus up). tarantool-common are general Tarantool alerts (Lua memory and slab_ratio) that can be applied to any Tarantool application. tarantool-business is a list of references on how you can monitor your business logic. One can base its alert rules on what's described there (because it's impossible to say if it is OK for your app to have 1000 RPS of 4xx errors or 0 requests on a router for an hour or not without knowing your app business logic beforehand). That's also the reason why I fixed job='example_project' in all tarantool-business alert rules while leaving common and tarantool-common alert rules process all possible Tarantool instances (if you have two different apps, they are likely to have different HTTP load and business logic, while 2 Gb Lua threshold is true for both of them).

Test Prometheus example alert rules with promtool.

The next step should be adding some documentation based on this example (here or in tarantool/metrics), but I think it should be a different PR.

DifferentialOrange · 2021-03-12T10:35:08Z

Example:

DifferentialOrange · 2021-03-12T10:48:28Z

The final note here is labels where I set severity: warning. Inspired by Google SRE book I have decided to use only 2 severity levels for alerting – warning and page. warning alerts should go to the ticketing system and you should react to these alerts during normal working days. page alerts are emergencies and can wake up on-call engineer – this type of alerts should be crafted carefully to avoid burnout. Alerts routing based on levels is managed by Alertmanager.

https://alex.dzyoba.com/blog/prometheus-alerts/

Closes #43

DifferentialOrange force-pushed the 43-prometheus-alert-rules branch from 37957ee to e892274 Compare March 12, 2021 09:53

DifferentialOrange changed the title ~~Add alert rule for dead instance~~ Prometheus alert Mar 12, 2021

Add Prometheus example alert rules

a2f9b58

Closes #43

DifferentialOrange force-pushed the 43-prometheus-alert-rules branch 2 times, most recently from b7a0f4b to df87842 Compare March 12, 2021 15:34

Test alert rules with promtool

452f63a

DifferentialOrange force-pushed the 43-prometheus-alert-rules branch from 2007e0d to 452f63a Compare March 12, 2021 16:54

DifferentialOrange changed the title ~~Prometheus alert~~ Prometheus example alert rules Mar 12, 2021

DifferentialOrange marked this pull request as ready for review March 12, 2021 17:08

DifferentialOrange requested a review from vasiliy-t March 12, 2021 17:08

vasiliy-t approved these changes Mar 16, 2021

View reviewed changes

DifferentialOrange merged commit 983bdf6 into master Mar 16, 2021

DifferentialOrange mentioned this pull request Mar 16, 2021

Alert rules documentation #64

Closed

DifferentialOrange deleted the 43-prometheus-alert-rules branch June 4, 2021 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus example alert rules #59

Prometheus example alert rules #59

DifferentialOrange commented Mar 10, 2021 •

edited

Loading

DifferentialOrange commented Mar 12, 2021 •

edited

Loading

DifferentialOrange commented Mar 12, 2021

Prometheus example alert rules #59

Prometheus example alert rules #59

Conversation

DifferentialOrange commented Mar 10, 2021 • edited Loading

DifferentialOrange commented Mar 12, 2021 • edited Loading

DifferentialOrange commented Mar 12, 2021

DifferentialOrange commented Mar 10, 2021 •

edited

Loading

DifferentialOrange commented Mar 12, 2021 •

edited

Loading