Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus example alert rules #59

Merged
merged 2 commits into from
Mar 16, 2021
Merged

Conversation

DifferentialOrange
Copy link
Member

@DifferentialOrange DifferentialOrange commented Mar 10, 2021

Closes #43

Added Prometheus example alert rules:

  • instance state (Prometheus up),
  • Lua memory (warning and alert),
  • arena/items limit warnings and alerts,
  • high HTTP latency,
  • too many HTTP 4xx responses (on a single instance or overall on cluster),
  • HTTP 5xx errors,
  • low router HTTP activity.

Alerts are separated into three groups. common is non-Tarantool alert on process work (Prometheus up). tarantool-common are general Tarantool alerts (Lua memory and slab_ratio) that can be applied to any Tarantool application. tarantool-business is a list of references on how you can monitor your business logic. One can base its alert rules on what's described there (because it's impossible to say if it is OK for your app to have 1000 RPS of 4xx errors or 0 requests on a router for an hour or not without knowing your app business logic beforehand). That's also the reason why I fixed job='example_project' in all tarantool-business alert rules while leaving common and tarantool-common alert rules process all possible Tarantool instances (if you have two different apps, they are likely to have different HTTP load and business logic, while 2 Gb Lua threshold is true for both of them).

Test Prometheus example alert rules with promtool.

The next step should be adding some documentation based on this example (here or in tarantool/metrics), but I think it should be a different PR.

@DifferentialOrange DifferentialOrange changed the title Add alert rule for dead instance Prometheus alert Mar 12, 2021
@DifferentialOrange
Copy link
Member Author

DifferentialOrange commented Mar 12, 2021

Example:
image
image

@DifferentialOrange
Copy link
Member Author

The final note here is labels where I set severity: warning. Inspired by Google SRE book I have decided to use only 2 severity levels for alerting – warning and page. warning alerts should go to the ticketing system and you should react to these alerts during normal working days. page alerts are emergencies and can wake up on-call engineer – this type of alerts should be crafted carefully to avoid burnout. Alerts routing based on levels is managed by Alertmanager.

https://alex.dzyoba.com/blog/prometheus-alerts/

@DifferentialOrange DifferentialOrange force-pushed the 43-prometheus-alert-rules branch 2 times, most recently from b7a0f4b to df87842 Compare March 12, 2021 15:34
@DifferentialOrange DifferentialOrange changed the title Prometheus alert Prometheus example alert rules Mar 12, 2021
@DifferentialOrange DifferentialOrange marked this pull request as ready for review March 12, 2021 17:08
@DifferentialOrange DifferentialOrange merged commit 983bdf6 into master Mar 16, 2021
@DifferentialOrange DifferentialOrange deleted the 43-prometheus-alert-rules branch June 4, 2021 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define Prometheus alerting rules
2 participants