The SLO Operator is a Kubernetes Operator that can be used to manage Service
Level Objectives for services. The SLO Operator allows users to define
ServiceLevelObjectives
CustomResources, to generate all Prometheus rules and
alerts required for an SLO.
The operator can be installed via the Helm chart present in the charts
directory. The chart can be installed with the following command:
helm upgrade --install slo-operator oci://ghcr.io/ricoberger/charts/slo-operator --version <VERSION>
By default the operator creates a PrometheusRule
for the
Prometheus Operator. If you are using the
VictoriaMetrics Operator a
VMRule
can be created by setting the SLO_OPERATOR_MODE
environment variable
to VictoriaMetrics
.
An example Grafana dashboard for the SLO Operator can be found in the servicelevelobjective.json file.
apiVersion: ricoberger.de/v1alpha1
kind: ServiceLevelObjective
metadata:
name:
namespace:
labels:
# Labels with a "slo-operator.ricoberger.de/" prefix are added to the
# generated Prometheus recording rules and alerts.
slo-operator.ricoberger.de/team: myteam
spec:
# A list of SLOs for the service.
slos:
- # The name of the SLO, e.g. "errors", "latency", etc.
name:
# The objective for the SLO, e.g. "99.9% uptime", "95% requests in 200ms",
# etc. It must be a percentage value between 0 and 100 as string, e.g.
# "99.9".
objective:
# A description for the SLO.
description:
# SLI contains the metrics to calculate the SLO. For example the total
# metric is the number of all requests, while the error metric is only the
# number of all 5xx requests.
#
# The total and error metric must contain a "${window}" placeholder, which
# will be replaced by the operator with the actual required window for the
# SLO (always 28 days) and the windows for the different burn rates.
sli:
totalQuery:
errorQuery:
alerting:
# Disabled can be used to disable the alerting. If the field is set to
# "true" the operator will not generate alerting rules for Prometheus.
disabled:
# Severities is a list of severities for the alerting rules created by
# the operator for the absent alert and the burn rate alerts. The list
# must contain 5 entries. The first one is used for the absent alert and
# the remaining 4 for the burn rate alerts ordered by criticality.
#
# The default list which is used, when the field is not set is
# ["critial", "error", "error", "warning", "warning"]
severities:
apiVersion: ricoberger.de/v1alpha1
kind: ServiceLevelObjective
metadata:
name: grafana
namespace: monitoring
labels:
slo-operator.ricoberger.de/team: myteam
spec:
slos:
- name: availability
objective: "99.9"
sli:
totalQuery: sum(rate(istio_requests_total{destination_workload_namespace=~"monitoring",destination_workload=~"grafana"}[${window}]))
errorQuery: sum(rate(istio_requests_total{destination_workload_namespace=~"monitoring",destination_workload=~"grafana",response_code=~"5.*"}[${window}]))
- name: latency
objective: "99.9"
sli:
totalQuery: sum(rate(istio_request_duration_milliseconds_count{destination_workload_namespace=~"monitoring",destination_workload=~"grafana"}[${window}]))
errorQuery: |
(
sum(rate(istio_request_duration_milliseconds_count{destination_workload_namespace=~"monitoring",destination_workload=~"grafana"}[${window}]))
-
sum(rate(istio_request_duration_milliseconds_bucket{destination_workload_namespace=~"monitoring",destination_workload=~"grafana",le="2500"}[${window}]))
)
- name: up
objective: "99.9"
sli:
totalQuery: sum(count_over_time(up{job="grafana"}[${window}]))
errorQuery: |
(
sum(count_over_time(up{job="grafana"}[${window}]))
-
sum(sum_over_time(up{job="grafana"}[${window}]))
)
After modifying the *_types.go
files in the api/v1alpha1
folder always run
the following command to update the generated code for that resource type:
make generate
The above Makefile target will invoke the
controller-gen utility to update the
api/v1alpha1/zz_generated.deepcopy.go
file to ensure our API's Go type
definitons implement the runtime.Object
interface that all Kind types must
implement.
Once the API is defined with spec/status fields and CRD validation markers, the CRD manifests can be generated and updated with the following command:
make manifests
This Makefile target will invoke controller-gen to generate the CRD manifest at
charts/slo-operator/crds/ricoberger.de_servicelevelobjectives.yaml
.
Deploy the CRD and run the operator locally with the default Kubernetes config
file present at $HOME/.kube/config
:
kubectl apply -f charts/slo-operator/crds/ricoberger.de_servicelevelobjectives.yaml
make run
The SLO Opeartor is heavily inspirred by Pyrra and Sloth. It creates a similar set of recording and alerting rules like Pyrra, but uses a pattern more similar to Sloth for define the Service Level Objectives.