Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Saturation SLO #964

Open
ArthurSens opened this issue Oct 27, 2023 · 2 comments
Open

Proposal for Saturation SLO #964

ArthurSens opened this issue Oct 27, 2023 · 2 comments

Comments

@ArthurSens
Copy link
Contributor

ArthurSens commented Oct 27, 2023

For a few days now I've been wondering how the implementation would look like for a Saturation SLO based on Prometheus metrics. I've come up with a design idea, so I'm opening this issue to discuss this further with the community.

The main idea here is to re-utilize the BoolGauge SLO as much as possible.

API:

type SaturationIndicator struct {
	// Utilization is the metric that represents the current utilization of the monitored resource.
	Utilization Query `json:"utilization"`

	// Capacity is the metric that represents the capacity of the monitored resource.
	Capacity Query `json:"capacity"`

	// Threshold is the maximum utilization allowed of the monitored resource.
        // It should represent a percentage between Utilization and Capacity.
	// It should be a number between 0 and 1.
	Threshold float64 `json:"threshold"`

	// +optional
	// Grouping allows an SLO to be defined for many SLI at once, like HTTP handlers for example.
	Grouping []string `json:"grouping"`
}

For additional Prometheus rules, all we need to do is generate vector(1) if (Utilization / Capacity) > Threshold and vector(0) if (Utilization / Capacity) <= Threshold. From this, we can reutilize the same prometheus rules used for BoolGauge:

- record: example-saturation-bool
  expr: |
    (vector(1) AND (Utilization / Capacity) > Threshold)
    OR
    vector(0)

## Same from BoolGauge below
- record: example-saturation-bool:count1w
  expr: sum (count_over_time(example-saturation-bool[1w]))

- record: example-saturation-bool:sum1w
  expr: sum (sum_over_time(example-saturation-bool[1w]))
  
- record: example-saturation-bool:burnrate1m
  expr: (sum (count_over_time(example-saturation-bool[1m])) - sum (sum_over_time(probe_success[1m]))) / sum (count_over_time(example-saturation-bool[1m]))
.
.
.
    
@ArthurSens
Copy link
Contributor Author

@metalmatze, friendly ping! Would love to open a PR myself once we agree on a design :)

@metalmatze
Copy link
Member

Sorry for the late reply. I was busy organizing PromCon, speaking at SRECon and afterward moving house.

The overall proposal looks good to me. I want to make sure to try this. If we can figure out the PromQL the rest should fall into place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants