-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-5053: Fallback for HPA on failure to retrieve metrics #5054
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: be0x74a The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @be0x74a! |
Hi @be0x74a. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerSpec`][] object: | ||
|
||
- `fallback`: an optional new object containing the following fields: | ||
- `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behaviour. Must be greater than 0 and is option with a default of 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean:
Must be greater than 0 and is option with a default of 3.
Is that saying that it's an optional field and that it defaults to "3" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, clarified the explanation. Thanks
- Feature implemented behind a `HPAFallback` feature flag | ||
- Initial e2e tests completed and enabled | ||
|
||
### Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possibly worth filling in.
ie: If someone enables the alpha feature, and does a rollback, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filled it with a brief overview of what would happen on feature flag enable and disable
Fill `Upgrade / Downgrade Strategy` section. Clarify `failureThreshold` field documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for putting this together :) I think this will be tremendously useful for users.
Would love for the community to provide more feedback on the API, I think we are definitely on the right track here!
type HorizontalPodAutoscalerStatus struct { | ||
// metricRetrievalFailureCount tracks the number of consecutive failures in retrieving metrics. | ||
//+optional | ||
MetricRetrievalFailureCount int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe be explicit in the name that this is consecutive failures?
ConsecutiveMetricRetrievalFailureCount
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Altough maybe a big name, I agree it's better to be explicit they're consecutive failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
// fallback state and defines the threshold for errors required to enter the | ||
// fallback state. | ||
//+optional | ||
Fallback *HPAFallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on adding this to the Behavior field?
(I don't feel strongly, just curious what you think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't though of it but I can see how it would make sense to define the fallback behavior on the Behavior filed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I semi-agree with this.
Behaviour feels right, but the description of Behaviour feels wrong: https://github.com/kubernetes/kubernetes/blob/0798325ba13643358aa3ebb7c6ddc3006ac26a7c/pkg/apis/autoscaling/types.go#L111-L113
// HorizontalPodAutoscalerBehavior configures a scaling behavior for Up and Down direction
// (scaleUp and scaleDown fields respectively).
It feels to me like "Scaling behaviour" and "Fallback behaviour" are two different things.
May be if the comment I linked to was reworded, then "Behaviour" is the right place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section for the comment update for the behavior field and moved the fallback into it
inFallback = false | ||
} | ||
|
||
if !inFallback { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll probably want to set a status when the HPA is in fallback mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be an event too?
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 2m19s horizontal-pod-autoscaler New size: 1; reason: All metrics below target
That's an event for a regular scale, so I think a similar message but with a different reason would be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll probably want to set a status when the HPA is in fallback mode?
I commented higher up about a condition (see https://github.com/kubernetes/enhancements/pull/5054/files#r1924247581).
I imagine that covers the status update you're talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - yes a condition flagging that the HPA is in this fallback state. Good callout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an event for a regular scale, so I think a similar message but with a different reason would be useful
I added an event here but I guess it's better to have a specific message for fallback and not the error responsible for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the FallbackActive
condition
CurrentReplicas int32 | ||
DesiredReplicas int32 | ||
CurrentMetrics []MetricStatus | ||
Conditions []HorizontalPodAutoscalerCondition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a new condition may be needed too?
This is what we currently have:
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count
I imagine a "Fallback" Type may be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definitely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerSpec`][] object: | ||
|
||
- `fallback`: an optional new object containing the following fields: | ||
- `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behaviour. Must be a value greater than 0. This field is optional and defaults to 3 if not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failureThreshold
specifies a maximum number of retries. Isn't a duration what users are looking for (i.e. trigger fallback if scraping metrics fails for x seconds)?
(Note that for KEDA it's equivalent, as KEDA specifies the scraping frequency.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that's a good question. As a user, I would naturally go for the retry count but as I don't control scrape interval, that doesn't translate into much.
- `replicas`: (integer) the number of replicas to scale to in case of fallback. Must be greater than 0 and it's mandatory. | ||
|
||
To allow for tracking of failures to fetch metrics a new field should be added to the existing [`HorizontalPodAutoscalerStatus`][] object: | ||
- `metricRetrievalFailureCount`: (integer) tracks the number of consecutive failures in retrieving metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a single integer, not an integer per metric? If two metrics start failing, don't we want to keep a separate counter for each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I was thinking was to integrate on the exit code of computeReplicasForMetrics
which handles partial metric failure. That way we could track a single integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's worth writing that down in the KEP? Specifically what happens when:
- There is 1 metric configured, and it fails
- There are multiple metrics configured, and a single metric fails
- There are multiple metrics configured, and they take turns to fail (unlikely, but possible?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If failureThreshold
is set to 10, and at a given moment metric foo
failed 9 times while metric bar
failed only once, I can't see how we could summarize the current state with one integer.
Do you mind elaborating on how you think this could be handled? I guess you'd take the max (9 in my example)? Is this desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at computeReplicasForMetrics
, I assume the counter can only be incremented once per cycle, and only if at least 1 metric has failed.
In the case you describe, if the first 9 cycles had a failure on metric foo
and the 10th cycle had a failure on metric bar
, then there would be 10 consecutive failures.
But if the first cycle had both metrics fail, then the next 8 cycles had metric foo
and the final 10th cycle had no metrics fail, then the count would max at 9 consecutive failures.
This is just my understanding of what @be0x74a is saying though.
It's worth clarifying in the KEP though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at computeReplicasForMetrics, I assume the counter can only be incremented once per cycle, and only if at least 1 metric has failed.
Yeah, that's exactly what I was thinking. I can add it to the KEP as the current state, but I don't mind discussing whether this is the correct approach.
Btw @adrianmoisey do you have a section in mind where I can put this? Maybe in the Design Details section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Design Details section sounds good. I think a section describing what a failure is should suffice?
* Replace `MetricRetrievalFailureCount` with `ConsecutiveMetricRetrievalFailureCount` * Move `HPAFallback` to `HorizontalPodAutoscalerBehavior` * Add FallbackActive condition
…allback # Conflicts: # keps/sig-autoscaling/5053-hpa-fallback/README.md
Co-authored-by: Adrian Moisey <adrian@changeover.za.net>
Add KEP-5053: Fallback for HPA on failure to retrieve metrics
Issue link: #5053