KEP-5053: Fallback for HPA on failure to retrieve metrics #5054

be0x74a · 2025-01-20T01:52:12Z

Add KEP-5053: Fallback for HPA on failure to retrieve metrics

Issue link: #5053

k8s-ci-robot · 2025-01-20T01:52:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: be0x74a
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-autoscaling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-20T01:52:21Z

Welcome @be0x74a!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-01-20T01:52:21Z

Hi @be0x74a. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

adrianmoisey · 2025-01-20T08:54:42Z

/ok-to-test

adrianmoisey · 2025-01-20T08:59:27Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerSpec`][] object:
+
+- `fallback`: an optional new object containing the following fields:
+  - `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behaviour. Must be greater than 0 and is option with a default of 3.


What does this mean:

Must be greater than 0 and is option with a default of 3.

Is that saying that it's an optional field and that it defaults to "3" ?

Yes exactly, clarified the explanation. Thanks

adrianmoisey · 2025-01-20T09:07:37Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+- Feature implemented behind a `HPAFallback` feature flag
+- Initial e2e tests completed and enabled
+
+### Upgrade / Downgrade Strategy


This is possibly worth filling in.
ie: If someone enables the alpha feature, and does a rollback, what happens?

Filled it with a brief overview of what would happen on feature flag enable and disable

Fill `Upgrade / Downgrade Strategy` section. Clarify `failureThreshold` field documentation

raywainman

Thanks a lot for putting this together :) I think this will be tremendously useful for users.

Would love for the community to provide more feedback on the API, I think we are definitely on the right track here!

raywainman · 2025-01-21T18:27:08Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+type HorizontalPodAutoscalerStatus struct {
+    // metricRetrievalFailureCount tracks the number of consecutive failures in retrieving metrics. 
+    //+optional
+    MetricRetrievalFailureCount int32


Should we maybe be explicit in the name that this is consecutive failures?

ConsecutiveMetricRetrievalFailureCount?

Altough maybe a big name, I agree it's better to be explicit they're consecutive failures

raywainman · 2025-01-21T18:28:37Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+    // fallback state and defines the threshold for errors required to enter the 
+    // fallback state.
+    //+optional
+    Fallback *HPAFallback


Any thoughts on adding this to the Behavior field?

(I don't feel strongly, just curious what you think)

Haven't though of it but I can see how it would make sense to define the fallback behavior on the Behavior filed

I semi-agree with this.
Behaviour feels right, but the description of Behaviour feels wrong: https://github.com/kubernetes/kubernetes/blob/0798325ba13643358aa3ebb7c6ddc3006ac26a7c/pkg/apis/autoscaling/types.go#L111-L113

// HorizontalPodAutoscalerBehavior configures a scaling behavior for Up and Down direction
// (scaleUp and scaleDown fields respectively).

It feels to me like "Scaling behaviour" and "Fallback behaviour" are two different things.
May be if the comment I linked to was reworded, then "Behaviour" is the right place?

Added a section for the comment update for the behavior field and moved the fallback into it

raywainman · 2025-01-21T18:30:23Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+        inFallback = false
+    }
+
+    if !inFallback {


I think we'll probably want to set a status when the HPA is in fallback mode?

May be an event too?

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulRescale 2m19s horizontal-pod-autoscaler New size: 1; reason: All metrics below target

That's an event for a regular scale, so I think a similar message but with a different reason would be useful

I think we'll probably want to set a status when the HPA is in fallback mode?

I commented higher up about a condition (see https://github.com/kubernetes/enhancements/pull/5054/files#r1924247581).

I imagine that covers the status update you're talking about?

Right - yes a condition flagging that the HPA is in this fallback state. Good callout.

That's an event for a regular scale, so I think a similar message but with a different reason would be useful

I added an event here but I guess it's better to have a specific message for fallback and not the error responsible for it.

Added the FallbackActive condition

adrianmoisey · 2025-01-21T19:15:54Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+    CurrentReplicas int32
+    DesiredReplicas int32
+    CurrentMetrics []MetricStatus
+    Conditions []HorizontalPodAutoscalerCondition


I think a new condition may be needed too?
This is what we currently have:

Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True ReadyForNewScale recommended size matches current size ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request) ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count

I imagine a "Fallback" Type may be useful

Yeah definitely

jm-franc · 2025-01-22T17:55:28Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerSpec`][] object:
+
+- `fallback`: an optional new object containing the following fields:
+  - `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behaviour. Must be a value greater than 0. This field is optional and defaults to 3 if not specified.


failureThreshold specifies a maximum number of retries. Isn't a duration what users are looking for (i.e. trigger fallback if scraping metrics fails for x seconds)?

(Note that for KEDA it's equivalent, as KEDA specifies the scraping frequency.)

Hmmm, that's a good question. As a user, I would naturally go for the retry count but as I don't control scrape interval, that doesn't translate into much.

jm-franc · 2025-01-22T18:02:08Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+  - `replicas`: (integer) the number of replicas to scale to in case of fallback. Must be greater than 0 and it's mandatory.
+
+To allow for tracking of failures to fetch metrics a new field should be added to the existing [`HorizontalPodAutoscalerStatus`][] object:
+- `metricRetrievalFailureCount`: (integer) tracks the number of consecutive failures in retrieving metrics.


It's a single integer, not an integer per metric? If two metrics start failing, don't we want to keep a separate counter for each?

The way I was thinking was to integrate on the exit code of computeReplicasForMetrics which handles partial metric failure. That way we could track a single integer

I wonder if it's worth writing that down in the KEP? Specifically what happens when:

There is 1 metric configured, and it fails

There are multiple metrics configured, and a single metric fails

There are multiple metrics configured, and they take turns to fail (unlikely, but possible?)

If failureThreshold is set to 10, and at a given moment metric foo failed 9 times while metric bar failed only once, I can't see how we could summarize the current state with one integer.

Do you mind elaborating on how you think this could be handled? I guess you'd take the max (9 in my example)? Is this desirable?

Taking a look at computeReplicasForMetrics, I assume the counter can only be incremented once per cycle, and only if at least 1 metric has failed.

In the case you describe, if the first 9 cycles had a failure on metric foo and the 10th cycle had a failure on metric bar, then there would be 10 consecutive failures.

But if the first cycle had both metrics fail, then the next 8 cycles had metric foo and the final 10th cycle had no metrics fail, then the count would max at 9 consecutive failures.

This is just my understanding of what @be0x74a is saying though.
It's worth clarifying in the KEP though.

Taking a look at computeReplicasForMetrics, I assume the counter can only be incremented once per cycle, and only if at least 1 metric has failed.

Yeah, that's exactly what I was thinking. I can add it to the KEP as the current state, but I don't mind discussing whether this is the correct approach.
Btw @adrianmoisey do you have a section in mind where I can put this? Maybe in the Design Details section?

The Design Details section sounds good. I think a section describing what a failure is should suffice?

My advise may be idiosyncratic, but I'd like to know what others think:

KEDA implements the external metric API only, which I think is where a fallback makes sense, as APIs running out of cluster are not under its control. How do we justify a special mechanism for (resource) metrics API failures, but we don't do the same for other server API failures?

KEDA fallback is per-metric, and allows to set a default value for this metric (e.g. 0 to "ignore"). This is less risky than the proposal here, as I can't really see what users could set HPAFallback.replicas to other than maxReplicas.

Because the fallback is not per-metric, users can't see what metric failed (iiuc). Is this acceptable?

My two cents on this: I think we should use a per-metric approach. It's much safer and gives more control to the users so this is definitely something we should reconsider in the design. It can also help with visibility - when we fallback, users will be able to see why we fallback (e.g. event/status on the HPA etc).
So agree with @jm-franc

Hey @jm-franc & @omerap12, in response to your comments:

I get the point for consistency, and the only justification I can manage for this special mechanism is the higher fragility of metric sources + its potential impact when failing
2 & 3. Time granted me a new perspective on this discussion. Although I still appreciate the simplicity of the initial design, I now believe that allowing for better configurability of the fallbacks is preferable.

I'll try to get a bit of time this week to work on adding this to the proposal

* Replace `MetricRetrievalFailureCount` with `ConsecutiveMetricRetrievalFailureCount` * Move `HPAFallback` to `HorizontalPodAutoscalerBehavior` * Add FallbackActive condition

…allback # Conflicts: # keps/sig-autoscaling/5053-hpa-fallback/README.md

keps/sig-autoscaling/5053-hpa-fallback/README.md

Co-authored-by: Adrian Moisey <adrian@changeover.za.net>

k8s-triage-robot · 2025-05-11T17:16:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

be0x74a · 2025-05-18T20:41:12Z

/remove-lifecycle stale

omerap12

Since KEDA provides a variety of fallback behavior options (https://keda.sh/docs/2.17/reference/scaledobject-spec/#fallbackbehavior), is there any reason not to support all of them?

be0x74a · 2025-06-16T21:27:58Z

Since KEDA provides a variety of fallback behavior options (https://keda.sh/docs/2.17/reference/scaledobject-spec/#fallbackbehavior), is there any reason not to support all of them?

Honestly, I missed these options altogether. But looking at them now, they look like a great addition

omerap12 · 2025-06-17T06:47:45Z

Since KEDA provides a variety of fallback behavior options (https://keda.sh/docs/2.17/reference/scaledobject-spec/#fallbackbehavior), is there any reason not to support all of them?

Honestly, I missed these options altogether. But looking at them now, they look like a great addition

I guess we can start with a basic fallback and add all other functionalities later

omerap12

I added some comments, and you forgot the production-ready review :)

omerap12 · 2025-06-17T07:59:32Z

keps/sig-autoscaling/5053-hpa-fallback/README.md

+            failureThreshold = *hpa.Spec.Fallback.FailureThreshold
+        } else {
+            // Default value
+            failureThreshold = 3


IMHO, we should align with KEDA on this, since the feature was originally designed by them and we want to avoid confusing users with differing behavior. So, this field should be mandatory if the fallback section is included.
Ref: https://keda.sh/docs/2.17/reference/scaledobject-spec/#fallback

omerap12 · 2025-06-17T08:06:52Z

keps/sig-autoscaling/5053-hpa-fallback/kep.yaml

+title: Fallback for HPA on failure to retrieve metrics
+kep-number: 5053
+authors:
+  - "@be0x74a"
+owning-sig: sig-autoscaling
+status: provisional
+creation-date: 2025-01-20
+reviewers:
+  - TBD
+approvers:
+  - TBD


You can add @adrianmoisey and me as reviewers (we're the current HPA reviewers), and add one of the SIG leads as an approver - @gjtempleton, if he’s okay with that.

Add KEP on Fallback for HPA on failure to retrieve metrics.

51d674d

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Jan 20, 2025

k8s-ci-robot requested a review from gjtempleton January 20, 2025 01:52

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2025

k8s-ci-robot requested a review from MaciekPytel January 20, 2025 01:52

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 20, 2025

be0x74a mentioned this pull request Jan 20, 2025

Fallback for HPA on failure to retrieve metrics #5053

Open

4 tasks

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 20, 2025

adrianmoisey reviewed Jan 20, 2025

View reviewed changes

Address PR comments

f9515ad

Fill `Upgrade / Downgrade Strategy` section. Clarify `failureThreshold` field documentation

be0x74a requested a review from adrianmoisey January 20, 2025 23:21

raywainman reviewed Jan 21, 2025

View reviewed changes

adrianmoisey reviewed Jan 21, 2025

View reviewed changes

jm-franc reviewed Jan 22, 2025

View reviewed changes

be0x74a added 2 commits February 9, 2025 22:39

Address PR comments:

d54c037

* Replace `MetricRetrievalFailureCount` with `ConsecutiveMetricRetrievalFailureCount` * Move `HPAFallback` to `HorizontalPodAutoscalerBehavior` * Add FallbackActive condition

Merge remote-tracking branch 'origin/kep-hpa-fallback' into kep-hpa-f…

c4213fc

…allback # Conflicts: # keps/sig-autoscaling/5053-hpa-fallback/README.md

adrianmoisey reviewed Feb 10, 2025

View reviewed changes

keps/sig-autoscaling/5053-hpa-fallback/README.md Outdated Show resolved Hide resolved

Update keps/sig-autoscaling/5053-hpa-fallback/README.md

18bc9d5

Co-authored-by: Adrian Moisey <adrian@changeover.za.net>

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 11, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2025

omerap12 reviewed Jun 10, 2025

View reviewed changes

omerap12 reviewed Jun 17, 2025

View reviewed changes

KEP-5053: Fallback for HPA on failure to retrieve metrics #5054

Are you sure you want to change the base?

KEP-5053: Fallback for HPA on failure to retrieve metrics #5054

Uh oh!

Conversation

be0x74a commented Jan 20, 2025

Uh oh!

k8s-ci-robot commented Jan 20, 2025

Uh oh!

k8s-ci-robot commented Jan 20, 2025

Uh oh!

k8s-ci-robot commented Jan 20, 2025

Uh oh!

adrianmoisey commented Jan 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raywainman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

be0x74a Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

be0x74a Feb 9, 2025 •

edited

Loading