Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-5053: Fallback for HPA on failure to retrieve metrics #5054

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

be0x74a
Copy link

@be0x74a be0x74a commented Jan 20, 2025

Add KEP-5053: Fallback for HPA on failure to retrieve metrics

Issue link: #5053

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Jan 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: be0x74a
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 20, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @be0x74a!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @be0x74a. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 20, 2025
@adrianmoisey
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 20, 2025
Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerSpec`][] object:

- `fallback`: an optional new object containing the following fields:
- `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behaviour. Must be greater than 0 and is option with a default of 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean:

Must be greater than 0 and is option with a default of 3.

Is that saying that it's an optional field and that it defaults to "3" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, clarified the explanation. Thanks

- Feature implemented behind a `HPAFallback` feature flag
- Initial e2e tests completed and enabled

### Upgrade / Downgrade Strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly worth filling in.
ie: If someone enables the alpha feature, and does a rollback, what happens?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filled it with a brief overview of what would happen on feature flag enable and disable

Fill `Upgrade / Downgrade Strategy` section.
Clarify `failureThreshold` field documentation
@be0x74a be0x74a requested a review from adrianmoisey January 20, 2025 23:21
Copy link

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting this together :) I think this will be tremendously useful for users.

Would love for the community to provide more feedback on the API, I think we are definitely on the right track here!

type HorizontalPodAutoscalerStatus struct {
// metricRetrievalFailureCount tracks the number of consecutive failures in retrieving metrics.
//+optional
MetricRetrievalFailureCount int32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe be explicit in the name that this is consecutive failures?

ConsecutiveMetricRetrievalFailureCount?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Altough maybe a big name, I agree it's better to be explicit they're consecutive failures

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// fallback state and defines the threshold for errors required to enter the
// fallback state.
//+optional
Fallback *HPAFallback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on adding this to the Behavior field?

(I don't feel strongly, just curious what you think)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't though of it but I can see how it would make sense to define the fallback behavior on the Behavior filed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I semi-agree with this.
Behaviour feels right, but the description of Behaviour feels wrong: https://github.com/kubernetes/kubernetes/blob/0798325ba13643358aa3ebb7c6ddc3006ac26a7c/pkg/apis/autoscaling/types.go#L111-L113

// HorizontalPodAutoscalerBehavior configures a scaling behavior for Up and Down direction
// (scaleUp and scaleDown fields respectively).

It feels to me like "Scaling behaviour" and "Fallback behaviour" are two different things.
May be if the comment I linked to was reworded, then "Behaviour" is the right place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section for the comment update for the behavior field and moved the fallback into it

inFallback = false
}

if !inFallback {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll probably want to set a status when the HPA is in fallback mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be an event too?

Events:
  Type     Reason                        Age                From                       Message
  ----     ------                        ----               ----                       -------
  Normal   SuccessfulRescale             2m19s              horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

That's an event for a regular scale, so I think a similar message but with a different reason would be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll probably want to set a status when the HPA is in fallback mode?

I commented higher up about a condition (see https://github.com/kubernetes/enhancements/pull/5054/files#r1924247581).

I imagine that covers the status update you're talking about?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - yes a condition flagging that the HPA is in this fallback state. Good callout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an event for a regular scale, so I think a similar message but with a different reason would be useful

I added an event here but I guess it's better to have a specific message for fallback and not the error responsible for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the FallbackActive condition

CurrentReplicas int32
DesiredReplicas int32
CurrentMetrics []MetricStatus
Conditions []HorizontalPodAutoscalerCondition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a new condition may be needed too?
This is what we currently have:

Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count

I imagine a "Fallback" Type may be useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerSpec`][] object:

- `fallback`: an optional new object containing the following fields:
- `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behaviour. Must be a value greater than 0. This field is optional and defaults to 3 if not specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failureThreshold specifies a maximum number of retries. Isn't a duration what users are looking for (i.e. trigger fallback if scraping metrics fails for x seconds)?

(Note that for KEDA it's equivalent, as KEDA specifies the scraping frequency.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's a good question. As a user, I would naturally go for the retry count but as I don't control scrape interval, that doesn't translate into much.

- `replicas`: (integer) the number of replicas to scale to in case of fallback. Must be greater than 0 and it's mandatory.

To allow for tracking of failures to fetch metrics a new field should be added to the existing [`HorizontalPodAutoscalerStatus`][] object:
- `metricRetrievalFailureCount`: (integer) tracks the number of consecutive failures in retrieving metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a single integer, not an integer per metric? If two metrics start failing, don't we want to keep a separate counter for each?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I was thinking was to integrate on the exit code of computeReplicasForMetrics which handles partial metric failure. That way we could track a single integer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth writing that down in the KEP? Specifically what happens when:

  1. There is 1 metric configured, and it fails
  2. There are multiple metrics configured, and a single metric fails
  3. There are multiple metrics configured, and they take turns to fail (unlikely, but possible?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If failureThreshold is set to 10, and at a given moment metric foo failed 9 times while metric bar failed only once, I can't see how we could summarize the current state with one integer.

Do you mind elaborating on how you think this could be handled? I guess you'd take the max (9 in my example)? Is this desirable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at computeReplicasForMetrics, I assume the counter can only be incremented once per cycle, and only if at least 1 metric has failed.

In the case you describe, if the first 9 cycles had a failure on metric foo and the 10th cycle had a failure on metric bar, then there would be 10 consecutive failures.

But if the first cycle had both metrics fail, then the next 8 cycles had metric foo and the final 10th cycle had no metrics fail, then the count would max at 9 consecutive failures.

This is just my understanding of what @be0x74a is saying though.
It's worth clarifying in the KEP though.

Copy link
Author

@be0x74a be0x74a Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at computeReplicasForMetrics, I assume the counter can only be incremented once per cycle, and only if at least 1 metric has failed.

Yeah, that's exactly what I was thinking. I can add it to the KEP as the current state, but I don't mind discussing whether this is the correct approach.
Btw @adrianmoisey do you have a section in mind where I can put this? Maybe in the Design Details section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Design Details section sounds good. I think a section describing what a failure is should suffice?

* Replace `MetricRetrievalFailureCount` with `ConsecutiveMetricRetrievalFailureCount`
* Move `HPAFallback` to `HorizontalPodAutoscalerBehavior`
* Add FallbackActive condition
…allback

# Conflicts:
#	keps/sig-autoscaling/5053-hpa-fallback/README.md
Co-authored-by: Adrian Moisey <adrian@changeover.za.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants