Skip to content

feat: Introduce object limits #2626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

mrueg
Copy link
Member

@mrueg mrueg commented Mar 10, 2025

What this PR does / why we need it:
This change allows user-controlled limits on how many objects KSM will list from the API. This is helpful to prevent resource exhaustion on KSM, in case the API creates too many resources.

The object limit it set globally and applied per resource watched.

This is currently a WIP as I'm not sure if it will work as expected and it needs further testing.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
Introduces a new metric on ksm telemetry, which is static over KSM config lifetime: kube_state_metrics_list_limit
This will help with alerting when a threshold is reached.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2622

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2025
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 10, 2025
@mrueg mrueg force-pushed the limit-list branch 3 times, most recently from 9f17eee to 915ef4e Compare March 10, 2025 22:38
@mrueg mrueg force-pushed the limit-list branch 2 times, most recently from adee22a to dcffe5f Compare March 27, 2025 19:41
@mrueg mrueg changed the title WIP: feat: Introduce object limits feat: Introduce object limits Mar 27, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2025
@mrueg mrueg force-pushed the limit-list branch 2 times, most recently from e1c99cf to b7f6f2b Compare March 27, 2025 20:38
}
i.metrics.ListRequestsTotal.WithLabelValues("success", i.resource).Inc()

if i.limit != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd want to validate against negative values, also s/int/uint preferably. I'm reviewing this on my back from work so apologies if I missed something obvious here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, to be honest, I have no idea why the kubernetes API made limits a signed int. https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#ListOptions

@mrueg mrueg force-pushed the limit-list branch 2 times, most recently from 00b2c6b to 93b4b9f Compare April 12, 2025 23:49
Copy link
Member

@rexagod rexagod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an e2e test-case as well that covers --object-limit? Also one nit, otherwise LGTM.

@@ -202,5 +204,9 @@ func (o *Options) Validate() error {
return fmt.Errorf("value for --auto-gomemlimit-ratio=%f must be greater than 0 and less than or equal to 1", o.AutoGoMemlimitRatio)
}

if o.ObjectLimit < 0 {
return fmt.Errorf("value for --object-limit=%d must be equal or greater than 0", o.ObjectLimit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be greater than zero? I see in a couple of instances in watch.go that we check for this being greater than zero? Or maybe we want the conditions to reflect this value (0) as well, and as such, accommodate this in watch.go as well?

The latter would make more sense as in we keep the same expectations as cache.ListerWatcher allows upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws an error because we assume that a negative object limit is invalid. If it's zero, we don't want to throw an error because this is the default and means that it's not set.

@mrueg mrueg force-pushed the limit-list branch 8 times, most recently from 3613b0d to fcd7a8b Compare May 31, 2025 20:22
@mrueg mrueg force-pushed the limit-list branch 5 times, most recently from 00bacd5 to 9ec4b78 Compare May 31, 2025 21:01
This change allows user-controlled limits on how many objects KSM will
list from the API. This is helpful to prevent resource exhaustion on
KSM, in case the API creates too many resources.

The object limit it set globally and applied per resource watched.
@mrueg
Copy link
Member Author

mrueg commented May 31, 2025

@rexagod I added another e2e test, please take a look if you have some spare cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cap / Limit number of objects ingested for native and Custom Resource Metrics
3 participants