Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volumebinding: scheduler queueing hints - CSIStorageCapacity #124961

Conversation

bells17
Copy link
Contributor

@bells17 bells17 commented May 20, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

kube-scheduler implements scheduling hints for the VolumeBinding plugin.
The scheduling hints allow the scheduler to determine whether to retry or skip scheduling a Pod based on the changes made to the CSIStorageCapacity resource referenced by the plugin.

Which issue(s) this PR fixes:

Part of #118893
KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4247-queueinghint/README.md
Base PR: #124939

Special notes for your reviewer:

Fields Impacting QueueingHintFn

PersistentVolume (PV) is not included in this table because it can undergo extensive changes when a conversion is performed by csi-translation-lib.

resource field Referenced in PreFilter+Filter? Admission Overwrite Prevention Config Need to Check Changes in QHint?
CSIStorageCapacity .metadata.labels x x x
CSIStorageCapacity .metadata.annotations x x x
CSIStorageCapacity .nodeTopology o o x
CSIStorageCapacity .storageClassName o o x
CSIStorageCapacity .capacity o x o
CSIStorageCapacity .maximumVolumeSize o x o

ref(ja): https://zenn.dev/bells17/scraps/65bd6891012bdc

Does this PR introduce a user-facing change?

kube-scheduler implements scheduling hints for the VolumeBinding plugin.
The scheduling hints allow the scheduler to retry scheduling a Pod that was previously rejected by the VolumeBinding plugin only if a new resource referenced by the plugin was created or an existing resource referenced by the plugin was updated.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 20, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bells17
Once this PR has been reviewed and has the lgtm label, please assign thockin for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/storage Categorizes an issue or PR as relevant to SIG Storage. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 20, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 20, 2024
@bells17
Copy link
Contributor Author

bells17 commented May 20, 2024

/cc @sanposhiho @utam0k

@sanposhiho
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 20, 2024
@bells17 bells17 force-pushed the qhint-volume-binding-csistoragecapacity branch from 20b3c2b to 411d7ae Compare May 20, 2024 12:44
@bells17 bells17 force-pushed the qhint-volume-binding-csistoragecapacity branch from bedc76c to 74f80bf Compare May 20, 2024 16:17
Copy link

linux-foundation-easycla bot commented May 21, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot removed the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 21, 2024
}
return events
}

func (pl *VolumeBinding) isSchedulableAfterCSIStorageCapacityChange(logger klog.Logger, pod *v1.Pod, oldObj, newObj interface{}) (framework.QueueingHint, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often is CSIStorageCapacity updated in an actual world?

If it's infrequent in general, can we just return Queue when CSIStorageCapacity is created/updated and a Pod has specific .spec.volumes?

e.g.,

func (pl *VolumeBinding) isSchedulableAfterCSIStorageCapacityChange(logger klog.Logger, pod *v1.Pod, oldObj, newObj interface{}) (framework.QueueingHint, error) {
//...
	for _, vol := range pod.Spec.Volumes {
		if vol.PersistentVolumeClaim != nil || vol.Ephemeral != nil:
			// This Pod might have got unschedulable due to CSIStorageCapacity in a past scheduling cycle.			
			return framework.Queue, nil 
		}
	}
	return framework.QueueSkip, nil

Copy link
Contributor Author

@bells17 bells17 May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment.

How often is CSIStorageCapacity updated in an actual world?

CSIStorageCapacity is updated by the external-provisioner of each CSI Driver. The relevant code can be found here:
https://github.com/kubernetes-csi/external-provisioner/blob/v5.0.0/pkg/capacity/capacity.go#L592-L669

There are several events that trigger the update, but with the default settings, each CSIStorageCapacity for a single CSI Driver is updated at least every minute:
https://github.com/kubernetes-csi/external-provisioner/blob/v5.0.0/pkg/capacity/capacity.go#L512-L526
https://github.com/kubernetes-csi/external-provisioner/blob/v5.0.0/cmd/csi-provisioner/csi-provisioner.go#L111

The number of CSIStorageCapacity objects depends on the NodeTopology configuration of the target CSI Driver.
For example, when using network volumes like the gcp-compute-persistent-disk-csi-driver, it can be on a per-zone basis:
https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/v999.1.0/pkg/gce-pd-csi-driver/node.go#L377-L392
When using lvm running on each node, like topolvm, CSIStorageCapacity is created on a per-node basis:
https://github.com/topolvm/topolvm/blob/main/internal/driver/node.go#L557-L566

If it's infrequent in general, can we just return Queue when CSIStorageCapacity is created/updated and a Pod has specific .spec.volumes?

Yes, that approach alone can help skip queueing Pods unrelated to CSIStorageCapacity, reducing unnecessary overhead.

@bells17 bells17 force-pushed the qhint-volume-binding-csistoragecapacity branch 2 times, most recently from 5346abf to be636e8 Compare May 26, 2024 13:19
@bells17 bells17 force-pushed the qhint-volume-binding-csistoragecapacity branch from be636e8 to 362a702 Compare May 26, 2024 13:38
@bells17
Copy link
Contributor Author

bells17 commented May 26, 2024

@sanposhiho @toVersus Thank you for your review. I have made the necessary changes, so please take another look.

Copy link

@toVersus toVersus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some nits, other than that LGTM.

for _, vol := range pod.Spec.Volumes {
if vol.PersistentVolumeClaim != nil || vol.Ephemeral != nil {
// This Pod might have got unschedulable due to CSIStorageCapacity in a past scheduling cycle.
logger.V(5).Info("CSIStorageCapacity was created or updated, but the pod doesn't mount any persistent volumes and generic ephemeral storage volumes. It doesn't make this pod schedulable.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me wonder: in the first place, could a Pod without any PVC or ephemeral volume get unschedulable by VolumeBinding plugin?
If No, we likely end up having no QHint for CSIStorageCapacity (for now, at least until we build some observability like I said).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I made a mistake in correcting the log message.
(This has been fixed)

It makes me wonder: in the first place, could a Pod without any PVC or ephemeral volume get unschedulable by VolumeBinding plugin?

I think the answer to this is No.

If No, we likely end up having no QHint for CSIStorageCapacity (for now, at least until we build some observability like I said).

Regarding this point, I think that if we don't return QueueSkip in the QHint, scheduling will be attempted due to CSIStorageCapacity changes even in cases where the Pod doesn't have a Volume. Would this not cause any issues in your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this point, I think that if we don't return QueueSkip in the QHint, scheduling will be attempted due to CSIStorageCapacity changes even in cases where the Pod doesn't have a Volume. Would this not cause any issues in your opinion?

Only QHints of plugins that have rejected a Pod would be run. e.g., if, in a previous scheduling cycle, a Pod is failed by resource fit plugin and PodAffinity plugin, then only QHints of those plugins would be run for the Pod.

Then, if the answer to my first question "could a Pod without any PVC or ephemeral volume get unschedulable by VolumeBinding plugin?" is No, filtering Pods with volume in QHint doesn't make any point since only Pods with volume would reach this QHint in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your input. I understand.
I'll proceed to close the unnecessary PRs.

@bells17 bells17 force-pushed the qhint-volume-binding-csistoragecapacity branch from 82b1102 to 8a942f3 Compare May 29, 2024 14:58
@toVersus
Copy link

/lgtm

@k8s-ci-robot
Copy link
Contributor

@toVersus: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bells17 bells17 requested a review from toVersus May 30, 2024 04:11
@bells17
Copy link
Contributor Author

bells17 commented May 31, 2024

/close
#124961 (comment)

@k8s-ci-robot
Copy link
Contributor

@bells17: Closed this PR.

In response to this:

/close
#124961 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants