Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-5075: DRA: Consumable Capacity #5104

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

sunya-ch
Copy link

@sunya-ch sunya-ch commented Jan 30, 2025

  • One-line PR description: Enable DRA API and scheduling to support a shared device allocation with consumable capacity.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2025
@sunya-ch sunya-ch marked this pull request as draft January 30, 2025 01:58
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @sunya-ch!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @sunya-ch. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 30, 2025
@sunya-ch sunya-ch mentioned this pull request Jan 30, 2025
4 tasks
@pohly
Copy link
Contributor

pohly commented Feb 4, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Feb 4, 2025
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good start. After reading this thoroughly, I think we can meet all the use cases with "per-device allocatable resources", and do not need to expose the "device provisioning" aspect to the K8s control plane. I think the driver may still do that, but it's actually not something the scheduler needs to care about.

I think if we repurpose this to just "per-device allocatable resources", we will not only meet these use cases, but we can meet some of the ones for modeling standard resources in DRA.

Please take a look and if you want to meet to discuss it, let me know.

quantity: 1Gi
```
#### Story 4
Users can define more than one provision requests to get devices from different sources. From the following claim spec, the claim should provide two devices which have different device sources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support this, we probably need to create an distinctAttributes constraint type. Otherwise they could land on the same device.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this considered as one of the item from the must-have list in Revisiting Kubernetes Hardware Resource Model
?

- Do we have a need for “anti-affinity”? For example, “select these two VFs make sure they are on different PFs”.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "must have" in there was for my prototype (which I never really did all those "must haves" :)).

Adding a distinctAttributes constraint can be done in a separate, smaller KEP and proceed independently.

-->
- Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use case of macvlan, ipvlan in DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.
- Enhance a capability of secondary networks to dynamically allocate secondary networks based on present availabilities such as bandwidth.
- Leverage the enhancement of capacity consumption from [KEP-4815](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do something distinct from the capacity consumption in there. That consumption is 100% internal to the resource slice, and is about shared resources. In this case, we are talking about consumption by the claims. The reason capacity is separate from attributes was exactly to support this kind of consumption model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I updated the third goal to

- Enable capacity field to be consumable.

required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->
This enhancement system design introduces the terms `device source`, `provision limit`, `provision request`, and `provisioned device`. In addition to the existing device instance, `device source` represents the root device or device specifiction as a source to provision or generate a new device allocated to the Pod. The number of provision per `device source` can be limited by the `provision limit`. Device sources are exposed by device vendors or provisioners via Publish API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I like this KEP, but I don't think it's necessary to introduce all this machinery. I also am not sure there is a need to "provision" a new device. Instead, I would suggest we can allocate a partial device and share the underlying device. That is, we can move toward "platform mediated sharing of devices", rather than just the current mechanism of pointing multiple containers or pods add the same resource claim ("user mediated sharing of devices").

In platform mediated sharing, we do not share resource claims, but instead we have separate resource claims which hold a partial allocation of the same underlying device from the same resource slice. Each of those resource claims contains the amount of allocation. This is analogous to how resources are consumed by pods on nodes; in this case, resources are consumed by claims on devices.

For some more background see this (outdated) document, and the prototype linked in there.

Consider if the capacity values in a device were considered allocatable. Then, you could have a resource claim for a portion of the device. The scheduler would be able to reconstruct "how much" of the device is consumed by aggregating all resource claims that are attached to that device.

Rather than a specific "provisioning limit" on the number of devices, the limit is hit when there are insufficient resources in capacity to satisfy the claim. This allows the limit to be hit based upon specific resources consumed.

With this, I think in fact we should be able to model "compute devices" that represent the CPU, memory, and other standard node resources. I think this would satisfy the requirements @catblade has for aligning CPU with GPU with NIC, for example.

WDYT? Would you be willing/able to rework this a bit and see if this idea can satisfy your needs? Pretty much, we need to be able to write claims that "allocate" from the capacity fields. In the minimal implementation, we wouldn't actually need a change in the ResourceSlice API, I don't think, only in the ResourceClaim API. In the case of things like virtual interfaces, the name of the created interface could be stored in the ResourceClaim.Status data.

For a more robust implementation, we would want the DeviceCapacity entries in the Capacities to be able to do things like block allocation ("you asked for 1 byte, but I can only allocate in 1Mi increments" - see the doc above). We may also need to be able to model different segments of resources, tagged with different topology, and be able to identify which types of resources are aggregatable across those topologies (for example, "8Gi is in numa0, 8Gi is in numa1 - if you ask for numa alignment and 4Gi, we give you a CPU and memory numa aligned, but if you ask for 10Gi we have to aggregate across the boundary"). Those are harder, maybe we start by writing it up without those, then wait until @catblade complains :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(by the way, there may well be use cases for "provisioning", I just think the ones listed so far don't need it...if you have more use cases, please do add them)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I think your comment is apt, and agree.

Copy link
Author

@sunya-ch sunya-ch Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In platform mediated sharing, we do not share resource claims, but instead we have separate resource claims which hold a partial allocation of the same underlying device from the same resource slice. Each of those resource claims contains the amount of allocation. This is analogous to how resources are consumed by pods on nodes; in this case, resources are consumed by claims on devices.

I totally agree with this direction.

The objective of this KEP is to distinguish the case when the device which is allocated to the pod is actually a new device provisioned dynamically based on the master device. For instance, macvlan and ipvlan are a new virtual network device (i.e., net1) which will be provisioned, configured, and removed by the cni-dra-driver based on the selected master device (i.e., eth1).

However, at the same time, I agree that we can consider provision limit as one capacity, and leave all provisioned data (specific to the new device) to the resourceclaim status.

Is it possible to have an explicit way to clarify the type of sharing in the ResourceClaims status such as for provisioned device, pre-sliced device like MIG, dynamically-slice like virtual GPU ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective of this KEP is to distinguish the case when the device which is allocated to the pod is actually a new device provisioned dynamically based on the master device. For instance, macvlan and ipvlan are a new virtual network device (i.e., net1) which will be provisioned, configured, and removed by the cni-dra-driver based on the selected master device (i.e., eth1).

While at the actual node layer, this is what is happening, it's not clear to me that the K8s control plane needs to model this. It's not clear to me there needs to be awareness at the ResourceSlice/ResourceClaim level of these devices that are provisioned just to contain the allocation of a shared underlying device. It might be necessary, I am just not 100% sure yet and want to explore the options.

Is it possible to have an explicit way to clarify the type of sharing in the ResourceClaims status such as for provisioned device, pre-sliced device like MIG, dynamically-slice like virtual GPU ?

I think we are probably going to need a way to indicate that a device can be shared by platform-mediated sharing. Earlier version of DRA had a Shareable flag to indicate that a device can be shared by user-mediated sharing, but we decided it was unnecessary (since the user controls this anyway). But if the platform needs to manage it, then the driver author may need to let K8s know if a device is shareable. More to explore!

bandwidth:
quantity: 1Gi
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can CPU, memory, and other standard resource requests as use cases. @catblade

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I consider this is per-device attribute and comparable to the memory attribute of GPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ageed.

devices:
provisions:
- name: net1
deviceClassName: vlan-cni.networking.x-k8s.io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll also need to attach some config to specify the VLAN ID, for example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the CNI configuration will be also required for ResourceClaim as @LionelJouin's design in cni-dra-driver.
https://github.com/kubernetes-sigs/cni-dra-driver/blob/eb94de8c61835de1f13eefe700de9d9d89615531/docs/design/device-configuration-api.md#design

I will add reference in this KEP too.

[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->

A motivating use case for supporting dynamically-provisioned device is a virtual network device when to-be-allocated device will be created or partitioned on demand (on claim). The orginal discussion is in [this PR's comment thread](https://github.com/kubernetes-sigs/cni-dra-driver/pull/1#discussion_r1889265214) and the limitation of current implementation has been addressed [here](https://github.com/kubernetes-sigs/cni-dra-driver/pull/1#discussion_r1890166449). The virtual network device is created and configured once the CNI is called based on the information of the master network device. The configured information specific to the generated device should be at the ResourceClaim not the ResourceSlice. However, the device in the ResourceCliam is now present only the device listed in the ResourceSlice and there is no attribute to differentiate between the master device and the actual provisioned device.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited for clarity, spelling, and grammar: A motivating use case for supporting dynamically-provisioned device is a virtual network device when to-be-allocated device will be created or partitioned on demand (on claim). The original discussion is in this PR's comment thread and the limitation of current implementation has been addressed here. The virtual network device is created and configured once the CNI is called based on the information of the master network device. The configured information specific to the generated device should be at the ResourceClaim, not the ResourceSlice. However, the device in the ResourceClaim is now present, even though the device is listed in the ResourceSlice, and there is no attribute to differentiate between the master device and the actual provisioned device.

List the specific goals of the KEP. What is it trying to achieve? How will we
know that this has succeeded?
-->
- Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use case of macvlan, ipvlan in DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for grammar/clarity: Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use cases of macvlan or ipvlan in a DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.

-->

#### Story 1
Users can request for secondary networks based on their bandwidth demands without the need to concern about neither device name nor a present available bandwidth of each specific device. DRA scheduler can dynamically select available device based on its availability.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited for clarity: A user requests a secondary network based on their bandwidth demands without wanting to specify a device name nor present available bandwidth of each specific device. The DRA scheduler can dynamically select available devices based on their availability.

bandwidth:
quantity: 1Gi
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 6a57299 to 72f9f82 Compare February 12, 2025 04:34
@sunya-ch
Copy link
Author

sunya-ch commented Feb 12, 2025

@johnbelamaric @catblade Thank you so much for taking care of this KEP.
I have updated the KEP to address some comments such as filling KEP information which I can fill and applying rewrite suggestion first.
To reduce the number of notification, I have put 🚀 icon to the comment that I think I have addressed. Please consider click resolve button if you are agree.

For the comment related to design change, I will also update the KEP correspondingly next.

@sunya-ch
Copy link
Author

sunya-ch commented Feb 12, 2025

@johnbelamaric I have tried drafting your suggestion below. Please correct my understanding.

kind: ResourceSlice
...
spec:
  driver: cni.dra.networking.x-k8s.io
  devices:
  - name: eth1
    basic:
      attributes:
        name:
          string: "eth1"
      capacity:
        vlan: # <- define provision limit here
          quantity: 1000 
        bandwidth:
          quantity: 10Gi
---
kind: DeviceClass
metadata:
  name: provisionable.networking.x-k8s.io
kind: ResourceClaim
...
spec:
  devices: # <- move back the provision request to device request
    requests:
    - name: macvlan
      deviceClassName: provisionable.networking.x-k8s.io
      allocationMode: ExactCount
      count: 1
      resources:  # <- add resource in request (like PodCapacityClaim?). This field does not exist yet; however, I think you also have an idea to have this available based on your suggestion. 
      # Discussion point 1: should the following resource requests be per each count? 
        vlan:
          quantity: 1 # <- I guess that you are expect this resource consumption defined in resource claim. 
          #  Discussion point 2: do we need explicit vlan quantity here while it should not be other than 1?
        bandwidth:
          quantity: 1Gi
    config:
    - requests:
      - macvlan
      opaque:
        driver: cni.dra.networking.x-k8s.io
        parameters: # CNIParameters with the GVK, interface name and CNI Config (in YAML format).
          apiVersion: cni.networking.x-k8s.io/v1alpha1
          kind: CNI
          ifName: "net1"
          config:
            cniVersion: 1.0.0
            name: net1
            plugins:
            - type: macvlan
              master: eth0
              mode: bridge
              ipam:
                type: host-local
                ranges:
                - - subnet: 10.10.1.0/24

For the resource claim status, I still would like to propose the way that we can distinguish the way device is shared there. Can we instead add some optional field in the AllocatedDeviceStatus? For example,

type AllocatedDeviceStatus struct {
...
	// ProvisionData contains provisioning-related information specific to the provisioned device.
	//
	// This data may include provisioner-specific information.
	//
	// +optional
       *ProvisionData
}

updates.

[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Procedural comment:

@johnbelamaric
Copy link
Member

johnbelamaric commented Feb 12, 2025

@johnbelamaric I have tried drafting your suggestion below. Please correct my understanding.

Yes, what you propose in this comment is very much what I am suggesting. And your "discussion" points are definitely things I agree need more discussion and thought.

For the resource claim status, I still would like to propose the way that we can distinguish the way device is shared there. Can we instead add some optional field in the AllocatedDeviceStatus? For example,

Maybe yes? Let's ground it in the specific use cases.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from c861fb3 to 32e1c9c Compare February 18, 2025 03:09
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 18, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sunya-ch
Once this PR has been reviewed and has the lgtm label, please assign huang-wei for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 32e1c9c to d0eb088 Compare February 18, 2025 03:11
@sunya-ch sunya-ch changed the title KEP-5075: DRA: Dynamic Device Provisioning Support KEP-5075: DRA: Consumable Capacity Feb 18, 2025
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from d0eb088 to 3e4b3b4 Compare February 18, 2025 08:58
Comment on lines 176 to 214
requests:
- name: macvlan
deviceClassName: vlan-cni.networking.x-k8s.io
allocationMode: Shared
resources:
requests:
bandwidth: "1Gi"
Copy link
Member

@aojea aojea Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how is this going to be related to eth1, what if there is eth2 and eth3 too?
config is opaque so we should not depend on opaque values, is what this is suggesting?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can specify the device via the selection (such as specify device's attribute name = eth1). Additionally, DRA allows pod to dynamically allocate device based on its available resource such as bandwidth.The selected device information will be also sent to the driver then the network driver and configure the selected device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the idea here is to underspecify when possible, but be more specific if needed. So if there are also eth2 and eth3, then leaving it like this means that any one of eth1-3 will work. If that is not the case, then a selector is needed to further narrow down the eligible devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, Morten. We can "schedule" onto any of the underlying eth* devices. If we want a specific one - or one with a specific attribute (like network or trunking) - then that should be specified like any other selector.

Comment on lines 374 to 380
type ResourceRequest struct {
// All marks requesting all resource from the shared device.
// If All is set to true, Quantity will be ignored.
//
// +optional
// +default=false
All bool `json:"all" protobuf:"bytes,1,name=all"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I set All then we operate in the previous model, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly yes. All can be set only when the allocation mode is Shared.
With Shared allocation mode on the shared device, the device will be first selected by selectors and lastly checked with constraints in the same way it works in the other allocation mode.
Regardless of All flag, what is added in allocation process is to check resource is still consumable.
When All is set, only shared device with full resource (i.e., no share has been allocate) can be selected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will need this, if they want All, they shouldn't allow Shared.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I'm thinking that there are two actors of the system, device driver and user who creates a claim. The device driver is the one who decides whether the device is shared. However, user may want it all themselves. This flag will allow this use case. Also, in that case, user don't have to check the resource slice on how many capacity defined and request all items.

allocatedShare:
bandwidth: "1Gi"
```
- ResourceClaim without a resource request (request full available device) (A-type ResourceSlice)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can already do this, no? why do we need to create this exception in this model, so we have two ways of doing the same

}
```

### Test Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is missing


## Summary

This KEP is an extended usecase from the partitionable device [KEP-4815: DRA: Add support for partitionable devices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Partitionable Devices KEP has been moved to sig-scheduling, so the correct link is https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4815-dra-partitionable-devices

For instance, this feature will be allow reserving memory fraction of virtual GPU in [the AWS virtual GPU device plugin](https://github.com/awslabs/aws-virtual-gpu-device-plugin) via DRA.
The number of shares can be limited if needed.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the main issue with the partitionable devices feature that this attempts to solve, is to make it possible to share capacity from a device without having to explicitly list every possible partition. We already know this can be very verbose in some situations (example in https://docs.google.com/document/d/1lXGfnrBixRIMW9ESa-mv09Kisb2myVFV_A3nqPJ4FCQ/edit?tab=t.0#bookmark=id.mmv8k6mzxscm). I think it might be useful to spell out more clearly here how this extends the already planned support for partitionable devices (which already allow for sharing devices and to some extent also on-demand sharing).

Comment on lines 176 to 214
requests:
- name: macvlan
deviceClassName: vlan-cni.networking.x-k8s.io
allocationMode: Shared
resources:
requests:
bandwidth: "1Gi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the idea here is to underspecify when possible, but be more specific if needed. So if there are also eth2 and eth3, then leaving it like this means that any one of eth1-3 will work. If that is not the case, then a selector is needed to further narrow down the eligible devices.

spec:
devices:
requests:
- name: vgpu0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also set allocationMode to Shared?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, allocation mode is optional, defaulting to ExactCount: 1, is that correct?

I guess maybe we need an allocation mode AllowShared. I mean, in general I don't think people require shared, but they'll take it if that's all they can get. So, this would imply you are willing to take a shared resource, OR an ExactCount of 1. Since sharing implies < 1, I think that works.

From a user point of view, I would think specifying requests implies shared unless the user disallows it. For example, as a user this would seem most convenient:

  • No resource requests specified, no allocation mode specified => allocation mode ExactCount: 1
  • Resource requests specified, no allocation mode specified => allocation mode AllowShared

But I think that sort of complex validation / defaulting is frowned upon by API reviewers because it's complex and can cause bugs. @pohly?

Note that in the future, we may want to allow a resource request to be met by multiple devices, so whatever we do here with AllocationMode should be expandable to that case. The question is how much control do we allow? For example, if a user wants 128 Gi GPU memory, they would specify that in resource requests. Then, allocation mode blank ideally would mean "however you can get it - shared device, one device, 6 devices, whatever". They could constrain that with the allocation mode to turn on/off shared, and limit to a range (or maybe just limit to a max). They would have to be able turn shared on and off independently of setting the max. Maybe something like this?

# allow at most 4 devices
allocationMode: AllowMultiple
max: 4

# allow both shared or multiple, max of 4 devices
allocationMode: AllowMultipleOrShared
max: 4

Given the flexibility we need here, probably avoiding complicated defaulting rules is the right move.

Copy link
Author

@sunya-ch sunya-ch Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric I have added AllowShared mode to cover a new Story 7 (A user requests a device regardless of whether the device is sharable or not).

For AllowMultiple and AllowMultipleOrShared , can we have that for a separate KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure I follow why sharing implies <1. Wouldn't it be possible for someone to request more than one device, but also accept that one or more of those devices might be shared? I'm not sure if it makes sense in the network use-case, but it does seem like something that might be useful for other use-cases.

I'm also uncertain about adding an additional option for the AllocationMode enum. I think I agree that combining AllocationMode: All with shared devices shouldn't be possible, it still feels a bit strange to add the new AllowShared value. The current values All and ExactCount specify how many devices that will be allocated for the request, which seems separate from whether or not shared devices can be used to satisfy the claim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AllowMultiple and AllowMultipleOrShared , can we have that for a separate KEP?

Yes, for sure. Although based on Morten's comment maybe we should think about other ways to specify this? Maybe AllocationMode isn't the right place? Do you have an alternative suggestion, @mortent ?

Comment on lines 312 to 313
Capacity map[QualifiedName]DeviceCapacity
ConsumableCapacity map[QualifiedName]DeviceConsumableCapacity
Copy link
Member

@mortent mortent Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capacity field specifies the available capacities for the device, so my intuition here is that we don't need a new ConsumableCapacity field. With the introduction of the ConsumesCapacity field in https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4815-dra-partitionable-devices, we make a distinction between the capacity a device consumes from the internal pool and the advertised capacity. Having three fields named Capacity, ConsumesCapacity and ConsumableCapacity will probably get somewhat confusing.

//
// +optional
// +default=false
InfinityResource bool `json:"infinity" protobuf:"bytes,1,name=infinity"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to advertise capacity if it is infinite? I would assume every device will have some kind of capacity that is finite and I would assume just listing those would be enough.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got through use case 2, will review more later


This KEP is an extended usecase from the partitionable device [KEP-4815: DRA: Add support for partitionable devices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices).
The enhancement enables a shared device allocation with consumable capacity values.
A shared device can be allocated to more than one resource claims with pre-device resource requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/claims/claim/
s/pre/per/


## Summary

This KEP is an extended usecase from the partitionable device [KEP-4815: DRA: Add support for partitionable devices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/usecase/use case/


- Introduce an ability to allocating on-demand shared devices via DRA. This should cover the use cases of macvlan or ipvlan in a DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.
- Enhance a capability of secondary networks to dynamically allocate secondary networks based on present availabilities such as bandwidth.
- Leverage the enhancement of capacity consumption from [KEP-4815](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to refer to this here.

The capacity consumption in that case is between partitions of the devices.

The capacity consumption in this case is between claims.

The actual pools of capacity being consumed are completely separate. This is a layered approach. This current KEP should latch on to the "Device" layer. The partitionable KEP resource pools are a layer lower. The partitionable KEP enables the scheduler to calculate what "whole devices" to offer. That is orthogonal to whether or not that "whole device" can be shared and provides its own, allocatable resources.

For example, let's pretend there is a physical GPU (8 gpu cores, 10 Gi memory) that can support two hardware-level partitions: left (6 gpu cores, 8 Gi) and right (2 gpu cores, 2 Gi). Suppose further that using software-level sharing, these types of GPUs can allocate individual GPU cores and GPU memory in 1GB blocks. You would use the features of the "partitionable" KEP to advertise:

  • whole GPU, 8 gpu cores, 10 Gi gpu memory
  • left GPU, 6 gpu cores, 8 Gi gpu memory
  • right GPU, 2 gpu cores, 2 Gi gpu memory

The partitionable "shared resources" would allow the scheduler to:

  • Withdraw "left" and "right" partitions from the available list when "whole" is allocated
  • Withdraw "whole" from the available list when "left" or "right" is allocated

That's all it does. The users never see the "consumption" of resources - they only see "here is a list of devices, when I use this device, these other ones become unavailable".

In fact though, because of the software-level sharing, a user could point multiple resource claims at, say, the "right" GPU. The goal of this KEP is to regulate that sharing.

In this example (and in NVIDIA MIGs), the resources used for the partitioning match the resources advertised by the partitions. However, that is just a "coincidence". Our model allows the resources advertised by the device (and thus consumable in this KEP) to be completely distinct from the resources consumed internally to manage the partitioning scheme.

attributes:
name:
string: "eth1"
consumableCapacity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason capacity is separate from attributes is because it is consumable. So, this should just be listed under "capacity". If we think there will be things that can't be shared, we should have a flag or something. It's not clear to me if that should be at the device level, or at the individual resource level. If it's not consumable, maybe it should be an attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current capacities as defined in 1.32 without this KEP are not consumable. Yes, I still think they should have been attributes with a capacity value type, but you and Tim wanted them to be a separate field with the goal to extend the usage of that field later.

This KEP looks like that extension, so instead of consumableCapacity we should have:

capacity:
   bandwidth:
     value: 10Gi
     consumable: true

I'm using a boolean here for the sake of simplicity. Usually from an API perspective, an enum is preferred because it can be extended later. But such a "type" enum is now a bit weird for "traditional" usage:

  • type: consumable = new semantic
  • type: attribute = traditional semantic, default if unset

Either way, we need a new field, which leads to the problem that the apiserver will drop it and thus "downgrades" the capacity to the traditional "attribute" semantic. The mitigation from https://docs.google.com/document/d/1sCfO4hJUUhZpzld-_wEZi85XqSclEVvjUfsEe2-duZs/edit?tab=t.0#bookmark=id.jkplkq9taqcd will have to be used by DRA drivers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly Yes. I agree. We also discuss about this topic in our call today.
I will update KEP correspondingly.

You may find in the other discussion points we had from this doc too: https://docs.google.com/document/d/1U0u2uErpYcf-RooPEws5oDMiJ9kT2uDoWNjcWrLH224/edit?tab=t.btyzk7v0acfo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Slack conversation: Is there a reason not to make all "capacity" entries consumable, as long as the device is marked as shared? I think in CEL, capacities would always be their initial values. Only by using a resource request would you account for allocated (consumed) resources. If we require the device to be marked as shared, then existing devices continue to work as they are (assuming shared is not the default)

Comment on lines 176 to 214
requests:
- name: macvlan
deviceClassName: vlan-cni.networking.x-k8s.io
allocationMode: Shared
resources:
requests:
bandwidth: "1Gi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, Morten. We can "schedule" onto any of the underlying eth* devices. If we want a specific one - or one with a specific attribute (like network or trunking) - then that should be specified like any other selector.

@pohly
Copy link
Contributor

pohly commented Feb 19, 2025

/milestone v1.34

@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Feb 19, 2025
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, some more comments.

Looking forward to the next revision.

attributes:
name:
string: "eth1"
consumableCapacity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Slack conversation: Is there a reason not to make all "capacity" entries consumable, as long as the device is marked as shared? I think in CEL, capacities would always be their initial values. Only by using a resource request would you account for allocated (consumed) resources. If we require the device to be marked as shared, then existing devices continue to work as they are (assuming shared is not the default)

spec:
devices:
requests:
- name: vgpu0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, allocation mode is optional, defaulting to ExactCount: 1, is that correct?

I guess maybe we need an allocation mode AllowShared. I mean, in general I don't think people require shared, but they'll take it if that's all they can get. So, this would imply you are willing to take a shared resource, OR an ExactCount of 1. Since sharing implies < 1, I think that works.

From a user point of view, I would think specifying requests implies shared unless the user disallows it. For example, as a user this would seem most convenient:

  • No resource requests specified, no allocation mode specified => allocation mode ExactCount: 1
  • Resource requests specified, no allocation mode specified => allocation mode AllowShared

But I think that sort of complex validation / defaulting is frowned upon by API reviewers because it's complex and can cause bugs. @pohly?

Note that in the future, we may want to allow a resource request to be met by multiple devices, so whatever we do here with AllocationMode should be expandable to that case. The question is how much control do we allow? For example, if a user wants 128 Gi GPU memory, they would specify that in resource requests. Then, allocation mode blank ideally would mean "however you can get it - shared device, one device, 6 devices, whatever". They could constrain that with the allocation mode to turn on/off shared, and limit to a range (or maybe just limit to a max). They would have to be able turn shared on and off independently of setting the max. Maybe something like this?

# allow at most 4 devices
allocationMode: AllowMultiple
max: 4

# allow both shared or multiple, max of 4 devices
allocationMode: AllowMultipleOrShared
max: 4

Given the flexibility we need here, probably avoiding complicated defaulting rules is the right move.

```

#### Story 4
A user defines more than one requests to get devices from different sources. From the following claim spec, the claim should select two devices which have different device sources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be supported without adding a distinctAttributes analogous to our existing matchAttributes. So, either that has to be added to this KEP, or that needs to be a separate KEP and the use case move there.

The consumableCapacity can be set to infinity.
Devices that have at least one consumableCapacity are considered to be shared devices.

Users can only claim a shared device using the newly added `Shared` allocation mode (`DeviceAllocationMode`). In other words, the shared device will be skipped in all other modes, while the unsharable device will be skipped in the Shared mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we require shared? If we think that's a real use case, then let's have AllowShared and RequireShared allocation modes.


Users can define specific per-device resource requests through
the newly added resources field in the `DeviceRequest` of the `ResourceClaim`.
If the resources field is not specified, all consumable resources are assumed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be "none", at least from the point of view of scheduling. In other words, these should be seen as "minimums", like requests for containers.

Are you saying that they should consume the whole device if they specify no requests, even if the device is marked as Shared?

and its attributes match the request's selectors and constraints.
"share" refers to amount of resources reserved or allocated to the claim request.
The newly added `allocatedShare` field in the `DeviceRequestAllocationResult` will be set when the allocation is successful.
A device can be shared (allocated) to different pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multi-node resources could have those pods split across nodes. I don't think there is a need to nail it down here.

Comment on lines 374 to 380
type ResourceRequest struct {
// All marks requesting all resource from the shared device.
// If All is set to true, Quantity will be ignored.
//
// +optional
// +default=false
All bool `json:"all" protobuf:"bytes,1,name=all"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will need this, if they want All, they shouldn't allow Shared.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 3e4b3b4 to 1404ee5 Compare February 20, 2025 02:59
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 1404ee5 to 8a8a8d9 Compare February 20, 2025 03:52
@sunya-ch
Copy link
Author

sunya-ch commented Feb 20, 2025

@johnbelamaric @mortent @pohly @aojea
I have updated the use cases and design based on the discussion in meeting and in slack.
Now, there will be only shared field added to the ResourceSlice and all capacity is considered as consumable if this flag is set. I believe that this change can resolve some questions and suggestions above. Could you please confirm the resolve ones?

@johnbelamaric I haven't add the attribute for refinement and for setting minimum requirement of the share in the capacity yet. I think those can be considered independent from this KEP and can be added later with its concrete use case. I add them to non-goal item for now. What do you think?

There are three discussions left in my understanding.

  1. All field since I think we still need this field for users to claim all resources (and block from the other allocations) for devices those are flagged as shared.

  2. Allow the same device to be (shared) allocated multiple times from the same claimsToAllocate which comes from the same pod? This is related to the comment about distinctAttributes. I update this to non-goal item for now.

@aojea About test plan, I will notify you again when it's ready.

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch 2 times, most recently from 62faa31 to 5725286 Compare February 20, 2025 05:16
@googs1025
Copy link
Member

/cc

@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch 3 times, most recently from e54e9ce to fdd689c Compare February 20, 2025 06:32

Users can define specific per-device resource requests through
the newly added `ResourceRequest` field in the `DeviceRequest` of the `ResourceClaim`.
In a resource request, users may define `Requests` and `Limits` or either of them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP is supposed to be pretty generic, so I think it should mention other typical use cases, so that we could have a better picture what kind of resources could be consumed in this way.

I suspect that some nomenclature is already settled down, but the part that is a bit risky to me is use of word limit to describe burstable consumption. If we consider memory as a consumable resource, then we should sum up limits, instead of requests.

I think it should be super clear which value decreases available capacity. If we need additional concepts (like burstable use), are we sure that these concepts are generic enough and have a future-proof naming?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current scheduling code works off of requests, not limits. Limits are only enforced by the kubelet. But maybe @dom4ha you can double check? We should do whatever we already do...

deviceClassName: vgpu.nvidia.com
resources:
requests:
memory: "10Gi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should users think about the requests specified here vs selectors that reference capacity?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortent
I think this will be similar to the node selection level where we have node selector to filter (select) the device and we have resource request to request device within the node. In DRA device level, selector is used for filtering the device while resource here is to request for the resource of that device.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that is probably the most logical way to think about it. But it probably also means that the selector will be evaluated against the full device and not the logical "shared device". It might be a bit surprising as it means that using a selector that is smaller than the request means the claim can never be satisfied, while it also lets users do things like request 10Gb of bandwidth from a device that must have at least 40Gb.

- name: net1
deviceClassName: vlan-cni.networking.x-k8s.io
resources:
requests:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also clarify here how shared devices works with the abstract pools introduced in https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4815-dra-partitionable-devices. If part of a shared device is allocated, I would assume that means that all the consumesCapacity specified for the device would be considered "used" by the capacity pool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. if any part of a partition is used, then that partition is in-use and allocated

// +optional
// +default=false
// +featureGate=ConsumableCapacity
Shared bool `json:"shared" protobuf:"bytes,3,opt,name=shared"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// this flag will block the allocation of the shared device from the other claims.
// +optional
// +default=false
All bool `json:"all" protobuf:"bytes,1,name=all"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The All and Requirements fields are mutually exclusive, so we should add a +oneOf directive.



// Requirements define specific values of resource requirements of the device request.
// If all="true", this field is ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than saying the field will be ignored, we want to validate that only one of the two fields are set.

@sunya-ch
Copy link
Author

Since there are multiple aspects of the KEP's API design to discuss, it may be easier to track everything in an online document. I would appreciate it if we could move the design discussions to this document.

Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from fdd689c to 647e34d Compare March 4, 2025 09:55
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
@sunya-ch sunya-ch force-pushed the dra-dynamic-provision branch from 647e34d to edef87c Compare March 4, 2025 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

9 participants