-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-5075: DRA: Consumable Capacity #5104
base: master
Are you sure you want to change the base?
Conversation
sunya-ch
commented
Jan 30, 2025
•
edited
Loading
edited
- One-line PR description: Enable DRA API and scheduling to support a shared device allocation with consumable capacity.
- Issue link: DRA: Consumable Capacity #5075
- Other comments:
- WIP forked repo: https://github.com/sunya-ch/kubernetes/tree/kep-5075
- Document for technical discussion: https://docs.google.com/document/d/1U0u2uErpYcf-RooPEws5oDMiJ9kT2uDoWNjcWrLH224/edit?usp=sharing
Welcome @sunya-ch! |
Hi @sunya-ch. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/wg device-management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good start. After reading this thoroughly, I think we can meet all the use cases with "per-device allocatable resources", and do not need to expose the "device provisioning" aspect to the K8s control plane. I think the driver may still do that, but it's actually not something the scheduler needs to care about.
I think if we repurpose this to just "per-device allocatable resources", we will not only meet these use cases, but we can meet some of the ones for modeling standard resources in DRA.
Please take a look and if you want to meet to discuss it, let me know.
quantity: 1Gi | ||
``` | ||
#### Story 4 | ||
Users can define more than one provision requests to get devices from different sources. From the following claim spec, the claim should provide two devices which have different device sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support this, we probably need to create an distinctAttributes
constraint type. Otherwise they could land on the same device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this considered as one of the item from the must-have list in Revisiting Kubernetes Hardware Resource Model
?
- Do we have a need for “anti-affinity”? For example, “select these two VFs make sure they are on different PFs”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "must have" in there was for my prototype (which I never really did all those "must haves" :)).
Adding a distinctAttributes
constraint can be done in a separate, smaller KEP and proceed independently.
--> | ||
- Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use case of macvlan, ipvlan in DRA driver for CNI and virtual accelerator devices with on-demand memory fraction. | ||
- Enhance a capability of secondary networks to dynamically allocate secondary networks based on present availabilities such as bandwidth. | ||
- Leverage the enhancement of capacity consumption from [KEP-4815](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to do something distinct from the capacity consumption in there. That consumption is 100% internal to the resource slice, and is about shared resources. In this case, we are talking about consumption by the claims. The reason capacity
is separate from attributes
was exactly to support this kind of consumption model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbelamaric I updated the third goal to
- Enable capacity field to be consumable.
required) or even code snippets. If there's any ambiguity about HOW your | ||
proposal will be implemented, this is the place to discuss them. | ||
--> | ||
This enhancement system design introduces the terms `device source`, `provision limit`, `provision request`, and `provisioned device`. In addition to the existing device instance, `device source` represents the root device or device specifiction as a source to provision or generate a new device allocated to the Pod. The number of provision per `device source` can be limited by the `provision limit`. Device sources are exposed by device vendors or provisioners via Publish API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I like this KEP, but I don't think it's necessary to introduce all this machinery. I also am not sure there is a need to "provision" a new device. Instead, I would suggest we can allocate a partial device and share the underlying device. That is, we can move toward "platform mediated sharing of devices", rather than just the current mechanism of pointing multiple containers or pods add the same resource claim ("user mediated sharing of devices").
In platform mediated sharing, we do not share resource claims, but instead we have separate resource claims which hold a partial allocation of the same underlying device from the same resource slice. Each of those resource claims contains the amount of allocation. This is analogous to how resources are consumed by pods on nodes; in this case, resources are consumed by claims on devices.
For some more background see this (outdated) document, and the prototype linked in there.
Consider if the capacity
values in a device were considered allocatable. Then, you could have a resource claim for a portion of the device. The scheduler would be able to reconstruct "how much" of the device is consumed by aggregating all resource claims that are attached to that device.
Rather than a specific "provisioning limit" on the number of devices, the limit is hit when there are insufficient resources in capacity
to satisfy the claim. This allows the limit to be hit based upon specific resources consumed.
With this, I think in fact we should be able to model "compute devices" that represent the CPU, memory, and other standard node resources. I think this would satisfy the requirements @catblade has for aligning CPU with GPU with NIC, for example.
WDYT? Would you be willing/able to rework this a bit and see if this idea can satisfy your needs? Pretty much, we need to be able to write claims that "allocate" from the capacity
fields. In the minimal implementation, we wouldn't actually need a change in the ResourceSlice API, I don't think, only in the ResourceClaim API. In the case of things like virtual interfaces, the name of the created interface could be stored in the ResourceClaim.Status data.
For a more robust implementation, we would want the DeviceCapacity
entries in the Capacities
to be able to do things like block allocation ("you asked for 1 byte, but I can only allocate in 1Mi increments" - see the doc above). We may also need to be able to model different segments of resources, tagged with different topology, and be able to identify which types of resources are aggregatable across those topologies (for example, "8Gi is in numa0, 8Gi is in numa1 - if you ask for numa alignment and 4Gi, we give you a CPU and memory numa aligned, but if you ask for 10Gi we have to aggregate across the boundary"). Those are harder, maybe we start by writing it up without those, then wait until @catblade complains :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(by the way, there may well be use cases for "provisioning", I just think the ones listed so far don't need it...if you have more use cases, please do add them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbelamaric I think your comment is apt, and agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In platform mediated sharing, we do not share resource claims, but instead we have separate resource claims which hold a partial allocation of the same underlying device from the same resource slice. Each of those resource claims contains the amount of allocation. This is analogous to how resources are consumed by pods on nodes; in this case, resources are consumed by claims on devices.
I totally agree with this direction.
The objective of this KEP is to distinguish the case when the device which is allocated to the pod is actually a new device provisioned dynamically based on the master device. For instance, macvlan and ipvlan are a new virtual network device (i.e., net1) which will be provisioned, configured, and removed by the cni-dra-driver based on the selected master device (i.e., eth1).
However, at the same time, I agree that we can consider provision limit as one capacity, and leave all provisioned data (specific to the new device) to the resourceclaim status.
Is it possible to have an explicit way to clarify the type of sharing in the ResourceClaims status such as for provisioned device, pre-sliced device like MIG, dynamically-slice like virtual GPU ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The objective of this KEP is to distinguish the case when the device which is allocated to the pod is actually a new device provisioned dynamically based on the master device. For instance, macvlan and ipvlan are a new virtual network device (i.e., net1) which will be provisioned, configured, and removed by the cni-dra-driver based on the selected master device (i.e., eth1).
While at the actual node layer, this is what is happening, it's not clear to me that the K8s control plane needs to model this. It's not clear to me there needs to be awareness at the ResourceSlice/ResourceClaim level of these devices that are provisioned just to contain the allocation of a shared underlying device. It might be necessary, I am just not 100% sure yet and want to explore the options.
Is it possible to have an explicit way to clarify the type of sharing in the ResourceClaims status such as for provisioned device, pre-sliced device like MIG, dynamically-slice like virtual GPU ?
I think we are probably going to need a way to indicate that a device can be shared by platform-mediated sharing. Earlier version of DRA had a Shareable
flag to indicate that a device can be shared by user-mediated sharing, but we decided it was unnecessary (since the user controls this anyway). But if the platform needs to manage it, then the driver author may need to let K8s know if a device is shareable. More to explore!
bandwidth: | ||
quantity: 1Gi | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can CPU, memory, and other standard resource requests as use cases. @catblade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbelamaric I consider this is per-device attribute and comparable to the memory attribute of GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ageed.
devices: | ||
provisions: | ||
- name: net1 | ||
deviceClassName: vlan-cni.networking.x-k8s.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll also need to attach some config to specify the VLAN ID, for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the CNI configuration will be also required for ResourceClaim as @LionelJouin's design in cni-dra-driver.
https://github.com/kubernetes-sigs/cni-dra-driver/blob/eb94de8c61835de1f13eefe700de9d9d89615531/docs/design/device-configuration-api.md#design
I will add reference in this KEP too.
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports | ||
--> | ||
|
||
A motivating use case for supporting dynamically-provisioned device is a virtual network device when to-be-allocated device will be created or partitioned on demand (on claim). The orginal discussion is in [this PR's comment thread](https://github.com/kubernetes-sigs/cni-dra-driver/pull/1#discussion_r1889265214) and the limitation of current implementation has been addressed [here](https://github.com/kubernetes-sigs/cni-dra-driver/pull/1#discussion_r1890166449). The virtual network device is created and configured once the CNI is called based on the information of the master network device. The configured information specific to the generated device should be at the ResourceClaim not the ResourceSlice. However, the device in the ResourceCliam is now present only the device listed in the ResourceSlice and there is no attribute to differentiate between the master device and the actual provisioned device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited for clarity, spelling, and grammar: A motivating use case for supporting dynamically-provisioned device is a virtual network device when to-be-allocated device will be created or partitioned on demand (on claim). The original discussion is in this PR's comment thread and the limitation of current implementation has been addressed here. The virtual network device is created and configured once the CNI is called based on the information of the master network device. The configured information specific to the generated device should be at the ResourceClaim, not the ResourceSlice. However, the device in the ResourceClaim is now present, even though the device is listed in the ResourceSlice, and there is no attribute to differentiate between the master device and the actual provisioned device.
List the specific goals of the KEP. What is it trying to achieve? How will we | ||
know that this has succeeded? | ||
--> | ||
- Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use case of macvlan, ipvlan in DRA driver for CNI and virtual accelerator devices with on-demand memory fraction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions for grammar/clarity: Introduce an ability to allocating on-demand provisioned devices via DRA. This should cover the use cases of macvlan or ipvlan in a DRA driver for CNI and virtual accelerator devices with on-demand memory fraction.
--> | ||
|
||
#### Story 1 | ||
Users can request for secondary networks based on their bandwidth demands without the need to concern about neither device name nor a present available bandwidth of each specific device. DRA scheduler can dynamically select available device based on its availability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edited for clarity: A user requests a secondary network based on their bandwidth demands without wanting to specify a device name nor present available bandwidth of each specific device. The DRA scheduler can dynamically select available devices based on their availability.
bandwidth: | ||
quantity: 1Gi | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
6a57299
to
72f9f82
Compare
@johnbelamaric @catblade Thank you so much for taking care of this KEP. For the comment related to design change, I will also update the KEP correspondingly next. |
@johnbelamaric I have tried drafting your suggestion below. Please correct my understanding. kind: ResourceSlice
...
spec:
driver: cni.dra.networking.x-k8s.io
devices:
- name: eth1
basic:
attributes:
name:
string: "eth1"
capacity:
vlan: # <- define provision limit here
quantity: 1000
bandwidth:
quantity: 10Gi
---
kind: DeviceClass
metadata:
name: provisionable.networking.x-k8s.io kind: ResourceClaim
...
spec:
devices: # <- move back the provision request to device request
requests:
- name: macvlan
deviceClassName: provisionable.networking.x-k8s.io
allocationMode: ExactCount
count: 1
resources: # <- add resource in request (like PodCapacityClaim?). This field does not exist yet; however, I think you also have an idea to have this available based on your suggestion.
# Discussion point 1: should the following resource requests be per each count?
vlan:
quantity: 1 # <- I guess that you are expect this resource consumption defined in resource claim.
# Discussion point 2: do we need explicit vlan quantity here while it should not be other than 1?
bandwidth:
quantity: 1Gi
config:
- requests:
- macvlan
opaque:
driver: cni.dra.networking.x-k8s.io
parameters: # CNIParameters with the GVK, interface name and CNI Config (in YAML format).
apiVersion: cni.networking.x-k8s.io/v1alpha1
kind: CNI
ifName: "net1"
config:
cniVersion: 1.0.0
name: net1
plugins:
- type: macvlan
master: eth0
mode: bridge
ipam:
type: host-local
ranges:
- - subnet: 10.10.1.0/24 For the resource claim status, I still would like to propose the way that we can distinguish the way device is shared there. Can we instead add some optional field in the AllocatedDeviceStatus? For example, type AllocatedDeviceStatus struct {
...
// ProvisionData contains provisioning-related information specific to the provisioned device.
//
// This data may include provisioner-specific information.
//
// +optional
*ProvisionData
} |
updates. | ||
|
||
[documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Procedural comment:
- you can remove comments in the YAML like this once you have filled out a section
- please use line wrapping as suggested in KEP template: add guidance for line breaks #5085
Yes, what you propose in this comment is very much what I am suggesting. And your "discussion" points are definitely things I agree need more discussion and thought.
Maybe yes? Let's ground it in the specific use cases. |
c861fb3
to
32e1c9c
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sunya-ch The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
32e1c9c
to
d0eb088
Compare
d0eb088
to
3e4b3b4
Compare
requests: | ||
- name: macvlan | ||
deviceClassName: vlan-cni.networking.x-k8s.io | ||
allocationMode: Shared | ||
resources: | ||
requests: | ||
bandwidth: "1Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how is this going to be related to eth1, what if there is eth2 and eth3 too?
config is opaque so we should not depend on opaque values, is what this is suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User can specify the device via the selection (such as specify device's attribute name = eth1). Additionally, DRA allows pod to dynamically allocate device based on its available resource such as bandwidth.The selected device information will be also sent to the driver then the network driver and configure the selected device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the idea here is to underspecify when possible, but be more specific if needed. So if there are also eth2
and eth3
, then leaving it like this means that any one of eth1-3
will work. If that is not the case, then a selector is needed to further narrow down the eligible devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, Morten. We can "schedule" onto any of the underlying eth* devices. If we want a specific one - or one with a specific attribute (like network or trunking) - then that should be specified like any other selector.
type ResourceRequest struct { | ||
// All marks requesting all resource from the shared device. | ||
// If All is set to true, Quantity will be ignored. | ||
// | ||
// +optional | ||
// +default=false | ||
All bool `json:"all" protobuf:"bytes,1,name=all"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I set All
then we operate in the previous model, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly yes. All
can be set only when the allocation mode is Shared
.
With Shared allocation mode on the shared device, the device will be first selected by selectors and lastly checked with constraints in the same way it works in the other allocation mode.
Regardless of All
flag, what is added in allocation process is to check resource is still consumable.
When All
is set, only shared device with full resource (i.e., no share has been allocate) can be selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will need this, if they want All, they shouldn't allow Shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbelamaric I'm thinking that there are two actors of the system, device driver and user who creates a claim. The device driver is the one who decides whether the device is shared. However, user may want it all themselves. This flag will allow this use case. Also, in that case, user don't have to check the resource slice on how many capacity defined and request all items.
allocatedShare: | ||
bandwidth: "1Gi" | ||
``` | ||
- ResourceClaim without a resource request (request full available device) (A-type ResourceSlice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can already do this, no? why do we need to create this exception in this model, so we have two ways of doing the same
} | ||
``` | ||
|
||
### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is missing
|
||
## Summary | ||
|
||
This KEP is an extended usecase from the partitionable device [KEP-4815: DRA: Add support for partitionable devices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Partitionable Devices KEP has been moved to sig-scheduling, so the correct link is https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4815-dra-partitionable-devices
For instance, this feature will be allow reserving memory fraction of virtual GPU in [the AWS virtual GPU device plugin](https://github.com/awslabs/aws-virtual-gpu-device-plugin) via DRA. | ||
The number of shares can be limited if needed. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the main issue with the partitionable devices feature that this attempts to solve, is to make it possible to share capacity from a device without having to explicitly list every possible partition. We already know this can be very verbose in some situations (example in https://docs.google.com/document/d/1lXGfnrBixRIMW9ESa-mv09Kisb2myVFV_A3nqPJ4FCQ/edit?tab=t.0#bookmark=id.mmv8k6mzxscm). I think it might be useful to spell out more clearly here how this extends the already planned support for partitionable devices (which already allow for sharing devices and to some extent also on-demand sharing).
requests: | ||
- name: macvlan | ||
deviceClassName: vlan-cni.networking.x-k8s.io | ||
allocationMode: Shared | ||
resources: | ||
requests: | ||
bandwidth: "1Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the idea here is to underspecify when possible, but be more specific if needed. So if there are also eth2
and eth3
, then leaving it like this means that any one of eth1-3
will work. If that is not the case, then a selector is needed to further narrow down the eligible devices.
spec: | ||
devices: | ||
requests: | ||
- name: vgpu0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also set allocationMode
to Shared
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, allocation mode is optional, defaulting to ExactCount: 1, is that correct?
I guess maybe we need an allocation mode AllowShared
. I mean, in general I don't think people require shared, but they'll take it if that's all they can get. So, this would imply you are willing to take a shared resource, OR an ExactCount of 1. Since sharing implies < 1, I think that works.
From a user point of view, I would think specifying requests implies shared unless the user disallows it. For example, as a user this would seem most convenient:
- No resource requests specified, no allocation mode specified => allocation mode
ExactCount: 1
- Resource requests specified, no allocation mode specified => allocation mode
AllowShared
But I think that sort of complex validation / defaulting is frowned upon by API reviewers because it's complex and can cause bugs. @pohly?
Note that in the future, we may want to allow a resource request to be met by multiple devices, so whatever we do here with AllocationMode should be expandable to that case. The question is how much control do we allow? For example, if a user wants 128 Gi GPU memory, they would specify that in resource requests. Then, allocation mode blank ideally would mean "however you can get it - shared device, one device, 6 devices, whatever". They could constrain that with the allocation mode to turn on/off shared, and limit to a range (or maybe just limit to a max). They would have to be able turn shared on and off independently of setting the max. Maybe something like this?
# allow at most 4 devices
allocationMode: AllowMultiple
max: 4
# allow both shared or multiple, max of 4 devices
allocationMode: AllowMultipleOrShared
max: 4
Given the flexibility we need here, probably avoiding complicated defaulting rules is the right move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnbelamaric I have added AllowShared
mode to cover a new Story 7 (A user requests a device regardless of whether the device is sharable or not).
For AllowMultiple
and AllowMultipleOrShared
, can we have that for a separate KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not sure I follow why sharing implies <1. Wouldn't it be possible for someone to request more than one device, but also accept that one or more of those devices might be shared? I'm not sure if it makes sense in the network use-case, but it does seem like something that might be useful for other use-cases.
I'm also uncertain about adding an additional option for the AllocationMode
enum. I think I agree that combining AllocationMode: All
with shared devices shouldn't be possible, it still feels a bit strange to add the new AllowShared
value. The current values All
and ExactCount
specify how many devices that will be allocated for the request, which seems separate from whether or not shared devices can be used to satisfy the claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
AllowMultiple
andAllowMultipleOrShared
, can we have that for a separate KEP?
Yes, for sure. Although based on Morten's comment maybe we should think about other ways to specify this? Maybe AllocationMode
isn't the right place? Do you have an alternative suggestion, @mortent ?
Capacity map[QualifiedName]DeviceCapacity | ||
ConsumableCapacity map[QualifiedName]DeviceConsumableCapacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capacity
field specifies the available capacities for the device, so my intuition here is that we don't need a new ConsumableCapacity
field. With the introduction of the ConsumesCapacity
field in https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4815-dra-partitionable-devices, we make a distinction between the capacity a device consumes from the internal pool and the advertised capacity. Having three fields named Capacity
, ConsumesCapacity
and ConsumableCapacity
will probably get somewhat confusing.
// | ||
// +optional | ||
// +default=false | ||
InfinityResource bool `json:"infinity" protobuf:"bytes,1,name=infinity"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to advertise capacity if it is infinite? I would assume every device will have some kind of capacity that is finite and I would assume just listing those would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got through use case 2, will review more later
|
||
This KEP is an extended usecase from the partitionable device [KEP-4815: DRA: Add support for partitionable devices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices). | ||
The enhancement enables a shared device allocation with consumable capacity values. | ||
A shared device can be allocated to more than one resource claims with pre-device resource requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/claims/claim/
s/pre/per/
|
||
## Summary | ||
|
||
This KEP is an extended usecase from the partitionable device [KEP-4815: DRA: Add support for partitionable devices](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/usecase/use case/
|
||
- Introduce an ability to allocating on-demand shared devices via DRA. This should cover the use cases of macvlan or ipvlan in a DRA driver for CNI and virtual accelerator devices with on-demand memory fraction. | ||
- Enhance a capability of secondary networks to dynamically allocate secondary networks based on present availabilities such as bandwidth. | ||
- Leverage the enhancement of capacity consumption from [KEP-4815](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4815-dra-partitionable-devices). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to refer to this here.
The capacity consumption in that case is between partitions of the devices.
The capacity consumption in this case is between claims.
The actual pools of capacity being consumed are completely separate. This is a layered approach. This current KEP should latch on to the "Device" layer. The partitionable KEP resource pools are a layer lower. The partitionable KEP enables the scheduler to calculate what "whole devices" to offer. That is orthogonal to whether or not that "whole device" can be shared and provides its own, allocatable resources.
For example, let's pretend there is a physical GPU (8 gpu cores, 10 Gi memory) that can support two hardware-level partitions: left (6 gpu cores, 8 Gi) and right (2 gpu cores, 2 Gi). Suppose further that using software-level sharing, these types of GPUs can allocate individual GPU cores and GPU memory in 1GB blocks. You would use the features of the "partitionable" KEP to advertise:
- whole GPU, 8 gpu cores, 10 Gi gpu memory
- left GPU, 6 gpu cores, 8 Gi gpu memory
- right GPU, 2 gpu cores, 2 Gi gpu memory
The partitionable "shared resources" would allow the scheduler to:
- Withdraw "left" and "right" partitions from the available list when "whole" is allocated
- Withdraw "whole" from the available list when "left" or "right" is allocated
That's all it does. The users never see the "consumption" of resources - they only see "here is a list of devices, when I use this device, these other ones become unavailable".
In fact though, because of the software-level sharing, a user could point multiple resource claims at, say, the "right" GPU. The goal of this KEP is to regulate that sharing.
In this example (and in NVIDIA MIGs), the resources used for the partitioning match the resources advertised by the partitions. However, that is just a "coincidence". Our model allows the resources advertised by the device (and thus consumable in this KEP) to be completely distinct from the resources consumed internally to manage the partitioning scheme.
attributes: | ||
name: | ||
string: "eth1" | ||
consumableCapacity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason capacity is separate from attributes is because it is consumable. So, this should just be listed under "capacity". If we think there will be things that can't be shared, we should have a flag or something. It's not clear to me if that should be at the device level, or at the individual resource level. If it's not consumable, maybe it should be an attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current capacities as defined in 1.32 without this KEP are not consumable. Yes, I still think they should have been attributes with a capacity value type, but you and Tim wanted them to be a separate field with the goal to extend the usage of that field later.
This KEP looks like that extension, so instead of consumableCapacity
we should have:
capacity:
bandwidth:
value: 10Gi
consumable: true
I'm using a boolean here for the sake of simplicity. Usually from an API perspective, an enum is preferred because it can be extended later. But such a "type" enum is now a bit weird for "traditional" usage:
type: consumable
= new semantictype: attribute
= traditional semantic, default if unset
Either way, we need a new field, which leads to the problem that the apiserver will drop it and thus "downgrades" the capacity to the traditional "attribute" semantic. The mitigation from https://docs.google.com/document/d/1sCfO4hJUUhZpzld-_wEZi85XqSclEVvjUfsEe2-duZs/edit?tab=t.0#bookmark=id.jkplkq9taqcd will have to be used by DRA drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly Yes. I agree. We also discuss about this topic in our call today.
I will update KEP correspondingly.
You may find in the other discussion points we had from this doc too: https://docs.google.com/document/d/1U0u2uErpYcf-RooPEws5oDMiJ9kT2uDoWNjcWrLH224/edit?tab=t.btyzk7v0acfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Slack conversation: Is there a reason not to make all "capacity" entries consumable, as long as the device is marked as shared? I think in CEL, capacities would always be their initial values. Only by using a resource request would you account for allocated (consumed) resources. If we require the device to be marked as shared, then existing devices continue to work as they are (assuming shared is not the default)
requests: | ||
- name: macvlan | ||
deviceClassName: vlan-cni.networking.x-k8s.io | ||
allocationMode: Shared | ||
resources: | ||
requests: | ||
bandwidth: "1Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, Morten. We can "schedule" onto any of the underlying eth* devices. If we want a specific one - or one with a specific attribute (like network or trunking) - then that should be specified like any other selector.
/milestone v1.34 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, some more comments.
Looking forward to the next revision.
attributes: | ||
name: | ||
string: "eth1" | ||
consumableCapacity: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Slack conversation: Is there a reason not to make all "capacity" entries consumable, as long as the device is marked as shared? I think in CEL, capacities would always be their initial values. Only by using a resource request would you account for allocated (consumed) resources. If we require the device to be marked as shared, then existing devices continue to work as they are (assuming shared is not the default)
spec: | ||
devices: | ||
requests: | ||
- name: vgpu0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, allocation mode is optional, defaulting to ExactCount: 1, is that correct?
I guess maybe we need an allocation mode AllowShared
. I mean, in general I don't think people require shared, but they'll take it if that's all they can get. So, this would imply you are willing to take a shared resource, OR an ExactCount of 1. Since sharing implies < 1, I think that works.
From a user point of view, I would think specifying requests implies shared unless the user disallows it. For example, as a user this would seem most convenient:
- No resource requests specified, no allocation mode specified => allocation mode
ExactCount: 1
- Resource requests specified, no allocation mode specified => allocation mode
AllowShared
But I think that sort of complex validation / defaulting is frowned upon by API reviewers because it's complex and can cause bugs. @pohly?
Note that in the future, we may want to allow a resource request to be met by multiple devices, so whatever we do here with AllocationMode should be expandable to that case. The question is how much control do we allow? For example, if a user wants 128 Gi GPU memory, they would specify that in resource requests. Then, allocation mode blank ideally would mean "however you can get it - shared device, one device, 6 devices, whatever". They could constrain that with the allocation mode to turn on/off shared, and limit to a range (or maybe just limit to a max). They would have to be able turn shared on and off independently of setting the max. Maybe something like this?
# allow at most 4 devices
allocationMode: AllowMultiple
max: 4
# allow both shared or multiple, max of 4 devices
allocationMode: AllowMultipleOrShared
max: 4
Given the flexibility we need here, probably avoiding complicated defaulting rules is the right move.
``` | ||
|
||
#### Story 4 | ||
A user defines more than one requests to get devices from different sources. From the following claim spec, the claim should select two devices which have different device sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be supported without adding a distinctAttributes
analogous to our existing matchAttributes
. So, either that has to be added to this KEP, or that needs to be a separate KEP and the use case move there.
The consumableCapacity can be set to infinity. | ||
Devices that have at least one consumableCapacity are considered to be shared devices. | ||
|
||
Users can only claim a shared device using the newly added `Shared` allocation mode (`DeviceAllocationMode`). In other words, the shared device will be skipped in all other modes, while the unsharable device will be skipped in the Shared mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we require shared? If we think that's a real use case, then let's have AllowShared
and RequireShared
allocation modes.
|
||
Users can define specific per-device resource requests through | ||
the newly added resources field in the `DeviceRequest` of the `ResourceClaim`. | ||
If the resources field is not specified, all consumable resources are assumed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be "none", at least from the point of view of scheduling. In other words, these should be seen as "minimums", like requests
for containers.
Are you saying that they should consume the whole device if they specify no requests, even if the device is marked as Shared?
and its attributes match the request's selectors and constraints. | ||
"share" refers to amount of resources reserved or allocated to the claim request. | ||
The newly added `allocatedShare` field in the `DeviceRequestAllocationResult` will be set when the allocation is successful. | ||
A device can be shared (allocated) to different pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think multi-node resources could have those pods split across nodes. I don't think there is a need to nail it down here.
type ResourceRequest struct { | ||
// All marks requesting all resource from the shared device. | ||
// If All is set to true, Quantity will be ignored. | ||
// | ||
// +optional | ||
// +default=false | ||
All bool `json:"all" protobuf:"bytes,1,name=all"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will need this, if they want All, they shouldn't allow Shared.
3e4b3b4
to
1404ee5
Compare
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
1404ee5
to
8a8a8d9
Compare
@johnbelamaric @mortent @pohly @aojea @johnbelamaric I haven't add the attribute for refinement and for setting minimum requirement of the share in the capacity yet. I think those can be considered independent from this KEP and can be added later with its concrete use case. I add them to non-goal item for now. What do you think? There are three discussions left in my understanding.
@aojea About test plan, I will notify you again when it's ready. |
62faa31
to
5725286
Compare
/cc |
e54e9ce
to
fdd689c
Compare
|
||
Users can define specific per-device resource requests through | ||
the newly added `ResourceRequest` field in the `DeviceRequest` of the `ResourceClaim`. | ||
In a resource request, users may define `Requests` and `Limits` or either of them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KEP is supposed to be pretty generic, so I think it should mention other typical use cases, so that we could have a better picture what kind of resources could be consumed in this way.
I suspect that some nomenclature is already settled down, but the part that is a bit risky to me is use of word limit
to describe burstable consumption. If we consider memory as a consumable resource, then we should sum up limits, instead of requests.
I think it should be super clear which value decreases available capacity. If we need additional concepts (like burstable use), are we sure that these concepts are generic enough and have a future-proof naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current scheduling code works off of requests, not limits. Limits are only enforced by the kubelet. But maybe @dom4ha you can double check? We should do whatever we already do...
deviceClassName: vgpu.nvidia.com | ||
resources: | ||
requests: | ||
memory: "10Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should users think about the requests specified here vs selectors that reference capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortent
I think this will be similar to the node selection level where we have node selector to filter (select) the device and we have resource request to request device within the node. In DRA device level, selector is used for filtering the device while resource here is to request for the resource of that device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree that is probably the most logical way to think about it. But it probably also means that the selector will be evaluated against the full device and not the logical "shared device". It might be a bit surprising as it means that using a selector that is smaller than the request means the claim can never be satisfied, while it also lets users do things like request 10Gb of bandwidth from a device that must have at least 40Gb.
- name: net1 | ||
deviceClassName: vlan-cni.networking.x-k8s.io | ||
resources: | ||
requests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also clarify here how shared devices works with the abstract pools introduced in https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4815-dra-partitionable-devices. If part of a shared device is allocated, I would assume that means that all the consumesCapacity
specified for the device would be considered "used" by the capacity pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. if any part of a partition is used, then that partition is in-use and allocated
// +optional | ||
// +default=false | ||
// +featureGate=ConsumableCapacity | ||
Shared bool `json:"shared" protobuf:"bytes,3,opt,name=shared"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// this flag will block the allocation of the shared device from the other claims. | ||
// +optional | ||
// +default=false | ||
All bool `json:"all" protobuf:"bytes,1,name=all"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The All
and Requirements
fields are mutually exclusive, so we should add a +oneOf
directive.
|
||
|
||
// Requirements define specific values of resource requirements of the device request. | ||
// If all="true", this field is ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than saying the field will be ignored, we want to validate that only one of the two fields are set.
Since there are multiple aspects of the KEP's API design to discuss, it may be easier to track everything in an online document. I would appreciate it if we could move the design discussions to this document. |
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
fdd689c
to
647e34d
Compare
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
647e34d
to
edef87c
Compare