kep-3695-beta update #5346

guptaNswati · 2025-05-27T19:45:53Z

One-line PR description: Update KEP to prepare for beta in 1.34
Issue link: DRA: Extend PodResources to include resources from Dynamic Resource Allocation #3695

/wg device-management
/assign @moshe010
/assign @klueska
/assign @@johnbelamaric for PRR

Signed-off-by: Swati Gupta <swatig@nvidia.com>

linux-foundation-easycla · 2025-05-27T19:45:58Z

The committers listed above are authorized under a signed CLA.

✅ login: guptaNswati / name: Swati Gupta (6e540a5, 64302a3, 72432d5, 3d1512b, ae728ef, 430351f, 2402b80)

k8s-ci-robot · 2025-05-27T19:46:01Z

Welcome @guptaNswati!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-05-27T19:46:03Z

Hi @guptaNswati. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

keps/sig-node/3695-pod-resources-for-dra/README.md

Co-authored-by: Kevin Klues <klueska@gmail.com>

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

kannon92 · 2025-06-02T16:01:01Z

/ok-to-test

keps/sig-node/3695-pod-resources-for-dra/README.md

Signed-off-by: Swati Gupta <swatig@nvidia.com>

keps/sig-node/3695-pod-resources-for-dra/kep.yaml

keps/prod-readiness/sig-node/3695.yaml

keps/sig-node/3695-pod-resources-for-dra/README.md

Signed-off-by: Swati Gupta <swatig@nvidia.com>

ffromani

LGTM assuming the existing comments (e.g. KEP template) are addressed

ffromani · 2025-06-12T11:35:28Z

keps/sig-node/3695-pod-resources-for-dra/README.md

- [ ] No major bugs reported in the previous cycle.
+- [] Gather feedback from consumers of the DRA feature.
+  - Integration with the NVIDIA DCGM exporter.
+- [x] No major bugs reported in the previous cycle.


PTAL to check kubernetes/kubernetes#132028 is relevant for the Get endpoint.
I think no, this is why I didn't change the impl. of Get, but worth a look in the context of this work

mmm we do List all the pods and get this from the client bindings of k8s.io/kubelet/pkg/apis/podresources/v1.GetPodResourcesResponse.GetPodResources but your change is conditional useActivePods:bool so should not impact this work.

Thanks for checking. So, if Get is used to get info about a non running-pod, maybe because of a race or because user error, what would happen? I can think of:

Get returns the state without error as long as the pod is known to the kubelet (non-deleted). Are the clients of the API ok with that?

Get returns error, like pod not found or anything equivalent (pod not running)?
What I'm trying to do here is to prevent us to fix the same bug twice, once in List, once in Get.

I dint find anything in issues of dcgm-exporter related to this. Our default metric collection interval is 30 secs, where we call the podresources api to list any running pods using the GPUs and collects the GPU metrics associated with a given GPU pod. So if a pod is not running or is stale and reporting older metrics within this time interval. I think its okay.

whats a non-running pod in this case? a pod that never started or deleted or had errors?

whats a non-running pod in this case? a pod that never started or deleted or had errors?

A non-running pod is a pod bound to the kubelet (thus not pending) which is in not in running phase. In practice, this maps to terminated but non deleted pods, which in turn maps to pods whose restartPolicy is not always. Probably the easiest example are pods belonging to kubernetes jobs.
Turns out this is a relatively easy yet combination to get, yet not too often used. This is basically one of the major reasons why it took us roughly 2 years to figure out

I'm glad you confirmed Get endpoint is not affected; this is what my research also suggested. Nevertheless, I think we should fully specify in the KEP the semantics about calling Get with non-running pods as defined above. The reason I'm writing is because last time I looked at the KEP (admittedly some months ago) this was not fully specified. Should be an easy fix and totally in scope for beta stage.

Insufficiently detailed specification bite us pretty bad (kubernetes/kubernetes#119423 (comment) - this quite complex to figure out requiring a lot of ancient history research) and I'm submitting a doc PR/KEP PR fix ASAP

I believe this holds.

I will add this. I ran this pr-test pod which terminated and is in crashloopback. But in my test, List() is also not returning it. Is it an ephemeral bug? I will test a bit more.

$ kubectl get pod | grep pr-test pr-test 0/1 CrashLoopBackOff 14 (4m39s ago) 52m $ kubectl describe pod pr-test Warning BackOff 36s (x225 over 50m) kubelet Back-off restarting failed container busybox in pod pr-test_default(8931c1c9-7bc6-4721-98fe-e6193dc10a33) $ sudo grpcurl -plaintext -import-path . -import-path vendor -proto staging/src/k8s.io/kubelet/pkg/apis/podresources/v1/api.proto -d '{"podName":"pr-test","podNamespace":"default"}' unix:///var/lib/kubelet/pod-resources/kubelet.sock v1.PodResourcesLister/Get ERROR: Code: Unknown Message: pod pr-test in namespace default not found

Clarification, at risk of being redundant. Get itself and the changes pertaining to this KEP had not bugs reported recently to the best of my knowledge. From that perspective we're good.
But since we discovered a List bug largely caused by insufficient specification over the cycle of the API, I think that this is a good chance to fully clarify and detail the semantics of Get (and the changes covered by this KEP) solidifying what is otherwise mostly volatile knowledge. I think this should be quick and easy and document behavior we already have and want, just adding details, so a perfect fit for beta which won't detriment the goal nor cause any slowdown. Again sorry if I'm redundant, just wanted to fully explain my PoV.

Thank you, this clarifies this a bit more.

guptaNswati · 2025-06-12T17:44:48Z

@soltysh @SergeyKanzhelev what else is needed to get this merged. I see its set to At risk enhancements freeze

SergeyKanzhelev · 2025-06-12T18:02:25Z

Please update to the latest template according to #3695 (comment)

My understanding @soltysh haven't stazrted reviews as the metadata was not correct. Now it should be correct, need to work on the document

guptaNswati · 2025-06-12T21:12:04Z

@SergeyKanzhelev

Please update to the latest template according to #3695 (comment)

It is using the latest template. See
KEP readme using the [latest template](https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template) has been merged into the k/enhancements repo.

Sorry, i am confused if you are referring to something else.

ffromani · 2025-06-13T07:32:14Z

@SergeyKanzhelev

Please update to the latest template according to #3695 (comment)

It is using the latest template. See KEP readme using the [latest template](https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template) has been merged into the k/enhancements repo.

Sorry, i am confused if you are referring to something else.

You need to check the KEP layout of this PR matches the layout in the latest KEP template available. When folks update the KEP template some sections may move/be created/merged/removed, some new PRR questions can arise, and so forth

SergeyKanzhelev · 2025-06-13T17:04:05Z

As @ffromani mentioned some template sections might have moved or were added. I am relying on this comment: #3695 (comment) that tells the the KEP readme is not following the last template.

guptaNswati · 2025-06-13T22:40:22Z

Got it. thanks both. I see i updated the implementation history but that is within the format.

- Kubernetes 1.27: Alpha version of the KEP.

- Kubernetes 1.34: Beta version of the KEP.

soltysh

There are multiple questions not answered, which are requirement for beta promotion.

keps/sig-node/3695-pod-resources-for-dra/README.md

soltysh · 2025-06-16T08:55:01Z

keps/sig-node/3695-pod-resources-for-dra/README.md

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:



Earlier in the doc:

Please make sure to check appropriate boxes in the ## Release Signoff Checklist.

Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

Please make sure to check appropriate boxes in the ## Release Signoff Checklist.

This was addressed - thank you.

Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

This one still holds. We need links for integration and e2e based on the template in the appropriate section. I believe e2es were added in kubernetes/kubernetes#116846 so you should be able to quickly fill those in. Not sure if there are others.

This still holds, I see Francesco mentioned several tests that were added, can we make sure they are explicitly linked in this document?

keps/sig-node/3695-pod-resources-for-dra/README.md

soltysh · 2025-06-16T08:58:54Z

keps/sig-node/3695-pod-resources-for-dra/README.md

- [ ] No major bugs reported in the previous cycle.
+- [] Gather feedback from consumers of the DRA feature.
+  - Integration with the NVIDIA DCGM exporter.
+- [x] No major bugs reported in the previous cycle.


I believe this holds.

soltysh · 2025-06-16T09:02:54Z

keps/sig-node/3695-pod-resources-for-dra/README.md

- [ ] No major bugs reported in the previous cycle.
+- [] Gather feedback from consumers of the DRA feature.
+  - Integration with the NVIDIA DCGM exporter.
+- [x] No major bugs reported in the previous cycle.

 #### GA



Further in the document:

In ###### Are there any tests for feature enablement/disablement? can you please link to those tests?

In ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? why that question is not applicable, can you explain?

In ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? N/A is not correct. You're using explicit metrics pod_resources_endpoint_requests_total, pod_resources_endpoint_requests_list and pod_resources_endpoint_requests_get. What is the SLO for those?

In ###### Will enabling / using this feature result in increasing size or count of the existing API objects? No is not correct. You're expanding the return types from Get call by additional information coming from DRA, so I'd expect that information be expressed in that answer.

In ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? your answer is not one I'd expect. What you're writing there is probably more suited to Troubleshooting or Risks and Mitigations sections. In your particular case I'm expecting a rough estimation if CPU/RAM/disk/IO will be affected by this feature. Based on my understanding you're likely increase the memory consumption to hold the DRA information, no?

You're missing ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? question, from the template

In ###### How does this feature react if the API server and/or etcd is unavailable? I'm expecting a concrete answer.

In ###### What are other known failure modes? can you please use the proposed template? IE. Detection -> Mitigations -> Diagnostic, etc.

In ###### What steps should be taken if SLOs are not being met to determine the problem? you're missing answer, N/A is not acceptable.

What about Drawbacks and alternatives considered? I'm missing both sections, have we not considered any?

@soltysh Most of these have the similar answers in this KEP and this KEP is an extension of it https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments

Are these blocking updates?

And for this particular KEP, nothing has changed in the API from Alpha to Beta.
@ffromani where can i find the e2e test. The test PR is this: kubernetes/kubernetes#116846

I do see some test here https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-node-kubelet-serial-containerd/1932604229172596736

but that doesnot seem to be the right link

we need to review on which lane the podresources api (not pod-level resources) tests run, which changed multiple time over the months. Let's find and fix the configuration if needed.

looking at the config in k/test-infra the tests should indeed run in the lanes
pull-kubernetes-node-kubelet-serial-containerd, but it's high time we add an explicit lane and/or we call out the tests explicitly in some lane. We can do something similar to the somehow confusing pull-kubernetes-node-kubelet-serial-podresources lane

@soltysh Most of these have the similar answers in this KEP and this KEP is an extension of it https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments

Are these blocking updates?

Yes, those are blocking changes. It doesn't matter the answers are similar, we need them in this document. If they are matching, it's fine to just copy them over.

Signed-off-by: Swati Gupta <swatig@nvidia.com>

ffromani

sig-node review: provisional LGTM. Two outstanding items from my POV.

(nonblocking, in scope for Beta, "just" extra work) turns out the CI config is obsolete, and the podresources api tests run incidentally. We should add them explicitly (like we do for example for cpumanager) and we should have a presubmit lane (like we do for example for pod level resources). I think it is unfair to account all this work into this KEP, but some of it make sense to be done in the context of the beta graduation. Already begun, PRs linked. We may need to fix or add some e2e tests, but I think this is totally fair in the context of the beta graduation.
(blocking) clarify the Get semantics as discussed. Because of how the API works, there is a TOCTOU unavoidable issue. There is no atomicity or consistency guarantee between List or Get, nor a concept of a session (which I would push back, btw). So it is possible, albeit probably unlikely, that Get is called with stale information (e.g pod becomes terminated between calls). We need to clarify and document the error paths

Once the above are addressed, LGTM.

soltysh

All of my previous comments still hold, and are blocking to merging this KEP.

soltysh · 2025-06-18T12:13:40Z

keps/sig-node/3695-pod-resources-for-dra/README.md

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:



Please make sure to check appropriate boxes in the ## Release Signoff Checklist.

This was addressed - thank you.

Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

This one still holds. We need links for integration and e2e based on the template in the appropriate section. I believe e2es were added in kubernetes/kubernetes#116846 so you should be able to quickly fill those in. Not sure if there are others.

soltysh · 2025-06-18T12:17:31Z

keps/sig-node/3695-pod-resources-for-dra/README.md

@@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed,

 ###### Are there any tests for feature enablement/disablement?

-e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
+e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846)


The linked PR isn't testing feature enablement/disablement, or am I misreading it? The closest place where you test this feature gate is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/podresources/server_v1_test.go but there you only turn this on, but I don't see the requested on/off test.

we have a on/off test scattered across the existing tests: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L977 and https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L1066
We can use a PR to make the tests more explicit and some changes are needed if the FG goes to default on: the FG status should be set explicitly.

Great, can we make sure this is listed here?

soltysh · 2025-06-18T12:18:19Z

keps/sig-node/3695-pod-resources-for-dra/README.md

- [ ] No major bugs reported in the previous cycle.
+- [] Gather feedback from consumers of the DRA feature.
+  - Integration with the NVIDIA DCGM exporter.
+- [x] No major bugs reported in the previous cycle.

 #### GA



@soltysh Most of these have the similar answers in this KEP and this KEP is an extension of it https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments

Are these blocking updates?

Yes, those are blocking changes. It doesn't matter the answers are similar, we need them in this document. If they are matching, it's fine to just copy them over.

ffromani

thanks, we need one more update please

ffromani · 2025-06-18T18:17:07Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+To fix a long-standing desync where `List()` would return allocations for terminated pods because it iterated over the full kubelet pod list instead of the resource-manager’s active pods, PR #132028 adds a new `activePodsOnly` boolean flag to the v1 `List` RPC.
+
+- **Default behavior** (`activePodsOnly=false`): backward-compatible, returns all pods as before.
+- **With `activePodsOnly=true`**: filters results to only those pods currently tracked as “active” by the resource managers.


I would omit this part, is not relevant for this KEP. I will submit a separate update to both the KEP and the docs later on.

okay. I was also not sure if this is the right place. But still wanted to add a bit of context of Get() is based on List()

Once this and the comment right below are addressed (basically once this section is rephrased), LGTM

note that even the current most common usage pattern is List + a series of call to Get, the two calls are independent from each other. Is perfectly legal for a client to call Get with a namespace+name pair taken in a different way than the output of List.

ffromani · 2025-06-18T18:21:36Z

keps/sig-node/3695-pod-resources-for-dra/README.md

+**Race and inconsistency between `List()` and `Get()`:**
+A pod’s phase may change between a successful `List()` and a subsequent `Get()`. Because there is no session or atomic snapshot between a `List()` and a later `Get()`, it’s possible (though uncommon) for a pod to transition to a non-Running phase in between calls. Clients must therefore be prepared for “stale” or “missing” data and handle the documented error paths when `Get()` is invoked on a pod that is no longer in the Running phase.


we need to fully specify the behavior of Get in this case, or if, because a mistake or a race or any issue, it is fed with inconsistent data.
Currently the code does this:

pod, exist := p.podsProvider.GetPodByName(req.PodNamespace, req.PodName) if !exist { metrics.PodResourcesEndpointErrorsGetCount.WithLabelValues("v1").Inc() return nil, fmt.Errorf("pod %s in namespace %s not found", req.PodName, req.PodNamespace) }

It seems to me the most logical extension is to take into account the phase of the pod either

return empty data and swallow the error

return error like pod not found

The API in general (we determined during the fixing and review process of 132028) wants to return data pertaining to running pods which thus actively hold resources.

Therefore, my take is that Get should also error out in the unlikely but possible case it is asked about non-running pods.

I'm open to the other approach or to any other sufficiently motivated approach.

@ffromani
How about this:

Get() returns a valid GetPodResourcesResponse with its current PodResources when the pod is in Running state. But due to some race condition or other inconsistency, it’s possible (though uncommon) for a specified pod to transition to terminated or its Status.Phase is not Running, the call must return a NotFound error.

Do you think, adding an explicit check about the Pod Status.Phase can help? Something like:

pod, exist := p.podsProvider.GetPodByName(req.PodNamespace, req.PodName) if !exist || pod.Status.Phase != v1.PodRunning { metrics.PodResourcesEndpointErrorsGetCount.WithLabelValues("v1").Inc() return nil, status.Errorf(codes.NotFound, "pod %s in namespace %s not found or not running", req.PodName, req.PodNamespace) }

Or we may be overthinking this. As you pointed out, its possible that Get() mostly (not 100%) returns an error when asked about a non running pod. I think just calling it out in the spec should be enough. Digging bit deeper in the code, Get() should not report a removed pod except a rare race condition https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod/pod_manager.go#L198

Your fix in List() to filterOutInactivePods() make sense https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1065

If you still think, its better to improve Get(), we can similar check like I proposed above.

I surely don't want to overcomplicate. The problem here is the insufficient validation and lack of checking, albeit in a corner case. TL;DR: we need to decide what Get does when the phase is not Running. A one-liner is sufficient for me, but it has to be there :)
e.g.
"Get() never fails if the kubelet knows about the pods" (but note this likely misreports resources of terminated pods)
"Get() returns error if the pod is known to the kubelet, but is terminated"

that's it.

guptaNswati · 2025-06-18T19:03:26Z

@soltysh PTAL, i tried to answers these #5346 (comment).

If these are satisfactory, then the only pending items:

adding e2e test link. I am still not sure what is an acceptable answer for this which unblocks this PR.
update Get() based on @ffromani feedback.

Signed-off-by: Swati Gupta <swatig@nvidia.com>

ffromani

/lgtm

sig-node review: my comment about Get was addressed. We may need to review for the nect cycle (beta2? GA?), but the current changes are sufficient.

ffromani · 2025-06-19T09:14:21Z

my 2c about e2e tests (k/k/test/e2e_node/podresources_test.go):

we have FG on/off test dispersed among others, should be clear and have its own block (Context in ginkgo lingo)
I would like to see more tests for Get to cover e.g. multi-container pods, pods not having exclusive resources
I would like tests to exercise the expected happy path (List + Get on each returned pod)
I would like to see a test which compares the output of List and Get in the happy path
I would like more tests to exercise both the error paths, e.g. pod found but bogus container; asking about terminated pod.

My PR kubernetes/kubernetes#132028 does a bit of cleanup in the e2e tests so it can be wise to wait for it to merge (it's a fix we need anyway) and base the work on top of it

Fixing the test-infra lane to better handle the e2e tests is already ongoing: kubernetes/kubernetes#132345 and can proceed in parallel

My understanding is that doing all the above is fair in the same cycle on which we attempt to graduate to beta.

soltysh

There are still answers missing inside the document itself, but since they were provided as comments, I'm willing to conditionally approve as is. Please make sure to either add the missing links before merging this PR, or shortly after.

/approve
the PRR section

soltysh · 2025-06-19T09:49:18Z

keps/sig-node/3695-pod-resources-for-dra/README.md

- [ ] No major bugs reported in the previous cycle.
+- [] Gather feedback from consumers of the DRA feature.
+  - Integration with the NVIDIA DCGM exporter.
+- [x] No major bugs reported in the previous cycle.


Thank you, this clarifies this a bit more.

soltysh · 2025-06-19T09:51:00Z

keps/sig-node/3695-pod-resources-for-dra/README.md

@@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed,

 ###### Are there any tests for feature enablement/disablement?

-e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
+e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846)


Great, can we make sure this is listed here?

soltysh · 2025-06-19T09:52:33Z

keps/sig-node/3695-pod-resources-for-dra/README.md

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:



This still holds, I see Francesco mentioned several tests that were added, can we make sure they are explicitly linked in this document?

k8s-ci-robot · 2025-06-19T09:56:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: guptaNswati, soltysh
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

klueska · 2025-06-19T11:38:54Z

There are still answers missing inside the document itself, but since they were provided as comments, I'm willing to conditionally approve as is. Please make sure to either add the missing links before merging this PR, or shortly after.

/approve

the PRR section

Thanks @soltysh. Given the holiday in the US today I don't think @guptaNswati will have time to make the update before tomorrow. I will work with her early next week to ensure all of the information in the comments gets incorporated.

@mrunalp, @haircommander, @SergeyKanzhelev can we please get a sig-node approval to move this forward.

Thanks!

ffromani · 2025-06-19T11:51:28Z

@mrunalp, @haircommander, @SergeyKanzhelev can we please get a sig-node approval to move this forward.

I think they are all based in the US. I'd be supporting (FWIW) an exception if the KEP slips just because there is no time left for an approval review because of the holidays.

kep-3695-beta update

6e540a5

Signed-off-by: Swati Gupta <swatig@nvidia.com>

k8s-ci-robot assigned klueska May 27, 2025

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label May 27, 2025

k8s-ci-robot assigned moshe010 May 27, 2025

github-project-automation bot added this to Dynamic Resource Allocation May 27, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation May 27, 2025

k8s-ci-robot requested review from dchen1107 and kikisdeliveryservice May 27, 2025 19:46

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 27, 2025

klueska reviewed May 27, 2025

View reviewed changes

keps/sig-node/3695-pod-resources-for-dra/README.md Outdated Show resolved Hide resolved

Update keps/sig-node/3695-pod-resources-for-dra/README.md

430351f

Co-authored-by: Kevin Klues <klueska@gmail.com>

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 28, 2025

pacoxu reviewed May 30, 2025

View reviewed changes

keps/sig-node/3695-pod-resources-for-dra/kep.yaml Show resolved Hide resolved

pacoxu mentioned this pull request Mar 19, 2025

DRA: Extend PodResources to include resources from Dynamic Resource Allocation #3695

Open

8 tasks

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2025

SergeyKanzhelev reviewed Jun 6, 2025

View reviewed changes

keps/sig-node/3695-pod-resources-for-dra/README.md Show resolved Hide resolved

SergeyKanzhelev reviewed Jun 6, 2025

View reviewed changes

keps/sig-node/3695-pod-resources-for-dra/README.md Outdated Show resolved Hide resolved

update milestone version and beta template

72432d5

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati commented Jun 9, 2025

View reviewed changes

keps/sig-node/3695-pod-resources-for-dra/kep.yaml Show resolved Hide resolved

pohly reviewed Jun 10, 2025

View reviewed changes

keps/prod-readiness/sig-node/3695.yaml Outdated Show resolved Hide resolved

keps/sig-node/3695-pod-resources-for-dra/README.md Outdated Show resolved Hide resolved

Update approver and kep status

3d1512b

Signed-off-by: Swati Gupta <swatig@nvidia.com>

ffromani reviewed Jun 12, 2025

View reviewed changes

soltysh reviewed Jun 16, 2025

View reviewed changes

Update to Release Signoff Checklist

2402b80

Signed-off-by: Swati Gupta <swatig@nvidia.com>

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 16, 2025

ffromani mentioned this pull request Jun 17, 2025

e2e: node: fix podresources API feature label kubernetes/kubernetes#132345

Open

ffromani reviewed Jun 18, 2025

View reviewed changes

soltysh reviewed Jun 18, 2025

View reviewed changes

ffromani reviewed Jun 18, 2025

View reviewed changes

guptaNswati added 2 commits June 19, 2025 06:35

Add when Get() returns error

64302a3

Signed-off-by: Swati Gupta <swatig@nvidia.com>

Address upgrade, resource usage and other related comments

ae728ef

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the kep-3695-beta-update branch from a804a04 to ae728ef Compare June 19, 2025 06:36

ffromani reviewed Jun 19, 2025

View reviewed changes

k8s-ci-robot assigned ffromani Jun 19, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025

soltysh approved these changes Jun 19, 2025

View reviewed changes

		@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:

		Race and inconsistency between `List()` and `Get()`:
		A pod’s phase may change between a successful `List()` and a subsequent `Get()`. Because there is no session or atomic snapshot between a `List()` and a later `Get()`, it’s possible (though uncommon) for a pod to transition to a non-Running phase in between calls. Clients must therefore be prepared for “stale” or “missing” data and handle the documented error paths when `Get()` is invoked on a pod that is no longer in the Running phase.

kep-3695-beta update #5346

Are you sure you want to change the base?

kep-3695-beta update #5346

Conversation

guptaNswati commented May 27, 2025

Uh oh!

linux-foundation-easycla bot commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented May 27, 2025

Uh oh!

k8s-ci-robot commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

kannon92 commented Jun 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Jun 12, 2025

Uh oh!

SergeyKanzhelev commented Jun 12, 2025

Uh oh!

guptaNswati commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffromani commented Jun 13, 2025

Uh oh!

SergeyKanzhelev commented Jun 13, 2025

Uh oh!

guptaNswati commented Jun 13, 2025

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

linux-foundation-easycla bot commented May 27, 2025 •

edited

Loading

guptaNswati Jun 12, 2025 •

edited

Loading

ffromani Jun 13, 2025 •

edited

Loading

guptaNswati Jun 13, 2025 •

edited

Loading

guptaNswati commented Jun 12, 2025 •

edited

Loading

guptaNswati Jun 16, 2025 •

edited

Loading

ffromani left a comment •

edited

Loading