Skip to content

kep-3695-beta update #5346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

guptaNswati
Copy link

/wg device-management
/assign @moshe010
/assign @klueska
/assign @@johnbelamaric for PRR

Signed-off-by: Swati Gupta <swatig@nvidia.com>
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label May 27, 2025
Copy link

linux-foundation-easycla bot commented May 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

Welcome @guptaNswati!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @guptaNswati. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 27, 2025
Co-authored-by: Kevin Klues <klueska@gmail.com>
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 28, 2025
@kannon92
Copy link
Contributor

kannon92 commented Jun 2, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 2, 2025
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming the existing comments (e.g. KEP template) are addressed

- [ ] No major bugs reported in the previous cycle.
- [] Gather feedback from consumers of the DRA feature.
- Integration with the NVIDIA DCGM exporter.
- [x] No major bugs reported in the previous cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL to check kubernetes/kubernetes#132028 is relevant for the Get endpoint.
I think no, this is why I didn't change the impl. of Get, but worth a look in the context of this work

Copy link
Author

@guptaNswati guptaNswati Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm we do List all the pods and get this from the client bindings of k8s.io/kubelet/pkg/apis/podresources/v1.GetPodResourcesResponse.GetPodResources but your change is conditional useActivePods:bool so should not impact this work.

Copy link
Contributor

@ffromani ffromani Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking. So, if Get is used to get info about a non running-pod, maybe because of a race or because user error, what would happen? I can think of:

  1. Get returns the state without error as long as the pod is known to the kubelet (non-deleted). Are the clients of the API ok with that?
  2. Get returns error, like pod not found or anything equivalent (pod not running)?
    What I'm trying to do here is to prevent us to fix the same bug twice, once in List, once in Get.

Copy link
Author

@guptaNswati guptaNswati Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dint find anything in issues of dcgm-exporter related to this. Our default metric collection interval is 30 secs, where we call the podresources api to list any running pods using the GPUs and collects the GPU metrics associated with a given GPU pod. So if a pod is not running or is stale and reporting older metrics within this time interval. I think its okay.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats a non-running pod in this case? a pod that never started or deleted or had errors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats a non-running pod in this case? a pod that never started or deleted or had errors?

A non-running pod is a pod bound to the kubelet (thus not pending) which is in not in running phase. In practice, this maps to terminated but non deleted pods, which in turn maps to pods whose restartPolicy is not always. Probably the easiest example are pods belonging to kubernetes jobs.
Turns out this is a relatively easy yet combination to get, yet not too often used. This is basically one of the major reasons why it took us roughly 2 years to figure out

I'm glad you confirmed Get endpoint is not affected; this is what my research also suggested. Nevertheless, I think we should fully specify in the KEP the semantics about calling Get with non-running pods as defined above. The reason I'm writing is because last time I looked at the KEP (admittedly some months ago) this was not fully specified. Should be an easy fix and totally in scope for beta stage.

Insufficiently detailed specification bite us pretty bad (kubernetes/kubernetes#119423 (comment) - this quite complex to figure out requiring a lot of ancient history research) and I'm submitting a doc PR/KEP PR fix ASAP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this holds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this. I ran this pr-test pod which terminated and is in crashloopback. But in my test, List() is also not returning it. Is it an ephemeral bug? I will test a bit more.

$ kubectl get pod | grep pr-test
pr-test                                0/1     CrashLoopBackOff   14 (4m39s ago)   52m

$ kubectl describe pod pr-test 
 Warning  BackOff    36s (x225 over 50m)   kubelet            Back-off restarting failed container busybox in pod pr-test_default(8931c1c9-7bc6-4721-98fe-e6193dc10a33)
 
$ sudo grpcurl -plaintext -import-path .  -import-path vendor  -proto staging/src/k8s.io/kubelet/pkg/apis/podresources/v1/api.proto -d '{"podName":"pr-test","podNamespace":"default"}'  unix:///var/lib/kubelet/pod-resources/kubelet.sock  v1.PodResourcesLister/Get
ERROR:
  Code: Unknown
  Message: pod pr-test in namespace default not found

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification, at risk of being redundant. Get itself and the changes pertaining to this KEP had not bugs reported recently to the best of my knowledge. From that perspective we're good.
But since we discovered a List bug largely caused by insufficient specification over the cycle of the API, I think that this is a good chance to fully clarify and detail the semantics of Get (and the changes covered by this KEP) solidifying what is otherwise mostly volatile knowledge. I think this should be quick and easy and document behavior we already have and want, just adding details, so a perfect fit for beta which won't detriment the goal nor cause any slowdown. Again sorry if I'm redundant, just wanted to fully explain my PoV.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this clarifies this a bit more.

@guptaNswati
Copy link
Author

@soltysh @SergeyKanzhelev what else is needed to get this merged. I see its set to At risk enhancements freeze

@SergeyKanzhelev
Copy link
Member

Please update to the latest template according to #3695 (comment)

My understanding @soltysh haven't stazrted reviews as the metadata was not correct. Now it should be correct, need to work on the document

@guptaNswati
Copy link
Author

guptaNswati commented Jun 12, 2025

@SergeyKanzhelev

Please update to the latest template according to #3695 (comment)

It is using the latest template. See
KEP readme using the [latest template](https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template) has been merged into the k/enhancements repo.

Sorry, i am confused if you are referring to something else.

@ffromani
Copy link
Contributor

@SergeyKanzhelev

Please update to the latest template according to #3695 (comment)

It is using the latest template. See KEP readme using the [latest template](https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template) has been merged into the k/enhancements repo.

Sorry, i am confused if you are referring to something else.

You need to check the KEP layout of this PR matches the layout in the latest KEP template available. When folks update the KEP template some sections may move/be created/merged/removed, some new PRR questions can arise, and so forth

@SergeyKanzhelev
Copy link
Member

As @ffromani mentioned some template sections might have moved or were added. I am relying on this comment: #3695 (comment) that tells the the KEP readme is not following the last template.

@guptaNswati
Copy link
Author

Got it. thanks both. I see i updated the implementation history but that is within the format.

- Kubernetes 1.27: Alpha version of the KEP.

- Kubernetes 1.34: Beta version of the KEP.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple questions not answered, which are requirement for beta promotion.

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in the doc:

  1. Please make sure to check appropriate boxes in the ## Release Signoff Checklist.
  2. Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please make sure to check appropriate boxes in the ## Release Signoff Checklist.

This was addressed - thank you.

  • Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

This one still holds. We need links for integration and e2e based on the template in the appropriate section. I believe e2es were added in kubernetes/kubernetes#116846 so you should be able to quickly fill those in. Not sure if there are others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still holds, I see Francesco mentioned several tests that were added, can we make sure they are explicitly linked in this document?

- [ ] No major bugs reported in the previous cycle.
- [] Gather feedback from consumers of the DRA feature.
- Integration with the NVIDIA DCGM exporter.
- [x] No major bugs reported in the previous cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this holds.

- [ ] No major bugs reported in the previous cycle.
- [] Gather feedback from consumers of the DRA feature.
- Integration with the NVIDIA DCGM exporter.
- [x] No major bugs reported in the previous cycle.

#### GA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further in the document:

  1. In ###### Are there any tests for feature enablement/disablement? can you please link to those tests?
  2. In ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? why that question is not applicable, can you explain?
  3. In ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? N/A is not correct. You're using explicit metrics pod_resources_endpoint_requests_total, pod_resources_endpoint_requests_list and pod_resources_endpoint_requests_get. What is the SLO for those?
  4. In ###### Will enabling / using this feature result in increasing size or count of the existing API objects? No is not correct. You're expanding the return types from Get call by additional information coming from DRA, so I'd expect that information be expressed in that answer.
  5. In ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? your answer is not one I'd expect. What you're writing there is probably more suited to Troubleshooting or Risks and Mitigations sections. In your particular case I'm expecting a rough estimation if CPU/RAM/disk/IO will be affected by this feature. Based on my understanding you're likely increase the memory consumption to hold the DRA information, no?
  6. You're missing ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? question, from the template
  7. In ###### How does this feature react if the API server and/or etcd is unavailable? I'm expecting a concrete answer.
  8. In ###### What are other known failure modes? can you please use the proposed template? IE. Detection -> Mitigations -> Diagnostic, etc.
  9. In ###### What steps should be taken if SLOs are not being met to determine the problem? you're missing answer, N/A is not acceptable.
  10. What about Drawbacks and alternatives considered? I'm missing both sections, have we not considered any?

Copy link
Author

@guptaNswati guptaNswati Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh Most of these have the similar answers in this KEP and this KEP is an extension of it https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments

Are these blocking updates?

And for this particular KEP, nothing has changed in the API from Alpha to Beta.
@ffromani where can i find the e2e test. The test PR is this: kubernetes/kubernetes#116846

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to review on which lane the podresources api (not pod-level resources) tests run, which changed multiple time over the months. Let's find and fix the configuration if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the config in k/test-infra the tests should indeed run in the lanes
pull-kubernetes-node-kubelet-serial-containerd, but it's high time we add an explicit lane and/or we call out the tests explicitly in some lane. We can do something similar to the somehow confusing pull-kubernetes-node-kubelet-serial-podresources lane

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh Most of these have the similar answers in this KEP and this KEP is an extension of it https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments

Are these blocking updates?

Yes, those are blocking changes. It doesn't matter the answers are similar, we need them in this document. If they are matching, it's fine to just copy them over.

Signed-off-by: Swati Gupta <swatig@nvidia.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 16, 2025
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig-node review: provisional LGTM. Two outstanding items from my POV.

  1. (nonblocking, in scope for Beta, "just" extra work) turns out the CI config is obsolete, and the podresources api tests run incidentally. We should add them explicitly (like we do for example for cpumanager) and we should have a presubmit lane (like we do for example for pod level resources). I think it is unfair to account all this work into this KEP, but some of it make sense to be done in the context of the beta graduation. Already begun, PRs linked. We may need to fix or add some e2e tests, but I think this is totally fair in the context of the beta graduation.
  2. (blocking) clarify the Get semantics as discussed. Because of how the API works, there is a TOCTOU unavoidable issue. There is no atomicity or consistency guarantee between List or Get, nor a concept of a session (which I would push back, btw). So it is possible, albeit probably unlikely, that Get is called with stale information (e.g pod becomes terminated between calls). We need to clarify and document the error paths

Once the above are addressed, LGTM.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my previous comments still hold, and are blocking to merging this KEP.

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please make sure to check appropriate boxes in the ## Release Signoff Checklist.

This was addressed - thank you.

  • Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

This one still holds. We need links for integration and e2e based on the template in the appropriate section. I believe e2es were added in kubernetes/kubernetes#116846 so you should be able to quickly fill those in. Not sure if there are others.

@@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed,

###### Are there any tests for feature enablement/disablement?

e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked PR isn't testing feature enablement/disablement, or am I misreading it? The closest place where you test this feature gate is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/podresources/server_v1_test.go but there you only turn this on, but I don't see the requested on/off test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a on/off test scattered across the existing tests: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L977 and https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L1066
We can use a PR to make the tests more explicit and some changes are needed if the FG goes to default on: the FG status should be set explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, can we make sure this is listed here?

- [ ] No major bugs reported in the previous cycle.
- [] Gather feedback from consumers of the DRA feature.
- Integration with the NVIDIA DCGM exporter.
- [x] No major bugs reported in the previous cycle.

#### GA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh Most of these have the similar answers in this KEP and this KEP is an extension of it https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2043-pod-resource-concrete-assigments

Are these blocking updates?

Yes, those are blocking changes. It doesn't matter the answers are similar, we need them in this document. If they are matching, it's fine to just copy them over.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, we need one more update please

To fix a long-standing desync where `List()` would return allocations for terminated pods because it iterated over the full kubelet pod list instead of the resource-manager’s active pods, PR #132028 adds a new `activePodsOnly` boolean flag to the v1 `List` RPC.

- **Default behavior** (`activePodsOnly=false`): backward-compatible, returns all pods as before.
- **With `activePodsOnly=true`**: filters results to only those pods currently tracked as “active” by the resource managers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would omit this part, is not relevant for this KEP. I will submit a separate update to both the KEP and the docs later on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. I was also not sure if this is the right place. But still wanted to add a bit of context of Get() is based on List()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this and the comment right below are addressed (basically once this section is rephrased), LGTM

note that even the current most common usage pattern is List + a series of call to Get, the two calls are independent from each other. Is perfectly legal for a client to call Get with a namespace+name pair taken in a different way than the output of List.

Comment on lines 119 to 120
**Race and inconsistency between `List()` and `Get()`:**
A pod’s phase may change between a successful `List()` and a subsequent `Get()`. Because there is no session or atomic snapshot between a `List()` and a later `Get()`, it’s possible (though uncommon) for a pod to transition to a non-Running phase in between calls. Clients must therefore be prepared for “stale” or “missing” data and handle the documented error paths when `Get()` is invoked on a pod that is no longer in the Running phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to fully specify the behavior of Get in this case, or if, because a mistake or a race or any issue, it is fed with inconsistent data.
Currently the code does this:

	pod, exist := p.podsProvider.GetPodByName(req.PodNamespace, req.PodName)
	if !exist {
		metrics.PodResourcesEndpointErrorsGetCount.WithLabelValues("v1").Inc()
		return nil, fmt.Errorf("pod %s in namespace %s not found", req.PodName, req.PodNamespace)
	}

It seems to me the most logical extension is to take into account the phase of the pod either

  1. return empty data and swallow the error
  2. return error like pod not found

The API in general (we determined during the fixing and review process of 132028) wants to return data pertaining to running pods which thus actively hold resources.

Therefore, my take is that Get should also error out in the unlikely but possible case it is asked about non-running pods.

I'm open to the other approach or to any other sufficiently motivated approach.

Copy link
Author

@guptaNswati guptaNswati Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffromani
How about this:

Get() returns a valid GetPodResourcesResponse with its current PodResources when the pod is in Running state. But due to some race condition or other inconsistency, it’s possible (though uncommon) for a specified pod to transition to terminated or its Status.Phase is not Running, the call must return a NotFound error.

Do you think, adding an explicit check about the Pod Status.Phase can help? Something like:

pod, exist := p.podsProvider.GetPodByName(req.PodNamespace, req.PodName)
if !exist || pod.Status.Phase != v1.PodRunning {
    metrics.PodResourcesEndpointErrorsGetCount.WithLabelValues("v1").Inc()
    return nil, status.Errorf(codes.NotFound, 
        "pod %s in namespace %s not found or not running", 
        req.PodName, req.PodNamespace)
} 

Copy link
Author

@guptaNswati guptaNswati Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we may be overthinking this. As you pointed out, its possible that Get() mostly (not 100%) returns an error when asked about a non running pod. I think just calling it out in the spec should be enough. Digging bit deeper in the code, Get() should not report a removed pod except a rare race condition https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod/pod_manager.go#L198

Your fix in List() to filterOutInactivePods() make sense https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1065

If you still think, its better to improve Get(), we can similar check like I proposed above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I surely don't want to overcomplicate. The problem here is the insufficient validation and lack of checking, albeit in a corner case. TL;DR: we need to decide what Get does when the phase is not Running. A one-liner is sufficient for me, but it has to be there :)
e.g.
"Get() never fails if the kubelet knows about the pods" (but note this likely misreports resources of terminated pods)
"Get() returns error if the pod is known to the kubelet, but is terminated"

that's it.

@guptaNswati
Copy link
Author

@soltysh PTAL, i tried to answers these #5346 (comment).

If these are satisfactory, then the only pending items:

  • adding e2e test link. I am still not sure what is an acceptable answer for this which unblocks this PR.
  • update Get() based on @ffromani feedback.

Signed-off-by: Swati Gupta <swatig@nvidia.com>
Signed-off-by: Swati Gupta <swatig@nvidia.com>
@guptaNswati guptaNswati force-pushed the kep-3695-beta-update branch from a804a04 to ae728ef Compare June 19, 2025 06:36
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

sig-node review: my comment about Get was addressed. We may need to review for the nect cycle (beta2? GA?), but the current changes are sufficient.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025
@ffromani
Copy link
Contributor

my 2c about e2e tests (k/k/test/e2e_node/podresources_test.go):

  • we have FG on/off test dispersed among others, should be clear and have its own block (Context in ginkgo lingo)
  • I would like to see more tests for Get to cover e.g. multi-container pods, pods not having exclusive resources
  • I would like tests to exercise the expected happy path (List + Get on each returned pod)
  • I would like to see a test which compares the output of List and Get in the happy path
  • I would like more tests to exercise both the error paths, e.g. pod found but bogus container; asking about terminated pod.

My PR kubernetes/kubernetes#132028 does a bit of cleanup in the e2e tests so it can be wise to wait for it to merge (it's a fix we need anyway) and base the work on top of it

Fixing the test-infra lane to better handle the e2e tests is already ongoing: kubernetes/kubernetes#132345 and can proceed in parallel

My understanding is that doing all the above is fair in the same cycle on which we attempt to graduate to beta.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still answers missing inside the document itself, but since they were provided as comments, I'm willing to conditionally approve as is. Please make sure to either add the missing links before merging this PR, or shortly after.

/approve
the PRR section

- [ ] No major bugs reported in the previous cycle.
- [] Gather feedback from consumers of the DRA feature.
- Integration with the NVIDIA DCGM exporter.
- [x] No major bugs reported in the previous cycle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this clarifies this a bit more.

@@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed,

###### Are there any tests for feature enablement/disablement?

e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, can we make sure this is listed here?

@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still holds, I see Francesco mentioned several tests that were added, can we make sure they are explicitly linked in this document?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: guptaNswati, soltysh
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@klueska
Copy link
Contributor

klueska commented Jun 19, 2025

There are still answers missing inside the document itself, but since they were provided as comments, I'm willing to conditionally approve as is. Please make sure to either add the missing links before merging this PR, or shortly after.

/approve

the PRR section

Thanks @soltysh. Given the holiday in the US today I don't think @guptaNswati will have time to make the update before tomorrow. I will work with her early next week to ensure all of the information in the comments gets incorporated.

@mrunalp, @haircommander, @SergeyKanzhelev can we please get a sig-node approval to move this forward.

Thanks!

@ffromani
Copy link
Contributor

@mrunalp, @haircommander, @SergeyKanzhelev can we please get a sig-node approval to move this forward.

I think they are all based in the US. I'd be supporting (FWIW) an exception if the KEP slips just because there is no time left for an approval review because of the holidays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

10 participants