NVIDIA GPU Reservation needs to be added to workload form #5005

catherineluse · 2022-01-28T20:59:40Z

Someone in the Rancher users slack asked why the GPU reservation form is not in v2.6.x but used to be in v2.5.

It looks like the field was intended to be added to v2.6.x as it is mentioned in this comment about the design of the form: #267 (comment)

The field would be added to the resources tab:

Ember UI for reference:

nwmac · 2022-01-31T14:11:38Z

The task here is:

Add a new field under the 'CPU Limit' similar to CPU - the units is just 'GPUs' - label 'NVIDIA GPU Limit/Reservation"
This field in a numeric input

If the value is not a number is 0, the resource limit/request for gpus should be removed. Otherwise it should be set to the value in the input box.

The code for cpu/memory limits/reservations is in the ContainerResourceLimit component. This is used in edit/workload/index.vue.

For this GPU setting, the key to use is nvidia.com/gpu - see 'nvidia.com/gpu'.

Unlike for the CPU and Memory, we only have 1 input box and we use the value in this for both the limits and requests.

You can test this feature by ensuring that the yaml created for a container contains the correct gpu limits/requests settings.

MbolotSuse · 2022-03-09T18:20:28Z

No validation template was filled out for this, but I can try and go off @nwmac 's comment here: #5005 (comment) for what to look for.

Validation Failed

While the resource limit was properly set as expected and is present in the UI, the request value was not set. Based on the earlier comment, I expected these to have the same value. Let me know if I am wrong there.

Reproduction steps

Rancher Version: 2.6.3 (Docker single node - docker pull rancher/rancher:v2.6.3 and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6.3)

Start rancher:v2.6.3.
Log in as an admin user.
In the local cluster, attempt to create a job.
In the job creation screen, select resources.
Confirm that there is no ability to set an NVIDIA GPU Reservation.

Validation steps

Rancher Version: 2.6-head (Docker single node - docker pull rancher/rancher:v2.6-head and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6-head)
2.6-head digest: sha256:3f7cad0584b73619958c4b02cf1203b1c906c4ea0d609b58858cd013d980521e
Latest commit (from rancher): 572ab58344b409b452ee3a3144e21066a5d89245

Start rancher:v2.6-head.
Log in as an admin user.
In the local cluster, attempt to create a job.
Set the name to test-job.
Set the container image to debian:stable-slim.
In the job creation screen, select resources.
Confirm that you can set a limit for NVIDIA GPU reservation.
Create the job. Confirm that a limit was set for nvidia.com/gpu to the requested value. Confirm that a request was set for nvidia.com/gpu to the requested value.

This was also validated for other non-Pod workload types (CronJobs, DaemonSets, Deployments, StatefulSets). Screenshots from those resources are not included for brevity's sake. StatefulSets seem to have a separate issue that required me to use the yaml to set the servicename, but that appears to be unrelated to this change.

nwmac · 2022-03-11T15:13:39Z

@neillsom The value is being set as a string, where it should be an number - see YAML:

        resources:
          limits:
            nvidia.com/gpu: "1"

(Should not be quoted)

Also, the ask here is to use the same value for requests, so we should duplicate the value, so you'd get:

        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1

This isn't strictly necessary, but mirrors the Ember UI - this was what was referred to originally on this ticket by:

Unlike for the CPU and Memory, we only have 1 input box and we use the value in this for both the limits and requests.

neillsom · 2022-03-11T21:22:36Z

@nwmac This string issue appears to be coming from the backend. We are sending a number but receiving a string. I'll create another PR for the limits/requests issue.

jtravee · 2022-03-16T22:30:15Z

Confirmed with @catherineluse and @gaktive to add release note label.

gaktive · 2022-03-21T15:04:20Z

PR #5424 addresses this and it was merged.

MbolotSuse · 2022-03-22T15:34:44Z

Validation Passed

Reproduction steps

Rancher Version: 2.6.3 (Docker single node - docker pull rancher/rancher:v2.6.3 and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6.3)

Start rancher:v2.6.3.
Log in as an admin user.
In the local cluster, attempt to create a job.
In the job creation screen, select resources.
Confirm that there is no ability to set an NVIDIA GPU Reservation.

Validation steps

Rancher Version: 2.6-head (Docker single node - docker pull rancher/rancher:v2.6-head and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6-head)
2.6-head digest: sha256:6b212fa904effc44676e8eb7315377f088ee3fa37cb6bc1f0132a57470eeeacd
Latest commit (from rancher): 0b4457317840d80c17990b5e04ba326eb411fb8f

Start rancher:v2.6-head.
Log in as an admin user.
In the local cluster, attempt to create a job.
Set the name to test-job.
Set the container image to debian:stable-slim.
In the job creation screen, select resources.
Confirm that you can set a limit for NVIDIA GPU reservation.
Create the job. Confirm that a limit was set for nvidia.com/gpu to the requested value. Confirm that a request was reset for the requested value.

This was also validated for other non-Pod workload types (CronJobs, DaemonSets, Deployments, StatefulSets). Screenshots from those resources are not included for brevity's sake.

Notes on other workload types

Cronjobs appeared to suffer from another UI issue. When hitting Create or Edit as Yaml an error appeared (on the UI in the first case, in the console on the second case) indicating that template.spec.containers is undefined. This appears to be a separate UI issue unrelated to this change.
StatefulSets seem to have a separate issue that required me to use the yaml to set the servicename, but that appears to be unrelated to this change. I used the UI to set the gpu limits, which worked as expected (despite the fact that the end create was done through the yaml).

Notes

The values set in the yaml were strings as noted by @nwmac. This did not appear to affect functionality (resources refused to schedule due to 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. on my local cluster).
While I did validate that the yaml was properly set, I did not verify that this works end to end (i.e. that you get what you requested and that you can properly use GPU in a container). I think that this would be a good test for QA to run to make sure that this works all the way through - @sowmyav27 let me know what you think on this.

anupama2501 · 2022-03-24T18:48:38Z

Verified on v2.6-head c2d8e32

Created an EKS cluster with 3 p3.large nodes with GPU option enabled.
Once the cluster is active run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
Verify GPU on each node by running the command `kubectl get nodes \
"-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"`
Create a job by adding value 1 in the resources section for GPU
Verified the values are updated for limits and requests under nvidia.com/gpu in the yaml file of the job

  resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"

Job/deployment comes up active.

nwmac added this to the v2.6.4 milestone Jan 31, 2022

nwmac added the [zube]: To Triage label Jan 31, 2022

nwmac assigned neillsom Jan 31, 2022

nwmac added [zube]: Next Up and removed [zube]: To Triage labels Jan 31, 2022

neillsom added [zube]: Working and removed [zube]: Next Up labels Feb 8, 2022

MKlimuszka added the team/area1 Team Neo label Feb 10, 2022

neillsom added [zube]: Next Up and removed [zube]: Working labels Feb 15, 2022

catherineluse added the kind/enhancement label Feb 23, 2022

neillsom added [zube]: Working and removed [zube]: Next Up labels Feb 28, 2022

neillsom mentioned this issue Mar 1, 2022

NVIDIA GPU Limit/Reservation added to Workload form #5237

Merged

neillsom added [zube]: Review [zube]: To Test and removed [zube]: Working [zube]: Review labels Mar 1, 2022

anupama2501 assigned brudnak Mar 7, 2022

anupama2501 added the status/dev-validate label Mar 7, 2022

anupama2501 unassigned brudnak Mar 7, 2022

samjustus added [zube]: Reopened and removed [zube]: To Test labels Mar 9, 2022

neillsom added [zube]: Working and removed [zube]: Reopened labels Mar 11, 2022

neillsom removed the [zube]: Working label Mar 11, 2022

neillsom added [zube]: Backend Blocked [zube]: Working and removed [zube]: Backend Blocked labels Mar 11, 2022

nwmac added the size/1 Size Estimate 1 label Mar 16, 2022

jtravee added the release-note label Mar 16, 2022

neillsom mentioned this issue Mar 17, 2022

NVIDIA GPU Limit/Reservation updates limits as well as requests #5424

Merged

neillsom added [zube]: Review and removed [zube]: Working labels Mar 17, 2022

gaktive added [zube]: To Test and removed [zube]: Review labels Mar 21, 2022

zube bot added [zube]: Reopened [zube]: To Test and removed [zube]: To Test [zube]: Reopened labels Mar 21, 2022

MbolotSuse added the [zube]: QA Working label Mar 22, 2022

zube bot removed the [zube]: To Test label Mar 22, 2022

anupama2501 closed this as completed Mar 24, 2022

zube bot added [zube]: Done and removed [zube]: QA Working labels Mar 24, 2022

zube bot removed the [zube]: Done label Jun 23, 2022

btat mentioned this issue May 19, 2023

Adding GPU limit/reservation to set-container-default-resource-limits.md rancher/rancher-docs#522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA GPU Reservation needs to be added to workload form #5005

NVIDIA GPU Reservation needs to be added to workload form #5005

catherineluse commented Jan 28, 2022

nwmac commented Jan 31, 2022

MbolotSuse commented Mar 9, 2022

nwmac commented Mar 11, 2022

neillsom commented Mar 11, 2022

jtravee commented Mar 16, 2022

gaktive commented Mar 21, 2022

MbolotSuse commented Mar 22, 2022

anupama2501 commented Mar 24, 2022

NVIDIA GPU Reservation needs to be added to workload form #5005

NVIDIA GPU Reservation needs to be added to workload form #5005

Comments

catherineluse commented Jan 28, 2022

nwmac commented Jan 31, 2022

MbolotSuse commented Mar 9, 2022

Validation Failed

Reproduction steps

Validation steps

nwmac commented Mar 11, 2022

neillsom commented Mar 11, 2022

jtravee commented Mar 16, 2022

gaktive commented Mar 21, 2022

MbolotSuse commented Mar 22, 2022

Reproduction steps

Validation steps

Notes on other workload types

Notes

anupama2501 commented Mar 24, 2022