Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU Reservation needs to be added to workload form #5005

Closed
catherineluse opened this issue Jan 28, 2022 · 8 comments
Closed

NVIDIA GPU Reservation needs to be added to workload form #5005

catherineluse opened this issue Jan 28, 2022 · 8 comments

Comments

@catherineluse
Copy link
Contributor

Someone in the Rancher users slack asked why the GPU reservation form is not in v2.6.x but used to be in v2.5.

It looks like the field was intended to be added to v2.6.x as it is mentioned in this comment about the design of the form: #267 (comment)

The field would be added to the resources tab:
Screen Shot 2022-01-28 at 11 55 20 AM

Ember UI for reference:
image

@nwmac
Copy link
Member

nwmac commented Jan 31, 2022

The task here is:

  • Add a new field under the 'CPU Limit' similar to CPU - the units is just 'GPUs' - label 'NVIDIA GPU Limit/Reservation"
  • This field in a numeric input

If the value is not a number is 0, the resource limit/request for gpus should be removed. Otherwise it should be set to the value in the input box.

The code for cpu/memory limits/reservations is in the ContainerResourceLimit component. This is used in edit/workload/index.vue.

For this GPU setting, the key to use is nvidia.com/gpu - see 'nvidia.com/gpu'.

Unlike for the CPU and Memory, we only have 1 input box and we use the value in this for both the limits and requests.

You can test this feature by ensuring that the yaml created for a container contains the correct gpu limits/requests settings.

@MbolotSuse
Copy link

No validation template was filled out for this, but I can try and go off @nwmac 's comment here: #5005 (comment) for what to look for.

Validation Failed

While the resource limit was properly set as expected and is present in the UI, the request value was not set. Based on the earlier comment, I expected these to have the same value. Let me know if I am wrong there.

Reproduction steps

  • Rancher Version: 2.6.3 (Docker single node - docker pull rancher/rancher:v2.6.3 and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6.3)
  1. Start rancher:v2.6.3.
  2. Log in as an admin user.
  3. In the local cluster, attempt to create a job.
  4. In the job creation screen, select resources.
  5. Confirm that there is no ability to set an NVIDIA GPU Reservation.

Before

Validation steps

  • Rancher Version: 2.6-head (Docker single node - docker pull rancher/rancher:v2.6-head and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6-head)
  • 2.6-head digest: sha256:3f7cad0584b73619958c4b02cf1203b1c906c4ea0d609b58858cd013d980521e
  • Latest commit (from rancher): 572ab58344b409b452ee3a3144e21066a5d89245
  1. Start rancher:v2.6-head.
  2. Log in as an admin user.
  3. In the local cluster, attempt to create a job.
  4. Set the name to test-job.
  5. Set the container image to debian:stable-slim.
  6. In the job creation screen, select resources.
  7. Confirm that you can set a limit for NVIDIA GPU reservation.
  8. Create the job. Confirm that a limit was set for nvidia.com/gpu to the requested value. Confirm that a request was set for nvidia.com/gpu to the requested value.

This was also validated for other non-Pod workload types (CronJobs, DaemonSets, Deployments, StatefulSets). Screenshots from those resources are not included for brevity's sake. StatefulSets seem to have a separate issue that required me to use the yaml to set the servicename, but that appears to be unrelated to this change.

After_1

After_2

After_3

@nwmac
Copy link
Member

nwmac commented Mar 11, 2022

@neillsom The value is being set as a string, where it should be an number - see YAML:

        resources:
          limits:
            nvidia.com/gpu: "1"

(Should not be quoted)

Also, the ask here is to use the same value for requests, so we should duplicate the value, so you'd get:

        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1

This isn't strictly necessary, but mirrors the Ember UI - this was what was referred to originally on this ticket by:

Unlike for the CPU and Memory, we only have 1 input box and we use the value in this for both the limits and requests.

@neillsom
Copy link
Contributor

@nwmac This string issue appears to be coming from the backend. We are sending a number but receiving a string. I'll create another PR for the limits/requests issue.

Screen Shot 2022-03-11 at 2.11.59 PM.png

Screen Shot 2022-03-11 at 2.12.25 PM.png

@jtravee
Copy link

jtravee commented Mar 16, 2022

Confirmed with @catherineluse and @gaktive to add release note label.

@gaktive
Copy link
Member

gaktive commented Mar 21, 2022

PR #5424 addresses this and it was merged.

@MbolotSuse
Copy link

Validation Passed

Reproduction steps

  • Rancher Version: 2.6.3 (Docker single node - docker pull rancher/rancher:v2.6.3 and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6.3)
  1. Start rancher:v2.6.3.
  2. Log in as an admin user.
  3. In the local cluster, attempt to create a job.
  4. In the job creation screen, select resources.
  5. Confirm that there is no ability to set an NVIDIA GPU Reservation.

Before

Validation steps

  • Rancher Version: 2.6-head (Docker single node - docker pull rancher/rancher:v2.6-head and docker run -d --restart=unless-stopped -p 8080:80 -p 8081:443 --name rancher --privileged rancher/rancher:v2.6-head)
  • 2.6-head digest: sha256:6b212fa904effc44676e8eb7315377f088ee3fa37cb6bc1f0132a57470eeeacd
  • Latest commit (from rancher): 0b4457317840d80c17990b5e04ba326eb411fb8f
  1. Start rancher:v2.6-head.
  2. Log in as an admin user.
  3. In the local cluster, attempt to create a job.
  4. Set the name to test-job.
  5. Set the container image to debian:stable-slim.
  6. In the job creation screen, select resources.
  7. Confirm that you can set a limit for NVIDIA GPU reservation.
  8. Create the job. Confirm that a limit was set for nvidia.com/gpu to the requested value. Confirm that a request was reset for the requested value.

This was also validated for other non-Pod workload types (CronJobs, DaemonSets, Deployments, StatefulSets). Screenshots from those resources are not included for brevity's sake.

Notes on other workload types

  • Cronjobs appeared to suffer from another UI issue. When hitting Create or Edit as Yaml an error appeared (on the UI in the first case, in the console on the second case) indicating that template.spec.containers is undefined. This appears to be a separate UI issue unrelated to this change.
  • StatefulSets seem to have a separate issue that required me to use the yaml to set the servicename, but that appears to be unrelated to this change. I used the UI to set the gpu limits, which worked as expected (despite the fact that the end create was done through the yaml).

After_1

After_2

After_3

Notes

  • The values set in the yaml were strings as noted by @nwmac. This did not appear to affect functionality (resources refused to schedule due to 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. on my local cluster).
  • While I did validate that the yaml was properly set, I did not verify that this works end to end (i.e. that you get what you requested and that you can properly use GPU in a container). I think that this would be a good test for QA to run to make sure that this works all the way through - @sowmyav27 let me know what you think on this.

@anupama2501
Copy link

Verified on v2.6-head c2d8e32

  • Created an EKS cluster with 3 p3.large nodes with GPU option enabled.
  • Once the cluster is active run kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
  • Verify GPU on each node by running the command `kubectl get nodes \
  • "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"`
  • Create a job by adding value 1 in the resources section for GPU
  • Verified the values are updated for limits and requests under nvidia.com/gpu in the yaml file of the job
  resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
  • Job/deployment comes up active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants