Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Add label validation during sky check #2653

Merged
merged 9 commits into from
Nov 19, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Oct 4, 2023

Closes #2652. Adds a validate_label_value method that can be implemented by LabelFormatters to verify labels are setup correctly during sky check (and hopefully in sky show-gpus too, with #2638).

Before:

$ sky check
...
Kubernetes: enabled
...

After:

  Kubernetes: enabled
    Hint: Node ip-192-168-56-141.us-west-2.compute.internal in Kubernetes cluster has invalid GPU label: skypilot.co/accelerator=A40. Label value "A40" must be lowercase if using the skypilot.co/accelerator label.

Also updates the docs to highlight that the gpu name needs to be lowercase.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests: label nodes with kubectl label nodes <node> skypilot.co/accelerator=A40, try launching/sky check.
  • Tested GPU Label Detection by manually setting kubectl label nodes <node> skypilot.co/accelerator=a40 (skypilot), kubectl label nodes <node> gpu.nvidia.com/class=A40(coreweave) and kubectl label nodes <node> cloud.google.com/gke-accelerator=nvidia-tesla-a40 (gke) on an EKS cluster.
  • sky launch -c test --cloud kubernetes --gpus A40:1 and sky launch -c test --cloud kubernetes --gpus a40:1
  • Docs rendered locally

Copy link
Collaborator

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Tested out and it works as expected. Left some minor comments. LGTM!

docs/source/reference/kubernetes/kubernetes-setup.rst Outdated Show resolved Hide resolved
sky/utils/kubernetes_utils.py Outdated Show resolved Hide resolved
sky/utils/kubernetes_utils.py Show resolved Hide resolved
sky/utils/kubernetes_utils.py Outdated Show resolved Hide resolved
@romilbhardwaj romilbhardwaj merged commit feacc9f into master Nov 19, 2023
19 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_gpulabel_validation branch November 19, 2023 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Custom GPU node labels need better error hints
2 participants