[k8s] GPU support for `sky local up` #2890

romilbhardwaj · 2023-12-21T06:26:43Z

Adds support for GPUs in kind clusters created by sky local up. This is useful for users who may have single machines with GPUs that they want to run SkyPilot tasks on.

Uses instructions from this guide: https://gist.github.com/romilbhardwaj/acde8657e319ecdc6ae9e50646acca33

Note that kind does not support GPUs natively - this is a hack which injects the GPU device through volume mounts feature in nvidia container runtime - https://docs.google.com/document/d/1uXVF-NWZQXgP1MLb87_kMkQvidpnkNWicdpO2l9g-fw/edit#

Does not auto-install or auto-configurenvidia-container-runtime and other dependencies because it will be highly dependent on user's env + requires sudo. Instead we print out instructions on how to do so.

Due to the brittle and non-native nature of this support, the --gpus flag will be hidden from users until we decide it is stable enough.

Tested:

AWS T4 instance
AWS V100 instance
GCP T4 instance

concretevitamin · 2023-12-21T17:11:28Z

Nice @romilbhardwaj!

Tried it on a sky-launched T4 machine. Some UX.

(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sky local up --gpus
⠋ Creating local cluster...
Failed to create local cluster. No kind clusters found.
NVIDIA is not set as the default runtime for Docker. To fix, run:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Successfully loaded config
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that the docker daemon be restarted.
(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sky local up --gpus
⠋ Creating local cluster...
Failed to create local cluster. No kind clusters found.
NVIDIA visible devices are not set as volume mounts in container runtime. To fix, run:
sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml

(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml
(base) gcpuser@ray-sky-2d4a-zongheng-8a39-head-d50432f5-compute:~/skypilot$ sky local up --gpus
⠙ Creating local cluster...
Failed to create local cluster. No kind clusters found.
Creating cluster "skypilot" ...
 • Ensuring node image (kindest/node:v1.27.3) 🖼  ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 • Preparing nodes 📦   ...
 ✓ Preparing nodes 📦
 • Writing configuration 📜  ...
 ✓ Writing configuration 📜
 • Starting control-plane 🕹️  ...
 ✓ Starting control-plane 🕹️
 • Installing CNI 🔌  ...
 ✓ Installing CNI 🔌
 • Installing StorageClass 💾  ...
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-skypilot"
You can now use your cluster with:

kubectl cluster-info --context kind-skypilot

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/
helm is not installed. Please install helm and try again. Installation instructions: https://helm.sh/docs/intro/install/

There are two sets of hinted commands (NVIDIA is not set as the default runtime for Docker -> NVIDIA visible devices are not set) that the user had to manually copy paste and run. Is it possible to make sky local up run them automatically?
The last hint seems to be new and comes from kind? It's not clear if a SkyPilot user should run the hinted kubectl cluster-info --context kind-skypilot command.

I then ran

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerators) are setup correctly. To further debug, run: sky check.

$ python -m sky.utils.kubernetes.gpu_labeler
Found 0 GPU nodes in the cluster
No GPU nodes found in the cluster. If you have GPU nodes, please ensure that they have the label `nvidia.com/gpu: <number of GPUs>`

without luck.

romilbhardwaj · 2023-12-23T03:01:55Z

Changing from draft -> ready for review.

Fixed some bugs/UX and tested on GCP and AWS GPU VMs.

There are two sets of hinted commands (NVIDIA is not set as the default runtime for Docker -> NVIDIA visible devices are not set) that the user had to manually copy paste and run. Is it possible to make sky local up run them automatically?

That's a good point. I wanted to do that too, but realized that 1) the steps involve modifying user's global settings, which will affect their environment and 2) Requires sudo permissions, which may require launching sky local up as a sudo user for users who don't have passwordless sudo set up. Figured it's best to force users to be intentional about this and configure their environment themselves.

concretevitamin · 2023-12-25T01:33:43Z

Thanks @romilbhardwaj. I tried out on

sky launch --cloud gcp --gpus L4:8 --use-spot --cpus 2+ -y -c dbg-docker

and couldn't seem to get it to work.

User journey
1.

$ sky local up --gpus
⠦ Creating local cluster with GPU support (this may take up to 15 minutes)...

It'd be great to print "To see detailed log: tail -f ..." just like sky launch. Currently it does take a while and makes me wonder "what it's doing and why so long".

$ sky local up --gpus
Local Kubernetes cluster created successfully with 96 CPUs. `sky launch` can now run tasks locally.
Hint: To change the number of CPUs, change your docker runtime settings. See https://kind.sigs.k8s.io/docs/user/quick-start/#settings-for-docker-desktop for more info.

with 96 CPUs -> with 96 CPUs and XX GPUs?
can we green the first line?

It is not clear if the local up command labels the GPUs for the user, turns out not?

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerators) are setup correctly. To further debug, run: sky check.

Making it auto label would be very convenient. Recall user feedback to minimize number of steps.

I manually triggered labeling

$ python -m sky.utils.kubernetes.gpu_labeler
Found 1 GPU nodes in the cluster
Created GPU labeler job for node skypilot-control-plane
GPU labeling started - this may take a few minutes to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs -n kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerators: <gpu_name>`.

The job appears finished?

$ kubectl get jobs -n kube-system -l job=sky-gpu-labeler
NAME                                               COMPLETIONS   DURATION   AGE
sky-gpu-labeler-d71e5a46f04de559ca097dd3c16cd8c5   1/1           6s         2m27s

but it's not labeled

kubectl describe nodes | grep accelerators  # returns nothing

romilbhardwaj · 2024-01-01T19:05:50Z

Thanks for the UX feedback - good points. Will fix.

For autolabelling - yes, sky local up --gpus will indeed autolabel, there should be no need to run the labelling separately. It works on T4/other gpus. Looks like the labelling script was broken for L4 GPUs - I've pushed a fix to add L4 and tested it on L4, seems to be working now.

Investigating if we can do away with the need to maintain the list of allowed_gpus and label with whatever gpu name nvidia-smi reports.

creatorrr · 2024-01-02T07:54:24Z

AFAIK, this method doesn't allow being selective about which and how many GPUs are visible and for local machines, the possible configurations are practically endless (I have 2x A5000 and 1x 2080Ti on my desktop and a 1x 4090 on my laptop). Maybe for local kubernetes, we could just create a new LOCAL gpu type?

tobi · 2024-01-04T16:11:19Z

I tried this out, thank you for doing this work. Let's make sure that the --gpus feature is set as default once this lands, because it's exceedingly nonsensical to have no GPU access when using sky where this is the entire point 😆.

I encountered a few issues:

invoking the shell script that you created from sky local up --gpus eats all of its output but the output is really necissary for debugging purposes, so we should maybe just shell out to it?
my cards are A6000 which were missing from the discovery process. This happened before at some point, I feel like sky should just really expose all nvidia cards and not rely on multiple implementations of allow-lists.
I got stuck on an endless loop of Waiting for GPU operator installation... after running the script directly. the gpu-operator never finds my A6000s. This is likely some latent configuration problem on my machine that i need to debug. Oddly running nvidia-smi through docker works fine, so i'm confused. But yes, maybe handle this case in such a way that after 10 retries it errors out.

romilbhardwaj · 2024-01-05T11:45:08Z

Hey @tobi - thanks for the super useful feedback! I've made a bunch of UX and functionality fixes - hopefully this should fix your issues. Please give it a go now :)

Logging: The detailed logs are now written to disk and the path to tail them is shown at the start. If any error happens, the stderr + the path to full logs is shown again. Hope this helps with debugging.
Fine-grained spinner: The progress spinner is now shows exactly the step that's being executed (e.g., "Installing NVIDIA GPU Operator", "Waiting for nodes to be labelled", etc.)
GPU discovery fixes: The GPU discovery has been reworked - your A6000 machines should work now. The allow list has been changed to simply be a canonical name "suggestion" list. If a GPU name fits any of the names in the list, it's considered a match and canonicalized so that the SkyPilot optimizer can handle failover to cloud, if necessary. If the GPU name does not have a match in the list, we directly use the name provided by nvidia-smi (useful for GPUs unknown to SkyPilot).
Timeouts and hints for debugging: Sorry to hear you got stuck on Waiting for GPU operator installation. This does seem like a configuration issue. Can you try running kubectl get pods -n gpu-operator and kubectl describe pod <stuck_pod_name> -n gpu-operator to help debug? (I've also added a timeout to this step so that it doesn't get stuck forever, and also added hints for how to debug if it does get stuck.)
GPU support is default now: GPU support for sky local up is now automatically enabled on supported hosts (can override with --no-gpus)

Some screenshots:

Spinner showing intermediate steps
sky local up on a 4xA6000 machine
Canonicalization in action on 4x V100-SXM on fluidstack - the optimizer shows both local V100s and AWS V100s as targets for provisioning (only AWS was enabled for this screenshot)

lmk if you run into any issues!

Tested:

GCP 1x L4 machine
GCP 1x T4 machine
4x A6000 machine (on fluidstack) - gets listed as a new GPU nvidia-rtx-a6000.
4x V100-SXM2 machine (on fluidstack) - gets canonicalized to v100

…o localup_gpus # Conflicts: # sky/utils/kubernetes/create_cluster.sh

Michaelvll

Thanks for adding the support for GPU with sky local up @romilbhardwaj! I just tried it on a new fluidstack VM with 4 QUADRO-RTX-5000 GPUs, and it works well. Left several comments mainly related to UX.

sky/cli.py

Michaelvll · 2024-01-05T17:49:10Z

sky/cli.py

+                for gpu, _ in sorted(result.items()):
                    gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])


nit:

Suggested change

for gpu, _ in sorted(result.items()):

gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])

for gpu, info_list in sorted(result.items()):

gpu_table.add_row([gpu, _list_to_str(info_list)])

I think we need to "pop" from the list to avoid having the GPU show up again in the other gpus table at L3502

Ahh, I see. Missed that part! Sounds good to me!

sky/cli.py

sky/utils/kubernetes/k8s_gpu_labeler_setup.yaml

sky/cli.py

romilbhardwaj · 2024-01-06T22:13:12Z

Thanks @Michaelvll! Addressed your comments and added a UX fix to show all dependency errors in one go:

tobi · 2024-01-08T22:04:59Z

really nice!

Michaelvll

Sorry for missing this PR and thank you for the update @romilbhardwaj! It looks great to me.

sky/cli.py

Michaelvll · 2024-01-19T01:39:03Z

sky/cli.py

+                for gpu, _ in sorted(result.items()):
                    gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])


Ahh, I see. Missed that part! Sounds good to me!

Michaelvll · 2024-01-19T01:39:22Z

sky/cli.py

+            if (cloud_obj is not None and
+                    cloud_obj.is_same_cloud(clouds.Kubernetes())):


nit:

Suggested change

if (cloud_obj is not None and

cloud_obj.is_same_cloud(clouds.Kubernetes())):

if isinstance(cloud_obj, clouds.Kubernetes):

Michaelvll · 2024-01-19T01:41:21Z

sky/utils/kubernetes/k8s_gpu_labeler_setup.yaml

-        'A100-80GB', 'A100', 'A10G', 'K80', 'M60', 'T4g', 'T4', 'V100', 'A10',
-        'P100', 'P40', 'P4', 'L4'
+        'A100-80GB', 'A100', 'A10G', 'H100', 'K80', 'M60', 'T4g', 'T4', 'V100', 
+        'A10', 'P100', 'P40', 'P4', 'L4', 'A6000'


Is this A6000 the same as RTXA6000 in other clouds? If the labelling is A6000 by default in k8s, we should probably rename it for the other clouds as well in the future. : )

Yes! nvidia-smi lists this as Nvidia RTX 6000 - perhaps we can converge onto nvidia-smi's naming convention in the future?

romilbhardwaj · 2024-01-22T18:54:36Z

Thanks for the review @Michaelvll! Tested again on a V100 - merging now.

local up GPU support

326cfda

local up GPU support

a1722b8

concretevitamin mentioned this pull request Jan 1, 2024

[k8s] Make sky local up support GPU #2889

Closed

Add L4 support

d05a5e2

creatorrr mentioned this pull request Jan 2, 2024

First steps: Support nvidia-drivers in local kind cluster #2932

Closed

5 tasks

romilbhardwaj added 3 commits January 4, 2024 14:15

new UX, remove hardcoded allowlist

bd833a2

UX fixes, use canonical names, add GPU count/label to output, misc fixes

15b9b08

lint

7876dad

romilbhardwaj requested a review from concretevitamin January 5, 2024 11:45

concretevitamin requested a review from Michaelvll January 5, 2024 16:48

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

f209823

…o localup_gpus # Conflicts: # sky/utils/kubernetes/create_cluster.sh

romilbhardwaj marked this pull request as ready for review January 5, 2024 17:14

sanity check for gpu support

2878112

Michaelvll reviewed Jan 5, 2024

View reviewed changes

romilbhardwaj added 3 commits January 6, 2024 12:43

comments

d0e11ae

Fix dependency UX

bc174f1

lint

c6bad26

newline stripping doesn't work on macos

a649db2

romilbhardwaj mentioned this pull request Jan 9, 2024

[k8s] gpu labeler job incorrectly detects A100 as A10 gpu #2959

Closed

romilbhardwaj requested a review from Michaelvll January 12, 2024 19:58

Michaelvll force-pushed the master branch from 71213e5 to 9743aa0 Compare January 13, 2024 05:30

Michaelvll approved these changes Jan 19, 2024

View reviewed changes

romilbhardwaj merged commit c05a0ea into master Jan 22, 2024
19 checks passed

romilbhardwaj deleted the localup_gpus branch January 22, 2024 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] GPU support for `sky local up` #2890

[k8s] GPU support for `sky local up` #2890

romilbhardwaj commented Dec 21, 2023 •

edited

concretevitamin commented Dec 21, 2023

romilbhardwaj commented Dec 23, 2023

concretevitamin commented Dec 25, 2023 •

edited

romilbhardwaj commented Jan 1, 2024

creatorrr commented Jan 2, 2024

tobi commented Jan 4, 2024

romilbhardwaj commented Jan 5, 2024

Michaelvll left a comment

Michaelvll Jan 5, 2024

romilbhardwaj Jan 6, 2024

Michaelvll Jan 19, 2024

romilbhardwaj commented Jan 6, 2024

tobi commented Jan 8, 2024

Michaelvll left a comment

Michaelvll Jan 19, 2024

Michaelvll Jan 19, 2024

Michaelvll Jan 19, 2024

romilbhardwaj Jan 22, 2024

romilbhardwaj commented Jan 22, 2024

		for gpu, _ in sorted(result.items()):
		gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])

		if (cloud_obj is not None and
		cloud_obj.is_same_cloud(clouds.Kubernetes())):

	if (cloud_obj is not None and
	cloud_obj.is_same_cloud(clouds.Kubernetes())):
	if isinstance(cloud_obj, clouds.Kubernetes):

[k8s] GPU support for sky local up #2890

[k8s] GPU support for sky local up #2890

Conversation

romilbhardwaj commented Dec 21, 2023 • edited

concretevitamin commented Dec 21, 2023

romilbhardwaj commented Dec 23, 2023

concretevitamin commented Dec 25, 2023 • edited

romilbhardwaj commented Jan 1, 2024

creatorrr commented Jan 2, 2024

tobi commented Jan 4, 2024

romilbhardwaj commented Jan 5, 2024

Some screenshots:

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jan 5, 2024

Choose a reason for hiding this comment

romilbhardwaj Jan 6, 2024

Choose a reason for hiding this comment

Michaelvll Jan 19, 2024

Choose a reason for hiding this comment

romilbhardwaj commented Jan 6, 2024

tobi commented Jan 8, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jan 19, 2024

Choose a reason for hiding this comment

Michaelvll Jan 19, 2024

Choose a reason for hiding this comment

Michaelvll Jan 19, 2024

Choose a reason for hiding this comment

romilbhardwaj Jan 22, 2024

Choose a reason for hiding this comment

romilbhardwaj commented Jan 22, 2024

[k8s] GPU support for `sky local up` #2890

[k8s] GPU support for `sky local up` #2890

romilbhardwaj commented Dec 21, 2023 •

edited

concretevitamin commented Dec 25, 2023 •

edited