Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Examples] Add docker compose example to run multiple containers #2745

Merged
merged 6 commits into from
Nov 6, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

Simple example showing how to use docker compose to launch multiple containers on a SkyPilot cluster.

  • sky launch -c myclus compose_example.yaml

@concretevitamin
Copy link
Collaborator

concretevitamin commented Nov 2, 2023

This is great to have @romilbhardwaj!

One issue with running it on Azure:

» sky launch compose_example.yaml -c dbg --cloud azure --cpus 2+ --down --gpus T4:4
...
(task, pid=26616)  gpu-app1 Pulled
(task, pid=26616)  gpu-app2 Pulled
(task, pid=26616)  Network sky_workdir_default  Creating
(task, pid=26616)  Network sky_workdir_default  Created
(task, pid=26616)  Container sky_workdir-gpu-app2-1  Creating
(task, pid=26616)  Container sky_workdir-gpu-app1-1  Creating
(task, pid=26616)  Container sky_workdir-gpu-app2-1  Created
(task, pid=26616)  Container sky_workdir-gpu-app1-1  Created
(task, pid=26616) Attaching to sky_workdir-gpu-app1-1, sky_workdir-gpu-app2-1
(task, pid=26616) Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown
ERROR: Job 1 failed with return code list: [1]
INFO: Job finished (status: FAILED).
...

This may be related to our default GPU image on Azure having CUDA too outdated. Is it a quick fix?

@romilbhardwaj
Copy link
Collaborator Author

Ah good catch - we do need to update our azure image (#2751). For this PR, I've changed the version to 11.5.2 and tested it works on aws, az and gcp.

services:
gpu-app1:
image: nvidia/cuda:11.5.2-runtime-ubuntu20.04
command: nvidia-smi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added -l 1 to here and L17 for nvidia-smi to loop forever.

It appears both containers print the same GPU ID. Note the SkyPilot task has 2 GPUs assigned, so GPUs 0 and 1 are available.

Is there any env var (CUDA_VISIBLE_DEVICES?) we can add to this file to show how to distribute the containers to GPUs 0 and 1 respectively? Can even be a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I've changed from count to explicit device_id. Note that nvidia-docker remaps device ids, so from within gpu-app2 container the GPU ID visible will be ['0'] (though it maps to physical device 1). Also added this as a comment.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

examples/docker/compose/docker-compose.yml Outdated Show resolved Hide resolved
@romilbhardwaj romilbhardwaj merged commit 5a35ab6 into master Nov 6, 2023
18 checks passed
@romilbhardwaj romilbhardwaj deleted the examples_docker_compose branch November 6, 2023 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants