docker rehaul: remove repo2docker and add gpu support #3161

stephchen · 2022-01-20T23:15:33Z

Fixes WB-7840

Description

removes repo2docker
adds CUDA support for gpus
some smaller things:
- builds docker image based on current python version if not launching from a previous wandb run
- adds --gpu flag to CLI to enable building a gpu enabled image (TODO: if rerunning a wandb run previously on gpu, set gpu on as default)
- TODO: I don't think buildx partial caching (eg if we change some but not all of requirements.txt) is working at the moment, looking into this

metrics:

for a test image with (numpy, torch, wandb):
- 203.7s build time from scratch
- 2.42gb image size (2.55gb with gpu)
- 8.2s build time when rebuilding identical image
compare with master:
- 380.3s build time from scratch (370s building base image in repo2docker, 10.3s building on top)
- 4.22gb image size (plus another 3.91gb base image stored on system)
- 3.1s build time when rebuilding identical image — this caching is also dependent entirely on the target project name which is brittle)

Testing

tested manually locally with a few different python versions, tested on gpu via gcp
writing unit tests now

Checklist

Name PR "[WB-NNNN][WB-MMMM] Add support for..." similar to entries in CHANGELOG.md
Include reference to internal ticket "Fixes WB-NNNN" (and github issue "Fixes #NNNN" if applicable)

vanpelt

This is looking cool, a couple things to consider:

The base image is really important here. When "gpu" is enabled we should default to the latest version of cuda and allow the user to specify what cuda version they want. Of course a user can override this entirely with a custom baseimage or image. I wonder if we can derive the cuda version from wandb-metadata.json or elsewhere...
We should consider standardizing around miniconda for our builds. The benefits will be faster build times for users that have conda configs and the ability to use any python version we want without doing an entire rebuild. We're currently using miniconda in our wandb/local build.

The cool thing about conda is that we can install any version of python we want. Potentially we make early steps of the build always install the miniconda py39 distribution, but then create a conda environment later with the appropriate python version, i.e.

conda env create --name=wandb python=3.7 pip
pip install -r requirements.txt

The trick is ensuring the appropriate conda environment is activated by default, happy to discuss more here if that's interesting.

We definitely have a lot of users using conda today and I think that's growing. It's especially popular for anyone in windows land and MLFlow has standardized on it.

stephchen · 2022-01-21T03:24:25Z

@vanpelt yeah conda is a good idea, repo2docker basically built their pip stuff on top of conda infra I think — lemme see if I can sub that in easily

cuda versioning is a good note, I haven't really tested it but eg cuda/torch compatibility is definitely a common issue

codecov · 2022-01-28T19:06:08Z

Codecov Report

Merging #3161 (9b89537) into master (412f7eb) will decrease coverage by 0.20%.
The diff coverage is 91.90%.

@@            Coverage Diff             @@
##           master    #3161      +/-   ##
==========================================
- Coverage   80.15%   79.95%   -0.21%     
==========================================
  Files         213      213              
  Lines       27875    27882       +7     
==========================================
- Hits        22343    22292      -51     
- Misses       5532     5590      +58

Flag	Coverage Δ
functest	`56.65% <1.73%> (-0.55%)`	⬇️
unittest	`69.56% <91.90%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
wandb/sdk/launch/launch_add.py	`82.85% <ø> (ø)`
wandb/sdk/launch/launch.py	`79.24% <80.00%> (-1.15%)`	⬇️
wandb/sdk/launch/runner/aws.py	`86.76% <87.50%> (+0.25%)`	⬆️
wandb/sdk/launch/docker.py	`88.99% <91.12%> (+2.01%)`	⬆️
wandb/sdk/launch/_project_spec.py	`89.52% <93.75%> (-0.89%)`	⬇️
wandb/cli/cli.py	`66.20% <100.00%> (+0.16%)`	⬆️
wandb/sdk/launch/runner/local.py	`82.85% <100.00%> (-2.09%)`	⬇️
wandb/sdk/launch/utils.py	`89.00% <100.00%> (-4.85%)`	⬇️
wandb/integration/metaflow/metaflow.py	`52.29% <0.00%> (-32.76%)`	⬇️
... and 13 more

KyleGoyette · 2022-01-31T23:14:50Z

wandb/sdk/launch/docker.py

+# make sure `python` points at the right version
+RUN update-alternatives --install /usr/bin/python python /usr/bin/python{py_version} 1 \
+    && update-alternatives --install /usr/local/bin/python python /usr/bin/python{py_version} 1
+"""


We should probably keep these in a different file, docker_templates or something

sorta prefer keeping these here for now since they are only used in once place?

Okay, thats fine

KyleGoyette · 2022-01-31T23:17:05Z

wandb/sdk/launch/docker.py

+def get_current_python_version():
+    full_version = sys.version.split()[0].split(".")
+    major = full_version[0]
+    version = ".".join(full_version[:2]) if len(full_version) >= 2 else major + ".0"


Is this just dropping the + for the dev versions of python potentially?

dropping the + would be correct since this is getting the version mostly to get the right docker base image, and i don't think they have them for dev versions

KyleGoyette · 2022-01-31T23:24:29Z

wandb/sdk/launch/docker.py

+            "Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing"
+        )
+        requirements_line = "RUN WANDB_DISABLE_CACHE=true "
+    requirements_line += "pip install -r requirements.txt"


Doesn't this drop the use of _wandb_bootstrap? What if the repo doesn't have a requirements.txt?

fixed and readded this in

KyleGoyette

Couple of questions where I'm not sure if we're doing the right thing especially around installing requirements. Also still needs some tests.

I don't think we should block this on CUDA version. We should allow users to specify, defaulting to the newest. We should be able to add the Cuda version to RunInfo and pull it down with the rest.

KyleGoyette · 2022-02-10T22:40:07Z

wandb/sdk/launch/docker.py

+    """Fill in the Dockerfile templates for stage 2 of build. CPU version is built on python:slim, GPU
+    version is built on nvidia:cuda"""
+
+    python_base_image = "python:{}-slim-buster".format(py_version)  # slim for running


We could allow users to specify a base image using the docker key of the launch spec config. in case they want to directly specify another image as base

We may also need a windows option for windows users.

KyleGoyette · 2022-02-10T22:49:00Z

wandb/sdk/launch/docker.py

+            wandb.termwarn(
+                "Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing"
+            )
+            prefix = "RUN WANDB_DISABLE_CACHE=true"


could factor out this warn and prefix to outside of the other if statements no?

wandb/sdk/launch/launch.py

KyleGoyette · 2022-02-10T22:53:14Z

wandb/sdk/launch/runner/local.py

            wandb.termwarn(
                "Using supplied docker image: {}. Artifact swapping and launch metadata disabled".format(
                    launch_project.docker_image
                )
            )
-            image_uri = launch_project.docker_image
-            _logger.info("Getting docker command...")
+


We should be able to apply the final build step ontop of a user provided image, where we would set the environment variables and inject the launch-metadata.json file. Then we can get rid of the above warning

i'm gonna push this to another pr, this one is complicated enough and doing this would require totally different handling

KyleGoyette · 2022-02-10T22:54:52Z

wandb/sdk/launch/_project_spec.py

@@ -370,6 +374,7 @@ def create_project_from_spec(launch_spec: Dict[str, Any], api: Api) -> LaunchPro
        launch_spec.get("overrides", {}),
        launch_spec.get("resource", "local"),
        launch_spec.get("resource_args", {}),
+        launch_spec.get("cuda", False),


When we pull the wandb-metadata.json file we can check that for a cuda version if this is false.

KyleGoyette

A few small changes and some more questions. This is getting pretty close.

vanpelt

So much HOTNESS 🔥 . Nice work!

I can imagine simplifying this build even further in the future. Micromamba looks really sick. Instead of installing the specific python version directly in Ubuntu or Debian, we could let micromamba do it. It can handle conda envs, virtual envs, pip installations and python installations all in a tiny package. Their official docker image is 35MB 😮

KyleGoyette · 2022-02-17T20:05:52Z

tests/fixtures/launch/launch_sagemaker_config.json

@@ -16,7 +16,7 @@
                "S3OutputPath": "s3://test-bucket/test-output"
            },
            "StoppingCondition": {
-                "MaxRuntimeInSeconds": "60"
+                "MaxRuntimeInSeconds": 60


tests/tests_launch/test_launch_aws.py

KyleGoyette · 2022-02-17T20:10:15Z

tox.ini

@@ -40,7 +40,7 @@ commands_pre =
 commands =
    mkdir -p test-results
    py{35,36,37,38,39,310}:     python -m pytest {env:CI_PYTEST_SPLIT_ARGS:} -n={env:CI_PYTEST_PARALLEL:4} --durations=20 --junitxml=test-results/junit.xml --cov-config=.coveragerc --cov --cov-report= --no-cov-on-fail {posargs:tests/}
-    pylaunch:               jupyter-repo2docker --no-run ./tests/fixtures/
+    ; pylaunch:               jupyter-repo2docker --no-run ./tests/fixtures/


The caching here is busted, this can be removed.

KyleGoyette · 2022-02-17T20:34:57Z

wandb/sdk/launch/utils.py

@@ -227,12 +236,12 @@ def download_entry_point(
 def download_wandb_python_deps(
    entity: str, project: str, run_name: str, api: Api, dir: str
 ) -> Optional[str]:
-    metadata = api.download_url(
+    reqs = api.download_url(  # @@@


nit: Remove this comment?

KyleGoyette · 2022-02-17T20:35:49Z

wandb/sdk/launch/runner/aws.py

@@ -341,6 +304,10 @@ def build_sagemaker_args(
            "Sagemaker launcher requires a StoppingCondition Sagemaker resource argument"
        )

+    # remove args that were passed in for launch but not passed to sagemaker
+    sagemaker_args.pop("region", None)
+    sagemaker_args.pop("profile", None)


nit, there's another arg popped on line 291 if we want to group them

KyleGoyette

nice!

stephchen added 10 commits January 11, 2022 00:44

tmp setup

16d3c2d

tmp name

09a5a6b

gpu

8624413

no cache dir

360e89a

pythonslim

f48cffe

multi p1

dc23d93

cli gpu option

9c970a8

python versions

89b724a

clean up codepath

3656dc7

lint

778f3a8

stephchen requested review from KyleGoyette and vanpelt January 20, 2022 23:18

vanpelt reviewed Jan 21, 2022

View reviewed changes

stephchen added 4 commits January 26, 2022 15:22

Merge branch 'master' into launch/docker-rehaul

d92312a

fix args bug, use python full image for build

262d352

cuda format

ede1dab

rm docker args cli arg

ded8d74

some organizational refactoring

d8372a2

KyleGoyette reviewed Jan 31, 2022

View reviewed changes

stephchen added 6 commits February 9, 2022 01:18

add conda support

911a65d

change gpu varname

bd3de94

readd support for frozen reqs

cc3bd02

fix buildx for pip

8747f95

buildx conda support

9013146

delete old fns

5c48b8d

KyleGoyette reviewed Feb 10, 2022

View reviewed changes

wandb/sdk/launch/launch.py Show resolved Hide resolved

KyleGoyette reviewed Feb 10, 2022

View reviewed changes

KyleGoyette requested changes Feb 10, 2022

View reviewed changes

add some tests

0efdbfc

vanpelt approved these changes Feb 11, 2022

View reviewed changes

stephchen added 5 commits February 14, 2022 15:14

fix cuda flag handling

e454695

review fixes + tests

a8e43af

Merge branch 'master' into launch/docker-rehaul

84a5573

fix sagemaker

f31cf95

fix command

d8a2aca

stephchen requested a review from KyleGoyette February 17, 2022 05:35

stephchen added 3 commits February 17, 2022 02:43

tests

2d1345e

Merge branch 'master' into launch/docker-rehaul

13686c4

tests

586b4e6

KyleGoyette reviewed Feb 17, 2022

View reviewed changes

tests/tests_launch/test_launch_aws.py Show resolved Hide resolved

KyleGoyette reviewed Feb 17, 2022

View reviewed changes

KyleGoyette approved these changes Feb 17, 2022

View reviewed changes

stephchen added 5 commits February 17, 2022 22:17

fixes

65cce23

coverage.......

c477e57

Merge branch 'master' into launch/docker-rehaul

916207f

timeout

a189fe6

bump yea-wandb

9b89537

stephchen merged commit 95795a1 into master Feb 18, 2022

stephchen deleted the launch/docker-rehaul branch February 18, 2022 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker rehaul: remove repo2docker and add gpu support #3161

docker rehaul: remove repo2docker and add gpu support #3161

stephchen commented Jan 20, 2022

vanpelt left a comment

stephchen commented Jan 21, 2022

codecov bot commented Jan 28, 2022 •

edited

KyleGoyette Jan 31, 2022

stephchen Feb 10, 2022

KyleGoyette Feb 10, 2022

KyleGoyette Jan 31, 2022

stephchen Feb 10, 2022

KyleGoyette Jan 31, 2022

stephchen Feb 10, 2022

KyleGoyette left a comment

KyleGoyette Feb 10, 2022

KyleGoyette Feb 10, 2022

KyleGoyette Feb 10, 2022

KyleGoyette Feb 10, 2022

stephchen Feb 14, 2022

KyleGoyette Feb 17, 2022

KyleGoyette Feb 10, 2022

KyleGoyette left a comment

vanpelt left a comment

KyleGoyette Feb 17, 2022

KyleGoyette Feb 17, 2022

KyleGoyette Feb 17, 2022

KyleGoyette Feb 17, 2022

KyleGoyette left a comment

docker rehaul: remove repo2docker and add gpu support #3161

docker rehaul: remove repo2docker and add gpu support #3161

Conversation

stephchen commented Jan 20, 2022

Description

Testing

Checklist

vanpelt left a comment

Choose a reason for hiding this comment

stephchen commented Jan 21, 2022

codecov bot commented Jan 28, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KyleGoyette left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KyleGoyette left a comment

Choose a reason for hiding this comment

vanpelt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KyleGoyette left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 28, 2022 •

edited