Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker rehaul: remove repo2docker and add gpu support #3161

Merged
merged 39 commits into from
Feb 18, 2022

Conversation

stephchen
Copy link
Contributor

Fixes WB-7840

Description

  • removes repo2docker
  • adds CUDA support for gpus
  • some smaller things:
    • builds docker image based on current python version if not launching from a previous wandb run
    • adds --gpu flag to CLI to enable building a gpu enabled image (TODO: if rerunning a wandb run previously on gpu, set gpu on as default)
    • TODO: I don't think buildx partial caching (eg if we change some but not all of requirements.txt) is working at the moment, looking into this

metrics:

  • for a test image with (numpy, torch, wandb):
    • 203.7s build time from scratch
    • 2.42gb image size (2.55gb with gpu)
    • 8.2s build time when rebuilding identical image
  • compare with master:
    • 380.3s build time from scratch (370s building base image in repo2docker, 10.3s building on top)
    • 4.22gb image size (plus another 3.91gb base image stored on system)
    • 3.1s build time when rebuilding identical image — this caching is also dependent entirely on the target project name which is brittle)

Testing

  • tested manually locally with a few different python versions, tested on gpu via gcp
  • writing unit tests now

Checklist

  • Name PR "[WB-NNNN][WB-MMMM] Add support for..." similar to entries in CHANGELOG.md
  • Include reference to internal ticket "Fixes WB-NNNN" (and github issue "Fixes #NNNN" if applicable)

Copy link
Contributor

@vanpelt vanpelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking cool, a couple things to consider:

  1. The base image is really important here. When "gpu" is enabled we should default to the latest version of cuda and allow the user to specify what cuda version they want. Of course a user can override this entirely with a custom baseimage or image. I wonder if we can derive the cuda version from wandb-metadata.json or elsewhere...
  2. We should consider standardizing around miniconda for our builds. The benefits will be faster build times for users that have conda configs and the ability to use any python version we want without doing an entire rebuild. We're currently using miniconda in our wandb/local build.

The cool thing about conda is that we can install any version of python we want. Potentially we make early steps of the build always install the miniconda py39 distribution, but then create a conda environment later with the appropriate python version, i.e.

conda env create --name=wandb python=3.7 pip
pip install -r requirements.txt

The trick is ensuring the appropriate conda environment is activated by default, happy to discuss more here if that's interesting.

We definitely have a lot of users using conda today and I think that's growing. It's especially popular for anyone in windows land and MLFlow has standardized on it.

@stephchen
Copy link
Contributor Author

@vanpelt yeah conda is a good idea, repo2docker basically built their pip stuff on top of conda infra I think — lemme see if I can sub that in easily

cuda versioning is a good note, I haven't really tested it but eg cuda/torch compatibility is definitely a common issue

@codecov
Copy link

codecov bot commented Jan 28, 2022

Codecov Report

Merging #3161 (9b89537) into master (412f7eb) will decrease coverage by 0.20%.
The diff coverage is 91.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3161      +/-   ##
==========================================
- Coverage   80.15%   79.95%   -0.21%     
==========================================
  Files         213      213              
  Lines       27875    27882       +7     
==========================================
- Hits        22343    22292      -51     
- Misses       5532     5590      +58     
Flag Coverage Δ
functest 56.65% <1.73%> (-0.55%) ⬇️
unittest 69.56% <91.90%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/launch/launch_add.py 82.85% <ø> (ø)
wandb/sdk/launch/launch.py 79.24% <80.00%> (-1.15%) ⬇️
wandb/sdk/launch/runner/aws.py 86.76% <87.50%> (+0.25%) ⬆️
wandb/sdk/launch/docker.py 88.99% <91.12%> (+2.01%) ⬆️
wandb/sdk/launch/_project_spec.py 89.52% <93.75%> (-0.89%) ⬇️
wandb/cli/cli.py 66.20% <100.00%> (+0.16%) ⬆️
wandb/sdk/launch/runner/local.py 82.85% <100.00%> (-2.09%) ⬇️
wandb/sdk/launch/utils.py 89.00% <100.00%> (-4.85%) ⬇️
wandb/integration/metaflow/metaflow.py 52.29% <0.00%> (-32.76%) ⬇️
... and 13 more

# make sure `python` points at the right version
RUN update-alternatives --install /usr/bin/python python /usr/bin/python{py_version} 1 \
&& update-alternatives --install /usr/local/bin/python python /usr/bin/python{py_version} 1
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably keep these in a different file, docker_templates or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorta prefer keeping these here for now since they are only used in once place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thats fine

def get_current_python_version():
full_version = sys.version.split()[0].split(".")
major = full_version[0]
version = ".".join(full_version[:2]) if len(full_version) >= 2 else major + ".0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just dropping the + for the dev versions of python potentially?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping the + would be correct since this is getting the version mostly to get the right docker base image, and i don't think they have them for dev versions

"Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing"
)
requirements_line = "RUN WANDB_DISABLE_CACHE=true "
requirements_line += "pip install -r requirements.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this drop the use of _wandb_bootstrap? What if the repo doesn't have a requirements.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed and readded this in

Copy link
Contributor

@KyleGoyette KyleGoyette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions where I'm not sure if we're doing the right thing especially around installing requirements. Also still needs some tests.

I don't think we should block this on CUDA version. We should allow users to specify, defaulting to the newest. We should be able to add the Cuda version to RunInfo and pull it down with the rest.

"""Fill in the Dockerfile templates for stage 2 of build. CPU version is built on python:slim, GPU
version is built on nvidia:cuda"""

python_base_image = "python:{}-slim-buster".format(py_version) # slim for running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could allow users to specify a base image using the docker key of the launch spec config. in case they want to directly specify another image as base

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also need a windows option for windows users.

wandb.termwarn(
"Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing"
)
prefix = "RUN WANDB_DISABLE_CACHE=true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could factor out this warn and prefix to outside of the other if statements no?

wandb.termwarn(
"Using supplied docker image: {}. Artifact swapping and launch metadata disabled".format(
launch_project.docker_image
)
)
image_uri = launch_project.docker_image
_logger.info("Getting docker command...")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to apply the final build step ontop of a user provided image, where we would set the environment variables and inject the launch-metadata.json file. Then we can get rid of the above warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm gonna push this to another pr, this one is complicated enough and doing this would require totally different handling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@@ -370,6 +374,7 @@ def create_project_from_spec(launch_spec: Dict[str, Any], api: Api) -> LaunchPro
launch_spec.get("overrides", {}),
launch_spec.get("resource", "local"),
launch_spec.get("resource_args", {}),
launch_spec.get("cuda", False),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we pull the wandb-metadata.json file we can check that for a cuda version if this is false.

Copy link
Contributor

@KyleGoyette KyleGoyette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small changes and some more questions. This is getting pretty close.

Copy link
Contributor

@vanpelt vanpelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much HOTNESS 🔥 . Nice work!

I can imagine simplifying this build even further in the future. Micromamba looks really sick. Instead of installing the specific python version directly in Ubuntu or Debian, we could let micromamba do it. It can handle conda envs, virtual envs, pip installations and python installations all in a tiny package. Their official docker image is 35MB 😮

@@ -16,7 +16,7 @@
"S3OutputPath": "s3://test-bucket/test-output"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": "60"
"MaxRuntimeInSeconds": 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty

tox.ini Outdated
@@ -40,7 +40,7 @@ commands_pre =
commands =
mkdir -p test-results
py{35,36,37,38,39,310}: python -m pytest {env:CI_PYTEST_SPLIT_ARGS:} -n={env:CI_PYTEST_PARALLEL:4} --durations=20 --junitxml=test-results/junit.xml --cov-config=.coveragerc --cov --cov-report= --no-cov-on-fail {posargs:tests/}
pylaunch: jupyter-repo2docker --no-run ./tests/fixtures/
; pylaunch: jupyter-repo2docker --no-run ./tests/fixtures/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching here is busted, this can be removed.

@@ -227,12 +236,12 @@ def download_entry_point(
def download_wandb_python_deps(
entity: str, project: str, run_name: str, api: Api, dir: str
) -> Optional[str]:
metadata = api.download_url(
reqs = api.download_url( # @@@
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove this comment?

@@ -341,6 +304,10 @@ def build_sagemaker_args(
"Sagemaker launcher requires a StoppingCondition Sagemaker resource argument"
)

# remove args that were passed in for launch but not passed to sagemaker
sagemaker_args.pop("region", None)
sagemaker_args.pop("profile", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, there's another arg popped on line 291 if we want to group them

Copy link
Contributor

@KyleGoyette KyleGoyette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@stephchen stephchen merged commit 95795a1 into master Feb 18, 2022
@stephchen stephchen deleted the launch/docker-rehaul branch February 18, 2022 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants