-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker rehaul: remove repo2docker and add gpu support #3161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking cool, a couple things to consider:
- The base image is really important here. When "gpu" is enabled we should default to the latest version of cuda and allow the user to specify what cuda version they want. Of course a user can override this entirely with a custom baseimage or image. I wonder if we can derive the cuda version from wandb-metadata.json or elsewhere...
- We should consider standardizing around miniconda for our builds. The benefits will be faster build times for users that have conda configs and the ability to use any python version we want without doing an entire rebuild. We're currently using miniconda in our wandb/local build.
The cool thing about conda is that we can install any version of python we want. Potentially we make early steps of the build always install the miniconda py39 distribution, but then create a conda environment later with the appropriate python version, i.e.
conda env create --name=wandb python=3.7 pip
pip install -r requirements.txt
The trick is ensuring the appropriate conda environment is activated by default, happy to discuss more here if that's interesting.
We definitely have a lot of users using conda today and I think that's growing. It's especially popular for anyone in windows land and MLFlow has standardized on it.
@vanpelt yeah conda is a good idea, repo2docker basically built their pip stuff on top of conda infra I think — lemme see if I can sub that in easily cuda versioning is a good note, I haven't really tested it but eg cuda/torch compatibility is definitely a common issue |
Codecov Report
@@ Coverage Diff @@
## master #3161 +/- ##
==========================================
- Coverage 80.15% 79.95% -0.21%
==========================================
Files 213 213
Lines 27875 27882 +7
==========================================
- Hits 22343 22292 -51
- Misses 5532 5590 +58
Flags with carried forward coverage won't be shown. Click here to find out more.
|
# make sure `python` points at the right version | ||
RUN update-alternatives --install /usr/bin/python python /usr/bin/python{py_version} 1 \ | ||
&& update-alternatives --install /usr/local/bin/python python /usr/bin/python{py_version} 1 | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably keep these in a different file, docker_templates or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorta prefer keeping these here for now since they are only used in once place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thats fine
def get_current_python_version(): | ||
full_version = sys.version.split()[0].split(".") | ||
major = full_version[0] | ||
version = ".".join(full_version[:2]) if len(full_version) >= 2 else major + ".0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just dropping the + for the dev versions of python potentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropping the + would be correct since this is getting the version mostly to get the right docker base image, and i don't think they have them for dev versions
wandb/sdk/launch/docker.py
Outdated
"Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing" | ||
) | ||
requirements_line = "RUN WANDB_DISABLE_CACHE=true " | ||
requirements_line += "pip install -r requirements.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this drop the use of _wandb_bootstrap
? What if the repo doesn't have a requirements.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed and readded this in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions where I'm not sure if we're doing the right thing especially around installing requirements. Also still needs some tests.
I don't think we should block this on CUDA version. We should allow users to specify, defaulting to the newest. We should be able to add the Cuda version to RunInfo and pull it down with the rest.
wandb/sdk/launch/docker.py
Outdated
"""Fill in the Dockerfile templates for stage 2 of build. CPU version is built on python:slim, GPU | ||
version is built on nvidia:cuda""" | ||
|
||
python_base_image = "python:{}-slim-buster".format(py_version) # slim for running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could allow users to specify a base image using the docker key of the launch spec config. in case they want to directly specify another image as base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also need a windows option for windows users.
wandb/sdk/launch/docker.py
Outdated
wandb.termwarn( | ||
"Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing" | ||
) | ||
prefix = "RUN WANDB_DISABLE_CACHE=true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could factor out this warn and prefix to outside of the other if statements no?
wandb.termwarn( | ||
"Using supplied docker image: {}. Artifact swapping and launch metadata disabled".format( | ||
launch_project.docker_image | ||
) | ||
) | ||
image_uri = launch_project.docker_image | ||
_logger.info("Getting docker command...") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to apply the final build step ontop of a user provided image, where we would set the environment variables and inject the launch-metadata.json file. Then we can get rid of the above warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm gonna push this to another pr, this one is complicated enough and doing this would require totally different handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
wandb/sdk/launch/_project_spec.py
Outdated
@@ -370,6 +374,7 @@ def create_project_from_spec(launch_spec: Dict[str, Any], api: Api) -> LaunchPro | |||
launch_spec.get("overrides", {}), | |||
launch_spec.get("resource", "local"), | |||
launch_spec.get("resource_args", {}), | |||
launch_spec.get("cuda", False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we pull the wandb-metadata.json
file we can check that for a cuda version if this is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small changes and some more questions. This is getting pretty close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much HOTNESS 🔥 . Nice work!
I can imagine simplifying this build even further in the future. Micromamba looks really sick. Instead of installing the specific python version directly in Ubuntu or Debian, we could let micromamba do it. It can handle conda envs, virtual envs, pip installations and python installations all in a tiny package. Their official docker image is 35MB 😮
@@ -16,7 +16,7 @@ | |||
"S3OutputPath": "s3://test-bucket/test-output" | |||
}, | |||
"StoppingCondition": { | |||
"MaxRuntimeInSeconds": "60" | |||
"MaxRuntimeInSeconds": 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty
tox.ini
Outdated
@@ -40,7 +40,7 @@ commands_pre = | |||
commands = | |||
mkdir -p test-results | |||
py{35,36,37,38,39,310}: python -m pytest {env:CI_PYTEST_SPLIT_ARGS:} -n={env:CI_PYTEST_PARALLEL:4} --durations=20 --junitxml=test-results/junit.xml --cov-config=.coveragerc --cov --cov-report= --no-cov-on-fail {posargs:tests/} | |||
pylaunch: jupyter-repo2docker --no-run ./tests/fixtures/ | |||
; pylaunch: jupyter-repo2docker --no-run ./tests/fixtures/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caching here is busted, this can be removed.
wandb/sdk/launch/utils.py
Outdated
@@ -227,12 +236,12 @@ def download_entry_point( | |||
def download_wandb_python_deps( | |||
entity: str, project: str, run_name: str, api: Api, dir: str | |||
) -> Optional[str]: | |||
metadata = api.download_url( | |||
reqs = api.download_url( # @@@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Remove this comment?
@@ -341,6 +304,10 @@ def build_sagemaker_args( | |||
"Sagemaker launcher requires a StoppingCondition Sagemaker resource argument" | |||
) | |||
|
|||
# remove args that were passed in for launch but not passed to sagemaker | |||
sagemaker_args.pop("region", None) | |||
sagemaker_args.pop("profile", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, there's another arg popped on line 291 if we want to group them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Fixes WB-7840
Description
--gpu
flag to CLI to enable building a gpu enabled image (TODO: if rerunning a wandb run previously on gpu, set gpu on as default)metrics:
Testing
Checklist