-
Notifications
You must be signed in to change notification settings - Fork 144
Description
Description
The current workflow for developing a component with Docker based schedulers requires manually building the Docker image every time there's a change. This is an extra layer of friction and requires significant Docker knowledge and maintenance. It would be nice to provide a way to do this via the torchx
CLI to reduce user friction.
Detailed Proposal
This is going to add the concept of "workspaces". These workspaces look like file systems and can be implemented as one via fsspec. This can either map to an on-disk project with a .torchxconfig
or in memory for use with notebooks.
This requires adding a workspace support to the runner and schedulers. There'll be a couple of standard patching implementations, a stub one for local, one for docker, etc.
Programmatic Experience
For programmatic access we need to implement a high level concept of a workspace. For CLI commands this will map to the project folder (i.e. with the .torchxconfig). For programmatic, it's a bit more abstract in order to support the notebook workspaces #344
Workspace basically just track files. This means that we can potentially use any fsspec filesystem interface to find and build files. When using Docker, we need to build a tarball of the local files to upload the context. fsspec provides a clean interface to be able to find all files and tarball them.
from torchx.runner import get_runner
app: specs.AppDef = ...
runner = get_runner()
runner.run(app, "kubernetes", workspace="file:///home/d4l3k/my_project")
For things like notebooks we can use an in memory file system:
runner.run(app, "kubernetes", workspace="memory://torchx-notebook/")
CLI Experience
Before:
$ docker build -t repo.sh/my_image:my_tag .
$ docker push repo.sh/my_image:my_tag
$ torchx run -s kubernetes dist.ddp --image repo.sh/my_image:my_tag my_trainer.py
After:
# in folder w/ .torchxconfig
$ torchx run -s kubernetes -cfg push=repo.sh/my_image dist.ddp my_trainer.py
This canary syntax with Docker can work with local_docker, kubernetes and potentially Ray.
Docker
Docker supports layering so we just have to create a small Dockerfile such as
FROM ghcr.io/pytorch/torchx:0.1.1dev0 # the specified image
COPY . .
We'll walk the workspace and upload all files as the Docker context when building.
For local running we can have to build it and use the local tag. For remote we need a repository to push it to. We can default to pushing to the same one the package is specifies and use the image hash as the label. This will be an extra run config required to override if they're building off of a standard Docker images such as the provided torchx one and can be specified in the .torchxconfig
file.
Alternatives
There's some question about whether we only want to support Docker or should use buildah instead since that seems to be the more robust option. https://github.com/containers/buildah However, for now it seems like users are most familiar with Docker and buildah provides a compatible API so for maximum support we can use the existing Docker API.
For small components we can inline the file via the existing python component. This will work for many things but not everything.
This also doesn't quite address how we can do the same thing on Slurm. Though we could potentially support Docker on slurm which would be interesting. Slurm does support OCI images so we should potentially migrate towards that as a first class support from TorchX.
Additional context/links
Docker Python SDK: https://docker-py.readthedocs.io/en/stable/images.html