-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
TL;DR
Proposing the -D <stage> option in the run command to specify dependencies on other stages.
One Possible Scenario
In order to reproduce results dependent on the whole software and OS environment, I'm executing my ML software in a Docker container. In this way, I can control not only the library versions of the used python packages but also the python version, the OS version, the CUDA/cuDNN versions etc. Thus, in my opinion, it makes sense to build the Docker container using dvc run for example:
dvc run -n build_docker -d docker/Dockerfile -d src/requirements.txt docker build -f docker/Dockerfile -t my_image . --network=host --no-cache=True
which depends on the Dockerfile and the requirements.txt file. The output of this command is the Docker image at some location which depends on the OS and the docker installation. Therefore, it is difficult to specify it as an output, and usually, one doesn't want to track the docker image but rather the Dockerfile.
The next stage could be e.g. a preprocessing step of the ML pipeline. It is executed within the previously built docker container and thus depends on the build_docker stage
dvc run -n preprocess -p config.yaml:data --external -o /tmp/preprocessed_data -d /data/ -d src/preprocess docker run -t --init --entrypoint= --runtime=nvidia --ipc=host --volume=/home/user/src:/app my_image:latest python3 -B src/main.py
There are following options to make the preprocessing stage dependent on the build_docker stage:
- With the current dependency option, files and directories can be specified exclusively as dependencies. In this case, one could add the same dependencies to the
preprocessstage which have been specified in thebuild_dockerstage. However, this would be error-prone as things have to be specified multiple times and one could not ensure that the image has been actually build before runningpreprocess. - Because it is difficult and probably not useful to define the docker image as an output one can wrap the
build_dockerin a separate script which outputs a dummy file. This file can be defined as an output inbuild_dockerand as a dependency for thepreprocessstage. However, this is not that handy and somewhat confusing.
Proposed Solution
I'm proposing a third more elegant way to define this dependency namely by specifying a stage as a dependency using e.g. a -D option. Thus we could just add -D build_docker to the preprocess stage without specifying the dependency indirectly over a dummy file.