Skip to content

Direct stage dependencies on other stages specifiable by the stages's name #4640

@ChristophJud

Description

@ChristophJud

TL;DR

Proposing the -D <stage> option in the run command to specify dependencies on other stages.

One Possible Scenario

In order to reproduce results dependent on the whole software and OS environment, I'm executing my ML software in a Docker container. In this way, I can control not only the library versions of the used python packages but also the python version, the OS version, the CUDA/cuDNN versions etc. Thus, in my opinion, it makes sense to build the Docker container using dvc run for example:
dvc run -n build_docker -d docker/Dockerfile -d src/requirements.txt docker build -f docker/Dockerfile -t my_image . --network=host --no-cache=True

which depends on the Dockerfile and the requirements.txt file. The output of this command is the Docker image at some location which depends on the OS and the docker installation. Therefore, it is difficult to specify it as an output, and usually, one doesn't want to track the docker image but rather the Dockerfile.

The next stage could be e.g. a preprocessing step of the ML pipeline. It is executed within the previously built docker container and thus depends on the build_docker stage
dvc run -n preprocess -p config.yaml:data --external -o /tmp/preprocessed_data -d /data/ -d src/preprocess docker run -t --init --entrypoint= --runtime=nvidia --ipc=host --volume=/home/user/src:/app my_image:latest python3 -B src/main.py

There are following options to make the preprocessing stage dependent on the build_docker stage:

  1. With the current dependency option, files and directories can be specified exclusively as dependencies. In this case, one could add the same dependencies to the preprocess stage which have been specified in the build_docker stage. However, this would be error-prone as things have to be specified multiple times and one could not ensure that the image has been actually build before running preprocess.
  2. Because it is difficult and probably not useful to define the docker image as an output one can wrap the build_docker in a separate script which outputs a dummy file. This file can be defined as an output in build_docker and as a dependency for the preprocess stage. However, this is not that handy and somewhat confusing.

Proposed Solution

I'm proposing a third more elegant way to define this dependency namely by specifying a stage as a dependency using e.g. a -D option. Thus we could just add -D build_docker to the preprocess stage without specifying the dependency indirectly over a dummy file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestRequesting a new featurep2-mediumMedium priority, should be done, but less important

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions