Skip to content

Granular pipeline dependency status #9431

@johan-sightic

Description

@johan-sightic

Feature Request

When I run dvc repro dvc detects which dependencies have changed an therefore which stages needs to be reproduced. I would like to access the granular changes of all the dependencies for a stage since it was last reproduced.

Example usage

Change in dependencies for stage preprocess:

$ dvc stage status preprocess --granular --json  (New command or addition to "dvc status" or "dvc data status")
{
    "new": [
        "path/to/new/dependency/file",
        ...
    ],
    "modified": [
        "path/to/modified/dependency/file",
        ...
    ],
    "deleted": [
        "path/to/deleted/dependency/file",
        ...
    ]
}

Motivation

This feature would be very useful for pipelines which process many independent samples and take a long time to run.

Imagine the following simple data setup where samples get preprocessed and stored in a new folder.

data
├── raw
│   ├── sample_001.jpeg
│   └── sample_002.jpeg
└── preprocessed
    ├── sample_001.jpeg
    └── sample_002.jpeg

And the corresponding simple pipeline.

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
      - preprocess.py
      - data/raw/
    outs:
      - data/preprocessed:
        persist: true

With this feature the pipeline stage code could check which samples have changed (new/modified/deleted) and only process those. It could also detect that the code has changed and reprocess all samples.

This would save me a lot of time since we have a long and slow pipeline where the raw data gets updated quite often.
Link to extended Discord discussion: https://discord.com/channels/485586884165107732/1093361005754585109
Link to another discussion of the same problem: #5917

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: pipelinesRelated to the pipelines featureA: statusRelated to the dvc diff/list/statusp2-mediumMedium priority, should be done, but less important

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions