-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Feature Request
When I run dvc repro dvc detects which dependencies have changed an therefore which stages needs to be reproduced. I would like to access the granular changes of all the dependencies for a stage since it was last reproduced.
Example usage
Change in dependencies for stage preprocess:
$ dvc stage status preprocess --granular --json (New command or addition to "dvc status" or "dvc data status")
{
"new": [
"path/to/new/dependency/file",
...
],
"modified": [
"path/to/modified/dependency/file",
...
],
"deleted": [
"path/to/deleted/dependency/file",
...
]
}Motivation
This feature would be very useful for pipelines which process many independent samples and take a long time to run.
Imagine the following simple data setup where samples get preprocessed and stored in a new folder.
data
├── raw
│ ├── sample_001.jpeg
│ └── sample_002.jpeg
└── preprocessed
├── sample_001.jpeg
└── sample_002.jpegAnd the corresponding simple pipeline.
stages:
preprocess:
cmd: python preprocess.py
deps:
- preprocess.py
- data/raw/
outs:
- data/preprocessed:
persist: trueWith this feature the pipeline stage code could check which samples have changed (new/modified/deleted) and only process those. It could also detect that the code has changed and reprocess all samples.
This would save me a lot of time since we have a long and slow pipeline where the raw data gets updated quite often.
Link to extended Discord discussion: https://discord.com/channels/485586884165107732/1093361005754585109
Link to another discussion of the same problem: #5917