-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bringing this over from Discord.
What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:
- All deps of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
- All outs of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
- If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock
- All params match what's in dvc.lock
Essentially, this is the same as running dvc pull then dvc repro --dry and checking that "Data and pipelines are up to date", except I don't want to have to run dvc pull. This is especially important if you're working with large datasets, as pulling them every time on a CI machine could be quite costly in time and/or actual dollar bills 💰
I played around a bit to see if there's a workaround with only the current functionality. Here's what I found.
dvc status by itself will tell you if any non-cached deps (e.g. source code) don't match dvc.lock. That'll look like this:
train:
changed deps:
deleted: data/processed/data_train.npz
deleted: data/processed/data_val.npz
modified: src/train <-- This one
dvc status -c will tell you if any outputs of any stages listed in dvc.lock aren't in remote storage. That'll look like this:
missing: data/processed/data_test.npz <-- This one
deleted: data/processed/data_train.npz
deleted: data/processed/data_val.npz
dvc params diff will obviously catch param changes. I don't think anything covers point 3 above, and even if I stitched these all together, it would be very brittle as it relies on the outputs of all these commands not changing.
As always, I'm happy to contribute the change if you all think it would be valuable. I know I would use it right away in several projects. Or let me know if I'm simply overlooking some existing functionality that would serve the same purpose.