Skip to content

New command: dvc verify - check that the pipeline is up to date without having to pull or run it #5369

@sjawhar

Description

@sjawhar

Bringing this over from Discord.

What I'd like is a command to use in a CI/CD process to check that the DVC pipeline is in a valid state. By "valid", I mean that:

  1. All deps of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
  2. All outs of all stages are either in the remote cache or match what's in the workspace, using the hashes in dvc.lock
  3. If any stage has an out that is the dep of any other stage, they have the same hash in dvc.lock
  4. All params match what's in dvc.lock

Essentially, this is the same as running dvc pull then dvc repro --dry and checking that "Data and pipelines are up to date", except I don't want to have to run dvc pull. This is especially important if you're working with large datasets, as pulling them every time on a CI machine could be quite costly in time and/or actual dollar bills 💰

I played around a bit to see if there's a workaround with only the current functionality. Here's what I found.

dvc status by itself will tell you if any non-cached deps (e.g. source code) don't match dvc.lock. That'll look like this:

train:
        changed deps:
                deleted:            data/processed/data_train.npz
                deleted:            data/processed/data_val.npz
                modified:           src/train <-- This one

dvc status -c will tell you if any outputs of any stages listed in dvc.lock aren't in remote storage. That'll look like this:

        missing:            data/processed/data_test.npz   <-- This one 
        deleted:            data/processed/data_train.npz
        deleted:            data/processed/data_val.npz

dvc params diff will obviously catch param changes. I don't think anything covers point 3 above, and even if I stitched these all together, it would be very brittle as it relies on the outputs of all these commands not changing.

As always, I'm happy to contribute the change if you all think it would be valuable. I know I would use it right away in several projects. Or let me know if I'm simply overlooking some existing functionality that would serve the same purpose.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: pipelinesRelated to the pipelines featureA: statusRelated to the dvc diff/list/statusfeature requestRequesting a new featurep1-importantImportant, aka current backlog of things to do

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions