Skip to content

Get rid of .dvc files? #4278

@jorgeorpinel

Description

@jorgeorpinel

TLDR: Skip rant and go to the last bullet and code blocks.

If we're planning a non-backward compatible change in the near future, maybe it would be a good time to consider completely moving from .dvc files to dvc.yaml/dvc.lock. Or implement and leave the old .dvc file approach as optional for backward compatibility (for some time at least).

Some context in treeverse/dvc.org#1384 (review)

My motivation to suggest this is mostly conceptual right now but maybe it has some very practical implications too? Leaving that open to discussion (cc @iterative/engineering).

In DVC 1.x we created the pipelines file dvc.yaml which contains all the stages. From that point on .dvc files stopped being "stage files" and they only remain as placeholders for data files. They're no longer considered any kind of stage (we even removed the terms "stage file" and "orphan stage" from docs already). This has caused headaches in docs when explaining commands that use or affect both .dvc AND dvc.yaml or .lock files such as status, checkout, repro (pretty important ones), because we constantly need to mention both "stages and .dvc files" or "dvc.yaml and .dvc file", etc.

The options I see here are:

  • Create some concept that encompasses both .dvc files and stages — already discussed with Ivan and we came up with "DVC file" but we don't love it because it's too similar to ".dvc file" so will probably cause confusion. This solution is purely docs-related and implies no action here (close this issue).

  • Rethink some of these commands a little and make sure by default they only use/affect stages, and have a separate set of commands, or explicit option requirements for them to use .dvc files. To me this seems overkill

  • Get rid of .dvc files! Why can't dvc.yaml (and lock) be used for this? It's a matter of introducing a new top section. Example below

dvc.yaml

data:
- corpus.csv
- dataset/

stages:
  cleanup:
    cmd: python clean.py corpus.csv df.h5
    deps:
    - corpus.csv
    outs:
    - df.h5
  ...

dvc.lock

outs:
- md5: 6137cde4893c59f76f005a8123d8e8e6
  path: df.h5
- md5: cde4876f0r5137c59f8e6a8423d8e936.dir
  path: dataset/

cleanup:
  cmd: python clean.py corpus.csv df.h5
  deps:
  - path: corpus.csv
    md5: 6137cde4893c59f76f005a8123d8e8e6
  outs:
  - path: df.h5
    md5: f40e3db3e1aa25562945045864a28deb
  ...

Something like that.

An additional advantage of this would be that it reduces the possible confusions between .dvc/ (internal dir) and .dvc (files).

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusionfeature requestRequesting a new feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions