-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
TLDR: Skip rant and go to the last bullet and code blocks.
If we're planning a non-backward compatible change in the near future, maybe it would be a good time to consider completely moving from .dvc files to dvc.yaml/dvc.lock. Or implement and leave the old .dvc file approach as optional for backward compatibility (for some time at least).
Some context in treeverse/dvc.org#1384 (review)
My motivation to suggest this is mostly conceptual right now but maybe it has some very practical implications too? Leaving that open to discussion (cc @iterative/engineering).
In DVC 1.x we created the pipelines file dvc.yaml which contains all the stages. From that point on .dvc files stopped being "stage files" and they only remain as placeholders for data files. They're no longer considered any kind of stage (we even removed the terms "stage file" and "orphan stage" from docs already). This has caused headaches in docs when explaining commands that use or affect both .dvc AND dvc.yaml or .lock files such as status, checkout, repro (pretty important ones), because we constantly need to mention both "stages and .dvc files" or "dvc.yaml and .dvc file", etc.
The options I see here are:
-
Create some concept that encompasses both .dvc files and stages — already discussed with Ivan and we came up with "DVC file" but we don't love it because it's too similar to ".dvc file" so will probably cause confusion. This solution is purely docs-related and implies no action here (close this issue).
-
Rethink some of these commands a little and make sure by default they only use/affect stages, and have a separate set of commands, or explicit option requirements for them to use .dvc files. To me this seems overkill
-
Get rid of .dvc files! Why can't dvc.yaml (and lock) be used for this? It's a matter of introducing a new top section. Example below
dvc.yaml
data:
- corpus.csv
- dataset/
stages:
cleanup:
cmd: python clean.py corpus.csv df.h5
deps:
- corpus.csv
outs:
- df.h5
...dvc.lock
outs:
- md5: 6137cde4893c59f76f005a8123d8e8e6
path: df.h5
- md5: cde4876f0r5137c59f8e6a8423d8e936.dir
path: dataset/
cleanup:
cmd: python clean.py corpus.csv df.h5
deps:
- path: corpus.csv
md5: 6137cde4893c59f76f005a8123d8e8e6
outs:
- path: df.h5
md5: f40e3db3e1aa25562945045864a28deb
...Something like that.
An additional advantage of this would be that it reduces the possible confusions between .dvc/ (internal dir) and .dvc (files).