-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Now all the checksums are scattered among DVC-files. It was a design decision to simplify git merge for ML experiments when a single data-file/dvc-stage changes were localized. However, we learned that in many cases -X theirs strategy is the best way to bring ML experiments to another branch without a manual merging and it is a good time to revisit this design decision.
There are two issues with checksums in many DVC-files:
- It makes DVC-files not readable by users
- DVC (a tool) has to modify files - not the best practice
- It could be convenient to have all the changes as a single file for automation tools (like CD4ML) which usually cannot make a Git commit (after
dvc repro). The changes in repo (changed dvc-files) need to be copied to somewhere (e.g. GitLab artifacts).
To solve the issues from the above - it might worth to extract all the checksums into a separate "State"-file. For example: Dvc.state or <anyname>.dvcstate or .dvc/state
Note, this is not the same as the current .dvc/state which is an ephemeral (not committed to Git) DB file. The state file needs to be committed to Git.
Example: Terraform keeps all the infrastructure configuration in *.tf files but stores state in a single, separate file terraform.tfstate.
Related issues: This FR might be related to a single dag FR #1871