Skip to content

add: performance and reliability issues #6227

@skshetry

Description

@skshetry
  • Repeated dvc add is not skipped.

    $ dvc add data
    $ dvc add data

    In 1.X, it'd have been skipped. And, dvc still deletes the file and tries to restore it from the cache making it slower.

  • DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.

    This is slow and might result in data loss if it happens to fail in between the operations.

  • DVC deletes the stage file, before even adding those files. This means that if the dvc add operation fails, the existing pointer file is lost, which is the only way to get access to the data.

  • DVC resets the stages multiple times (only if multiple targets are provided) and forces the stage recollection which is slow.

  • To the same effect, it resets the internal state of the repo after creating each stage, which also happens to reset dulwich's ignore manager, making it horribly slow if using too many targets (or, -R).

https://github.com/iterative/dvc/blob/4e792ae61c5927ab2e5f6a6914d985d43aa705b4/dvc/repo/add.py#L266

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-managementRelated to dvc add/checkout/commit/move/removeenhancementEnhances DVCperformanceimprovement over resource / time consuming tasksuiuser interface / interaction

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions