Skip to content

unexpected behavior of consecutive use of RemoteDataset.pull() with use_folders=True #603

@filippocastelli

Description

@filippocastelli

Issue description:

Pulling two different releases using RemoteDataset.pull() using with_folders=True can result in images missing from the second release pull unless force_replace=True is specified.

When making consecutive calls to RemoteDataset.pull(release) with different releases, if any file in the second release shares filename with one in the first (even with different paths), pull is going to ignore those files unless force_replace=True

release_A references the following files:

  • /dir_A/000.png
  • /dir_A/001.png

release_B references:

  • /dir_B/000.png
  • /dir_B/002.png

After calling

RemoteDataset.pull(release_A, with_folders=True)
RemoteDataset.pull(release_B, with_folders=True)

The resulting images directory will be

images:
    - dir_A:
        - 000.png
        - 001.png
    - dir_B:
        - 002.png

Expected Behavior:

Since /dir_B/000.png is not supposed to overwrite /dir_A/000.png , it's expected that the second pull should download it as well.
force_replace=True should only be necessary to download files with conflicting local paths.

Workaround

Specify force_replace=True when pulling data

Environment

  • python==3.10
  • darwin-py==0.8.24

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions