-
Notifications
You must be signed in to change notification settings - Fork 47
[IO-1586][external] Changes to RemoteDataset.pull() to account for identically named files in different export releases #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s in different releases
IO-1586 Bug: External issue with sequential dataset pull operations
BUG submission from: Summary (describe an issue): Running dataset pull twice in a row, different folders, with the same filename in, overwrites the oldest uploaded. Share Loom/Screenshots with Console/Network opened: Darwin affected version Darwin-py, all recent versions. Impact One known customer Priority Medium Steps to reproduce See linked ticket |
darwin/dataset/remote_dataset.py
Outdated
| self.console.print(f"\t - {error}") | ||
|
|
||
| downloaded_file_count = len([f for f in self.local_images_path.rglob("*") if f.is_file()]) | ||
| # Remove images that don't have a corresponding annotation file in the release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? The underlying download_all_images_from_annotations function gets passed the remove_extra flag and is meant to already unlink/delete files that don't have an annotation. If that's not doing it's job and still needs to be cleaned up afterwards then perhaps this is a separate ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually want to clarify one thing here - Is the intended behaviour of remove_extra=True to remove files that don't have a corresponding annotation file, or to remove files that are not part of the release currently being pulled?
Currently, the behaviour is the latter, because we remove all files that don't have a stem matching any stem in the release-specific annotations folder
If the latter is the intended behaviour, it works unless there are duplicate filename stems in the release, because it relies on the existing_images dict that I changed to a list (see comment above for why a dict was problematic). If existing_images is changed to a list or a set of paths where there are duplicate stems in the release, I can't think of a way to get a guaranteed mapping between annotation file and corresponding image/video etc. other than by inspecting the local_path field inside source_files for each annotation file. However, this isn't populated in annotation files that are part of the current pull until the download function is actually finished (i.e. exhaust_generator() is run). As far as I can tell, this means that the removal of extra files has to take place after this (I chose to put it inside remote_dataset.py, but it could easily be put into a function)
Would this be good to put into a new ticket? There was already an issue with remove_extra before my changes (problematic existing_images dict), but my changes broke it more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind this here, but creating an extra ticket and progressing it with this merge would be good for tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have made changes to ensure that remove_extra functions exactly as it did before the ticket, will raise it separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rslota Would you be able to clarify on my comment above? In working on this issue, I discovered a problem with the remove_extra flag, and neither Nathan nor I are able to tell what it's intended behaviour is
I can create a new ticket to work on it once I know what it should be doing
For now, I've made sure it's behaviour is identical with the changes in this ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late reply, I missed the notification.
remove_extra afaik is here to remove any image files that is already downloaded that would not get overridden by the current release being pulled. In other words, it should work the same as wiping image directory before pulling the release, however, it can't be implemented this way, in case we have issues downloading "new" images, and customer would ended up with empty target directory.
Nathanjp91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment to clarify
owencjones
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for now, good first PR though :)
darwin/dataset/remote_dataset.py
Outdated
| self.console.print(f"\t - {error}") | ||
|
|
||
| downloaded_file_count = len([f for f in self.local_images_path.rglob("*") if f.is_file()]) | ||
| # Remove images that don't have a corresponding annotation file in the release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind this here, but creating an extra ticket and progressing it with this merge would be good for tracking.
Nathanjp91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the change to Set + full filename rather than overwrite on partial filename in dict
Problem
As described in this GitHub issue, pulling two different releases using
RemoteDataset.pull()passingwith_folders=Truecan result in missing files from the second release unlessforce_replace=Trueis passed as well. This is becausepull()does not download files if the stem of the filename isalready present anywhere in the
~/.darwin/datasets/{team_slug}/{dataset_slug}/imagesfolder.A further breakdown of the cause is available in the Linear ticket IO-1586
Solution
Instead of checking the stem of existing filenames in the images folder, check the full local filepath of each file against the "planned" local path for each file in the export release. This way, if the "planned" local path already exists, we know that file doesn't need to be downloaded again.
This led to the discovery of an issue with the
remove_extraflag, the purpose of which is to remove files from theimagesfolder that aren't contained within the export release currently being pulled. Similarly to above, it was using filename stems to make decisions about which files should be removed, which meant it couldn't account for situations where a release contained duplicate filenames.To resolve this, the functionality was moved up a level to
remote_dataset.pywhere it has access to build a list of full local filepaths from the complete annotation files. Files in theimagesfolder are then compared against this list, and those without a path contained in this list are removed.Changelog
Changes to RemoteDataset.pull() to prevent situations where identically named files in different folders could be overwritten when pulling different export releases