Skip to content

Conversation

@JBWilkie
Copy link
Contributor

@JBWilkie JBWilkie commented Sep 25, 2023

Problem

As described in this GitHub issue, pulling two different releases using RemoteDataset.pull() passing with_folders=True can result in missing files from the second release unless force_replace=True is passed as well. This is because pull() does not download files if the stem of the filename is
already present anywhere in the ~/.darwin/datasets/{team_slug}/{dataset_slug}/images folder.

A further breakdown of the cause is available in the Linear ticket IO-1586

Solution

Instead of checking the stem of existing filenames in the images folder, check the full local filepath of each file against the "planned" local path for each file in the export release. This way, if the "planned" local path already exists, we know that file doesn't need to be downloaded again.

This led to the discovery of an issue with the remove_extra flag, the purpose of which is to remove files from the images folder that aren't contained within the export release currently being pulled. Similarly to above, it was using filename stems to make decisions about which files should be removed, which meant it couldn't account for situations where a release contained duplicate filenames.

To resolve this, the functionality was moved up a level to remote_dataset.py where it has access to build a list of full local filepaths from the complete annotation files. Files in the images folder are then compared against this list, and those without a path contained in this list are removed.

Changelog

Changes to RemoteDataset.pull() to prevent situations where identically named files in different folders could be overwritten when pulling different export releases

@linear
Copy link

linear bot commented Sep 25, 2023

IO-1586 Bug: External issue with sequential dataset pull operations

BUG submission from:

#603

Summary (describe an issue):

Running dataset pull twice in a row, different folders, with the same filename in, overwrites the oldest uploaded.

Share Loom/Screenshots with Console/Network opened:

Darwin affected version

Darwin-py, all recent versions.

Impact

One known customer

Priority

Medium

Steps to reproduce

See linked ticket

@JBWilkie JBWilkie changed the title [IO-1586][external] Changes to RemoteDataset.pull() to account for identically named files in different export release [IO-1586][external] Changes to RemoteDataset.pull() to account for identically named files in different export releases Sep 25, 2023
self.console.print(f"\t - {error}")

downloaded_file_count = len([f for f in self.local_images_path.rglob("*") if f.is_file()])
# Remove images that don't have a corresponding annotation file in the release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? The underlying download_all_images_from_annotations function gets passed the remove_extra flag and is meant to already unlink/delete files that don't have an annotation. If that's not doing it's job and still needs to be cleaned up afterwards then perhaps this is a separate ticket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually want to clarify one thing here - Is the intended behaviour of remove_extra=True to remove files that don't have a corresponding annotation file, or to remove files that are not part of the release currently being pulled?

Currently, the behaviour is the latter, because we remove all files that don't have a stem matching any stem in the release-specific annotations folder

If the latter is the intended behaviour, it works unless there are duplicate filename stems in the release, because it relies on the existing_images dict that I changed to a list (see comment above for why a dict was problematic). If existing_images is changed to a list or a set of paths where there are duplicate stems in the release, I can't think of a way to get a guaranteed mapping between annotation file and corresponding image/video etc. other than by inspecting the local_path field inside source_files for each annotation file. However, this isn't populated in annotation files that are part of the current pull until the download function is actually finished (i.e. exhaust_generator() is run). As far as I can tell, this means that the removal of extra files has to take place after this (I chose to put it inside remote_dataset.py, but it could easily be put into a function)

Would this be good to put into a new ticket? There was already an issue with remove_extra before my changes (problematic existing_images dict), but my changes broke it more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind this here, but creating an extra ticket and progressing it with this merge would be good for tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have made changes to ensure that remove_extra functions exactly as it did before the ticket, will raise it separately

Copy link
Contributor Author

@JBWilkie JBWilkie Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rslota Would you be able to clarify on my comment above? In working on this issue, I discovered a problem with the remove_extra flag, and neither Nathan nor I are able to tell what it's intended behaviour is

I can create a new ticket to work on it once I know what it should be doing

For now, I've made sure it's behaviour is identical with the changes in this ticket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply, I missed the notification.
remove_extra afaik is here to remove any image files that is already downloaded that would not get overridden by the current release being pulled. In other words, it should work the same as wiping image directory before pulling the release, however, it can't be implemented this way, in case we have issues downloading "new" images, and customer would ended up with empty target directory.

Copy link
Contributor

@Nathanjp91 Nathanjp91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment to clarify

Copy link
Contributor

@owencjones owencjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for now, good first PR though :)

self.console.print(f"\t - {error}")

downloaded_file_count = len([f for f in self.local_images_path.rglob("*") if f.is_file()])
# Remove images that don't have a corresponding annotation file in the release
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind this here, but creating an extra ticket and progressing it with this merge would be good for tracking.

@JBWilkie JBWilkie requested a review from Nathanjp91 October 3, 2023 11:22
Copy link
Contributor

@Nathanjp91 Nathanjp91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the change to Set + full filename rather than overwrite on partial filename in dict

@JBWilkie JBWilkie requested a review from owencjones October 3, 2023 16:12
@JBWilkie JBWilkie merged commit 013a988 into master Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants