Skip to content

Provide complete filepath to is_valid_file in make_dataset rather than only the filename #4871

@kotchin

Description

@kotchin

🚀 The feature

(First issue/feature request, tried my best to follow the guidelines, apologies if I missed something).

In torchvision.datasets.folder.make_dataset, we are given the option to use is_valid_file (or extensions).

My feature request is to allow is_valid_file to get the whole path to the file rather than just the filename.

Motivation, pitch

Currently, if we wish to use is_valid_file, we can only act on the filename without getting the whole path to the file, which means it's currently particularly tricky to open the file and verify whether it meets certain criteria and decide whether it's valid.

Perhaps this was not the intended function initially, but it seems there's an opportunity of improving the possibilities of is_valid_file by providing it the whole path rather than just the filename.

My exact implementation idea would be the following:
Replace the following snippet:

for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):
            for fname in sorted(fnames):
                if is_valid_file(fname):
                    path = os.path.join(root, fname)
                    item = path, class_index
                    instances.append(item)

with:

for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):
            for fname in sorted(fnames):
            path = os.path.join(root, fname)
                if is_valid_file(path):
                    item = path, class_index
                    instances.append(item)

This change might break retro-compatibility for users who make use of is_valid_file, however their fix would be particularly simple as they could add the following line in their is_valid_file function:
root, fname= os.path.split(path) where path is the variable for is_valid_file and they could continue using fname in their function (or however they've named it) as previously.

Alternatives

No response

Additional context

One example of how this could be useful is if one wants to use pictures which are meeting certain resolution, channels, or other criteria, in a configurable way, where the criteria could be coded in the is_valid_file function, without having to delete or move files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions