Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The group_by_keys function in tariterators.py #157

Open
EEthinker opened this issue Feb 23, 2022 · 9 comments
Open

The group_by_keys function in tariterators.py #157

EEthinker opened this issue Feb 23, 2022 · 9 comments
Assignees
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@EEthinker
Copy link

I am not able to understand the logic of the code
"
if suffix in current_sample:
raise ValueError(f"{fname}: duplicate file name in tar file {suffix} {current_sample.keys()})
"
in the group_by_keys function in file tariterators.py.

In particular, in my applications sometimes it works well, and sometimes it throws this error. I found that when we are running this iterator we are in fact actively adding keys and values to the dictionary, and one of the keys we added is in fact the suffix, so naturally, it should appear in current_sample. I am not sure why this should be an error, and how to actually fix this. Any help/comment would be greatly appreciated. Many thanks!

@tmbdev
Copy link
Collaborator

tmbdev commented Feb 25, 2022

Can you attach the output of tar tvf shard.tar please?

This error is triggered when you have a file that contains repeated file names, something like:

dir/base.jpg
dir/base.json
dir/base.jpg

Such files are valid tar files (the contents of the latter file overrides those of the first occurrence), but for WebDataset, we consider them an error.

@tmbdev tmbdev added the question Further information is requested label Feb 25, 2022
@tmbdev tmbdev self-assigned this Feb 25, 2022
@EEthinker
Copy link
Author

Actually, each of my tar files only contains a single image, and they share the same name across different tar files. When my input urls contain all the tar files, it throws this error. Is it expected?

@tmbdev
Copy link
Collaborator

tmbdev commented Feb 28, 2022

Yes, that will trigger this error. File names are supposed to be distinct in WebDataset files; that's just a very useful convention, and it is needed to segment tar files into samples.

You do have an option of addressing this by using the explicit pipeline construction. Something like the following might work:

# warning: untested code

def my_tarfile_to_samples(src):
    streams = url_opener(src, handler=handler)
    files = tar_file_expander(streams, handler=handler)
    count = 0
    for file in files:
        fname, value = filesample["fname"], filesample["data"]
        _, ext = os.path.splitext(fname)
        yield {
            "__key__": str(count),
            ext: value
        }

dataset = wds.DataPipeline(
    wds.SimpleShardList(url),
    wds.shuffle(100),
    wds.split_by_worker,
    my_tarfile_to_samples(),
    wds.shuffle(1000),
    wds.decode("torchrgb"),
    wds.to_tuple("jpg;png;jpeg"),
    wds.batched(16)
)

However, I strongly recommend giving samples unique basenames. Ultimately, the view of WebDataset-style files is that you ought to be able to take all tar files and extract them somewhere and end up with a reasonable file-based representation of dataset.

@tmbdev tmbdev closed this as completed Feb 28, 2022
@parkitny
Copy link

This issue isn't resolved and should be re-opened. @EEthinker is correct that group_by_keys takes in the handler as an argument, but does not use it at all. Therefore the ValueError is always thrown when a duplicate is found, regardless of whether the warn_and_continue handler is provided as an argument. I hit this error a bit recently so looked into the code details. I could create a PR with a fix if you like, some time in the next few weeks.

@tmbdev
Copy link
Collaborator

tmbdev commented Mar 18, 2023

I have added a rename_files option that allows you to rename files from the tar file prior to grouping by the grou_by_keys function. There is also a select_files argument that lets you skip files during tarfile reading, giving you speedups for skipping unused data when reading from local disk.

The ValueError is always thrown, but it is also always caught. The handler function is called in the exception handler.

@tmbdev tmbdev added the documentation Improvements or additions to documentation label Mar 18, 2023
@FuchenUSTC
Copy link

@tmbdev I found that the same error will raised when I read many tar files and each one only have one single video/image. If I merge those videos/images into tars which includes ~100 files per one, the error is disappeared. Maybe there should be some test for such case (many tars and each one only includes one file).

@tmbdev
Copy link
Collaborator

tmbdev commented Apr 30, 2023

File names inside tar files must be unique across .tar files. That's because webdataset considers the entire collection of tar files to be the dataset and requires unique file names across it.

WebDataset notices non unique file names when you happen to shuffle shards in such a way that two tar files containing identical file names happen to end up next to each other.

When you put all files into a single tar file, the shard shuffle won't shuffle the files, so if you happen to have picked an order in which identical dinners are not next to one another, so won't get this error.

We could consider tar file boundaries to be sample boundaries, but having duplicate keys still causes problems elsewhere, e.g. with sample caching.

You can find duplicates using tar tf shards-*.tar | sort | uniq -d

Todo: maybe we need an explicit checker for this.

@piEsposito
Copy link

@tmbdev I've opened #327 proposing a change that I suppose solves this problem in a harmless way - at least it did for my specific use case.

@jpc
Copy link
Contributor

jpc commented Jun 9, 2024

I think I've seen this error when I had one shard containing just a single sample (in two files) and I was doing infinite reshuffling. I think it randomly reselected the same shard twice in a row and ended up throwing this error despite all the file names being unique.

Tomorrow I'l try to reproduce this with a dataset containing a single sample in a single shard – it should fail every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants