-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The group_by_keys function in tariterators.py #157
Comments
Can you attach the output of This error is triggered when you have a file that contains repeated file names, something like:
Such files are valid tar files (the contents of the latter file overrides those of the first occurrence), but for WebDataset, we consider them an error. |
Actually, each of my tar files only contains a single image, and they share the same name across different tar files. When my input urls contain all the tar files, it throws this error. Is it expected? |
Yes, that will trigger this error. File names are supposed to be distinct in WebDataset files; that's just a very useful convention, and it is needed to segment tar files into samples. You do have an option of addressing this by using the explicit pipeline construction. Something like the following might work:
However, I strongly recommend giving samples unique basenames. Ultimately, the view of WebDataset-style files is that you ought to be able to take all tar files and extract them somewhere and end up with a reasonable file-based representation of dataset. |
This issue isn't resolved and should be re-opened. @EEthinker is correct that |
I have added a rename_files option that allows you to rename files from the tar file prior to grouping by the grou_by_keys function. There is also a select_files argument that lets you skip files during tarfile reading, giving you speedups for skipping unused data when reading from local disk. The ValueError is always thrown, but it is also always caught. The handler function is called in the exception handler. |
@tmbdev I found that the same error will raised when I read many tar files and each one only have one single video/image. If I merge those videos/images into tars which includes ~100 files per one, the error is disappeared. Maybe there should be some test for such case (many tars and each one only includes one file). |
File names inside tar files must be unique across .tar files. That's because webdataset considers the entire collection of tar files to be the dataset and requires unique file names across it. WebDataset notices non unique file names when you happen to shuffle shards in such a way that two tar files containing identical file names happen to end up next to each other. When you put all files into a single tar file, the shard shuffle won't shuffle the files, so if you happen to have picked an order in which identical dinners are not next to one another, so won't get this error. We could consider tar file boundaries to be sample boundaries, but having duplicate keys still causes problems elsewhere, e.g. with sample caching. You can find duplicates using Todo: maybe we need an explicit checker for this. |
I think I've seen this error when I had one shard containing just a single sample (in two files) and I was doing infinite reshuffling. I think it randomly reselected the same shard twice in a row and ended up throwing this error despite all the file names being unique. Tomorrow I'l try to reproduce this with a dataset containing a single sample in a single shard – it should fail every time. |
I am not able to understand the logic of the code
"
if suffix in current_sample:
raise ValueError(f"{fname}: duplicate file name in tar file {suffix} {current_sample.keys()})
"
in the group_by_keys function in file tariterators.py.
In particular, in my applications sometimes it works well, and sometimes it throws this error. I found that when we are running this iterator we are in fact actively adding keys and values to the dictionary, and one of the keys we added is in fact the suffix, so naturally, it should appear in current_sample. I am not sure why this should be an error, and how to actually fix this. Any help/comment would be greatly appreciated. Many thanks!
The text was updated successfully, but these errors were encountered: