The group_by_keys function in tariterators.py #157

EEthinker · 2022-02-23T23:39:02Z

I am not able to understand the logic of the code
"
if suffix in current_sample:
raise ValueError(f"{fname}: duplicate file name in tar file {suffix} {current_sample.keys()})
"
in the group_by_keys function in file tariterators.py.

In particular, in my applications sometimes it works well, and sometimes it throws this error. I found that when we are running this iterator we are in fact actively adding keys and values to the dictionary, and one of the keys we added is in fact the suffix, so naturally, it should appear in current_sample. I am not sure why this should be an error, and how to actually fix this. Any help/comment would be greatly appreciated. Many thanks!

tmbdev · 2022-02-25T17:26:39Z

Can you attach the output of tar tvf shard.tar please?

This error is triggered when you have a file that contains repeated file names, something like:

dir/base.jpg
dir/base.json
dir/base.jpg

Such files are valid tar files (the contents of the latter file overrides those of the first occurrence), but for WebDataset, we consider them an error.

EEthinker · 2022-02-25T17:28:43Z

Actually, each of my tar files only contains a single image, and they share the same name across different tar files. When my input urls contain all the tar files, it throws this error. Is it expected?

tmbdev · 2022-02-28T03:48:21Z

Yes, that will trigger this error. File names are supposed to be distinct in WebDataset files; that's just a very useful convention, and it is needed to segment tar files into samples.

You do have an option of addressing this by using the explicit pipeline construction. Something like the following might work:

# warning: untested code

def my_tarfile_to_samples(src):
    streams = url_opener(src, handler=handler)
    files = tar_file_expander(streams, handler=handler)
    count = 0
    for file in files:
        fname, value = filesample["fname"], filesample["data"]
        _, ext = os.path.splitext(fname)
        yield {
            "__key__": str(count),
            ext: value
        }

dataset = wds.DataPipeline(
    wds.SimpleShardList(url),
    wds.shuffle(100),
    wds.split_by_worker,
    my_tarfile_to_samples(),
    wds.shuffle(1000),
    wds.decode("torchrgb"),
    wds.to_tuple("jpg;png;jpeg"),
    wds.batched(16)
)

However, I strongly recommend giving samples unique basenames. Ultimately, the view of WebDataset-style files is that you ought to be able to take all tar files and extract them somewhere and end up with a reasonable file-based representation of dataset.

parkitny · 2022-12-24T13:13:22Z

This issue isn't resolved and should be re-opened. @EEthinker is correct that group_by_keys takes in the handler as an argument, but does not use it at all. Therefore the ValueError is always thrown when a duplicate is found, regardless of whether the warn_and_continue handler is provided as an argument. I hit this error a bit recently so looked into the code details. I could create a PR with a fix if you like, some time in the next few weeks.

tmbdev · 2023-03-18T21:50:35Z

I have added a rename_files option that allows you to rename files from the tar file prior to grouping by the grou_by_keys function. There is also a select_files argument that lets you skip files during tarfile reading, giving you speedups for skipping unused data when reading from local disk.

The ValueError is always thrown, but it is also always caught. The handler function is called in the exception handler.

FuchenUSTC · 2023-04-23T06:01:24Z

@tmbdev I found that the same error will raised when I read many tar files and each one only have one single video/image. If I merge those videos/images into tars which includes ~100 files per one, the error is disappeared. Maybe there should be some test for such case (many tars and each one only includes one file).

tmbdev · 2023-04-30T02:59:07Z

File names inside tar files must be unique across .tar files. That's because webdataset considers the entire collection of tar files to be the dataset and requires unique file names across it.

WebDataset notices non unique file names when you happen to shuffle shards in such a way that two tar files containing identical file names happen to end up next to each other.

When you put all files into a single tar file, the shard shuffle won't shuffle the files, so if you happen to have picked an order in which identical dinners are not next to one another, so won't get this error.

We could consider tar file boundaries to be sample boundaries, but having duplicate keys still causes problems elsewhere, e.g. with sample caching.

You can find duplicates using tar tf shards-*.tar | sort | uniq -d

Todo: maybe we need an explicit checker for this.

piEsposito · 2024-01-29T19:57:11Z

@tmbdev I've opened #327 proposing a change that I suppose solves this problem in a harmless way - at least it did for my specific use case.

jpc · 2024-06-09T19:44:14Z

I think I've seen this error when I had one shard containing just a single sample (in two files) and I was doing infinite reshuffling. I think it randomly reselected the same shard twice in a row and ended up throwing this error despite all the file names being unique.

Tomorrow I'l try to reproduce this with a dataset containing a single sample in a single shard – it should fail every time.

tmbdev added the question Further information is requested label Feb 25, 2022

tmbdev self-assigned this Feb 25, 2022

tmbdev closed this as completed Feb 28, 2022

tmbdev added the documentation Improvements or additions to documentation label Mar 18, 2023

tmbdev reopened this Apr 30, 2023

piEsposito mentioned this issue Jan 25, 2024

Allow filenames to be repeated across shards #327

Open

HollowMan6 mentioned this issue Mar 30, 2024

ValueError: duplicate file name in tar file when training bronemos/view-fusion#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The group_by_keys function in tariterators.py #157

The group_by_keys function in tariterators.py #157

EEthinker commented Feb 23, 2022

tmbdev commented Feb 25, 2022

EEthinker commented Feb 25, 2022

tmbdev commented Feb 28, 2022 •

edited

Loading

parkitny commented Dec 24, 2022

tmbdev commented Mar 18, 2023

FuchenUSTC commented Apr 23, 2023

tmbdev commented Apr 30, 2023 •

edited

Loading

piEsposito commented Jan 29, 2024

jpc commented Jun 9, 2024

The group_by_keys function in tariterators.py #157

The group_by_keys function in tariterators.py #157

Comments

EEthinker commented Feb 23, 2022

tmbdev commented Feb 25, 2022

EEthinker commented Feb 25, 2022

tmbdev commented Feb 28, 2022 • edited Loading

parkitny commented Dec 24, 2022

tmbdev commented Mar 18, 2023

FuchenUSTC commented Apr 23, 2023

tmbdev commented Apr 30, 2023 • edited Loading

piEsposito commented Jan 29, 2024

jpc commented Jun 9, 2024

tmbdev commented Feb 28, 2022 •

edited

Loading

tmbdev commented Apr 30, 2023 •

edited

Loading