getitem() not implemented? #1524

ShairozS · 2022-01-15T21:03:44Z

❓ Questions and Help

Description

For some reason, calling getitem() on the Torchtext Multi30k dataset returns a NotImplementedError for me, despite the dataset being properly downloaded and calling next(iter()) on it providing valid output. Can someone help me understand this? I need the method as I'm wrapping the dataset in a larger dataset class and will have to call getitem() explicity to perform joint pre-processing with other dataset products.

Sample

m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de')) ; m30k.__getitem__(0)

The text was updated successfully, but these errors were encountered:

erip · 2022-01-16T00:23:29Z

Multi30k is (and, indeed, all torchtext datasets are) iterable-style and therefore does not implement __getitem__. You can convert it to a map-style dataset (which implements __getitem__) by using torchtext.data.functional.to_map_style_dataset:

>>> import torchtext
>>> m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de'))
>>> map_m30k = torchtext.data.functional.to_map_style_dataset(m30k)
>>> map_m30k[0]
('A man in an orange hat starring at something.\n', 'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.\n')

parmeet · 2022-01-16T03:02:39Z

Thanks @erip for the reply. Please note that this is an experimental functionality (

text/torchtext/data/functional.py

Line 15 in a2ab974

This file contains experimental functionality.

). Also in light of migration (#1494), I wonder if there is a better way to do it such that we do not lose the datapipe properties. @ejguan Is there provision to support__getitem__ for datapipe (even if it mean materializing the whole dataset)?

ejguan · 2022-01-18T14:56:24Z

It's doable using MapDataPipe but it's a different concept. And, it's currently the second citizen for TorchData as we are recommending using IterDataPipe in favor of streaming especially for the large Dataset.

https://github.com/pytorch/pytorch/tree/master/torch/utils/data/datapipes/map

And, there are ways to convert from MapDataPipe to IterDataPipe, or collaborate between two types of DataPipes like https://github.com/pytorch/data/blob/24c25c030e1fce6c75c41b18c837c049a14410f1/torchdata/datapipes/iter/util/combining.py#L90

Tbh, we don't have the plan to add __getitem__ to IterDataPipe as __getitem__ doesn't fit the streaming manner, which always requires to materialize the data or objects for each index within the DataPipe instance.

ShairozS · 2022-01-24T23:35:02Z

Can I understand the reasoning behind implementing torchtext datasets as iterable-style instead of map-style? Many significantly larger image datasets (such as Imagenet and CIFAR-10) are implemented as iterable-style in torchvision (indeed loading the entire dataset into memory is not a requirement of the iterable-style anyway), and I'm not sure why batch size would be element dependent in this case. Those are really the only two cases where it seems convention denotes an iterable-style dataset be used.

erip · 2022-01-24T23:52:14Z

Batch size can certainly be element dependent in NLP cases where you may want to form batches based on the length of examples (like max-token post-pad batching).

Some datasets in torchtext are modestly sized, but others (like CC100 soon) are significantly larger and iterable-style is the only way to realistically consume them. Additionally, datapipes in the pytorch ecosystem prefer iterable-style which enables slightly cleaner and intent-revealing semantics at the dataset level (vs. at the loader level).

ShairozS changed the title ~~__getitem__ not implemented?~~ __getitem()__ not implemented? Jan 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getitem() not implemented? #1524

getitem() not implemented? #1524

ShairozS commented Jan 15, 2022 •

edited

erip commented Jan 16, 2022 •

edited

parmeet commented Jan 16, 2022

ejguan commented Jan 18, 2022

ShairozS commented Jan 24, 2022

erip commented Jan 24, 2022 •

edited

__getitem()__ not implemented? #1524

__getitem()__ not implemented? #1524

Comments

ShairozS commented Jan 15, 2022 • edited

❓ Questions and Help

erip commented Jan 16, 2022 • edited

parmeet commented Jan 16, 2022

ejguan commented Jan 18, 2022

ShairozS commented Jan 24, 2022

erip commented Jan 24, 2022 • edited

getitem() not implemented? #1524

getitem() not implemented? #1524

ShairozS commented Jan 15, 2022 •

edited

erip commented Jan 16, 2022 •

edited

erip commented Jan 24, 2022 •

edited