Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__getitem()__ not implemented? #1524

Open
ShairozS opened this issue Jan 15, 2022 · 5 comments
Open

__getitem()__ not implemented? #1524

ShairozS opened this issue Jan 15, 2022 · 5 comments

Comments

@ShairozS
Copy link

ShairozS commented Jan 15, 2022

❓ Questions and Help

Description

For some reason, calling getitem() on the Torchtext Multi30k dataset returns a NotImplementedError for me, despite the dataset being properly downloaded and calling next(iter()) on it providing valid output. Can someone help me understand this? I need the method as I'm wrapping the dataset in a larger dataset class and will have to call getitem() explicity to perform joint pre-processing with other dataset products.

Sample

m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de')) ; m30k.__getitem__(0)

@ShairozS ShairozS changed the title __getitem__ not implemented? __getitem()__ not implemented? Jan 15, 2022
@erip
Copy link
Contributor

erip commented Jan 16, 2022

Multi30k is (and, indeed, all torchtext datasets are) iterable-style and therefore does not implement __getitem__. You can convert it to a map-style dataset (which implements __getitem__) by using torchtext.data.functional.to_map_style_dataset:

>>> import torchtext
>>> m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de'))
>>> map_m30k = torchtext.data.functional.to_map_style_dataset(m30k)
>>> map_m30k[0]
('A man in an orange hat starring at something.\n', 'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.\n')

@parmeet
Copy link
Contributor

parmeet commented Jan 16, 2022

Thanks @erip for the reply. Please note that this is an experimental functionality (

This file contains experimental functionality.
). Also in light of migration (#1494), I wonder if there is a better way to do it such that we do not lose the datapipe properties. @ejguan Is there provision to support__getitem__ for datapipe (even if it mean materializing the whole dataset)?

@ejguan
Copy link
Contributor

ejguan commented Jan 18, 2022

It's doable using MapDataPipe but it's a different concept. And, it's currently the second citizen for TorchData as we are recommending using IterDataPipe in favor of streaming especially for the large Dataset.

And, there are ways to convert from MapDataPipe to IterDataPipe, or collaborate between two types of DataPipes like https://github.com/pytorch/data/blob/24c25c030e1fce6c75c41b18c837c049a14410f1/torchdata/datapipes/iter/util/combining.py#L90

Tbh, we don't have the plan to add __getitem__ to IterDataPipe as __getitem__ doesn't fit the streaming manner, which always requires to materialize the data or objects for each index within the DataPipe instance.

@ShairozS
Copy link
Author

Can I understand the reasoning behind implementing torchtext datasets as iterable-style instead of map-style? Many significantly larger image datasets (such as Imagenet and CIFAR-10) are implemented as iterable-style in torchvision (indeed loading the entire dataset into memory is not a requirement of the iterable-style anyway), and I'm not sure why batch size would be element dependent in this case. Those are really the only two cases where it seems convention denotes an iterable-style dataset be used.

@erip
Copy link
Contributor

erip commented Jan 24, 2022

Batch size can certainly be element dependent in NLP cases where you may want to form batches based on the length of examples (like max-token post-pad batching).

Some datasets in torchtext are modestly sized, but others (like CC100 soon) are significantly larger and iterable-style is the only way to realistically consume them. Additionally, datapipes in the pytorch ecosystem prefer iterable-style which enables slightly cleaner and intent-revealing semantics at the dataset level (vs. at the loader level).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants