New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__getitem()__ not implemented? #1524
Comments
>>> import torchtext
>>> m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de'))
>>> map_m30k = torchtext.data.functional.to_map_style_dataset(m30k)
>>> map_m30k[0]
('A man in an orange hat starring at something.\n', 'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.\n') |
Thanks @erip for the reply. Please note that this is an experimental functionality ( text/torchtext/data/functional.py Line 15 in a2ab974
__getitem__ for datapipe (even if it mean materializing the whole dataset)?
|
It's doable using And, there are ways to convert from Tbh, we don't have the plan to add |
Can I understand the reasoning behind implementing torchtext datasets as iterable-style instead of map-style? Many significantly larger image datasets (such as Imagenet and CIFAR-10) are implemented as iterable-style in torchvision (indeed loading the entire dataset into memory is not a requirement of the iterable-style anyway), and I'm not sure why batch size would be element dependent in this case. Those are really the only two cases where it seems convention denotes an iterable-style dataset be used. |
Batch size can certainly be element dependent in NLP cases where you may want to form batches based on the length of examples (like max-token post-pad batching). Some datasets in torchtext are modestly sized, but others (like CC100 soon) are significantly larger and iterable-style is the only way to realistically consume them. Additionally, datapipes in the pytorch ecosystem prefer iterable-style which enables slightly cleaner and intent-revealing semantics at the dataset level (vs. at the loader level). |
❓ Questions and Help
Description
For some reason, calling getitem() on the Torchtext Multi30k dataset returns a NotImplementedError for me, despite the dataset being properly downloaded and calling next(iter()) on it providing valid output. Can someone help me understand this? I need the method as I'm wrapping the dataset in a larger dataset class and will have to call getitem() explicity to perform joint pre-processing with other dataset products.
Sample
m30k = torchtext.datasets.Multi30k(root='.\Data', split='test', language_pair=('en', 'de')) ; m30k.__getitem__(0)
The text was updated successfully, but these errors were encountered: