Raise AttributeError when attribute is unknown for torchtext.data.dataset #835

ToddMorrill · 2020-06-21T14:33:08Z

🐛 Bug

Describe the bug
In short, an empty generator is created when calling __getattr__ with an unknown attribute on torchtext.data.dataset. Here is code responsible for this. See a more complete explanation here: skorch-dev/skorch#605 (comment)

To Reproduce
Steps to reproduce the behavior:

from torchtext.datasets import IMDB
from torchtext.data import Field, LabelField
from transformers import BertTokenizer

MAX_SEQ_LEN = 512  # discard everything after this many tokens, for speed

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:MAX_SEQ_LEN - 2]
    return tokens

TEXT = Field(
    batch_first=True,
    use_vocab=False,
    tokenize=tokenize_and_cut,
    preprocessing=tokenizer.convert_tokens_to_ids,
    init_token=tokenizer.cls_token_id,
    eos_token=tokenizer.sep_token_id,
    pad_token=tokenizer.pad_token_id,
    unk_token=tokenizer.unk_token_id,
)

LABEL = LabelField(dtype=torch.int64)

ds_train, ds_test = IMDB.splits(TEXT, LABEL)
ds_train.shape # this results in an empty generator

Expected behavior
If the attribute is not known, an AttributeError should be raised (as prescribed by the Python docs). However, for that to happen, __getattr__ should not be a generator.

Environment

Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.5.0
[pip3] torchtext==0.6.0
[conda] Could not collect

Additional context
The goal here is to be able to integrate torchtext with skorch and sklearn's RandomizedSearchCV. Here's a great example of torchtext working with skorch. We want to go one step further and make it work with RandomizedSearchCV too.

Please help make all these great tools work together :)

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2020-06-21T17:37:28Z

Could you try our new IMDB dataset in torchtext.experimental.datasets?

ToddMorrill · 2020-06-21T20:37:32Z

Ok, I just took a look at this. The short answer is that something liketrain.shape does raise an AttributeError as expected. ✅
Here's the code and output:

ds_train.shape
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-cb86048069af> in <module>
----> 1 ds_train.shape

AttributeError: 'TextClassificationDataset' object has no attribute 'shape'

The longer answer is that I tried to actually use this new dataset in this example. Is there any documentation on how these new experimental datasets play with torchtext.data.Field and torchtext.data.Iterator? Here's the result of trying to plug this dataset into torchtext.data.BucketIterator.

train_iter = BucketIterator(ds_train, 16, sort=False, sort_key=lambda x: len(x.text))
next(iter(train_iter))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-cef5238e265d> in <module>
      1 train_iter = BucketIterator(ds_train, 16, sort=False, sort_key=lambda x: len(x.text))
----> 2 next(iter(train_iter))

~/Documents/regtech/.venv/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)
    154                     else:
    155                         minibatch.sort(key=self.sort_key, reverse=True)
--> 156                 yield Batch(minibatch, self.dataset, self.device)
    157             if not self.repeat:
    158                 return

~/Documents/regtech/.venv/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)
     23             self.batch_size = len(data)
     24             self.dataset = dataset
---> 25             self.fields = dataset.fields.keys()  # copy field names
     26             self.input_fields = [k for k, v in dataset.fields.items() if
     27                                  v is not None and not v.is_target]

AttributeError: 'TextClassificationDataset' object has no attribute 'fields'

bentrevett · 2020-06-22T10:42:49Z

I believe the intention is that the new experimental datasets will not use the old torchtext.iterators and instead use the standard torch.utils.data.Dataloader instead.

ToddMorrill · 2020-06-22T12:06:14Z

That makes sense. I just stumbled upon this, which sheds some more light on the plans.

I do hope that the current great functionality of torchtext will be ported over to these new design patterns. For example, the BucketIterator is quite handy. I hope there's a way to provide this out-of-the-box.

Finally, it looks like I'll need to hack around a little to port this code over to this new design pattern. I'll have to learn where everything lives (e.g. tokenizers, custom pad tokens, pretrained vocab/embeddings, etc.). Currently, everything is very torchtext.data.Field-centric.

zhangguanheng66 · 2020-06-22T15:01:36Z

Yes. We want users to adopt those building blocks, instead of the "black box"
Could you try the new experimental datasets and see if it could solve your issue. Any feedback are appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise AttributeError when attribute is unknown for torchtext.data.dataset #835

Raise AttributeError when attribute is unknown for torchtext.data.dataset #835

ToddMorrill commented Jun 21, 2020

zhangguanheng66 commented Jun 21, 2020

ToddMorrill commented Jun 21, 2020

bentrevett commented Jun 22, 2020

ToddMorrill commented Jun 22, 2020

zhangguanheng66 commented Jun 22, 2020 •

edited

Raise AttributeError when attribute is unknown for torchtext.data.dataset #835

Raise AttributeError when attribute is unknown for torchtext.data.dataset #835

Comments

ToddMorrill commented Jun 21, 2020

🐛 Bug

zhangguanheng66 commented Jun 21, 2020

ToddMorrill commented Jun 21, 2020

bentrevett commented Jun 22, 2020

ToddMorrill commented Jun 22, 2020

zhangguanheng66 commented Jun 22, 2020 • edited

zhangguanheng66 commented Jun 22, 2020 •

edited