Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise AttributeError when attribute is unknown for torchtext.data.dataset #835

Open
ToddMorrill opened this issue Jun 21, 2020 · 5 comments

Comments

@ToddMorrill
Copy link

🐛 Bug

Describe the bug
In short, an empty generator is created when calling __getattr__ with an unknown attribute on torchtext.data.dataset. Here is code responsible for this. See a more complete explanation here: skorch-dev/skorch#605 (comment)

To Reproduce
Steps to reproduce the behavior:

from torchtext.datasets import IMDB
from torchtext.data import Field, LabelField
from transformers import BertTokenizer

MAX_SEQ_LEN = 512  # discard everything after this many tokens, for speed

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:MAX_SEQ_LEN - 2]
    return tokens

TEXT = Field(
    batch_first=True,
    use_vocab=False,
    tokenize=tokenize_and_cut,
    preprocessing=tokenizer.convert_tokens_to_ids,
    init_token=tokenizer.cls_token_id,
    eos_token=tokenizer.sep_token_id,
    pad_token=tokenizer.pad_token_id,
    unk_token=tokenizer.unk_token_id,
)

LABEL = LabelField(dtype=torch.int64)

ds_train, ds_test = IMDB.splits(TEXT, LABEL)
ds_train.shape # this results in an empty generator

Expected behavior
If the attribute is not known, an AttributeError should be raised (as prescribed by the Python docs). However, for that to happen, __getattr__ should not be a generator.

Environment

Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.5.0
[pip3] torchtext==0.6.0
[conda] Could not collect

Additional context
The goal here is to be able to integrate torchtext with skorch and sklearn's RandomizedSearchCV. Here's a great example of torchtext working with skorch. We want to go one step further and make it work with RandomizedSearchCV too.

Please help make all these great tools work together :)

@zhangguanheng66
Copy link
Contributor

Could you try our new IMDB dataset in torchtext.experimental.datasets?

@ToddMorrill
Copy link
Author

Ok, I just took a look at this. The short answer is that something liketrain.shape does raise an AttributeError as expected. ✅
Here's the code and output:

ds_train.shape
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-cb86048069af> in <module>
----> 1 ds_train.shape

AttributeError: 'TextClassificationDataset' object has no attribute 'shape'

The longer answer is that I tried to actually use this new dataset in this example. Is there any documentation on how these new experimental datasets play with torchtext.data.Field and torchtext.data.Iterator? Here's the result of trying to plug this dataset into torchtext.data.BucketIterator.

train_iter = BucketIterator(ds_train, 16, sort=False, sort_key=lambda x: len(x.text))
next(iter(train_iter))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-cef5238e265d> in <module>
      1 train_iter = BucketIterator(ds_train, 16, sort=False, sort_key=lambda x: len(x.text))
----> 2 next(iter(train_iter))

~/Documents/regtech/.venv/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)
    154                     else:
    155                         minibatch.sort(key=self.sort_key, reverse=True)
--> 156                 yield Batch(minibatch, self.dataset, self.device)
    157             if not self.repeat:
    158                 return

~/Documents/regtech/.venv/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)
     23             self.batch_size = len(data)
     24             self.dataset = dataset
---> 25             self.fields = dataset.fields.keys()  # copy field names
     26             self.input_fields = [k for k, v in dataset.fields.items() if
     27                                  v is not None and not v.is_target]

AttributeError: 'TextClassificationDataset' object has no attribute 'fields'

@bentrevett
Copy link
Contributor

I believe the intention is that the new experimental datasets will not use the old torchtext.iterators and instead use the standard torch.utils.data.Dataloader instead.

@ToddMorrill
Copy link
Author

That makes sense. I just stumbled upon this, which sheds some more light on the plans.

I do hope that the current great functionality of torchtext will be ported over to these new design patterns. For example, the BucketIterator is quite handy. I hope there's a way to provide this out-of-the-box.

Finally, it looks like I'll need to hack around a little to port this code over to this new design pattern. I'll have to learn where everything lives (e.g. tokenizers, custom pad tokens, pretrained vocab/embeddings, etc.). Currently, everything is very torchtext.data.Field-centric.

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jun 22, 2020

Yes. We want users to adopt those building blocks, instead of the "black box"
Could you try the new experimental datasets and see if it could solve your issue. Any feedback are appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants