Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-write IMDB dataset in torchtext.experimental.datasets #651

Merged
merged 119 commits into from
Dec 4, 2019

Conversation

zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented Nov 26, 2019

Re-write IMDB dataset. See #624 for motivation and API change. To load the new IMDB data with torch.utils.data.DataLoader

### Load a single dataset
from torchtext.experimental.datasets import IMDB
train1, = IMDB(data_select='train')

# Generate 8x8 batches
import torch
from torch.utils.data import DataLoader
def generate_rows(data):
	label_list = []
	txt_list = []
	for label, txt in data:
		label_list.append(label)
		txt_list.append(txt)
	return txt_list, label_list
	
dataloader = DataLoader(train1, batch_size=64, num_workers=4, collate_fn=generate_rows)
for txt, label in dataloader:
    # Send batch to the model.

@zhangguanheng66 zhangguanheng66 changed the title Moving the IMDB dataset under TextClassification Move TextClassification datasets to torch.prototype.datasets plus IMDB dataset Dec 2, 2019
@zhangguanheng66 zhangguanheng66 changed the title Move TextClassification datasets to torch.prototype.datasets plus IMDB dataset [BC Breaking]Move TextClassification datasets to torchtext.prototype.datasets plus IMDB dataset Dec 3, 2019
return self._vocab


def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be combined with _generate_imdb_data_iterators and abstracted further by allowing to pass an iters_group as an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if isinstance(data_select, str):
data_select = [data_select]
if dataset_name == 'IMDB':
return _setup_datasets(_generate_imdb_data_iterators(dataset_name, root, ngrams,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference here is _generate_imdb_data_iterators and _generate_data_iterators. If we create a more generic version of these two functions that can accept an iters_group we should be able to write less code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -1,7 +1,7 @@
# This is an example to create a text classification dataset and train a sentiment model.

In the basic case, users can train the sentiment model in model.py with
AG_NEWS dataset in torchtext.datasets.text_classification. The dataset is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to indicate that the interface is in pre-release and will evolve?

Copy link
Contributor Author

@zhangguanheng66 zhangguanheng66 Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix it in this release as all the text classification datasets are moved to prototype folder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincentqb - we talked a bit offline and thought it should be a good idea to move these new pieces into a preview or prototype folder first (even though we think they're much better) since the changes are quite drastic. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the previous datasets still being deprecated? Is the new interface finalized?

If yes and yes, then why delay? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current datasets won't be deprecated during this release, instead we'll introduce a new prototype section.

The new interface is not finalized and we notice various inconsistencies bubbling up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prototype folder allows us to evolve the interface without a lot of pressure from BC breaking. We could improve the datasets API while receiving feedbacks from OSS users. Once they are matured, we could move them to the main folder.

path['test'] = fname

iters_group = {}
if 'train' in data_select:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both "_imdb_iterator" and "_csv_iterator" have a similar interface, except that _csv_iterator is passed a path. Instead of accepting a path['train'] it could simply accept 'train' like _imdb_iterator does. Then you only need to create an iterator via something like _gen_imdb_iterator and all of this branching etc. can disappear.

@zhangguanheng66 zhangguanheng66 changed the title [BC Breaking]Move TextClassification datasets to torchtext.prototype.datasets plus IMDB dataset Re-write IMDB dataset in torchtext.prototype.datasets Dec 3, 2019
@@ -160,10 +178,21 @@ def AG_NEWS(*args, **kwargs):
Default: 1
vocab: Vocabulary used for dataset. If None, it will generate a new
vocabulary based on the train data set.
include_unk: include unknown token in the data (Default: False)
removed_tokens: removed tokens from output dataset (Default: [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes BC breaking since they'll live outside the experimental folder after all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you break out these BC breaking changes, which are unrelated to IMDB, into a separate PR?

yield ngrams_iterator(tokenizer(f.read()), ngrams)


def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to repeat all of this, but you can use it from the torchtext/datasets/text_classification file


__version__ = '0.4.0'

__all__ = ['data',
'datasets',
'utils',
'vocab',
'legacy']
'legacy',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to delete this

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - see one comment, then we can merge

@cpuhrsch cpuhrsch changed the title Re-write IMDB dataset in torchtext.prototype.datasets Re-write IMDB dataset in torchtext.experimental.datasets Dec 4, 2019
@cpuhrsch cpuhrsch merged commit 9776c53 into pytorch:master Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants