Re-write IMDB dataset in torchtext.experimental.datasets #651

zhangguanheng66 · 2019-11-26T02:22:22Z

Re-write IMDB dataset. See #624 for motivation and API change. To load the new IMDB data with torch.utils.data.DataLoader

### Load a single dataset
from torchtext.experimental.datasets import IMDB
train1, = IMDB(data_select='train')

# Generate 8x8 batches
import torch
from torch.utils.data import DataLoader
def generate_rows(data):
	label_list = []
	txt_list = []
	for label, txt in data:
		label_list.append(label)
		txt_list.append(txt)
	return txt_list, label_list
	
dataloader = DataLoader(train1, batch_size=64, num_workers=4, collate_fn=generate_rows)
for txt, label in dataloader:
    # Send batch to the model.

cpuhrsch · 2019-12-03T19:16:31Z

torchtext/prototype/datasets/text_classification.py

+        return self._vocab
+
+
+def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select):


I think this can be combined with _generate_imdb_data_iterators and abstracted further by allowing to pass an iters_group as an argument.

cpuhrsch · 2019-12-03T19:17:12Z

torchtext/prototype/datasets/text_classification.py

+    if isinstance(data_select, str):
+        data_select = [data_select]
+    if dataset_name == 'IMDB':
+        return _setup_datasets(_generate_imdb_data_iterators(dataset_name, root, ngrams,


The only difference here is _generate_imdb_data_iterators and _generate_data_iterators. If we create a more generic version of these two functions that can accept an iters_group we should be able to write less code.

vincentqb · 2019-12-03T19:48:31Z

examples/text_classification/README.md

@@ -1,7 +1,7 @@
 # This is an example to create a text classification dataset and train a sentiment model. 

 In the basic case, users can train the sentiment model in model.py with 
-AG_NEWS dataset in torchtext.datasets.text_classification. The dataset is


Is this meant to indicate that the interface is in pre-release and will evolve?

We need to fix it in this release as all the text classification datasets are moved to prototype folder.

@vincentqb - we talked a bit offline and thought it should be a good idea to move these new pieces into a preview or prototype folder first (even though we think they're much better) since the changes are quite drastic. What do you think?

Are the previous datasets still being deprecated? Is the new interface finalized?

If yes and yes, then why delay? :)

The current datasets won't be deprecated during this release, instead we'll introduce a new prototype section.

The new interface is not finalized and we notice various inconsistencies bubbling up.

prototype folder allows us to evolve the interface without a lot of pressure from BC breaking. We could improve the datasets API while receiving feedbacks from OSS users. Once they are matured, we could move them to the main folder.

…nction.

cpuhrsch · 2019-12-03T21:42:28Z

torchtext/prototype/datasets/text_classification.py

+                path['test'] = fname
+
+    iters_group = {}
+    if 'train' in data_select:


Both "_imdb_iterator" and "_csv_iterator" have a similar interface, except that _csv_iterator is passed a path. Instead of accepting a path['train'] it could simply accept 'train' like _imdb_iterator does. Then you only need to create an iterator via something like _gen_imdb_iterator and all of this branching etc. can disappear.

cpuhrsch · 2019-12-04T16:45:18Z

torchtext/datasets/text_classification.py

@@ -160,10 +178,21 @@ def AG_NEWS(*args, **kwargs):
            Default: 1
        vocab: Vocabulary used for dataset. If None, it will generate a new
            vocabulary based on the train data set.
-        include_unk: include unknown token in the data (Default: False)
+        removed_tokens: removed tokens from output dataset (Default: [])


Are these changes BC breaking since they'll live outside the experimental folder after all?

Could you break out these BC breaking changes, which are unrelated to IMDB, into a separate PR?

cpuhrsch · 2019-12-04T16:45:58Z

torchtext/experimental/datasets/text_classification.py

+                    yield ngrams_iterator(tokenizer(f.read()), ngrams)
+
+
+def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select):


I don't think you need to repeat all of this, but you can use it from the torchtext/datasets/text_classification file

cpuhrsch · 2019-12-04T19:11:28Z

torchtext/__init__.py


 __version__ = '0.4.0'

 __all__ = ['data',
           'datasets',
           'utils',
           'vocab',
-           'legacy']
+           'legacy',


You probably need to delete this

cpuhrsch

LGTM - see one comment, then we can merge

Guanheng Zhang added 30 commits October 23, 2019 14:01

Move PennTreebank, WikiText103, WikiText2 to torchtext.legacy

45d53de

Some initial work.

1f95483

Merge branch 'master' into legacy_language_modeling

2d3ebe2

Re-write three datasets.

97af9d0

Merge branch 'master' into legacy_language_modeling

544b069

Update tests.

cc127de

Move legacy docs for language modeling dataset.

97cfd05

Update docs.

0ac3e18

Minor debug

56046fa

Update test.

9962732

Minor change in tests.

ad7938e

Flake8

3ff1cce

Merge branch 'master' into legacy_language_modeling

361f688

Move imdb to legacy.

c2de141

Move two funct to data/functional.py.

cc1ae4d

Fix <'unk'> compability issue.

f4018cc

Minor changes.

ff329f9

Merge branch 'legacy_language_modeling' into new_imdb

eb7a6bc

Draft

821d7e7

Update unit tests.

65c470c

Bump up version.

d3d6f1b

Update docs.

b78fc88

Minor change.

aeef855

Merge branch 'master' into legacy_language_modeling

96cd268

Minor change

25336b9

Merge branch 'legacy_language_modeling' into new_imdb

b0c7204

Add flags for train/valid/test/

4819f18

Update docs.

48cb0a8

Merge branch 'legacy_language_modeling' into new_imdb

5a33660

Move IMDB to text_classification.py

fb1e13a

Guanheng Zhang added 7 commits December 2, 2019 12:13

add underscore to internal function.

cca58bf

add _initiate_datasets func

2acb8a1

update docs

9228a59

add prototype to __init__ file

adbda15

flake8

7be9df2

Flake8

50707ae

Remove imdb in legacy

888122c

zhangguanheng66 changed the title ~~Moving the IMDB dataset under TextClassification~~ Move TextClassification datasets to torch.prototype.datasets plus IMDB dataset Dec 2, 2019

Guanheng Zhang added 2 commits December 2, 2019 14:21

add prototype in examples/text_classification

51a3408

docs.

40494bb

zhangguanheng66 changed the title ~~Move TextClassification datasets to torch.prototype.datasets plus IMDB dataset~~ [BC Breaking]Move TextClassification datasets to torchtext.prototype.datasets plus IMDB dataset Dec 3, 2019

cpuhrsch reviewed Dec 3, 2019

View reviewed changes

vincentqb reviewed Dec 3, 2019

View reviewed changes

combine _generate_data_iterators and _generate_imdb_data_iterators fu…

65db243

…nction.

cpuhrsch reviewed Dec 3, 2019

View reviewed changes

re-name prototype to experimental.

a42ff14

zhangguanheng66 changed the title ~~[BC Breaking]Move TextClassification datasets to torchtext.prototype.datasets plus IMDB dataset~~ Re-write IMDB dataset in torchtext.prototype.datasets Dec 3, 2019

Guanheng Zhang added 4 commits December 3, 2019 14:52

Move text classfication datasets back.

09160bd

Remove text classification datasets from experimental.

55c8a31

Remove prototype from examples/text_classification

92fd553

remove experimental marker from text classification dataset.

b26a43d

cpuhrsch reviewed Dec 4, 2019

View reviewed changes

split IMDB and text classification into two PRs.

367eaf3

cpuhrsch reviewed Dec 4, 2019

View reviewed changes

cpuhrsch approved these changes Dec 4, 2019

View reviewed changes

cpuhrsch changed the title ~~Re-write IMDB dataset in torchtext.prototype.datasets~~ Re-write IMDB dataset in torchtext.experimental.datasets Dec 4, 2019

cpuhrsch merged commit 9776c53 into pytorch:master Dec 4, 2019

zhangguanheng66 mentioned this pull request Dec 6, 2019

Overview of issues in torchtext and the plan for revamping #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-write IMDB dataset in torchtext.experimental.datasets #651

Re-write IMDB dataset in torchtext.experimental.datasets #651

zhangguanheng66 commented Nov 26, 2019 •

edited

Loading

cpuhrsch Dec 3, 2019

zhangguanheng66 Dec 3, 2019

cpuhrsch Dec 3, 2019

zhangguanheng66 Dec 3, 2019

vincentqb Dec 3, 2019

zhangguanheng66 Dec 3, 2019 •

edited

Loading

cpuhrsch Dec 3, 2019

vincentqb Dec 3, 2019

cpuhrsch Dec 3, 2019

zhangguanheng66 Dec 3, 2019

cpuhrsch Dec 3, 2019

cpuhrsch Dec 4, 2019

cpuhrsch Dec 4, 2019

cpuhrsch Dec 4, 2019

cpuhrsch Dec 4, 2019

cpuhrsch left a comment

		return self._vocab


		def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select):

		yield ngrams_iterator(tokenizer(f.read()), ngrams)


		def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select):

Re-write IMDB dataset in torchtext.experimental.datasets #651

Re-write IMDB dataset in torchtext.experimental.datasets #651

Conversation

zhangguanheng66 commented Nov 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Dec 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch left a comment

Choose a reason for hiding this comment

zhangguanheng66 commented Nov 26, 2019 •

edited

Loading

zhangguanheng66 Dec 3, 2019 •

edited

Loading