Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental.dataset WikiText2, WikiText103, PennTreeBank, WMTNewsCrawl #774

Merged
merged 28 commits into from
Jun 4, 2020

Conversation

rmz59
Copy link
Contributor

@rmz59 rmz59 commented May 15, 2020

@cpuhrsch cpuhrsch changed the title New language modeling experimental.dataset WikiText2, WikiText103, PennTreeBank May 15, 2020
@rmz59 rmz59 marked this pull request as draft May 15, 2020 16:34
@rmz59 rmz59 marked this pull request as ready for review May 15, 2020 17:09
self.vocab = vocab
self.transforms = transforms
self.data = torch.cat(tuple(transforms(row) for row in data), axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with text classification datasets, please call the transform func in __getitem__ func

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be hard to call the transform func in __getitem__. For dataset like WikiText2, the raw data wiki.train.tokens is stored as a multi-line txt file, and self.__getitem__[i] is expeced to output i-th token. Therefore, pre-processing is required to concat and tokenize the multi-line file. Can you advise if we need an additional preprocess func?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant' move transforms into __getitem__, because tokenizer must be applied before __getitem__. Otherwise, the unit test https://github.com/pytorch/text/blob/master/test/data/test_builtin_datasets.py#L54-L57 will fail here.

Possible solution

  • Move tokenizer out of transforms
  • OR: split transforms into two parts - global transforms / token-level transforms.

Please advise how I can move forward. Thanks.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and very close to the dataset abstraction. Only made a few suggestions for changes.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also consolidate WMTNewsCrawl here into word language modeling datasets?

@rmz59 rmz59 marked this pull request as draft May 20, 2020 02:05
@@ -210,14 +159,13 @@ def PennTreebank(*args, **kwargs):
Separately returns the train/test/valid set

Arguments:
root: Directory where the datasets are saved. Default: ".data"
vocab: Vocabulary used for dataset. If None, it will generate a new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you change the order of vocab and tokenizer, will this be a BC breaking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a possible BC breaking if people use args instead of kwargs. I restored the previous order of vocab and tokenizer in commit f433b40 above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Let me know when you are done with the revision. I will have another look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed to go by tokenizer, root, vocab in order.

@zhangguanheng66
Copy link
Contributor

@z-Runmin Please kindly let us know if you need a review.

@rmz59 rmz59 marked this pull request as ready for review June 3, 2020 02:24
@rmz59
Copy link
Contributor Author

rmz59 commented Jun 3, 2020

Not sure why unittest_windows failed to install torchtext

@zhangguanheng66 zhangguanheng66 changed the title experimental.dataset WikiText2, WikiText103, PennTreeBank experimental.dataset WikiText2, WikiText103, PennTreeBank, WMTNewsCrawl Jun 3, 2020
@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jun 3, 2020

Not sure why unittest_windows failed to install torchtext

@peterjc123 Could you take a look at here? The master branch is green.

token_id_3, token_id_1]).long()
>>> vocab = build_vocab_from_iterator([['language', 'modeling']])
>>> dataset = LanguageModelingDataset(data, vocab)
transforms: Text string transforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and docs for single_line

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Add only two comments. Once the CI tests are fixed, we can merge the PR. Then, I will switch my BERT pipeline to the new datasets.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And could you add a test for WMTNewsCrawl dataset? like

def test_penntreebank(self):

@rmz59 rmz59 marked this pull request as draft June 4, 2020 00:39
@peterjc123
Copy link
Contributor

Not sure why unittest_windows failed to install torchtext

@peterjc123 Could you take a look at here? The master branch is green.

@z-Runmin @zhangguanheng66 You might need to rebase your branch on master.

@rmz59 rmz59 marked this pull request as ready for review June 4, 2020 03:51
@peterjc123
Copy link
Contributor

Looks like the wmt dataset is so large that it couldn't be downloaded within the time limit (30 min).

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jun 4, 2020

Looks like the wmt dataset is so large that it couldn't be downloaded within the time limit (30 min).

OK, let me remove the test.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some minor changes. Will merge after CI tests pass.

@zhangguanheng66 zhangguanheng66 merged commit 6bca9f3 into pytorch:master Jun 4, 2020
zhangguanheng66 pushed a commit to zhangguanheng66/text that referenced this pull request Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants