-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
experimental.dataset WikiText2, WikiText103, PennTreeBank, WMTNewsCrawl #774
Conversation
rmz59
commented
May 15, 2020
•
edited
Loading
edited
- Update word language modeling datasets [HELP WANTED] Re-write datasets in torchtext #742
…language_modeling # Conflicts: # torchtext/experimental/datasets/language_modeling.py
self.vocab = vocab | ||
self.transforms = transforms | ||
self.data = torch.cat(tuple(transforms(row) for row in data), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with text classification datasets, please call the transform func in __getitem__
func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be hard to call the transform func in __getitem__
. For dataset like WikiText2, the raw data wiki.train.tokens
is stored as a multi-line txt file, and self.__getitem__[i]
is expeced to output i-th token. Therefore, pre-processing is required to concat and tokenize the multi-line file. Can you advise if we need an additional preprocess
func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant' move transforms
into __getitem__
, because tokenizer
must be applied before __getitem__
. Otherwise, the unit test https://github.com/pytorch/text/blob/master/test/data/test_builtin_datasets.py#L54-L57 will fail here.
Possible solution
- Move
tokenizer
out oftransforms
- OR: split
transforms
into two parts - global transforms / token-level transforms.
Please advise how I can move forward. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and very close to the dataset abstraction. Only made a few suggestions for changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also consolidate WMTNewsCrawl
here into word language modeling datasets?
@@ -210,14 +159,13 @@ def PennTreebank(*args, **kwargs): | |||
Separately returns the train/test/valid set | |||
|
|||
Arguments: | |||
root: Directory where the datasets are saved. Default: ".data" | |||
vocab: Vocabulary used for dataset. If None, it will generate a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you change the order of vocab
and tokenizer
, will this be a BC breaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a possible BC breaking if people use args
instead of kwargs
. I restored the previous order of vocab
and tokenizer
in commit f433b40 above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Let me know when you are done with the revision. I will have another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we agreed to go by tokenizer
, root
, vocab
in order.
@z-Runmin Please kindly let us know if you need a review. |
Not sure why unittest_windows failed to install |
@peterjc123 Could you take a look at here? The master branch is green. |
token_id_3, token_id_1]).long() | ||
>>> vocab = build_vocab_from_iterator([['language', 'modeling']]) | ||
>>> dataset = LanguageModelingDataset(data, vocab) | ||
transforms: Text string transforms. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and docs for single_line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Add only two comments. Once the CI tests are fixed, we can merge the PR. Then, I will switch my BERT pipeline to the new datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And could you add a test for WMTNewsCrawl dataset? like
text/test/data/test_builtin_datasets.py
Line 79 in e709553
def test_penntreebank(self): |
@z-Runmin @zhangguanheng66 You might need to rebase your branch on master. |
Looks like the wmt dataset is so large that it couldn't be downloaded within the time limit (30 min). |
OK, let me remove the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some minor changes. Will merge after CI tests pass.