-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write IMDB dataset in torchtext.experimental.datasets #651
Conversation
return self._vocab | ||
|
||
|
||
def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be combined with _generate_imdb_data_iterators and abstracted further by allowing to pass an iters_group as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if isinstance(data_select, str): | ||
data_select = [data_select] | ||
if dataset_name == 'IMDB': | ||
return _setup_datasets(_generate_imdb_data_iterators(dataset_name, root, ngrams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference here is _generate_imdb_data_iterators and _generate_data_iterators. If we create a more generic version of these two functions that can accept an iters_group we should be able to write less code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@@ -1,7 +1,7 @@ | |||
# This is an example to create a text classification dataset and train a sentiment model. | |||
|
|||
In the basic case, users can train the sentiment model in model.py with | |||
AG_NEWS dataset in torchtext.datasets.text_classification. The dataset is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to indicate that the interface is in pre-release and will evolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix it in this release as all the text classification datasets are moved to prototype
folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb - we talked a bit offline and thought it should be a good idea to move these new pieces into a preview or prototype folder first (even though we think they're much better) since the changes are quite drastic. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the previous datasets still being deprecated? Is the new interface finalized?
If yes and yes, then why delay? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current datasets won't be deprecated during this release, instead we'll introduce a new prototype section.
The new interface is not finalized and we notice various inconsistencies bubbling up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prototype
folder allows us to evolve the interface without a lot of pressure from BC breaking. We could improve the datasets API while receiving feedbacks from OSS users. Once they are matured, we could move them to the main folder.
path['test'] = fname | ||
|
||
iters_group = {} | ||
if 'train' in data_select: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both "_imdb_iterator" and "_csv_iterator" have a similar interface, except that _csv_iterator is passed a path. Instead of accepting a path['train']
it could simply accept 'train'
like _imdb_iterator does. Then you only need to create an iterator via something like _gen_imdb_iterator
and all of this branching etc. can disappear.
@@ -160,10 +178,21 @@ def AG_NEWS(*args, **kwargs): | |||
Default: 1 | |||
vocab: Vocabulary used for dataset. If None, it will generate a new | |||
vocabulary based on the train data set. | |||
include_unk: include unknown token in the data (Default: False) | |||
removed_tokens: removed tokens from output dataset (Default: []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes BC breaking since they'll live outside the experimental folder after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you break out these BC breaking changes, which are unrelated to IMDB, into a separate PR?
yield ngrams_iterator(tokenizer(f.read()), ngrams) | ||
|
||
|
||
def _generate_data_iterators(dataset_name, root, ngrams, tokenizer, data_select): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to repeat all of this, but you can use it from the torchtext/datasets/text_classification file
|
||
__version__ = '0.4.0' | ||
|
||
__all__ = ['data', | ||
'datasets', | ||
'utils', | ||
'vocab', | ||
'legacy'] | ||
'legacy', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need to delete this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - see one comment, then we can merge
Re-write IMDB dataset. See #624 for motivation and API change. To load the new IMDB data with
torch.utils.data.DataLoader