-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624
Re-write language_modeling datasets (PennTreebank, WikiText103, WikiText2) #624
Conversation
} | ||
|
||
|
||
def _read_text_iterator(data_path, tokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions look similar to what's in the classification datasets - is there a way to share this code via some simple abstractions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that. The only difference is the label in text classification dataset.
|
||
def _create_data_from_iterator(vocab, iterator, include_unk): | ||
_data = [] | ||
with tqdm(unit_scale=0, unit='lines') as t: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this noisy by default? Printing progress bars can cause a lot of noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could remove it. But I saw some issues that users complained about no response during loading the data. They wanted to see the progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it quiet by default, because otherwise it'll spam the logs of programs running outside of an interactive environment.
Generally I think this looks good. In general, we need to write clear documentation that describes how to migrate to this new Dataset in comparison to the old one. |
yield tokens | ||
|
||
|
||
def _create_data_from_iterator(vocab, iterator, include_unk): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function could be unified as well.
Yes. I could write some instructions after IMDB datasets. |
torchtext/data/functional.py
Outdated
@@ -178,33 +176,31 @@ def read_text_iterator(path, tokenizer): | |||
yield tokens | |||
|
|||
|
|||
def create_data_from_iterator(vocab, iterator, include_unk): | |||
r"""Create data from an token iterator. | |||
def create_data_from_iterator(vocab, iterator, removed_tokens=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to keep track of things that break BC for the release notes. As far as I can tell this does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the first time we add create_data_from_iterator
in torch.data.functional
.
test/data/test_builtin_datasets.py
Outdated
self.assertEqual(tokens_ids, [2, 285, 502, 699]) | ||
|
||
# Delete the dataset after we're done to save disk space on CI | ||
if os.environ.get("TRAVIS") == "true": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a generic cleanup pattern. You should use the capabilities the test framework brings along for this.
|
||
return (LanguageModelingDataset(torch.Tensor(train_data).long(), vocab), | ||
LanguageModelingDataset(torch.Tensor(test_data).long(), vocab), | ||
LanguageModelingDataset(torch.Tensor(valid_data).long(), vocab)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "dataset" is therefore a list of 3 datasets corresponding to train/test/validation? I assume the division is dictated by the source itself, right?
Could the user ever want to just get one of train/test/validation? What do you think of letting the user get one at the time when building a dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: librispeech in audio simply lets the user chose which one to download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be an option with an extra input for that.
But it's kind of convention to generate three datasets together (if there are) and most users will need all the them for training and inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of having a function returning each separately, and then wrap all three with the API here?
Something like
vocab = build_vocab(train)
train = _setup_datasets_part("train", vocab)
test = _setup_datasets_part("test", vocab)
valid = _setup_datasets_part("valid", vocab)
return train, test, valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the flag is probably a good idea. Someone might only want to run validation or testing in a separate process. either accept a string or a tuple of strings i'd say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this offline. The decision is to add a keyword argument (a tuple of strings) to identify which datasets will be generated. The default value is a tuple of ('train', 'test'). Users have the flexibility to choose a single dataset with a proper vocab
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
In CommonVoice, the implementation doesn't know about the meaning of train/test/etc. Instead, the zip file contains many tsv files (e.g. for train, test, ...), and the user simply specifies which file to use.
An advantage is that the user could decide to create its own custom tsv file (say for training), and load that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
In CommonVoice, the implementation doesn't know about the meaning of train/test/etc. Instead, the zip file contains many tsv files (e.g. for train, test, ...), and the user simply specifies which file to use.
An advantage is that the user could decide to create its own custom tsv file (say for training), and load that one.
Yes, that's kind of the same case for translation dataset so we need to provide an option to choose the file.
However, for most text dataset, the raw data files have been properly tagged with train
, test
, and valid
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some datasets in audio even have "train_100", "train_500" etc built-in, see this. With only "train", "test" "valid", how would you like to deal with that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my early response, I propose to add three more flags train_filename
, valid_filename
, test_filename
, which give you the flexibility to choose a specific file for training. I think either way should way and we just need to keep consistent across datasets within domain.
vocab, read_text_iterator(valid_path, tokenizer), removed_tokens) | ||
valid_data = [] | ||
for tokens in valid_iter: | ||
valid_data += tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't know that it's valuable to simplify this block, but it'd be more readable by just adding a space between line 91 and 92 :)
|
||
def _setup_datasets(dataset_name, tokenizer=get_tokenizer("basic_english"), | ||
root='.data', vocab=None, removed_tokens=['<unk>']): | ||
if dataset_name == 'PennTreebank': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use that function for future datasets, is there a more general strategy than just making PennTreebank
a special case?
_data[item] += tokens | ||
|
||
return tuple(LanguageModelingDataset(torch.tensor(_data[d]).long(), vocab) | ||
for d in data_select if _data[d] != []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if _data[d]
is empty we actually want to raise an error because the data the user requested is empty OR we simply return a dataset that loads nothing.
If the data is empty we might have downloaded something corrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if 'valid' in data_select: | ||
extracted_files.append(download_from_url(URLS['PennTreebank'][2], root=root)) | ||
else: | ||
dataset_tar = download_from_url(URLS[dataset_name], root=root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we could do with these download functions is compare the result to a checksum. We added those capabilities in torchaudio. I think that might actually be worth it, if this checksum calculating is fast. If it's not fast we might need to make this a flag. This could save a lot of grief and is something rather common. Should we do this now? cc @vincentqb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a checksum is a good idea. However, since the goal is to soonish put the download function in sync, we should avoid investing time optimizing this particular version. I'd recommend using the one we worked on recently in torchaudio instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should copy this one into here as well. Do you want to create a PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb @cpuhrsch I'm fine to put one download
function across domains if it doesn't break any current download activity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth testing this, since it's our goal down the road :)
Thanks for updating the description! it's much clearer. Could you also add a paragraph that contrasts the old and new datasets? As in, do we give up on any capabilities, how much does the API differ at each step etc. |
torchtext/data/functional.py
Outdated
if removed_tokens is None: | ||
yield iter(vocab[token] for token in tokens) | ||
else: | ||
tokens = list(filter(lambda x: x not in removed_tokens, tokens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this on the fly :). filter([...]) should be able to feed into iter and simply skip undesired items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks.
Add a session to compare the difference at the very end. @cpuhrsch |
torchtext/data/functional.py
Outdated
if removed_tokens is None: | ||
yield iter(vocab[token] for token in tokens) | ||
else: | ||
yield iter(vocab[token] for token in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write this more compactly as iter(filter(lambda x: x is not None, [3, 4, None]))
torchtext/data/functional.py
Outdated
yield tokens | ||
|
||
|
||
def create_data_from_iterator(vocab, iterator, removed_tokens=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the name of the function clear? convert_tokens_to_ids
?
torchtext/legacy/__init__.py
Outdated
@@ -0,0 +1,5 @@ | |||
from . import datasets | |||
|
|||
__version__ = '0.4.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: What does the version mean here?
README.rst
Outdated
Legacy Code | ||
=========== | ||
|
||
We are currently retiring several datasets as legacy code ```torchtext.legacy```: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We have currently retired several datasets and move them under torch.legacy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. See Vincent's comment "create_data_from_iterator".
Re-write the language modeling datasets with a new pattern which was first applied for text classification datasets in v.0.4.0 release.
Motivation
The motivation for the new pattern is to simplify data processing in Torchtext and grant users more flexibility to build the pipeline. There are three major issues we want to solve:
Field
class, which couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. It's like a "black box" and confuses users about what's going on inside it. Instead, with the new pattern, the components mentioned above should be some basic building blocks. Users can build the data processing pipeline with the orthogonal components.DataLoader
andSampler
intorch.utils.data
. Some duplicate functionals (e.g.Iterator
,Batch
,splits
intorchtext
) should be replaced by the corresponding functionals intorch.utils.data
to reduce the maintaince efforts.Example
class adds no structure, and should be replaced with tuple/dict or namedtuple.API for new language modeling datasets
To support "one-command" data loading, we have built a pipeline here to support PennTreebank, WikiText103, WikiText2. Users are welcome to build their ones if they follow the pattern. To load the new datasets, simple call the dataset API, as follow:
If you want to use a specific tokenizer,
If you just want the valid set:
Legacy code
We decide to move the old language modeling datasets (a.k.a. PennTreebank, WikiText103, WikiText2) to a legacy folder
torchtext.legacy.datasets
. In the past, you may use those datasets as follow:You can still use the legacy datasets, as follow:
Difference
Field
object including a specific tokenizer. In the new dataset API, user can pass the tokenizer directly to the dataset constructorvocab
object is associated withField
class, and there is no way to apply a pre-trainedvocab
object. In the new dataset,vocab
object can be obtained byand apply to generate new datasets
BucketIterator.splits
, users are encouraged to usetorch.utils.data.DataLoader
to generate batches of data. You can specify how exactly the samples need to be batched usingcollate_fn
.