-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new dataset - Enwik9 #610
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we seem to be ok with loading the data in memory, this looks good to me. I mentioned a few points, but I don't think any are blocking.
@@ -16,7 +16,7 @@ | |||
def generate_sp_model(filename, vocab_size=20000, | |||
model_type="unigram", | |||
model_prefix='m_user'): | |||
"""Train a SentencePiece tokenizer. | |||
r"""Train a SentencePiece tokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This string change could have gone in a separate PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it's a very small change so I just made it here :).
dataset_zip = download_from_url(url, | ||
path=os.path.join(root, 'enwik9.zip'), | ||
root=root) | ||
extracted_file = extract_archive(dataset_zip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the data is already downloaded or extracted, could we want to skip these steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the download function will automatically skip if the downloaded file is detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file was downloaded then extracted, and then the user deletes the downloaded file, it will be downloaded again even though it's already extracted, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If downloaded file is deleted, then it will be downloaded again. But I have a if
check above to see if the processed_file exists. And in this case, downloading is much more expensive than extracting.
@@ -10,7 +10,7 @@ | |||
AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, \ | |||
YelpReviewFull, YahooAnswers, \ | |||
AmazonReviewPolarity, AmazonReviewFull | |||
|
|||
from .unsupervised_learning import EnWik9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same about import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused. This is grouped together with other import
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs some work. Some very basic benchmarks using -m cProfile -s tottime
would also be appreciated as a sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially since the data is loaded in memory, I second that benchmark would be useful.
Also, the interface seems different from the other torchtext dataset (e.g. IMDB). Is this the new format? Could you document the format and differences in the PR here?
@vincentqb - we consider datasets such as IMDB to be legacy at this point. The supervised learning datasets are more in line with our simplified philosophy. |
Thanks for pointing this out. It'd be nice if the PR would say which ones are the new format, and which ones are the old. |
@vincentqb - since we're still in the process of more rigorously defining data processing and can't reference an RFC or wider discussion that kind of notice will mostly about noting that this does not comply to the current standard, but nothing more. |
Sure. I will post the performance benchmark results in the PR later. |
_patterns = list((re.compile(p), r) | ||
for (p, r) in replace_pattern) | ||
|
||
def _internal_func(txt_iter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the goal of wrapping the yield inside a function? Could you remove the def
and the return _internal_func
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom_replace
func returns _internal_func
, which accepts txt_iter
for replacement. I could make those two as a single func but then, I have to re.compile the replace pattern every time.
dataset_zip = download_from_url(url, | ||
path=os.path.join(root, 'enwik9.zip'), | ||
root=root) | ||
extracted_file = extract_archive(dataset_zip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file was downloaded then extracted, and then the user deletes the downloaded file, it will be downloaded again even though it's already extracted, right?
for x in self._data: | ||
yield x | ||
|
||
def get_vocab(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the vocab be part of the dataset? We had a discussion internally here about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some arguments for this. For the old torchtext, vocab
stay with Field
and we don't like that. However, I also don't want a vocab
apart completely from datasets. We still need to find a way to carry vocab
somewhere.
def __len__(self): | ||
return len(self._data) | ||
|
||
def __iter__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have __getitem__
and __len__
, __iter__
is implied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I don't need to have iter() here if we decide to use the map pattern.
|
||
processed_file = os.path.join(root, 'norm_enwik9') | ||
if not os.path.exists(processed_file): | ||
url = 'http://mattmahoney.net/dc/enwik9.zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user may want to customize the URL, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this dataset, no. But we could make a more general case for similar datasets.
read_lines = read_lines_from_iterator(processed_file, | ||
offsets, begin_line, num_lines) | ||
|
||
self._data = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data is all loaded in memory and kept there. Is that desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all the data. Only partial data are loaded to the memory depending on begin_line
and num_lines
enwik9: compressed size of first 109 bytes of enwiki-20060303-pages-articles.xml.
It's part of Large Text Compression Benchmark project (here).
Benchmark results:
389.595s download_from_url: downloading
7.571s extract_archive
504.140s preprocess_raw_enwik9: remove html, images.
35.772s generate_offsets
0.128s get_vocab
0.012s read_lines_from_iterator
So the most time consuming tasks are downloading and pre-processing (a.k.a. clean up the raw data). Therefore, the processed data are saved to disk for re-use.