Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new dataset - Enwik9 #610

Merged
merged 26 commits into from
Oct 24, 2019
Merged

Conversation

zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented Oct 2, 2019

enwik9: compressed size of first 109 bytes of enwiki-20060303-pages-articles.xml.

It's part of Large Text Compression Benchmark project (here).

Benchmark results:
389.595s download_from_url: downloading
7.571s extract_archive
504.140s preprocess_raw_enwik9: remove html, images.
35.772s generate_offsets
0.128s get_vocab
0.012s read_lines_from_iterator

So the most time consuming tasks are downloading and pre-processing (a.k.a. clean up the raw data). Therefore, the processed data are saved to disk for re-use.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we seem to be ok with loading the data in memory, this looks good to me. I mentioned a few points, but I don't think any are blocking.

@@ -16,7 +16,7 @@
def generate_sp_model(filename, vocab_size=20000,
model_type="unigram",
model_prefix='m_user'):
"""Train a SentencePiece tokenizer.
r"""Train a SentencePiece tokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This string change could have gone in a separate PR :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's a very small change so I just made it here :).

torchtext/data/functional.py Outdated Show resolved Hide resolved
torchtext/datasets/unsupervised_learning.py Outdated Show resolved Hide resolved
torchtext/datasets/unsupervised_learning.py Outdated Show resolved Hide resolved
torchtext/datasets/unsupervised_learning.py Outdated Show resolved Hide resolved
torchtext/datasets/unsupervised_learning.py Outdated Show resolved Hide resolved
torchtext/datasets/unsupervised_learning.py Outdated Show resolved Hide resolved
dataset_zip = download_from_url(url,
path=os.path.join(root, 'enwik9.zip'),
root=root)
extracted_file = extract_archive(dataset_zip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the data is already downloaded or extracted, could we want to skip these steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the download function will automatically skip if the downloaded file is detected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file was downloaded then extracted, and then the user deletes the downloaded file, it will be downloaded again even though it's already extracted, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If downloaded file is deleted, then it will be downloaded again. But I have a if check above to see if the processed_file exists. And in this case, downloading is much more expensive than extracting.

torchtext/data/functional.py Outdated Show resolved Hide resolved
@@ -10,7 +10,7 @@
AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, \
YelpReviewFull, YahooAnswers, \
AmazonReviewPolarity, AmazonReviewFull

from .unsupervised_learning import EnWik9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same about import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused. This is grouped together with other import.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needs some work. Some very basic benchmarks using -m cProfile -s tottime would also be appreciated as a sanity check.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially since the data is loaded in memory, I second that benchmark would be useful.

Also, the interface seems different from the other torchtext dataset (e.g. IMDB). Is this the new format? Could you document the format and differences in the PR here?

@vincentqb vincentqb self-requested a review October 3, 2019 18:45
@cpuhrsch
Copy link
Contributor

cpuhrsch commented Oct 3, 2019

@vincentqb - we consider datasets such as IMDB to be legacy at this point. The supervised learning datasets are more in line with our simplified philosophy.

@vincentqb
Copy link
Contributor

@vincentqb - we consider datasets such as IMDB to be legacy at this point. The supervised learning datasets are more in line with our simplified philosophy.

Thanks for pointing this out. It'd be nice if the PR would say which ones are the new format, and which ones are the old.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Oct 3, 2019

@vincentqb - since we're still in the process of more rigorously defining data processing and can't reference an RFC or wider discussion that kind of notice will mostly about noting that this does not comply to the current standard, but nothing more.

@zhangguanheng66
Copy link
Contributor Author

Still needs some work. Some very basic benchmarks using -m cProfile -s tottime would also be appreciated as a sanity check.

Sure. I will post the performance benchmark results in the PR later.

@zhangguanheng66 zhangguanheng66 merged commit 9afcfcd into pytorch:master Oct 24, 2019
_patterns = list((re.compile(p), r)
for (p, r) in replace_pattern)

def _internal_func(txt_iter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of wrapping the yield inside a function? Could you remove the def and the return _internal_func?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom_replace func returns _internal_func, which accepts txt_iter for replacement. I could make those two as a single func but then, I have to re.compile the replace pattern every time.

dataset_zip = download_from_url(url,
path=os.path.join(root, 'enwik9.zip'),
root=root)
extracted_file = extract_archive(dataset_zip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file was downloaded then extracted, and then the user deletes the downloaded file, it will be downloaded again even though it's already extracted, right?

for x in self._data:
yield x

def get_vocab(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the vocab be part of the dataset? We had a discussion internally here about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some arguments for this. For the old torchtext, vocab stay with Field and we don't like that. However, I also don't want a vocab apart completely from datasets. We still need to find a way to carry vocab somewhere.

def __len__(self):
return len(self._data)

def __iter__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have __getitem__ and __len__, __iter__ is implied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I don't need to have iter() here if we decide to use the map pattern.


processed_file = os.path.join(root, 'norm_enwik9')
if not os.path.exists(processed_file):
url = 'http://mattmahoney.net/dc/enwik9.zip'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user may want to customize the URL, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this dataset, no. But we could make a more general case for similar datasets.

read_lines = read_lines_from_iterator(processed_file,
offsets, begin_line, num_lines)

self._data = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is all loaded in memory and kept there. Is that desired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the data. Only partial data are loaded to the memory depending on begin_line and num_lines

@zhangguanheng66 zhangguanheng66 deleted the enwik9 branch November 25, 2019 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants