Add a new dataset - Enwik9 #610

zhangguanheng66 · 2019-10-02T19:13:03Z

enwik9: compressed size of first 109 bytes of enwiki-20060303-pages-articles.xml.

It's part of Large Text Compression Benchmark project (here).

Benchmark results:
389.595s download_from_url: downloading
7.571s extract_archive
504.140s preprocess_raw_enwik9: remove html, images.
35.772s generate_offsets
0.128s get_vocab
0.012s read_lines_from_iterator

So the most time consuming tasks are downloading and pre-processing (a.k.a. clean up the raw data). Therefore, the processed data are saved to disk for re-use.

vincentqb

Given that we seem to be ok with loading the data in memory, this looks good to me. I mentioned a few points, but I don't think any are blocking.

vincentqb · 2019-10-03T18:06:09Z

torchtext/data/functional.py

@@ -16,7 +16,7 @@
 def generate_sp_model(filename, vocab_size=20000,
                      model_type="unigram",
                      model_prefix='m_user'):
-    """Train a SentencePiece tokenizer.
+    r"""Train a SentencePiece tokenizer.


nit: This string change could have gone in a separate PR :)

I feel it's a very small change so I just made it here :).

torchtext/data/functional.py

torchtext/datasets/unsupervised_learning.py

vincentqb · 2019-10-03T18:17:34Z

torchtext/datasets/unsupervised_learning.py

+            dataset_zip = download_from_url(url,
+                                            path=os.path.join(root, 'enwik9.zip'),
+                                            root=root)
+            extracted_file = extract_archive(dataset_zip)


if the data is already downloaded or extracted, could we want to skip these steps?

the download function will automatically skip if the downloaded file is detected.

If the file was downloaded then extracted, and then the user deletes the downloaded file, it will be downloaded again even though it's already extracted, right?

Yes. If downloaded file is deleted, then it will be downloaded again. But I have a if check above to see if the processed_file exists. And in this case, downloading is much more expensive than extracting.

torchtext/data/functional.py

vincentqb · 2019-10-03T18:20:41Z

torchtext/datasets/__init__.py

@@ -10,7 +10,7 @@
    AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, \
    YelpReviewFull, YahooAnswers, \
    AmazonReviewPolarity, AmazonReviewFull
-
+from .unsupervised_learning import EnWik9


nit: same about import

confused. This is grouped together with other import.

cpuhrsch

Still needs some work. Some very basic benchmarks using -m cProfile -s tottime would also be appreciated as a sanity check.

vincentqb

Especially since the data is loaded in memory, I second that benchmark would be useful.

Also, the interface seems different from the other torchtext dataset (e.g. IMDB). Is this the new format? Could you document the format and differences in the PR here?

cpuhrsch · 2019-10-03T18:45:54Z

@vincentqb - we consider datasets such as IMDB to be legacy at this point. The supervised learning datasets are more in line with our simplified philosophy.

vincentqb · 2019-10-03T18:48:37Z

@vincentqb - we consider datasets such as IMDB to be legacy at this point. The supervised learning datasets are more in line with our simplified philosophy.

Thanks for pointing this out. It'd be nice if the PR would say which ones are the new format, and which ones are the old.

cpuhrsch · 2019-10-03T18:50:29Z

@vincentqb - since we're still in the process of more rigorously defining data processing and can't reference an RFC or wider discussion that kind of notice will mostly about noting that this does not comply to the current standard, but nothing more.

zhangguanheng66 · 2019-10-03T19:10:29Z

Still needs some work. Some very basic benchmarks using -m cProfile -s tottime would also be appreciated as a sanity check.

Sure. I will post the performance benchmark results in the PR later.

vincentqb · 2019-10-24T18:06:31Z

torchtext/data/functional.py

+    _patterns = list((re.compile(p), r)
+                     for (p, r) in replace_pattern)
+
+    def _internal_func(txt_iter):


What's the goal of wrapping the yield inside a function? Could you remove the def and the return _internal_func?

The custom_replace func returns _internal_func, which accepts txt_iter for replacement. I could make those two as a single func but then, I have to re.compile the replace pattern every time.

vincentqb · 2019-10-24T18:07:55Z

torchtext/datasets/unsupervised_learning.py

+            dataset_zip = download_from_url(url,
+                                            path=os.path.join(root, 'enwik9.zip'),
+                                            root=root)
+            extracted_file = extract_archive(dataset_zip)


If the file was downloaded then extracted, and then the user deletes the downloaded file, it will be downloaded again even though it's already extracted, right?

vincentqb · 2019-10-24T18:14:34Z

torchtext/datasets/unsupervised_learning.py

+        for x in self._data:
+            yield x
+
+    def get_vocab(self):


Will the vocab be part of the dataset? We had a discussion internally here about this.

There are some arguments for this. For the old torchtext, vocab stay with Field and we don't like that. However, I also don't want a vocab apart completely from datasets. We still need to find a way to carry vocab somewhere.

vincentqb · 2019-10-24T18:15:00Z

torchtext/datasets/unsupervised_learning.py

+    def __len__(self):
+        return len(self._data)
+
+    def __iter__(self):


Since you have __getitem__ and __len__, __iter__ is implied.

Yes. I don't need to have iter() here if we decide to use the map pattern.

vincentqb · 2019-10-24T18:15:47Z

torchtext/datasets/unsupervised_learning.py

+
+        processed_file = os.path.join(root, 'norm_enwik9')
+        if not os.path.exists(processed_file):
+            url = 'http://mattmahoney.net/dc/enwik9.zip'


The user may want to customize the URL, etc.

For this dataset, no. But we could make a more general case for similar datasets.

vincentqb · 2019-10-24T18:16:35Z

torchtext/datasets/unsupervised_learning.py

+        read_lines = read_lines_from_iterator(processed_file,
+                                              offsets, begin_line, num_lines)
+
+        self._data = []


The data is all loaded in memory and kept there. Is that desired?

Not all the data. Only partial data are loaded to the memory depending on begin_line and num_lines

Guanheng Zhang added 24 commits September 16, 2019 15:00

Add processor for enwiki9

fa3ac03

Impletement raw data normalization.

91c134a

move to dataset.

6ac67a1

Merge remote-tracking branch 'upstream/master' into enwik9

62dcef4

Re-name to unsupervised learning dataset.

e680716

Finalize the txt classification.

4faa739

Add func to read line with offsets.

3ea90a2

Merge branch 'master' into enwik9

ba417b0

Add custom_replace transform.

17fd1ad

Update unsupervised.

4d6b679

Add _setup_datasets function.

79f7290

Basic set up.

df9b8c7

Commits.

8304af2

Merge branch 'master' into enwik9

1d59638

Commit first version.

cb0b831

Add vocab.

cafc84e

Add a simple splace split.

1fa651f

Minor edit.

f39133d

flake8

7fdd7b8

Add docs.

36b89df

Add test and docs for custom_replace transform.

38bd384

Update functional test.

47717dc

Fix the tests.

90aabd0

Minor change.

353869f

vincentqb approved these changes Oct 3, 2019

View reviewed changes

cpuhrsch requested changes Oct 3, 2019

View reviewed changes

vincentqb reviewed Oct 3, 2019

View reviewed changes

vincentqb self-requested a review October 3, 2019 18:45

Minor changes to address reviewers' comments.

3be1555

vincentqb mentioned this pull request Oct 7, 2019

Shared Dataset Functionality pytorch/pytorch#24915

Open

Minor changes.

cf9ae5f

cpuhrsch approved these changes Oct 24, 2019

View reviewed changes

zhangguanheng66 merged commit 9afcfcd into pytorch:master Oct 24, 2019

vincentqb reviewed Oct 24, 2019

View reviewed changes

zhangguanheng66 deleted the enwik9 branch November 25, 2019 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new dataset - Enwik9 #610

Add a new dataset - Enwik9 #610

zhangguanheng66 commented Oct 2, 2019 •

edited

Loading

vincentqb left a comment

vincentqb Oct 3, 2019

zhangguanheng66 Oct 4, 2019

vincentqb Oct 3, 2019

zhangguanheng66 Oct 4, 2019

vincentqb Oct 24, 2019

zhangguanheng66 Oct 24, 2019

vincentqb Oct 3, 2019

zhangguanheng66 Oct 4, 2019

cpuhrsch left a comment

vincentqb left a comment •

edited

Loading

cpuhrsch commented Oct 3, 2019

vincentqb commented Oct 3, 2019

cpuhrsch commented Oct 3, 2019

zhangguanheng66 commented Oct 3, 2019

vincentqb Oct 24, 2019

zhangguanheng66 Oct 24, 2019

vincentqb Oct 24, 2019

vincentqb Oct 24, 2019

zhangguanheng66 Oct 24, 2019

vincentqb Oct 24, 2019

zhangguanheng66 Oct 24, 2019

vincentqb Oct 24, 2019

zhangguanheng66 Oct 24, 2019

vincentqb Oct 24, 2019

zhangguanheng66 Oct 24, 2019

Add a new dataset - Enwik9 #610

Add a new dataset - Enwik9 #610

Conversation

zhangguanheng66 commented Oct 2, 2019 • edited Loading

vincentqb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch left a comment

Choose a reason for hiding this comment

vincentqb left a comment • edited Loading

Choose a reason for hiding this comment

cpuhrsch commented Oct 3, 2019

vincentqb commented Oct 3, 2019

cpuhrsch commented Oct 3, 2019

zhangguanheng66 commented Oct 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented Oct 2, 2019 •

edited

Loading

vincentqb left a comment •

edited

Loading