Experimental translation datasets #751

akurniawan · 2020-05-02T07:04:57Z

Adding new abstraction for translation datasets with new API
Add utilities for automatically building dataset from file

Merging from upstream

…_remove for removing multiple files at once

…nslation

akurniawan · 2020-05-02T09:09:48Z

@zhangguanheng66 since the WMT test is pretty long, it causes the test to fail, should we skip the test for WMT? I initially marked the test as slow, but apparently after pulling from latest master, the annotation is removed, so not sure if we are still allowed to skip test in this case

zhangguanheng66 · 2020-05-04T20:48:31Z

@zhangguanheng66 since the WMT test is pretty long, it causes the test to fail, should we skip the test for WMT? I initially marked the test as slow, but apparently after pulling from latest master, the annotation is removed, so not sure if we are still allowed to skip test in this case

We recently make some changes to our CI tests and switch to circleci. There is no slow decorator anymore.

torchtext/data/functional.py

torchtext/experimental/datasets/translation.py

zhangguanheng66 · 2020-05-15T18:15:54Z

@akurniawan Just want to check if you want to continue the PR? Thanks.

akurniawan · 2020-05-16T04:28:39Z

@zhangguanheng66 yes, I still want to continue the PR. I was hospitalized for the last 2 weeks and have no access to my laptop. Will start the fix on monday if it's okay with you

zhangguanheng66 · 2020-05-17T20:52:39Z

@zhangguanheng66 yes, I still want to continue the PR. I was hospitalized for the last 2 weeks and have no access to my laptop. Will start the fix on monday if it's okay with you

Hope you are fine and recover soon. No rush for the PR and I was just checking if you are still interested in the PR.

…nslation

akurniawan · 2020-05-29T03:33:45Z

@zhangguanheng66 I have already rebased to the latest master and it fixes the problem. Now the PR is ready to review again. Let me know if you have further concern.

zhangguanheng66 · 2020-05-29T15:07:48Z

requirements.txt

@@ -7,6 +7,9 @@ requests
 # Optional NLP tools
 nltk
 spacy
+# Needed for machine translation tokenizers
+https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz#egg=de_core_news_sm==2.2.5


What's this for? If spacy is not installed for a specific language, it will throw out a error message.

Initially I added this for resolving issues in unittest as the dataset used was primarily dutch and the I was using spacy as the tokenizer. But after do some thinking maybe this is unnecessary to be included in requirements.txt

zhangguanheng66 · 2020-05-29T15:17:25Z

torchtext/experimental/datasets/raw/translation.py

+            yield " ".join(row)
+
+
+def _clean_xml_file(f_xml):


what's this func for?

Some files in IWSLT are original in xml files, so need to have this function to extract the contents. This function was also already there from the branch you created previously, and it was also there in the old translation dataset class.

zhangguanheng66 · 2020-05-29T15:18:37Z

torchtext/experimental/datasets/raw/translation.py

+                fd_txt.write(e.text.strip() + '\n')
+
+
+def _clean_tags_file(f_orig):


what's this func for?

There are files such as train.tags.en-de.en, train.tags.en-cs.en, etc which contains something similar to xml files. The contents look like the following

<translator href="http://www.ted.com/profiles/383709">Abdulatif Mahgoub</translator> <reviewer href="http://www.ted.com/profiles/122540">Anwar Dafa-Alla</reviewer> <url>http://www.ted.com/talks/steve_keil_a_manifesto_for_play_for_bulgaria_and_beyond</url> <keywords>talks, Europe, TEDx, business, culture, economics, play, social change</keywords> <speaker>Steve Keil</speaker> <talkid>1170</talkid> <title>ستيف كايل: بيان رسمي للعب ، الى بلغاريا و ماوراءها</title> <description>TED Talk Subtitles and Transcript: في تيدأكس بيىجي في صوفيا، بلغاريا ، في صوفيا ، يحارب ستيف كايل "المعلومات الثقافية المتوارثة" والتي أصابت وطنه بلغاريا - ويدعو للعودة الى اللعب لتنشيط الاقتصاد، التعليم ، والمجتمع. حديث رائع ذو رسالة عالمية للأفراد في كل مكان ، الذين يعيدون ابتكار اماكن عملهم ، مدارسهم ، حياتهم.</description> إنني هنا اليوم لأبدأ ثورة. والآن وقبل أن تهبوا، أو تبدأوا بالغناء، أو اختيار لون مفضل، أريد تحديد ما أعنيه بثورة. بكلمة ثورة، أعني تغييرا جذريا وبعيد المدى في طريقة تفكيرنا وتصرفاتنا-- في طريقة تفكيرنا وطريقة تصرفاتنا. والآن لماذا ، ستيف، لماذا نحتاج الى ثورة؟ نحتاج إلى ثورة لأن الأشياء لا تعمل؛ انها لا تعمل تماماً. وهذا يجعلني حزينا جدا، لأنني سئمت وتعبت من الأشياء التي لا تعمل. أوتعرفون، لقد سئمت وتعبت منّا ونحن لا نحيا بامكانياتنا. سئمت وتعبت منّا وقد أصبحنا متأخرين.

Do we need those informations? IMO, without those tag files, the PR is much simpler.

Sorry for the lack of context. Those tags files are parts of IWSLT dataset when you download them

zhangguanheng66 · 2020-05-29T15:22:23Z

torchtext/experimental/datasets/raw/translation.py

+        if 'xml' in fname:
+            _clean_xml_file(fname)
+            file_archives.append(os.path.splitext(fname)[0])
+        elif "tags" in fname:


what's in tags file? Do we need to provide them to users?

If I understand correctly, this is needed to extract the real content and remove the tags from .tags files. This function was there from the branch you created previously, and it was also there in the old translation dataset class.

zhangguanheng66 · 2020-05-29T15:24:04Z

torchtext/experimental/datasets/raw/translation.py

+             valid_filename="val",
+             test_filename="test_2016_flickr",
+             root='.data'):
+    """ Define translation datasets: Multi30k


In the docs, it's better to summarize all the combinations of language and years as the options for users.

fixed, please let me know if it's sufficient

zhangguanheng66 · 2020-05-29T15:24:22Z

torchtext/experimental/datasets/raw/translation.py

+        The available datasets include:
+
+    Arguments:
+        languages: The source and target languages for the datasets.


Same as previous comments

fixed, please let me know if it's sufficient

zhangguanheng66 · 2020-05-29T15:25:49Z

@zhangguanheng66 I have already rebased to the latest master and it fixes the problem. Now the PR is ready to review again. Let me know if you have further concern.

Thanks @akurniawan. LGTM. Add some questions to help me understand some code.

…nslation

zhangguanheng66 · 2020-06-02T13:22:58Z

torchtext/experimental/datasets/raw/translation.py

+    return tuple(datasets)
+
+
+class RawTextIterableDataset(torch.utils.data.IterableDataset):


Should be call this RawTranslationIterableDataset?

zhangguanheng66 · 2020-06-02T13:25:37Z

torchtext/experimental/datasets/raw/translation.py

+
+
+def Multi30k(languages="de-en",
+             train_filename="train",


Seems to me that we can combine languages and train_filename together and explicitly the train data as train_filename, for example train_filename="train.de". Same for valid_filename.

zhangguanheng66 · 2020-06-02T13:27:12Z

torchtext/experimental/datasets/translation.py

+from torchtext.data.utils import get_tokenizer
+
+
+def vocab_func(vocab):


You could simply import those utils from https://github.com/pytorch/text/blob/master/torchtext/experimental/functional.py

zhangguanheng66 · 2020-06-02T13:29:09Z

torchtext/experimental/datasets/translation.py

+             vocab=(None, None),
+             removed_tokens=['<unk>']):
+    """ Define translation datasets: Multi30k
+        Separately returns train/valid/test datasets as a tuple


I think we should keep a copy of the available language/train_filename here because most users will the docs here.

…meter

zhangguanheng66 · 2020-06-03T22:03:31Z

torchtext/experimental/datasets/translation.py

+            Default: ('val.de', 'val.en')
+        test_filenames: the source and target filenames for test.
+            Default: ('test2016.de', 'test2016.en')
+        tokenizer: the tokenizer used to preprocess source and target raw text data.


We should explicitly say that the tokenizer needs a tuple of tokenizers. And the default is (get_tokenizer("spacy", language='de_core_news_sm'), get_tokenizer("spacy", language='en_core_web_sm')). Same for vocab.

zhangguanheng66

@akurniawan Thanks for the contribution and everything looks good to me.
nit: add a comment for docs.

zhangguanheng66

Just find that we need a doc string in https://github.com/pytorch/text/blob/master/docs/source/experimental_datasets.rst

zhangguanheng66 · 2020-06-05T00:09:23Z

@akurniawan Let me know when you are done so I can merge this PR.

…nslation

akurniawan · 2020-06-05T00:29:21Z

@zhangguanheng66 I have already added the doc string, please take a look

akurniawan and others added 7 commits February 13, 2019 14:37

Merge pull request #1 from pytorch/master

3ab5c1e

Merging from upstream

Merge remote-tracking branch 'upstream/master'

db42b7f

first commit for both WMT14 and Multi30k

20545fb

add IWSLT dataset

f57adf0

update argument documentation for all datasets

ba23a50

add unit test for all datasets and updating behaviour for conditional…

1359d4b

…_remove for removing multiple files at once

add utilities for loading translation data

dc39604

akurniawan marked this pull request as draft May 2, 2020 07:05

akurniawan marked this pull request as ready for review May 2, 2020 07:06

akurniawan added 5 commits May 2, 2020 14:15

Merge branch 'master' of https://github.com/pytorch/text into new_tra…

2b34d39

…nslation

fix linting

b064941

remove slow unittest

c9fabc5

update spacy model names and add them to requirements

c794d9e

add deependency for spacy models

5d0eb66

zhangguanheng66 reviewed May 4, 2020

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

zhangguanheng66 reviewed May 4, 2020

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

zhangguanheng66 reviewed May 4, 2020

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

zhangguanheng66 reviewed May 4, 2020

View reviewed changes

torchtext/data/functional.py Outdated Show resolved Hide resolved

zhangguanheng66 reviewed May 4, 2020

View reviewed changes

torchtext/experimental/datasets/translation.py Show resolved Hide resolved

zhangguanheng66 mentioned this pull request May 7, 2020

How to apply torchtext to prepare data for a transformer #757

Closed

bentrevett mentioned this pull request May 10, 2020

multi30k dataset only contains test set for French and no Czech data #762

Open

akurniawan added 4 commits May 18, 2020 14:40

restore unnecessary changes

48cbb63

move functionalities to specific dataset

1cd21f5

Merge branch 'master' of https://github.com/pytorch/text into new_tra…

21f49fd

…nslation

adding raw module for translation dataset

2264e16

Merge branch 'master' of https://github.com/pytorch/text into new_tra…

95ccf0c

…nslation

zhangguanheng66 reviewed May 29, 2020

View reviewed changes

zhangguanheng66 requested changes May 29, 2020

View reviewed changes

akurniawan added 4 commits June 2, 2020 09:01

add documentation for IWSLT and Multi30k

47ae37a

remove unnecessary code

69901dc

Merge branch 'master' of https://github.com/pytorch/text into new_tra…

98dceb4

…nslation

remove spacy specific model from requirements

bf0820e

zhangguanheng66 reviewed Jun 2, 2020

View reviewed changes

akurniawan added 3 commits June 3, 2020 09:06

use modules already available in functional and remove languages para…

b203b57

…meter

change return of get_iterator to return both src and tgt

0628072

add data_select parameter

7e768a1

zhangguanheng66 reviewed Jun 3, 2020

View reviewed changes

zhangguanheng66 approved these changes Jun 3, 2020

View reviewed changes

add more documentation for tokenizer and vocab

906f792

zhangguanheng66 requested changes Jun 4, 2020

View reviewed changes

akurniawan added 2 commits June 5, 2020 06:45

add doc string

ade3cda

fix escessive underline

a758d6c

akurniawan added 2 commits June 5, 2020 07:16

fix indentation

0726cbf

Merge branch 'master' of https://github.com/pytorch/text into new_tra…

2e64b4c

…nslation

zhangguanheng66 changed the title ~~New translation datasets~~ Experimental translation datasets Jun 5, 2020

zhangguanheng66 merged commit 9022871 into pytorch:master Jun 5, 2020

zhangguanheng66 pushed a commit to zhangguanheng66/text that referenced this pull request Jun 5, 2020

Experimental translation datasets (pytorch#751)

14b2c07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental translation datasets #751

Experimental translation datasets #751

akurniawan commented May 2, 2020

akurniawan commented May 2, 2020

zhangguanheng66 commented May 4, 2020

zhangguanheng66 commented May 15, 2020

akurniawan commented May 16, 2020

zhangguanheng66 commented May 17, 2020

akurniawan commented May 29, 2020

zhangguanheng66 May 29, 2020

akurniawan Jun 2, 2020

zhangguanheng66 May 29, 2020

akurniawan Jun 2, 2020

zhangguanheng66 May 29, 2020

akurniawan Jun 2, 2020

zhangguanheng66 Jun 2, 2020

akurniawan Jun 3, 2020

zhangguanheng66 May 29, 2020

akurniawan Jun 2, 2020

zhangguanheng66 May 29, 2020

akurniawan Jun 2, 2020

zhangguanheng66 May 29, 2020

akurniawan Jun 2, 2020

zhangguanheng66 commented May 29, 2020

zhangguanheng66 Jun 2, 2020

zhangguanheng66 Jun 2, 2020

zhangguanheng66 Jun 2, 2020

zhangguanheng66 Jun 2, 2020

akurniawan Jun 3, 2020

zhangguanheng66 Jun 3, 2020

akurniawan Jun 4, 2020

zhangguanheng66 left a comment

zhangguanheng66 left a comment •

edited

Loading

zhangguanheng66 commented Jun 5, 2020

akurniawan commented Jun 5, 2020

		fd_txt.write(e.text.strip() + '\n')


		def _clean_tags_file(f_orig):

		return tuple(datasets)


		class RawTextIterableDataset(torch.utils.data.IterableDataset):

		from torchtext.data.utils import get_tokenizer


		def vocab_func(vocab):

Experimental translation datasets #751

Experimental translation datasets #751

Conversation

akurniawan commented May 2, 2020

akurniawan commented May 2, 2020

zhangguanheng66 commented May 4, 2020

zhangguanheng66 commented May 15, 2020

akurniawan commented May 16, 2020

zhangguanheng66 commented May 17, 2020

akurniawan commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 left a comment

Choose a reason for hiding this comment

zhangguanheng66 left a comment • edited Loading

Choose a reason for hiding this comment

zhangguanheng66 commented Jun 5, 2020

akurniawan commented Jun 5, 2020

zhangguanheng66 left a comment •

edited

Loading