Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental translation datasets #751

Merged
merged 44 commits into from
Jun 5, 2020

Conversation

akurniawan
Copy link
Contributor

  • Adding new abstraction for translation datasets with new API
  • Add utilities for automatically building dataset from file

@akurniawan akurniawan marked this pull request as draft May 2, 2020 07:05
@akurniawan akurniawan marked this pull request as ready for review May 2, 2020 07:06
@akurniawan
Copy link
Contributor Author

@zhangguanheng66 since the WMT test is pretty long, it causes the test to fail, should we skip the test for WMT? I initially marked the test as slow, but apparently after pulling from latest master, the annotation is removed, so not sure if we are still allowed to skip test in this case

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 since the WMT test is pretty long, it causes the test to fail, should we skip the test for WMT? I initially marked the test as slow, but apparently after pulling from latest master, the annotation is removed, so not sure if we are still allowed to skip test in this case

We recently make some changes to our CI tests and switch to circleci. There is no slow decorator anymore.

@zhangguanheng66
Copy link
Contributor

@akurniawan Just want to check if you want to continue the PR? Thanks.

@akurniawan
Copy link
Contributor Author

@zhangguanheng66 yes, I still want to continue the PR. I was hospitalized for the last 2 weeks and have no access to my laptop. Will start the fix on monday if it's okay with you

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 yes, I still want to continue the PR. I was hospitalized for the last 2 weeks and have no access to my laptop. Will start the fix on monday if it's okay with you

Hope you are fine and recover soon. No rush for the PR and I was just checking if you are still interested in the PR.

@akurniawan
Copy link
Contributor Author

@zhangguanheng66 I have already rebased to the latest master and it fixes the problem. Now the PR is ready to review again. Let me know if you have further concern.

requirements.txt Outdated
@@ -7,6 +7,9 @@ requests
# Optional NLP tools
nltk
spacy
# Needed for machine translation tokenizers
https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz#egg=de_core_news_sm==2.2.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? If spacy is not installed for a specific language, it will throw out a error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I added this for resolving issues in unittest as the dataset used was primarily dutch and the I was using spacy as the tokenizer. But after do some thinking maybe this is unnecessary to be included in requirements.txt

yield " ".join(row)


def _clean_xml_file(f_xml):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this func for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some files in IWSLT are original in xml files, so need to have this function to extract the contents. This function was also already there from the branch you created previously, and it was also there in the old translation dataset class.

fd_txt.write(e.text.strip() + '\n')


def _clean_tags_file(f_orig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this func for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are files such as train.tags.en-de.en, train.tags.en-cs.en, etc which contains something similar to xml files. The contents look like the following

 <translator href="http://www.ted.com/profiles/383709">Abdulatif Mahgoub</translator>
 <reviewer href="http://www.ted.com/profiles/122540">Anwar Dafa-Alla</reviewer>
 <url>http://www.ted.com/talks/steve_keil_a_manifesto_for_play_for_bulgaria_and_beyond</url>
 <keywords>talks, Europe, TEDx, business, culture, economics, play, social change</keywords>
 <speaker>Steve Keil</speaker>
 <talkid>1170</talkid>
 <title>ستيف كايل: بيان رسمي للعب ، الى بلغاريا و ماوراءها</title>
 <description>TED Talk Subtitles and Transcript: في تيدأكس بيىجي في صوفيا، بلغاريا ، في صوفيا ، يحارب ستيف كايل "المعلومات الثقافية المتوارثة" والتي أصابت وطنه بلغاريا - ويدعو للعودة الى اللعب لتنشيط الاقتصاد، التعليم ، والمجتمع. حديث رائع ذو رسالة عالمية للأفراد في كل مكان ، الذين يعيدون ابتكار اماكن عملهم ، مدارسهم ، حياتهم.</description>
  إنني هنا اليوم  لأبدأ ثورة. 
  والآن وقبل أن تهبوا،  أو تبدأوا بالغناء،  أو اختيار لون مفضل،  أريد تحديد ما أعنيه بثورة. 
  بكلمة ثورة،  أعني تغييرا جذريا وبعيد المدى  في طريقة تفكيرنا وتصرفاتنا--  في طريقة تفكيرنا وطريقة تصرفاتنا. 
  والآن لماذا ، ستيف، لماذا نحتاج الى ثورة؟ 
  نحتاج إلى ثورة  لأن الأشياء لا تعمل؛ انها لا تعمل تماماً. 
  وهذا يجعلني حزينا جدا،  لأنني سئمت وتعبت من الأشياء التي لا تعمل. 
  أوتعرفون، لقد سئمت وتعبت منّا ونحن لا نحيا بامكانياتنا. 
  سئمت وتعبت منّا وقد أصبحنا متأخرين. 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need those informations? IMO, without those tag files, the PR is much simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the lack of context. Those tags files are parts of IWSLT dataset when you download them

if 'xml' in fname:
_clean_xml_file(fname)
file_archives.append(os.path.splitext(fname)[0])
elif "tags" in fname:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's in tags file? Do we need to provide them to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is needed to extract the real content and remove the tags from .tags files. This function was there from the branch you created previously, and it was also there in the old translation dataset class.

valid_filename="val",
test_filename="test_2016_flickr",
root='.data'):
""" Define translation datasets: Multi30k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs, it's better to summarize all the combinations of language and years as the options for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, please let me know if it's sufficient

The available datasets include:

Arguments:
languages: The source and target languages for the datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, please let me know if it's sufficient

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 I have already rebased to the latest master and it fixes the problem. Now the PR is ready to review again. Let me know if you have further concern.

Thanks @akurniawan. LGTM. Add some questions to help me understand some code.

return tuple(datasets)


class RawTextIterableDataset(torch.utils.data.IterableDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be call this RawTranslationIterableDataset?



def Multi30k(languages="de-en",
train_filename="train",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me that we can combine languages and train_filename together and explicitly the train data as train_filename, for example train_filename="train.de". Same for valid_filename.

from torchtext.data.utils import get_tokenizer


def vocab_func(vocab):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vocab=(None, None),
removed_tokens=['<unk>']):
""" Define translation datasets: Multi30k
Separately returns train/valid/test datasets as a tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep a copy of the available language/train_filename here because most users will the docs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Default: ('val.de', 'val.en')
test_filenames: the source and target filenames for test.
Default: ('test2016.de', 'test2016.en')
tokenizer: the tokenizer used to preprocess source and target raw text data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly say that the tokenizer needs a tuple of tokenizers. And the default is (get_tokenizer("spacy", language='de_core_news_sm'), get_tokenizer("spacy", language='en_core_web_sm')). Same for vocab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akurniawan Thanks for the contribution and everything looks good to me.
nit: add a comment for docs.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangguanheng66
Copy link
Contributor

@akurniawan Let me know when you are done so I can merge this PR.

@akurniawan
Copy link
Contributor Author

@zhangguanheng66 I have already added the doc string, please take a look

@zhangguanheng66 zhangguanheng66 changed the title New translation datasets Experimental translation datasets Jun 5, 2020
@zhangguanheng66 zhangguanheng66 merged commit 9022871 into pytorch:master Jun 5, 2020
zhangguanheng66 pushed a commit to zhangguanheng66/text that referenced this pull request Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants