-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental translation datasets #751
Conversation
akurniawan
commented
May 2, 2020
- Adding new abstraction for translation datasets with new API
- Add utilities for automatically building dataset from file
Merging from upstream
…_remove for removing multiple files at once
@zhangguanheng66 since the WMT test is pretty long, it causes the test to fail, should we skip the test for WMT? I initially marked the test as slow, but apparently after pulling from latest master, the annotation is removed, so not sure if we are still allowed to skip test in this case |
We recently make some changes to our CI tests and switch to circleci. There is no slow decorator anymore. |
@akurniawan Just want to check if you want to continue the PR? Thanks. |
@zhangguanheng66 yes, I still want to continue the PR. I was hospitalized for the last 2 weeks and have no access to my laptop. Will start the fix on monday if it's okay with you |
Hope you are fine and recover soon. No rush for the PR and I was just checking if you are still interested in the PR. |
@zhangguanheng66 I have already rebased to the latest master and it fixes the problem. Now the PR is ready to review again. Let me know if you have further concern. |
requirements.txt
Outdated
@@ -7,6 +7,9 @@ requests | |||
# Optional NLP tools | |||
nltk | |||
spacy | |||
# Needed for machine translation tokenizers | |||
https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz#egg=de_core_news_sm==2.2.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? If spacy is not installed for a specific language, it will throw out a error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I added this for resolving issues in unittest as the dataset used was primarily dutch and the I was using spacy as the tokenizer. But after do some thinking maybe this is unnecessary to be included in requirements.txt
yield " ".join(row) | ||
|
||
|
||
def _clean_xml_file(f_xml): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this func for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some files in IWSLT are original in xml files, so need to have this function to extract the contents. This function was also already there from the branch you created previously, and it was also there in the old translation dataset class.
fd_txt.write(e.text.strip() + '\n') | ||
|
||
|
||
def _clean_tags_file(f_orig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this func for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are files such as train.tags.en-de.en, train.tags.en-cs.en, etc which contains something similar to xml files. The contents look like the following
<translator href="http://www.ted.com/profiles/383709">Abdulatif Mahgoub</translator>
<reviewer href="http://www.ted.com/profiles/122540">Anwar Dafa-Alla</reviewer>
<url>http://www.ted.com/talks/steve_keil_a_manifesto_for_play_for_bulgaria_and_beyond</url>
<keywords>talks, Europe, TEDx, business, culture, economics, play, social change</keywords>
<speaker>Steve Keil</speaker>
<talkid>1170</talkid>
<title>ستيف كايل: بيان رسمي للعب ، الى بلغاريا و ماوراءها</title>
<description>TED Talk Subtitles and Transcript: في تيدأكس بيىجي في صوفيا، بلغاريا ، في صوفيا ، يحارب ستيف كايل "المعلومات الثقافية المتوارثة" والتي أصابت وطنه بلغاريا - ويدعو للعودة الى اللعب لتنشيط الاقتصاد، التعليم ، والمجتمع. حديث رائع ذو رسالة عالمية للأفراد في كل مكان ، الذين يعيدون ابتكار اماكن عملهم ، مدارسهم ، حياتهم.</description>
إنني هنا اليوم لأبدأ ثورة.
والآن وقبل أن تهبوا، أو تبدأوا بالغناء، أو اختيار لون مفضل، أريد تحديد ما أعنيه بثورة.
بكلمة ثورة، أعني تغييرا جذريا وبعيد المدى في طريقة تفكيرنا وتصرفاتنا-- في طريقة تفكيرنا وطريقة تصرفاتنا.
والآن لماذا ، ستيف، لماذا نحتاج الى ثورة؟
نحتاج إلى ثورة لأن الأشياء لا تعمل؛ انها لا تعمل تماماً.
وهذا يجعلني حزينا جدا، لأنني سئمت وتعبت من الأشياء التي لا تعمل.
أوتعرفون، لقد سئمت وتعبت منّا ونحن لا نحيا بامكانياتنا.
سئمت وتعبت منّا وقد أصبحنا متأخرين.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need those informations? IMO, without those tag files, the PR is much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the lack of context. Those tags files are parts of IWSLT dataset when you download them
if 'xml' in fname: | ||
_clean_xml_file(fname) | ||
file_archives.append(os.path.splitext(fname)[0]) | ||
elif "tags" in fname: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's in tags file? Do we need to provide them to users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is needed to extract the real content and remove the tags from .tags
files. This function was there from the branch you created previously, and it was also there in the old translation dataset class.
valid_filename="val", | ||
test_filename="test_2016_flickr", | ||
root='.data'): | ||
""" Define translation datasets: Multi30k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docs, it's better to summarize all the combinations of language and years as the options for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, please let me know if it's sufficient
The available datasets include: | ||
|
||
Arguments: | ||
languages: The source and target languages for the datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, please let me know if it's sufficient
Thanks @akurniawan. LGTM. Add some questions to help me understand some code. |
return tuple(datasets) | ||
|
||
|
||
class RawTextIterableDataset(torch.utils.data.IterableDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be call this RawTranslationIterableDataset
?
|
||
|
||
def Multi30k(languages="de-en", | ||
train_filename="train", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me that we can combine languages
and train_filename
together and explicitly the train data as train_filename
, for example train_filename="train.de"
. Same for valid_filename.
from torchtext.data.utils import get_tokenizer | ||
|
||
|
||
def vocab_func(vocab): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simply import those utils from https://github.com/pytorch/text/blob/master/torchtext/experimental/functional.py
vocab=(None, None), | ||
removed_tokens=['<unk>']): | ||
""" Define translation datasets: Multi30k | ||
Separately returns train/valid/test datasets as a tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep a copy of the available language/train_filename here because most users will the docs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Default: ('val.de', 'val.en') | ||
test_filenames: the source and target filenames for test. | ||
Default: ('test2016.de', 'test2016.en') | ||
tokenizer: the tokenizer used to preprocess source and target raw text data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explicitly say that the tokenizer needs a tuple of tokenizers. And the default is (get_tokenizer("spacy", language='de_core_news_sm'), get_tokenizer("spacy", language='en_core_web_sm')). Same for vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akurniawan Thanks for the contribution and everything looks good to me.
nit: add a comment for docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just find that we need a doc string in https://github.com/pytorch/text/blob/master/docs/source/experimental_datasets.rst
@akurniawan Let me know when you are done so I can merge this PR. |
@zhangguanheng66 I have already added the doc string, please take a look |