Rework the translate problem #370

vince62s · 2017-10-19T12:25:21Z

Rename wmt to translate, make it a single class file.
Split language pairs for clarity in each translate_enxx.py
Dissociate ende / enfr, make them independant, no real reason to combine them.

For enfr, I commented the huge WMT dataset and replaced it with a baseline 1M segment we use in OpenNMT. Good for benchmarking too :)

split language pairs for clarity dissociate ende / enfr, make them independant

googlebot · 2017-10-19T12:25:31Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

vince62s · 2017-10-19T12:27:25Z

I signed it!

googlebot · 2017-10-19T12:27:29Z

CLAs look good, thanks!

vince62s · 2017-10-22T20:33:16Z

@lukaszkaiser please review.

martinpopel · 2017-10-23T16:59:39Z

tensor2tensor/data_generators/generator_utils.py

@@ -338,13 +338,19 @@ def generate():

        # Use Tokenizer to count the word occurrences.
        with tf.gfile.GFile(filepath, mode="r") as source_file:
-          file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5
+          file_byte_budget = 1e6 if filepath.endswith("en") else 1e6


I agree 1e6 is a better default than 3.5e5, but

ideally it should be a parameter (which can be overridden in problem spec)

now the if branch is redundant and misleading

martinpopel · 2017-10-23T17:11:11Z

tensor2tensor/data_generators/generator_utils.py

-          file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5
+          file_byte_budget = 1e6 if filepath.endswith("en") else 1e6
+          counter = 0
+          countermax = int(source_file.size() / 1e6)


I understand the goal: each countermaxth line is sampled, but

I would suggest to make countermax lower, otherwise in 50% of cases we reach the end of file without reaching file_byte_budget bytes of sampled data (I expect line lengths are distributed randomly in the training data).

Don't repeat the constant 1e6, use file_byte_budget instead.

Add a comment about the intended goal: to get a representative sample from (possibly) unshuffled training data

martinpopel · 2017-10-23T17:17:27Z

tensor2tensor/data_generators/generator_utils.py

+              line = line.strip()
+              file_byte_budget -= len(line)
+              counter = 0
+              yield line


Maybe it would be better to have this file_byte_budget as a separate PR as it is not really related to wmt.py refactoring and (unlike the refactoring) it changes the behavior and may affect BLEU.

Now I realized that even better than solution would be to increase file_byte_budget dynamically as long as the subword/BPE algorithm ends up with min_count=1 (or 2 or a user-specified constant). Or (to make it faster) we could make sure the space-delimited-tokens (or e.g. 10-bytes-tokens for non-space languages) vocabulary size is significantly bigger than the expected subword vocab size.

Let's not make it too complicated. Using a parameter file_budget (per file) should be good enough for great most cases, no?

lukaszkaiser · 2017-10-23T21:57:13Z

I think it looks good to merge, but would like to wait until Martin's comments are addressed, let me know!

vince62s · 2017-10-24T06:54:26Z

ok guys, let make this simple, we can still adjust later on.
I adjusted countermax to take into account Martin's comment on the file distribution.
As is now, it stays similar to the original concept, which is the same file_byte_budget for each file.
Let's get some results of a few training, but my feeling is that is good enough for most cases.

mehmedes · 2017-10-24T08:16:51Z

Just a thought: What do you think of applying bi_vocabs not only for enzh but also for all other problems that are bilingual?

colmantse · 2017-10-24T08:49:00Z

according to my experiment on web crawl, straight forward subword works fine, there is no need of bivocab given sufficient datasize.

martinpopel · 2017-10-24T08:49:33Z

@mehmedes: I don't think bi_vocab is a good idea because it prevents segmenting named entities the same way in both languages (and thus translating them correctly/acceptably). Even for English-Chinese it is an open question whether bi_vocab gives better results (probably depending on data).

lukaszkaiser · 2017-10-24T19:45:33Z

I think we should make it easy to use bi-vocab, but keep it all on a problem-to-problem basis. There might also be more work on the vocab side and to make the budget easier to tweak. But I think this PR is large enough and good as it it. I think it's ready to merge, should I wait for anything more? I'll wait an hour or so and merge then if noone complains. Great thanks for doing this!

lukaszkaiser

Looks good, thanks!

vince62s · 2017-10-25T06:47:20Z

yes it's ok to merge.

lukaszkaiser · 2017-10-25T07:18:32Z

Thanks guys!

Rename wmt.py to translate.py

eacde9d

split language pairs for clarity dissociate ende / enfr, make them independant

vince62s added 2 commits October 19, 2017 14:35

fix warning

dd08f9d

another warning fix

fc35144

vince62s mentioned this pull request Oct 19, 2017

problem translate_ende_wmt32k #269

Closed

vince62s changed the title ~~Rework the translate problem~~ [do not merge yet] Rework the translate problem Oct 20, 2017

adjust vocab with random lines

9f59a50

vince62s changed the title ~~[do not merge yet] Rework the translate problem~~ Rework the translate problem Oct 22, 2017

martinpopel reviewed Oct 23, 2017

View reviewed changes

fix

b43f833

lukaszkaiser approved these changes Oct 24, 2017

View reviewed changes

lukaszkaiser merged commit a836d66 into tensorflow:master Oct 25, 2017

Rework the translate problem #370

Rework the translate problem #370

Uh oh!

Conversation

vince62s commented Oct 19, 2017

Uh oh!

googlebot commented Oct 19, 2017

Uh oh!

vince62s commented Oct 19, 2017

Uh oh!

googlebot commented Oct 19, 2017

Uh oh!

vince62s commented Oct 22, 2017

Uh oh!

martinpopel Oct 23, 2017

Choose a reason for hiding this comment

Uh oh!

martinpopel Oct 23, 2017

Choose a reason for hiding this comment

Uh oh!

martinpopel Oct 23, 2017

Choose a reason for hiding this comment

Uh oh!

lukaszkaiser Oct 23, 2017

Choose a reason for hiding this comment

Uh oh!

lukaszkaiser commented Oct 23, 2017

Uh oh!

vince62s commented Oct 24, 2017

Uh oh!

mehmedes commented Oct 24, 2017

Uh oh!

colmantse commented Oct 24, 2017

Uh oh!

martinpopel commented Oct 24, 2017

Uh oh!

lukaszkaiser commented Oct 24, 2017

Uh oh!

lukaszkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

vince62s commented Oct 25, 2017

Uh oh!

lukaszkaiser commented Oct 25, 2017

Uh oh!

Uh oh!