Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Conversation

vince62s
Copy link
Contributor

Rename wmt to translate, make it a single class file.
Split language pairs for clarity in each translate_enxx.py
Dissociate ende / enfr, make them independant, no real reason to combine them.

For enfr, I commented the huge WMT dataset and replaced it with a baseline 1M segment we use in OpenNMT. Good for benchmarking too :)

split language pairs for clarity
dissociate ende / enfr, make them independant
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

@vince62s
Copy link
Contributor Author

I signed it!

@googlebot
Copy link

CLAs look good, thanks!

@vince62s vince62s changed the title Rework the translate problem [do not merge yet] Rework the translate problem Oct 20, 2017
@vince62s vince62s changed the title [do not merge yet] Rework the translate problem Rework the translate problem Oct 22, 2017
@vince62s
Copy link
Contributor Author

@lukaszkaiser please review.

@@ -338,13 +338,19 @@ def generate():

# Use Tokenizer to count the word occurrences.
with tf.gfile.GFile(filepath, mode="r") as source_file:
file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5
file_byte_budget = 1e6 if filepath.endswith("en") else 1e6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 1e6 is a better default than 3.5e5, but

  • ideally it should be a parameter (which can be overridden in problem spec)
  • now the if branch is redundant and misleading

file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5
file_byte_budget = 1e6 if filepath.endswith("en") else 1e6
counter = 0
countermax = int(source_file.size() / 1e6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the goal: each countermaxth line is sampled, but

  • I would suggest to make countermax lower, otherwise in 50% of cases we reach the end of file without reaching file_byte_budget bytes of sampled data (I expect line lengths are distributed randomly in the training data).
  • Don't repeat the constant 1e6, use file_byte_budget instead.
  • Add a comment about the intended goal: to get a representative sample from (possibly) unshuffled training data

line = line.strip()
file_byte_budget -= len(line)
counter = 0
yield line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Maybe it would be better to have this file_byte_budget as a separate PR as it is not really related to wmt.py refactoring and (unlike the refactoring) it changes the behavior and may affect BLEU.
  • Now I realized that even better than solution would be to increase file_byte_budget dynamically as long as the subword/BPE algorithm ends up with min_count=1 (or 2 or a user-specified constant). Or (to make it faster) we could make sure the space-delimited-tokens (or e.g. 10-bytes-tokens for non-space languages) vocabulary size is significantly bigger than the expected subword vocab size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not make it too complicated. Using a parameter file_budget (per file) should be good enough for great most cases, no?

@lukaszkaiser
Copy link
Contributor

I think it looks good to merge, but would like to wait until Martin's comments are addressed, let me know!

@vince62s
Copy link
Contributor Author

ok guys, let make this simple, we can still adjust later on.
I adjusted countermax to take into account Martin's comment on the file distribution.
As is now, it stays similar to the original concept, which is the same file_byte_budget for each file.
Let's get some results of a few training, but my feeling is that is good enough for most cases.

@mehmedes
Copy link

Just a thought: What do you think of applying bi_vocabs not only for enzh but also for all other problems that are bilingual?

@colmantse
Copy link

according to my experiment on web crawl, straight forward subword works fine, there is no need of bivocab given sufficient datasize.

@martinpopel
Copy link
Contributor

@mehmedes: I don't think bi_vocab is a good idea because it prevents segmenting named entities the same way in both languages (and thus translating them correctly/acceptably). Even for English-Chinese it is an open question whether bi_vocab gives better results (probably depending on data).

@lukaszkaiser
Copy link
Contributor

I think we should make it easy to use bi-vocab, but keep it all on a problem-to-problem basis. There might also be more work on the vocab side and to make the budget easier to tweak. But I think this PR is large enough and good as it it. I think it's ready to merge, should I wait for anything more? I'll wait an hour or so and merge then if noone complains. Great thanks for doing this!

Copy link
Contributor

@lukaszkaiser lukaszkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@vince62s
Copy link
Contributor Author

yes it's ok to merge.

@lukaszkaiser lukaszkaiser merged commit a836d66 into tensorflow:master Oct 25, 2017
@lukaszkaiser
Copy link
Contributor

Thanks guys!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants