-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Rework the translate problem #370
Conversation
split language pairs for clarity dissociate ende / enfr, make them independant
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
I signed it! |
CLAs look good, thanks! |
@lukaszkaiser please review. |
@@ -338,13 +338,19 @@ def generate(): | |||
|
|||
# Use Tokenizer to count the word occurrences. | |||
with tf.gfile.GFile(filepath, mode="r") as source_file: | |||
file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5 | |||
file_byte_budget = 1e6 if filepath.endswith("en") else 1e6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree 1e6 is a better default than 3.5e5, but
- ideally it should be a parameter (which can be overridden in problem spec)
- now the
if
branch is redundant and misleading
file_byte_budget = 3.5e5 if filepath.endswith("en") else 7e5 | ||
file_byte_budget = 1e6 if filepath.endswith("en") else 1e6 | ||
counter = 0 | ||
countermax = int(source_file.size() / 1e6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the goal: each countermax
th line is sampled, but
- I would suggest to make
countermax
lower, otherwise in 50% of cases we reach the end of file without reaching file_byte_budget bytes of sampled data (I expect line lengths are distributed randomly in the training data). - Don't repeat the constant 1e6, use
file_byte_budget
instead. - Add a comment about the intended goal: to get a representative sample from (possibly) unshuffled training data
line = line.strip() | ||
file_byte_budget -= len(line) | ||
counter = 0 | ||
yield line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe it would be better to have this file_byte_budget as a separate PR as it is not really related to wmt.py refactoring and (unlike the refactoring) it changes the behavior and may affect BLEU.
- Now I realized that even better than solution would be to increase file_byte_budget dynamically as long as the subword/BPE algorithm ends up with min_count=1 (or 2 or a user-specified constant). Or (to make it faster) we could make sure the space-delimited-tokens (or e.g. 10-bytes-tokens for non-space languages) vocabulary size is significantly bigger than the expected subword vocab size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not make it too complicated. Using a parameter file_budget
(per file) should be good enough for great most cases, no?
I think it looks good to merge, but would like to wait until Martin's comments are addressed, let me know! |
ok guys, let make this simple, we can still adjust later on. |
Just a thought: What do you think of applying bi_vocabs not only for enzh but also for all other problems that are bilingual? |
according to my experiment on web crawl, straight forward subword works fine, there is no need of bivocab given sufficient datasize. |
@mehmedes: I don't think bi_vocab is a good idea because it prevents segmenting named entities the same way in both languages (and thus translating them correctly/acceptably). Even for English-Chinese it is an open question whether bi_vocab gives better results (probably depending on data). |
I think we should make it easy to use bi-vocab, but keep it all on a problem-to-problem basis. There might also be more work on the vocab side and to make the budget easier to tweak. But I think this PR is large enough and good as it it. I think it's ready to merge, should I wait for anything more? I'll wait an hour or so and merge then if noone complains. Great thanks for doing this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
yes it's ok to merge. |
Thanks guys! |
Rename wmt to translate, make it a single class file.
Split language pairs for clarity in each translate_enxx.py
Dissociate ende / enfr, make them independant, no real reason to combine them.
For enfr, I commented the huge WMT dataset and replaced it with a baseline 1M segment we use in OpenNMT. Good for benchmarking too :)