Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final vocabulary size is not equal to character vocabulary plus num_operations ? #16

Closed
huache opened this issue Apr 1, 2017 · 7 comments

Comments

@huache
Copy link

huache commented Apr 1, 2017

For this fake corpus
when engage what
Its character vocabulary size is 7 (e a h w n g t ).
Lean BPE by two num_operations, and apply it with the two generated codes (wh and en), we get:
wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t
The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@ ), not 9.
Do I calculate it wrong?

In my opinion, the equation Final vocabulary size = character vocabulary + num_operations based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of e and n, generates two token en and en@@ in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ )
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?

@rsennrich
Copy link
Owner

the equation is a bit simplified, and there are several factors that make a difference:

  • for the initial character vocabulary, we will see most of them in word-internal and word-final position, so we may need to reserve two symbols per character.
  • it is possible that all occurrences of a character (or a subword unit) get merged into larger subword units.
  • there's other factors that can cause a mismatch between the number of BPE units and vocabulary size. For instance, when doing joint BPE on a parallel corpus, some subword units may only appear in the source, others only in the target.

unfortunately, it is a bit tricky to control the size of the final vocabulary exactly, but what I do in practice is to run apply_bpe.py on the training corpus, and then extract the vocabulary from that.

@huache
Copy link
Author

huache commented Apr 2, 2017

Thank you for your quick reply!
If vocabulary was extracted from encoded training corpus, how to avoid unknown word in test corpus?

I have an immature idea, how about encoding all non-subword word-final position units to single character? For example, encode when to wh@@ e@@ n in my fake corpus, and then add the two form of initial characters to the extracted vocabulary. There should be no unknown word in valid or test corpus.

@rsennrich
Copy link
Owner

I pushed some code that should prevent unknown words in valid/test corpora (except for unknown characters). This is done with a more consistent handling of the end-of-word token (you need to rerun learn_bpe.py for this), and by passing a vocabulary file to apply_bpe.py, suppressing any subword units that are out-of-vocabulary (or whose frequency is below a given threshold).

See the README for usage instructions.

@huache
Copy link
Author

huache commented Apr 22, 2017

Great ! I really appreciate it. I‘d like to retrain my model with the new code, its performance will be better.
Thanks again !

@hoangcuong2011
Copy link

Hi @rsennrich,

I noticed the number of unique tokens extracted from an encoded training corpus is larger than the number of merge operations.

For instance when I set BPE size to 32K, the number of unique tokens in the encoded training corpus is 32755.

Several tokens that are in the encoded corpus but not in the vocab file generated by learn-bpe are as below. I am curious why this happens? Thanks

ectiv
inc
seri
ş
pri
ť
ث@@
Ń@@
propri
ל@@
फ@@
ь@@
ศ@@
Å@@
parti

@rsennrich
Copy link
Owner

Hi Hoang,

the number of unique tokens is typically larger than the number of merge operations because you have to add the size of the character vocabulary (*2 to account for the fact that characters could be word-internal or word-final). This also explains some of the unique tokens you see, such as "फ@@".

The numbers won't match perfectly, because some characters may only occur word-internally for example, or because all occurrences of a character or subword may have been merged into larger subwords.

as to your question why your encoded corpus contains larger subwords that are not in the list of merge operations, this shouldn't happen. How did you search for these mismatches?

@hoangcuong2011
Copy link

Hi Rico,

Thanks for your detailed reply. In the beginning I naively thought we should use the list of merge operations as the vocab. Now learning from this thread I know a better way of using subword is to extract the vocab (I supposed most frequently tokens) from the encoded corpus. I actually run an experiment myself to verify this and I observed a better BLEU score with this way (Pls let me know if you have a difference experience on this). For me it is interesting to know these details under the hood. Thanks.

BTW These mismatches I found was by using a preprocessed training corpus uploaded from a team. I still have not figured out exactly why it happens but since you think it should not happen, I suppose something strange is not from subword-nmt. Thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants