Final vocabulary size is not equal to character vocabulary plus num_operations ? #16

huache · 2017-04-01T05:39:55Z

For this fake corpus
when engage what
Its character vocabulary size is 7 (e a h w n g t ).
Lean BPE by two num_operations, and apply it with the two generated codes (wh and en), we get:
wh@@ en en@@ g@@ a@@ g@@ e wh@@ a@@ t
The final vocabulary size is 7 (a@@ wh@@ g@@ e t en en@@ ), not 9.
Do I calculate it wrong?

In my opinion, the equation Final vocabulary size = character vocabulary + num_operations based on the assumption that every merge operation generates one new token.
But in this case, the merge operation of e and n, generates two token en and en@@ in the encoded text, and this phenomenon is totally unpredictable. To make sure there is no unknown word, the final vocabulary size should be 18 ??
(e a h w n g t wh en e@@ a@@ h@@ w@@ n@@ g@@ t@@ wh@@ en@@ )
I am really confused !
How to generate the final vocabulary, and how to control its size exactly ?

The text was updated successfully, but these errors were encountered:

rsennrich · 2017-04-01T20:23:48Z

the equation is a bit simplified, and there are several factors that make a difference:

for the initial character vocabulary, we will see most of them in word-internal and word-final position, so we may need to reserve two symbols per character.
it is possible that all occurrences of a character (or a subword unit) get merged into larger subword units.
there's other factors that can cause a mismatch between the number of BPE units and vocabulary size. For instance, when doing joint BPE on a parallel corpus, some subword units may only appear in the source, others only in the target.

unfortunately, it is a bit tricky to control the size of the final vocabulary exactly, but what I do in practice is to run apply_bpe.py on the training corpus, and then extract the vocabulary from that.

huache · 2017-04-02T04:34:34Z

Thank you for your quick reply!
If vocabulary was extracted from encoded training corpus, how to avoid unknown word in test corpus?

I have an immature idea, how about encoding all non-subword word-final position units to single character? For example, encode when to wh@@ e@@ n in my fake corpus, and then add the two form of initial characters to the extracted vocabulary. There should be no unknown word in valid or test corpus.

rsennrich · 2017-04-21T10:30:04Z

I pushed some code that should prevent unknown words in valid/test corpora (except for unknown characters). This is done with a more consistent handling of the end-of-word token (you need to rerun learn_bpe.py for this), and by passing a vocabulary file to apply_bpe.py, suppressing any subword units that are out-of-vocabulary (or whose frequency is below a given threshold).

See the README for usage instructions.

huache · 2017-04-22T15:20:36Z

Great ! I really appreciate it. I‘d like to retrain my model with the new code, its performance will be better.
Thanks again !

hoangcuong2011 · 2021-01-02T18:38:10Z

Hi @rsennrich,

I noticed the number of unique tokens extracted from an encoded training corpus is larger than the number of merge operations.

For instance when I set BPE size to 32K, the number of unique tokens in the encoded training corpus is 32755.

Several tokens that are in the encoded corpus but not in the vocab file generated by learn-bpe are as below. I am curious why this happens? Thanks

ectiv
inc
seri
ş
pri
ť
ث@@
Ń@@
propri
ל@@
फ@@
ь@@
ศ@@
Å@@
parti

rsennrich · 2021-01-04T08:24:45Z

Hi Hoang,

the number of unique tokens is typically larger than the number of merge operations because you have to add the size of the character vocabulary (*2 to account for the fact that characters could be word-internal or word-final). This also explains some of the unique tokens you see, such as "फ@@".

The numbers won't match perfectly, because some characters may only occur word-internally for example, or because all occurrences of a character or subword may have been merged into larger subwords.

as to your question why your encoded corpus contains larger subwords that are not in the list of merge operations, this shouldn't happen. How did you search for these mismatches?

hoangcuong2011 · 2021-01-05T21:34:13Z

Hi Rico,

Thanks for your detailed reply. In the beginning I naively thought we should use the list of merge operations as the vocab. Now learning from this thread I know a better way of using subword is to extract the vocab (I supposed most frequently tokens) from the encoded corpus. I actually run an experiment myself to verify this and I observed a better BLEU score with this way (Pls let me know if you have a difference experience on this). For me it is interesting to know these details under the hood. Thanks.

BTW These mismatches I found was by using a preprocessed training corpus uploaded from a team. I still have not figured out exactly why it happens but since you think it should not happen, I suppose something strange is not from subword-nmt. Thx.

rsennrich closed this as completed Apr 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final vocabulary size is not equal to character vocabulary plus num_operations ? #16

Final vocabulary size is not equal to character vocabulary plus num_operations ? #16

huache commented Apr 1, 2017 •

edited

rsennrich commented Apr 1, 2017

huache commented Apr 2, 2017 •

edited

rsennrich commented Apr 21, 2017

huache commented Apr 22, 2017 •

edited

hoangcuong2011 commented Jan 2, 2021

rsennrich commented Jan 4, 2021

hoangcuong2011 commented Jan 5, 2021

Final vocabulary size is not equal to character vocabulary plus num_operations ? #16

Final vocabulary size is not equal to character vocabulary plus num_operations ? #16

Comments

huache commented Apr 1, 2017 • edited

rsennrich commented Apr 1, 2017

huache commented Apr 2, 2017 • edited

rsennrich commented Apr 21, 2017

huache commented Apr 22, 2017 • edited

hoangcuong2011 commented Jan 2, 2021

rsennrich commented Jan 4, 2021

hoangcuong2011 commented Jan 5, 2021

huache commented Apr 1, 2017 •

edited

huache commented Apr 2, 2017 •

edited

huache commented Apr 22, 2017 •

edited