Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Programmatically usage #76

Closed
loretoparisi opened this issue Apr 3, 2019 · 9 comments
Closed

About Programmatically usage #76

loretoparisi opened this issue Apr 3, 2019 · 9 comments

Comments

@loretoparisi
Copy link

loretoparisi commented Apr 3, 2019

I'm trying to use the package programatically. I'm doing

    from subword_nmt.apply_bpe import BPE, read_vocabulary
     # read/write files as UTF-8
    bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
    bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
    vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)

    bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
    codes = bpe.process_line(line)

Is that correct? Also, I'm not sure of the vocabulary_threshold, since I do not see any default value. Is there any one?

Thank you.

@rsennrich
Copy link
Owner

This looks mostly fine. Two remarks:

  • vocabulary is optional. Its function is described in the README. If you use it, you can provide a vocabulary-threshold (effectively filtering out low-frequency items from the vocabulary), but this is also optional.
  • you will typically want to apply BPE to more than one line. If so, make sure that only the last line is executed repeatedly.

@loretoparisi
Copy link
Author

Thank you, I'm going to split lines to handle both cases then. In my model I have both vocabulary and codes, but at this point my wonder becomes: how to get the right threshold? I mean assumed I have a vocabulary already, shall I have make some stats to get the lowest frequency words?

@loretoparisi
Copy link
Author

loretoparisi commented Apr 3, 2019

@rsennrich I get this error:

Error: invalid line 1 in BPE codes file: e n 52708119
The line should exist of exactly two subword units, separated by whitespace

My codes and vocabulary files are from FAIR LASER model:

that are like

root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab 
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes 
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691

The vocabulary is loaded correctly through the read_vocabulary api, while I immediately get that error I presume when passing to the line

encoder = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)

@rsennrich
Copy link
Owner

As to your first question, have a look at your vocabulary file - whether you set the threshold to 5 or 500 won't make a big difference for you, since most rare tokens are single (non-Latin) characters that won't be affected by this.

FAIR LASER uses a different BPE implementation ( https://github.com/glample/fastBPE ), which seems to store the BPE file in a different format. It might work if you simply remove the third item in each entry (the frequency), but I can't guarantee there's no other inconsistency, e.g. in how UTF-8 whitespace is handled.

@loretoparisi
Copy link
Author

loretoparisi commented Apr 3, 2019

@rsennrich thank you, looking at the results it seems the problem is the third column only, so we did

self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')[:2]) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]

Regarding the compatibility with fastBPE I thought there was an official approach to follow, sort of I mean. Assumed that I load the same codes and dictionary I get different results:

Using fastBPE

hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad

Using subword-nmt

ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad

Which can be the issue here?

@rsennrich
Copy link
Owner

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

@loretoparisi
Copy link
Author

@rsennrich ok, so basically subword-nmt needs the comment to detect the version. The only issue I see is that if I have a pre-trained file it can happen that I cannot modify it. Thanks, closing.

@RenShuhuai-Andy
Copy link

RenShuhuai-Andy commented Feb 25, 2020

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ...
The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

@RenShuhuai-Andy
Copy link

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ...
The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

oh I have solved this problem, I set the bpe parameter incorrectly, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants