About Programmatically usage #76

loretoparisi · 2019-04-03T13:20:45Z

I'm trying to use the package programatically. I'm doing

    from subword_nmt.apply_bpe import BPE, read_vocabulary
     # read/write files as UTF-8
    bpe_codes_fin = codecs.open(bpe_codes, encoding='utf-8')
    bpe_vocab_fin = codecs.open(bpe_vocab, encoding='utf-8')
    vocabulary = read_vocabulary(bpe_vocab_fin, vocabulary_threshold)

    bpe = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)
    codes = bpe.process_line(line)

Is that correct? Also, I'm not sure of the vocabulary_threshold, since I do not see any default value. Is there any one?

Thank you.

The text was updated successfully, but these errors were encountered:

rsennrich · 2019-04-03T13:42:08Z

This looks mostly fine. Two remarks:

vocabulary is optional. Its function is described in the README. If you use it, you can provide a vocabulary-threshold (effectively filtering out low-frequency items from the vocabulary), but this is also optional.
you will typically want to apply BPE to more than one line. If so, make sure that only the last line is executed repeatedly.

loretoparisi · 2019-04-03T13:48:09Z

Thank you, I'm going to split lines to handle both cases then. In my model I have both vocabulary and codes, but at this point my wonder becomes: how to get the right threshold? I mean assumed I have a vocabulary already, shall I have make some stats to get the lowest frequency words?

loretoparisi · 2019-04-03T14:04:23Z

@rsennrich I get this error:

Error: invalid line 1 in BPE codes file: e n 52708119
The line should exist of exactly two subword units, separated by whitespace

My codes and vocabulary files are from FAIR LASER model:

that are like

root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab 
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes 
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691

The vocabulary is loaded correctly through the read_vocabulary api, while I immediately get that error I presume when passing to the line

encoder = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None)

rsennrich · 2019-04-03T15:15:34Z

As to your first question, have a look at your vocabulary file - whether you set the threshold to 5 or 500 won't make a big difference for you, since most rare tokens are single (non-Latin) characters that won't be affected by this.

FAIR LASER uses a different BPE implementation ( https://github.com/glample/fastBPE ), which seems to store the BPE file in a different format. It might work if you simply remove the third item in each entry (the frequency), but I can't guarantee there's no other inconsistency, e.g. in how UTF-8 whitespace is handled.

loretoparisi · 2019-04-03T15:36:19Z

@rsennrich thank you, looking at the results it seems the problem is the third column only, so we did

self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')[:2]) for (n, item) in enumerate(codes) if (n < merges or merges == -1)]

Regarding the compatibility with fastBPE I thought there was an official approach to follow, sort of I mean. Assumed that I load the same codes and dictionary I get different results:

Using fastBPE

hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad

Using subword-nmt

ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad

Which can be the issue here?

rsennrich · 2019-04-04T09:45:27Z

try adding this as the first line to the BPE file:

#version: 0.2

the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

loretoparisi · 2019-04-04T09:53:15Z

@rsennrich ok, so basically subword-nmt needs the comment to detect the version. The only issue I see is that if I have a pre-trained file it can happen that I cannot modify it. Thanks, closing.

RenShuhuai-Andy · 2020-02-25T16:06:06Z

try adding this as the first line to the BPE file:
#version: 0.2
the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.

Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ...
The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

RenShuhuai-Andy · 2020-02-26T03:59:48Z

try adding this as the first line to the BPE file:
#version: 0.2
the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well.
Hi~ it doesn't work for me. The error log is Error: invalid line 1 in BPE codes file: e n</w> 1423551864 before adding #version: 0.2, then it's Error: invalid line 2 in BPE codes file: e n</w> 1423551864 ...
The BPE file I used is downloaded from fairseq: transformer.wmt19.en-de, and I export LANG=en_US.UTF-8; export LC_ALL=en_US.UTF-8. Any advice? @rsennrich

oh I have solved this problem, I set the bpe parameter incorrectly, sorry

loretoparisi mentioned this issue Apr 3, 2019

learn_bpe.py generates an invalid bpe file #46

Closed

loretoparisi mentioned this issue Apr 3, 2019

Differences with subword-nmt glample/fastBPE#13

Closed

loretoparisi closed this as completed Apr 4, 2019

RenShuhuai-Andy mentioned this issue Feb 25, 2020

Fail to load transformer.wmt18.en-de due to encoding facebookresearch/fairseq#1287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Programmatically usage #76

About Programmatically usage #76

loretoparisi commented Apr 3, 2019 •

edited

Loading

rsennrich commented Apr 3, 2019

loretoparisi commented Apr 3, 2019

loretoparisi commented Apr 3, 2019 •

edited

Loading

rsennrich commented Apr 3, 2019

loretoparisi commented Apr 3, 2019 •

edited

Loading

rsennrich commented Apr 4, 2019

loretoparisi commented Apr 4, 2019

RenShuhuai-Andy commented Feb 25, 2020 •

edited

Loading

RenShuhuai-Andy commented Feb 26, 2020

About Programmatically usage #76

About Programmatically usage #76

Comments

loretoparisi commented Apr 3, 2019 • edited Loading

rsennrich commented Apr 3, 2019

loretoparisi commented Apr 3, 2019

loretoparisi commented Apr 3, 2019 • edited Loading

rsennrich commented Apr 3, 2019

loretoparisi commented Apr 3, 2019 • edited Loading

rsennrich commented Apr 4, 2019

loretoparisi commented Apr 4, 2019

RenShuhuai-Andy commented Feb 25, 2020 • edited Loading

RenShuhuai-Andy commented Feb 26, 2020

loretoparisi commented Apr 3, 2019 •

edited

Loading

loretoparisi commented Apr 3, 2019 •

edited

Loading

loretoparisi commented Apr 3, 2019 •

edited

Loading

RenShuhuai-Andy commented Feb 25, 2020 •

edited

Loading