-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Programmatically usage #76
Comments
This looks mostly fine. Two remarks:
|
Thank you, I'm going to split lines to handle both cases then. In my model I have both vocabulary and codes, but at this point my wonder becomes: how to get the right threshold? I mean assumed I have a vocabulary already, shall I have make some stats to get the lowest frequency words? |
@rsennrich I get this error:
My codes and vocabulary files are from FAIR LASER model:
that are like
The vocabulary is loaded correctly through the encoder = BPE(bpe_codes_fin, merges=-1, separator='@@', vocab=vocabulary, glossaries=None) |
As to your first question, have a look at your vocabulary file - whether you set the threshold to 5 or 500 won't make a big difference for you, since most rare tokens are single (non-Latin) characters that won't be affected by this. FAIR LASER uses a different BPE implementation ( https://github.com/glample/fastBPE ), which seems to store the BPE file in a different format. It might work if you simply remove the third item in each entry (the frequency), but I can't guarantee there's no other inconsistency, e.g. in how UTF-8 whitespace is handled. |
@rsennrich thank you, looking at the results it seems the problem is the third column only, so we did self.bpe_codes = [tuple(item.strip('\r\n ').split(' ')[:2]) for (n, item) in enumerate(codes) if (n < merges or merges == -1)] Regarding the compatibility with Using fastBPE
Using subword-nmt
Which can be the issue here? |
try adding this as the first line to the BPE file:
the reason for this is explained in the README. It looks like fastBPE implements the new variant (v 0.2) as well. |
@rsennrich ok, so basically subword-nmt needs the comment to detect the version. The only issue I see is that if I have a pre-trained file it can happen that I cannot modify it. Thanks, closing. |
Hi~ it doesn't work for me. The error log is |
oh I have solved this problem, I set the bpe parameter incorrectly, sorry |
I'm trying to use the package programatically. I'm doing
Is that correct? Also, I'm not sure of the
vocabulary_threshold
, since I do not see any default value. Is there any one?Thank you.
The text was updated successfully, but these errors were encountered: