New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roberta-large using BPE tokenizer generates multi tokens. #92
Comments
Could anyone do me a favor plz... |
You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN). |
Thanks for answering! For your first suggestion, I still don't know how it works yet. Here is the related code.
Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token. Could you please explain more explictly how to modify it at your convenience. Anyway, your suggestion does a great help to me, thanks again. Best wishes. |
Hi, GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ". The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines: if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
space_word = "Ġ" + word
id = tokenizer.convert_tokens_to_ids(space_word)
else:
id = tokenizer.convert_tokens_to_ids(word) Refer to this thread for more details: |
That works! Thanks for the solution and reference.🥳 |
Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding? |
Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.
For example,
Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']
Now I just comment the code
assert ( # len(ids) == 1
in utils.py to enforce using the first tokenizer.But I don't know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?
Thanks~
The text was updated successfully, but these errors were encountered: