Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roberta-large using BPE tokenizer generates multi tokens. #92

Closed
caidongqi opened this issue Sep 6, 2022 · 6 comments
Closed

Roberta-large using BPE tokenizer generates multi tokens. #92

caidongqi opened this issue Sep 6, 2022 · 6 comments

Comments

@caidongqi
Copy link

Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.

For example, Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']

Now I just comment the code assert ( # len(ids) == 1 in utils.py to enforce using the first tokenizer.

But I don't know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?

Thanks~

@caidongqi
Copy link
Author

Could anyone do me a favor plz...

@huchinlp
Copy link

You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN).
Since Roberta is case-sensitive, you may also try lowercase "society".

@caidongqi
Copy link
Author

Thanks for answering!
But using lowercase doesn't work for me 😭
Bug still exists: Verbalization "society" does not correspond to a single token, got ['soc', 'iety']

For your first suggestion, I still don't know how it works yet. Here is the related code.

kwargs = {'add_prefix_space': True} if isinstance(tokenizer, GPT2Tokenizer) else {}
ids = tokenizer.encode(word, add_special_tokens=False, **kwargs)
if not force_single_token:
    return ids
assert (
    len(ids) == 1
), f'Verbalization "{word}" does not correspond to a single token, got {tokenizer.convert_ids_to_tokens(ids)}'

Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token.
So the assertion check finds there are two ids and drops me an error.

Could you please explain more explictly how to modify it at your convenience.

Anyway, your suggestion does a great help to me, thanks again. Best wishes.

@huchinlp
Copy link

huchinlp commented Oct 15, 2022

Hi,

GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ".
Actually, "Society" is not a token in the vocab but "ĠSociety" is a valid one.
You can call tokenizer.convert_tokens_to_ids("ĠSociety") and the result is 3930.

The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines:

if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
     space_word = "Ġ" + word
     id = tokenizer.convert_tokens_to_ids(space_word)
else:
     id = tokenizer.convert_tokens_to_ids(word)

Refer to this thread for more details:
https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante
Best.

@caidongqi
Copy link
Author

That works! Thanks for the solution and reference.🥳

@nieallen
Copy link

Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants