Roberta-large using BPE tokenizer generates multi tokens. #92

caidongqi · 2022-09-06T09:22:18Z

Roberta-large uses byte-level Byte-Pair-Encoding. It avoids the common PET training.

For example, Verbalization "Society" does not correspond to a single token, got ['Soc', 'iety']

Now I just comment the code assert ( # len(ids) == 1 in utils.py to enforce using the first tokenizer.

But I don't know whether it will affect the accuracy. So is there any alternative since PET uses Roberta-large by default?

Thanks~

The text was updated successfully, but these errors were encountered:

caidongqi · 2022-09-15T10:24:42Z

Could anyone do me a favor plz...

huchinlp · 2022-10-14T12:51:28Z

You can try another API: tokenizer.convert_tokens_to_ids(YOUR_TOKEN).
Since Roberta is case-sensitive, you may also try lowercase "society".

caidongqi · 2022-10-14T16:12:12Z

Thanks for answering!
But using lowercase doesn't work for me 😭
Bug still exists: Verbalization "society" does not correspond to a single token, got ['soc', 'iety']

For your first suggestion, I still don't know how it works yet. Here is the related code.

kwargs = {'add_prefix_space': True} if isinstance(tokenizer, GPT2Tokenizer) else {}
ids = tokenizer.encode(word, add_special_tokens=False, **kwargs)
if not force_single_token:
    return ids
assert (
    len(ids) == 1
), f'Verbalization "{word}" does not correspond to a single token, got {tokenizer.convert_ids_to_tokens(ids)}'

Roberta tokenizer converts one word into two tokens (with specific ids). But vanilla PET can only process one token.
So the assertion check finds there are two ids and drops me an error.

Could you please explain more explictly how to modify it at your convenience.

Anyway, your suggestion does a great help to me, thanks again. Best wishes.

huchinlp · 2022-10-15T06:31:46Z

Hi,

GPT-2 and Roberta tokenizers will recognize the space before a word and replace it with a "Ġ".
Actually, "Society" is not a token in the vocab but "ĠSociety" is a valid one.
You can call tokenizer.convert_tokens_to_ids("ĠSociety") and the result is 3930.

The only thing you need to do is replace "tokenizer.encode(xxxxx)" with the following lines:

if tokenizer.convert_tokens_to_ids(word) == tokenizer.unk_token_id:
     space_word = "Ġ" + word
     id = tokenizer.convert_tokens_to_ids(space_word)
else:
     id = tokenizer.convert_tokens_to_ids(word)

Refer to this thread for more details:
https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475?u=joaogante
Best.

caidongqi · 2022-10-15T07:39:12Z

That works! Thanks for the solution and reference.🥳

nieallen · 2022-11-29T03:32:43Z

Hi, how to training PET model uses xlm-roberta with byte-level Byte-Pair-Encoding?

caidongqi closed this as completed Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roberta-large using BPE tokenizer generates multi tokens. #92

Roberta-large using BPE tokenizer generates multi tokens. #92

caidongqi commented Sep 6, 2022

caidongqi commented Sep 15, 2022

huchinlp commented Oct 14, 2022

caidongqi commented Oct 14, 2022

huchinlp commented Oct 15, 2022 •

edited

caidongqi commented Oct 15, 2022

nieallen commented Nov 29, 2022

Roberta-large using BPE tokenizer generates multi tokens. #92

Roberta-large using BPE tokenizer generates multi tokens. #92

Comments

caidongqi commented Sep 6, 2022

caidongqi commented Sep 15, 2022

huchinlp commented Oct 14, 2022

caidongqi commented Oct 14, 2022

huchinlp commented Oct 15, 2022 • edited

caidongqi commented Oct 15, 2022

nieallen commented Nov 29, 2022

huchinlp commented Oct 15, 2022 •

edited