Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.decode throwing an error #21

Closed
dsplog opened this issue Nov 14, 2023 · 3 comments
Closed

tokenizer.decode throwing an error #21

dsplog opened this issue Nov 14, 2023 · 3 comments

Comments

@dsplog
Copy link
Contributor

dsplog commented Nov 14, 2023

kindly see the code snippet below. was using the flow in preprocess.ipynb, but faced an error in

# get each token's lower case

lower_tokens = []
for t in tqdm(unique_index):
    word = tokenizer.decode([t])
    if word.lower() != word:
        t = tokenizer.encode([word.lower()])[0]
        lower_tokens.append(t)
    else:
        lower_tokens.append(t)

could see tokenizer.sym2idx defined, but tokenizer.idx2sym is an empty list.

from transformers import TransfoXLTokenizer
tokenizer = TransfoXLTokenizer.from_pretrained("transfo-xl-wt103")

enc = tokenizer.encode("Hello, my dog is cute")
enc
[14049, 2, 617, 3225, 23, 16072]

tokenizer.decode
<bound method PreTrainedTokenizerBase.decode of TransfoXLTokenizer(name_or_path='transfo-xl-wt103', vocab_size=0, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '', 'unk_token': '', 'additional_special_tokens': ['']}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
24: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
3039: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}>
tokenizer.decode(enc)
Traceback (most recent call last):
File "", line 1, in
File "/home/home/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3738, in decode
return self._decode(
File "/home/home/.local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 1001, in _decode
filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
File "/home/home/.local/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 982, in convert_ids_to_tokens
tokens.append(self._convert_id_to_token(index))
File "/home/home/.local/lib/python3.8/site-packages/transformers/models/transfo_xl/tokenization_transfo_xl.py", line 451, in _convert_id_to_token
return self.idx2sym[idx]
IndexError: list index out of range
@yl4579
Copy link
Owner

yl4579 commented Nov 18, 2023

What is your transformer version?

@dsplog
Copy link
Contributor Author

dsplog commented Nov 19, 2023

thanks for pointer.

checked the version history of transformers https://pypi.org/project/transformers/#history
this issue is not there till transformer version 4.33.3

after that, from v4.34.0 till v4.35.2, we have this issue

@yl4579
Copy link
Owner

yl4579 commented Nov 19, 2023

So I'll add the version in the README then.

@yl4579 yl4579 closed this as completed Nov 19, 2023
yl4579 added a commit that referenced this issue Nov 19, 2023
@yl4579 yl4579 mentioned this issue Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants