Closed
Description
System Info
transformers
version: 4.49.0- Platform: Linux-5.4.0-187-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- Huggingface_hub version: 0.29.1
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA L40S
Who can help?
Hi @ArthurZucker @itazap (tagging you per instructions), when I use the Llama3 tokenizers and encode the string ' ...', and then decode the resulting token, I get the string '...' back instead of the string ' ...' (leading space is missing).
I believe that decode should be the inverse of encode in this case, and it's unclear to me why it isn't.
Sorry if I'm misunderstanding something! Thanks for your time :)
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.decode(tokenizer.encode(" ...")[1:])
This outputs '...'
(no leading space).
Expected behavior
I believe that decode
should be the inverse of encode
.
E.g
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer.decode(tokenizer.encode(" Hello world")[1:]) #[1:] to remove beginning of sequence token
outputs " Hello world", as expected.