Llama3 tokenizer decode is incorrect for ' ...' with leading space

### System Info

- `transformers` version: 4.49.0
- Platform: Linux-5.4.0-187-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- Huggingface_hub version: 0.29.1
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA L40S


### Who can help?

Hi @ArthurZucker @itazap (tagging you per instructions), when I use the Llama3 tokenizers and encode the string ' ...', and then decode the resulting token, I get the string '...' back instead of the string ' ...' (leading space is missing).

I believe that decode should be the inverse of encode in this case, and it's unclear to me why it isn't.

Sorry if I'm misunderstanding something! Thanks for your time :) 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

tokenizer.decode(tokenizer.encode(" ...")[1:])
```

This outputs `'...' ` (no leading space).

### Expected behavior

I believe that `decode` should be the inverse of `encode`.

E.g

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

tokenizer.decode(tokenizer.encode(" Hello world")[1:]) #[1:] to remove beginning of sequence token
```

outputs " Hello world", as expected.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama3 tokenizer decode is incorrect for ' ...' with leading space #36622

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama3 tokenizer decode is incorrect for ' ...' with leading space #36622

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions