Skip to content

Conversation

@lucylq
Copy link
Contributor

@lucylq lucylq commented Oct 17, 2025

Summary:
Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos.

Use tokenizer eos as source of truth.

This is causing issues when we have special_tokens_map.json (from huggingface), which contains eos as a special id. It is overridden by the method, and llm generates excess output because we do not terminate correctly.

see meta-pytorch/tokenizers#139 where support for special_tokens_map.json is added.

Differential Revision: D84865804

Summary:
Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos.

Use tokenizer eos as source of truth.

Differential Revision: D84865804
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15215

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job

As of commit cbe4ba6 with merge base caa35f6 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025
@meta-codesync
Copy link

meta-codesync bot commented Oct 17, 2025

@lucylq has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84865804.

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which llm are you seeing this issue for? And do you know which models store eos and which don't? What about models that don't have a special_tokens_map.json?

@lucylq
Copy link
Contributor Author

lucylq commented Oct 17, 2025

Which llm are you seeing this issue for?

@jackzhxng it's overriding the correct eos token for a fine-tuned llama3_2 1B. See here: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/special_tokens_map.json
I added special_tokens_map.json parsing to tokenizers here: meta-pytorch/tokenizers#139

And do you know which models store eos and which don't?

No, from discussion with @larryliu0820 it was legacy/older tokenizers/internal case but I'm not sure.

What about models that don't have a special_tokens_map.json?

The eos should be in the main tokenizer.json. If there are multiple eos tokens, I think special_tokens_map.json tells you which one to use. e.g. I think <|eot_id|> is used for chat conversations (end of turn), where as <|end_of_text|> for non-chat.

Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay seems reasonable but where does the method eos come from originally? And we should be sure the tokenizers will always contain an eos

@lucylq
Copy link
Contributor Author

lucylq commented Oct 17, 2025

Hmm okay seems reasonable but where does the method eos come from originally?

I think it comes from export time, if you export with 'get_eos_ids'.
It's still retrievable from the module, this removes it from the runner api.

And we should be sure the tokenizers will always contain an eos

Agree, though I think this requires us to expose this in the tokenizer API... currently tokenizer->eos_tok is an int, so hard to tell if it's valid or not...

tokenizers::Tokenizer* tokenizer,
Module* module) {
std::unordered_set<uint64_t> eos_ids = {tokenizer->eos_tok()};
// Get EOS IDs if available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah previously we rely on metadata inside of .pte to determine what eos to use. Partially because some models don't respect the tokenizer's eos. I think we should get rid of this logic to avoid confusion.

@lucylq
Copy link
Contributor Author

lucylq commented Oct 17, 2025

Update: I exported with base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' and that works well.

Not good UX for the user to manually specify eos ids because the tokenizer one is overridden, though.

lucylq added a commit to lucylq/executorch-1 that referenced this pull request Oct 17, 2025
Summary:
See: pytorch#15215

Currently:
- default eos/bos tokens are embedded into the pte
- llama3 instruct has a different set of eos/bos tokens
- users must manually specify at export time the llama3 instruct eos/bos tokens, because the runner overrides tokenizer eos/bos with the values in the PTE

This diff:
- removes the defaults
- rely on tokenizer for eos/bos UNLESS the user explicitly specifies in the metadata, in which case use the eos/bos saved in PTE.

Differential Revision: D84942718
lucylq added a commit to lucylq/executorch-1 that referenced this pull request Oct 17, 2025
Summary:

See: pytorch#15215

Currently:
- default eos/bos tokens are embedded into the pte
- llama3 instruct has a different set of eos/bos tokens
- users must manually specify at export time the llama3 instruct eos/bos tokens, because the runner overrides tokenizer eos/bos with the values in the PTE

This diff:
- removes the defaults
- rely on tokenizer for eos/bos UNLESS the user explicitly specifies in the metadata, in which case use the eos/bos saved in PTE.

Reviewed By: jackzhxng

Differential Revision: D84942718
@lucylq lucylq closed this Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants