Use tokenizer eos #15215

lucylq · 2025-10-17T01:12:47Z

Summary:
Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos.

Use tokenizer eos as source of truth.

This is causing issues when we have special_tokens_map.json (from huggingface), which contains eos as a special id. It is overridden by the method, and llm generates excess output because we do not terminate correctly.

see meta-pytorch/tokenizers#139 where support for special_tokens_map.json is added.

Differential Revision: D84865804

Summary: Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos. Use tokenizer eos as source of truth. Differential Revision: D84865804

pytorch-bot · 2025-10-17T01:12:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15215

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job

As of commit cbe4ba6 with merge base caa35f6 ():

NEW FAILURES - The following jobs have failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t d74603f1cfd301fc1bf3882accc8875da04ebdad7cce1fd2659ce09e10a40913 /exec failed with exit code 1
Test CUDA Builds / export-voxtral-cuda-artifact / linux-job (gh)
RuntimeError: Command docker exec -t 1703759a4f737ff407086ec576a3bcba651b9a2476365bc7af12564ae19ca836 /exec failed with exit code 2

CANCELLED JOB - The following job was cancelled. Please retry:

periodic / gather-models (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-17T01:12:55Z

@lucylq has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84865804.

github-actions · 2025-10-17T01:13:31Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

jackzhxng

Which llm are you seeing this issue for? And do you know which models store eos and which don't? What about models that don't have a special_tokens_map.json?

lucylq · 2025-10-17T16:28:12Z

Which llm are you seeing this issue for?

@jackzhxng it's overriding the correct eos token for a fine-tuned llama3_2 1B. See here: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/special_tokens_map.json
I added special_tokens_map.json parsing to tokenizers here: meta-pytorch/tokenizers#139

And do you know which models store eos and which don't?

No, from discussion with @larryliu0820 it was legacy/older tokenizers/internal case but I'm not sure.

What about models that don't have a special_tokens_map.json?

The eos should be in the main tokenizer.json. If there are multiple eos tokens, I think special_tokens_map.json tells you which one to use. e.g. I think <|eot_id|> is used for chat conversations (end of turn), where as <|end_of_text|> for non-chat.

jackzhxng

Hmm okay seems reasonable but where does the method eos come from originally? And we should be sure the tokenizers will always contain an eos

lucylq · 2025-10-17T17:44:12Z

Hmm okay seems reasonable but where does the method eos come from originally?

I think it comes from export time, if you export with 'get_eos_ids'.
It's still retrievable from the module, this removes it from the runner api.

And we should be sure the tokenizers will always contain an eos

Agree, though I think this requires us to expose this in the tokenizer API... currently tokenizer->eos_tok is an int, so hard to tell if it's valid or not...

larryliu0820 · 2025-10-17T19:02:47Z

extension/llm/runner/llm_runner_helper.cpp

    tokenizers::Tokenizer* tokenizer,
    Module* module) {
  std::unordered_set<uint64_t> eos_ids = {tokenizer->eos_tok()};
-  // Get EOS IDs if available


Yeah previously we rely on metadata inside of .pte to determine what eos to use. Partially because some models don't respect the tokenizer's eos. I think we should get rid of this logic to avoid confusion.

lucylq · 2025-10-17T19:04:57Z

Update: I exported with base.metadata='"{\"get_bos_id\":128000, \"get_eos_ids\":[128009, 128001]}"' and that works well.

Not good UX for the user to manually specify eos ids because the tokenizer one is overridden, though.

Summary: See: pytorch#15215 Currently: - default eos/bos tokens are embedded into the pte - llama3 instruct has a different set of eos/bos tokens - users must manually specify at export time the llama3 instruct eos/bos tokens, because the runner overrides tokenizer eos/bos with the values in the PTE This diff: - removes the defaults - rely on tokenizer for eos/bos UNLESS the user explicitly specifies in the metadata, in which case use the eos/bos saved in PTE. Differential Revision: D84942718

Summary: See: pytorch#15215 Currently: - default eos/bos tokens are embedded into the pte - llama3 instruct has a different set of eos/bos tokens - users must manually specify at export time the llama3 instruct eos/bos tokens, because the runner overrides tokenizer eos/bos with the values in the PTE This diff: - removes the defaults - rely on tokenizer for eos/bos UNLESS the user explicitly specifies in the metadata, in which case use the eos/bos saved in PTE. Reviewed By: jackzhxng Differential Revision: D84942718

Use tokenizer eos

cbe4ba6

Summary: Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos. Use tokenizer eos as source of truth. Differential Revision: D84865804

lucylq requested review from jackzhxng, larryliu0820, mergennachin and swolchok as code owners October 17, 2025 01:12

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 17, 2025

mergennachin requested review from kirklandsign and shoumikhin October 17, 2025 12:41

jackzhxng reviewed Oct 17, 2025

View reviewed changes

larryliu0820 reviewed Oct 17, 2025

View reviewed changes

lucylq mentioned this pull request Oct 17, 2025

Remove default bos/eos from metadata #15231

Merged

lucylq closed this Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use tokenizer eos #15215

Use tokenizer eos #15215

Uh oh!

lucylq commented Oct 17, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 17, 2025

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

jackzhxng left a comment

Uh oh!

lucylq commented Oct 17, 2025 •

edited

Loading

Uh oh!

jackzhxng left a comment

Uh oh!

lucylq commented Oct 17, 2025

Uh oh!

larryliu0820 Oct 17, 2025

Uh oh!

lucylq commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use tokenizer eos #15215

Use tokenizer eos #15215

Uh oh!

Conversation

lucylq commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15215

❌ 2 New Failures, 1 Cancelled Job

Uh oh!

meta-codesync bot commented Oct 17, 2025

Uh oh!

github-actions bot commented Oct 17, 2025

This PR needs a release notes: label

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

lucylq commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackzhxng left a comment

Choose a reason for hiding this comment

Uh oh!

lucylq commented Oct 17, 2025

Uh oh!

larryliu0820 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

lucylq commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucylq commented Oct 17, 2025 •

edited

Loading

pytorch-bot bot commented Oct 17, 2025 •

edited

Loading

This PR needs a `release notes:` label

lucylq commented Oct 17, 2025 •

edited

Loading