- 
                Notifications
    You must be signed in to change notification settings 
- Fork 706
Use tokenizer eos #15215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tokenizer eos #15215
Conversation
Summary: Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos. Use tokenizer eos as source of truth. Differential Revision: D84865804
| 🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15215
 Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Cancelled JobAs of commit cbe4ba6 with merge base caa35f6 ( NEW FAILURES - The following jobs have failed:
 
 This comment was automatically generated by Dr. CI and updates every 15 minutes. | 
| This PR needs a  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which llm are you seeing this issue for? And do you know which models store eos and which don't? What about models that don't have a special_tokens_map.json?
| 
 @jackzhxng it's overriding the correct eos token for a fine-tuned llama3_2 1B. See here: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct/blob/main/special_tokens_map.json 
 No, from discussion with @larryliu0820 it was legacy/older tokenizers/internal case but I'm not sure. 
 The eos should be in the main tokenizer.json. If there are multiple eos tokens, I think  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay seems reasonable but where does the method eos come from originally? And we should be sure the tokenizers will always contain an eos
| 
 I think it comes from export time, if you export with 'get_eos_ids'. 
 Agree, though I think this requires us to expose this in the tokenizer API... currently tokenizer->eos_tok is an int, so hard to tell if it's valid or not... | 
| tokenizers::Tokenizer* tokenizer, | ||
| Module* module) { | ||
| std::unordered_set<uint64_t> eos_ids = {tokenizer->eos_tok()}; | ||
| // Get EOS IDs if available | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah previously we rely on metadata inside of .pte to determine what eos to use. Partially because some models don't respect the tokenizer's eos. I think we should get rid of this logic to avoid confusion.
| Update: I exported with  Not good UX for the user to manually specify eos ids because the tokenizer one is overridden, though. | 
Summary: See: pytorch#15215 Currently: - default eos/bos tokens are embedded into the pte - llama3 instruct has a different set of eos/bos tokens - users must manually specify at export time the llama3 instruct eos/bos tokens, because the runner overrides tokenizer eos/bos with the values in the PTE This diff: - removes the defaults - rely on tokenizer for eos/bos UNLESS the user explicitly specifies in the metadata, in which case use the eos/bos saved in PTE. Differential Revision: D84942718
Summary: See: pytorch#15215 Currently: - default eos/bos tokens are embedded into the pte - llama3 instruct has a different set of eos/bos tokens - users must manually specify at export time the llama3 instruct eos/bos tokens, because the runner overrides tokenizer eos/bos with the values in the PTE This diff: - removes the defaults - rely on tokenizer for eos/bos UNLESS the user explicitly specifies in the metadata, in which case use the eos/bos saved in PTE. Reviewed By: jackzhxng Differential Revision: D84942718
Summary:
Currently we check the tokenizer eos, and then override it with method eos. Sounds like this was for legacy cases where tokenizer did not store eos.
Use tokenizer eos as source of truth.
This is causing issues when we have special_tokens_map.json (from huggingface), which contains eos as a special id. It is overridden by the method, and llm generates excess output because we do not terminate correctly.
see meta-pytorch/tokenizers#139 where support for
special_tokens_map.jsonis added.Differential Revision: D84865804