-
Notifications
You must be signed in to change notification settings - Fork 250
Open
Labels
actionableItems in the backlog waiting for an appropriate impl/fixItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomersGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
As a sequel to #1518 where we added an enum for tokenizer types to simplify TokenizerArgs __post_init__
, we need to further improve it to simplify new tokenizer type onboarding:
Tasks
- Move TokenizerType to a centralized place
- We now have two of them:
Lines 67 to 69 in 0299a37
class TokenizerType(Enum): Tiktoken = auto() SentencePiece = auto() torchchat/torchchat/cli/builder.py
Lines 241 to 245 in 0299a37
class TokenizerType(Enum): NONE = 0 TIKTOKEN = 1 SENTENCEPIECE = 2 HF_TOKENIZER = 3
- We now have two of them:
- Check all getters of tokenizer types
- It may be able to be simplified as inline
torchchat/torchchat/generate.py
Line 368 in 0299a37
self.is_llama3_model = self.tokenizer_args.is_tiktoken()
- It may be able to be simplified as inline
- Add documentation for future tokenizer onboard.
- We may need to point people to update the model validation logic:
torchchat/torchchat/cli/builder.py
Lines 290 to 322 in 0299a37
def validate_model( self, model: Optional[Model], model_description: str = "model", ) -> None: if model is None: return if self.tokenizer_type == TokenizerType.NONE: raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}") is_tiktoken = self.is_tiktoken() is_sentencepiece = self.is_sentencepiece() is_hf_tokenizer = self.is_hf_tokenizer() use_tiktoken = model.config.use_tiktoken use_hf_tokenizer = model.config.use_hf_tokenizer use_sentencepiece = not (use_tiktoken or use_hf_tokenizer) if ( (is_tiktoken and not use_tiktoken) or (is_hf_tokenizer and not use_hf_tokenizer) or (is_sentencepiece and not use_sentencepiece) ): raise RuntimeError( "model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format( tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer), tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer), model_description, ) ) return
- We may need to point people to update the model validation logic:
To test, run a model with each tokenizer type:
- python torchchat.py generate llama2
- python torchchat.py generate llama3
- python torchchat.py generate granite-code
cc @Jack-Khuu @byjlw
Metadata
Metadata
Assignees
Labels
actionableItems in the backlog waiting for an appropriate impl/fixItems in the backlog waiting for an appropriate impl/fixgood first issueGood for newcomersGood for newcomerstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
No status