Improve Tokenizer New Type Onboarding

### 🚀 The feature, motivation and pitch
---
As a sequel to https://github.com/pytorch/torchchat/issues/1518 where we added an enum for tokenizer types to simplify `TokenizerArgs __post_init__`, we need to further improve it to simplify new tokenizer type onboarding:

### Tasks
---
- Move TokenizerType to a centralized place
  - We now have two of them: https://github.com/pytorch/torchchat/blob/0299a37a342348803763e37e9f4823c5bcb12c92/dist_run.py#L67-L69 https://github.com/pytorch/torchchat/blob/0299a37a342348803763e37e9f4823c5bcb12c92/torchchat/cli/builder.py#L241-L245
- Check all getters of tokenizer types
  - It may be able to be simplified as inline https://github.com/pytorch/torchchat/blob/0299a37a342348803763e37e9f4823c5bcb12c92/torchchat/generate.py#L368
- Add documentation for future tokenizer onboard.
  - We may need to point people to update the model validation logic: https://github.com/pytorch/torchchat/blob/0299a37a342348803763e37e9f4823c5bcb12c92/torchchat/cli/builder.py#L290-L322
---
To test, run a model with each tokenizer type:
- python torchchat.py generate llama2
- python torchchat.py generate llama3
- python torchchat.py generate granite-code

cc @Jack-Khuu @byjlw 

	class TokenizerType(Enum):
	NONE = 0
	TIKTOKEN = 1
	SENTENCEPIECE = 2
	HF_TOKENIZER = 3

	def validate_model(
	self,
	model: Optional[Model],
	model_description: str = "model",
	) -> None:
	if model is None:
	return

	if self.tokenizer_type == TokenizerType.NONE:
	raise RuntimeError(f"no tokenizer was found at {self.tokenizer_path}")

	is_tiktoken = self.is_tiktoken()
	is_sentencepiece = self.is_sentencepiece()
	is_hf_tokenizer = self.is_hf_tokenizer()

	use_tiktoken = model.config.use_tiktoken
	use_hf_tokenizer = model.config.use_hf_tokenizer
	use_sentencepiece = not (use_tiktoken or use_hf_tokenizer)

	if (
	(is_tiktoken and not use_tiktoken) or
	(is_hf_tokenizer and not use_hf_tokenizer) or
	(is_sentencepiece and not use_sentencepiece)
	):
	raise RuntimeError(
	"model-specified tokenizer ({}) does not match provided tokenizer ({}) for {}".format(
	tokenizer_setting_to_name(use_tiktoken, use_hf_tokenizer),
	tokenizer_setting_to_name(is_tiktoken, is_hf_tokenizer),
	model_description,
	)
	)

	return

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Tokenizer New Type Onboarding #1536

🚀 The feature, motivation and pitch

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	class TokenizerType(Enum):
	Tiktoken = auto()
	SentencePiece = auto()

Improve Tokenizer New Type Onboarding #1536

Description

🚀 The feature, motivation and pitch

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions