支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
common-char-3500.txt
:3500 个常用汉字common-word-6000.txt
:6000 个常用词汇感谢 #65 中指出问题。对于 LLaMA 模型,其快慢速分词器表现有差异,先前测试的时候是通过
LlamaTokenizer.from_pretrained
加载的,所以没有发现这一问题。同时 #65 提供了 3 种解决方案,并在 #66 中采取了第三种,即设置normalized=False
,这会将词表中的已有词汇也设为 False。结合第二种 Chinese-LLaMA-Alpaca 的方案,还是采用对于 LLaMA 模型禁用 use_fast 的方法另外,Chinese-LLaMA-Alpaca 的方案适合用 sentencepiece 分词的模型,例如百川、ChatGLM以及一众 LLaMA-like 的模型,后面应该会另外作出适配