扩展词表代码需优化 #65

tiandiweizun · 2023-10-25T08:10:32Z

expand_vocab里面的add_new_tokens，会把token加到tokens_trie，仅适用于slow tokenizer，因为fast模式尽管添加了，但是在tokenize的时候直接调用encode_plus，会调用rust，这就会出问题。即如果下载了llama2-xxx-hf的情况下。
model=AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf")
model.add_tokens("谁")
print(model.tokenize("你是谁"))
会发现不生效

建议修改：任选其一

AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf",use_fast=False)
merge_tokenize过程参见https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_tokenizer/merge_tokenizers.py
tokenizer.add_tokens(AddedToken(token, normalized=False))

#66

tiandiweizun closed this as completed Oct 26, 2023

Mxoder mentioned this issue Oct 28, 2023

支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

扩展词表代码需优化 #65

扩展词表代码需优化 #65

tiandiweizun commented Oct 25, 2023 •

edited

Loading

扩展词表代码需优化 #65

扩展词表代码需优化 #65

Comments

tiandiweizun commented Oct 25, 2023 • edited Loading

tiandiweizun commented Oct 25, 2023 •

edited

Loading