支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

Mxoder · 2023-10-28T05:14:47Z

支持了直接注入的扩充词表方式，并添加了两个示例文件：

common-char-3500.txt：3500 个常用汉字
common-word-6000.txt：6000 个常用词汇

感谢 #65 中指出问题。对于 LLaMA 模型，其快慢速分词器表现有差异，先前测试的时候是通过 LlamaTokenizer.from_pretrained 加载的，所以没有发现这一问题。同时 #65 提供了 3 种解决方案，并在 #66 中采取了第三种，即设置 normalized=False，这会将词表中的已有词汇也设为 False。结合第二种 Chinese-LLaMA-Alpaca 的方案，还是采用对于 LLaMA 模型禁用 use_fast 的方法
另外，Chinese-LLaMA-Alpaca 的方案适合用 sentencepiece 分词的模型，例如百川、ChatGLM以及一众 LLaMA-like 的模型，后面应该会另外作出适配

Mxoder · 2023-10-28T05:26:20Z

目前用 add_tokens 来扩充是考虑到其他类的模型。另外，在之前的测试里，快速分词器中除了 LlamaTokenizerFast 会有这个问题，其余如 BloomTokenizerFast，BartTokenizerFast，GPTNeoXTokenizerFast, GPT2TokenizerFast，T5TokenizerFast 等都可以正常注入，所以目前只对 LLaMA 禁用了 use_fast

Mxoder and others added 5 commits October 28, 2023 09:10

添加了两个示例文件，汉字常用字与常用词

e11eb4c

支持了直接注入新词汇的词表扩充方式

7fbdb34

Merge branch 'stanleylsx:main' into main

ff75724

修改：加载LLaMA时禁用use_fast

765f94f

Merge branch 'main' of https://github.com/Mxoder/llms_tool

b2c2454

stanleylsx merged commit 6cbcfdc into stanleylsx:main Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

Mxoder commented Oct 28, 2023

Mxoder commented Oct 28, 2023

支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

Conversation

Mxoder commented Oct 28, 2023

Mxoder commented Oct 28, 2023