Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

支持了直接注入新词汇的词表扩充方式+修补了LLaMA扩充词表的FastTokenizer问题 #70

Merged
merged 5 commits into from
Oct 28, 2023

Conversation

Mxoder
Copy link
Contributor

@Mxoder Mxoder commented Oct 28, 2023

  1. 支持了直接注入的扩充词表方式,并添加了两个示例文件:
  • common-char-3500.txt:3500 个常用汉字
  • common-word-6000.txt:6000 个常用词汇
  1. 感谢 #65 中指出问题。对于 LLaMA 模型,其快慢速分词器表现有差异,先前测试的时候是通过 LlamaTokenizer.from_pretrained 加载的,所以没有发现这一问题。同时 #65 提供了 3 种解决方案,并在 #66 中采取了第三种,即设置 normalized=False,这会将词表中的已有词汇也设为 False。结合第二种 Chinese-LLaMA-Alpaca 的方案,还是采用对于 LLaMA 模型禁用 use_fast 的方法

  2. 另外,Chinese-LLaMA-Alpaca 的方案适合用 sentencepiece 分词的模型,例如百川、ChatGLM以及一众 LLaMA-like 的模型,后面应该会另外作出适配

@stanleylsx stanleylsx merged commit 6cbcfdc into stanleylsx:main Oct 28, 2023
@Mxoder
Copy link
Contributor Author

Mxoder commented Oct 28, 2023

目前用 add_tokens 来扩充是考虑到其他类的模型。另外,在之前的测试里,快速分词器中除了 LlamaTokenizerFast 会有这个问题,其余如 BloomTokenizerFastBartTokenizerFastGPTNeoXTokenizerFast, GPT2TokenizerFastT5TokenizerFast 等都可以正常注入,所以目前只对 LLaMA 禁用了 use_fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants