Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacBERT-CSC的tokenzier编码可能会导致不等长 #469

Closed
yongzhuo opened this issue Jan 25, 2024 · 3 comments
Closed

MacBERT-CSC的tokenzier编码可能会导致不等长 #469

yongzhuo opened this issue Jan 25, 2024 · 3 comments
Labels
question Further information is requested

Comments

@yongzhuo
Copy link

Describe the Question

Please provide a clear and concise description of what the question is.
图书信息作\u3000者:张丽珊出版社:天津人民出版社ISBN:978-7-201-08411-4出版时间:2014-01-02定\u3000价:¥34.魃0内容简介没有经过婚姻治疗的离婚是人世间最不负责任的务定,完整的家庭是父母给予孩子最好的礼物

冰河漂流  沸腾阿尔山之冬

image
image

@yongzhuo yongzhuo added the question Further information is requested label Jan 25, 2024
@shibing624
Copy link
Owner

模型不能保证100%准确,这种单个case的问题不要提issue了,意义不大。

@yongzhuo
Copy link
Author

其实还是编码问题,一类是之前同issues/448那样的某个字符没有;
还有就是同这个issue提到的“¥34.魃0”,他纠了魃为8,然后数据就合起来了当成一个id,导致编码和原来的不一样,就会导致后边全按错了。
bpe不是一个字符对应一个id或多或少会引入该问题

@shibing624
Copy link
Owner

有道理,我修复下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants