Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

能识别的字太少了 #33

Open
QWERTY770 opened this issue Jan 23, 2022 · 1 comment
Open

能识别的字太少了 #33

QWERTY770 opened this issue Jan 23, 2022 · 1 comment
Labels
enhancement New feature or request todo

Comments

@QWERTY770
Copy link

刚才我验证了0x4e00到0x9fff的每一个汉字能否被识别(能出现相关近义词就是能识别,返回{"error": 1}就不能识别)

验证代码:(可能需要数小时)

import requests as r

result = 0
result_2 = 0
for i in range(0x4e01, 0x9fff + 1):
    t = r.get(f"https://wantwords.thunlp.org/ChineseRD/?description={chr(i)}&mode=CC")
    if i % 256 == 0:
        print(f"There are {result_2} unrecognizable characters in 256 characters({hex(i-256)}~{hex(i)}).")
        result_2 = 0
    if t.text == '{"error": 1}':
        result += 1
        result_2 += 1
print(result)

结果显示:在所有的20992个汉字中,竟然有9033个汉字不能被识别,能识别的仅有11959个!

因此,我觉得软件支持的汉字太少(CJK基本集支持度才57%,扩展区更加不行),很多不算太生僻的字都不能识别。可以考虑扩展词库了(肯定可以,有些不支持的汉字百度都能搜到)。

@Fanchao-Qi Fanchao-Qi added the enhancement New feature or request label Feb 28, 2022
@Fanchao-Qi
Copy link
Contributor

谢谢反馈! 我们正在训练新模型以囊括更多中文字符。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request todo
Projects
None yet
Development

No branches or pull requests

2 participants