能识别的字太少了 #33

QWERTY770 · 2022-01-23T05:31:06Z

刚才我验证了0x4e00到0x9fff的每一个汉字能否被识别（能出现相关近义词就是能识别，返回{"error": 1}就不能识别）

验证代码：（可能需要数小时）

import requests as r

result = 0
result_2 = 0
for i in range(0x4e01, 0x9fff + 1):
    t = r.get(f"https://wantwords.thunlp.org/ChineseRD/?description={chr(i)}&mode=CC")
    if i % 256 == 0:
        print(f"There are {result_2} unrecognizable characters in 256 characters({hex(i-256)}~{hex(i)}).")
        result_2 = 0
    if t.text == '{"error": 1}':
        result += 1
        result_2 += 1
print(result)

结果显示：在所有的20992个汉字中，竟然有9033个汉字不能被识别，能识别的仅有11959个！

因此，我觉得软件支持的汉字太少（CJK基本集支持度才57%，扩展区更加不行），很多不算太生僻的字都不能识别。可以考虑扩展词库了（肯定可以，有些不支持的汉字百度都能搜到）。

The text was updated successfully, but these errors were encountered:

Fanchao-Qi · 2022-02-28T16:13:09Z

谢谢反馈！我们正在训练新模型以囊括更多中文字符。

Fanchao-Qi added the enhancement New feature or request label Feb 28, 2022

Fanchao-Qi added the todo label May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

能识别的字太少了 #33

能识别的字太少了 #33

QWERTY770 commented Jan 23, 2022

Fanchao-Qi commented Feb 28, 2022

能识别的字太少了 #33

能识别的字太少了 #33

Comments

QWERTY770 commented Jan 23, 2022

Fanchao-Qi commented Feb 28, 2022