You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import requests as r
result = 0
result_2 = 0
for i in range(0x4e01, 0x9fff + 1):
t = r.get(f"https://wantwords.thunlp.org/ChineseRD/?description={chr(i)}&mode=CC")
if i % 256 == 0:
print(f"There are {result_2} unrecognizable characters in 256 characters({hex(i-256)}~{hex(i)}).")
result_2 = 0
if t.text == '{"error": 1}':
result += 1
result_2 += 1
print(result)
刚才我验证了0x4e00到0x9fff的每一个汉字能否被识别(能出现相关近义词就是能识别,返回{"error": 1}就不能识别)
验证代码:(可能需要数小时)
结果显示:在所有的20992个汉字中,竟然有9033个汉字不能被识别,能识别的仅有11959个!
因此,我觉得软件支持的汉字太少(CJK基本集支持度才57%,扩展区更加不行),很多不算太生僻的字都不能识别。可以考虑扩展词库了(肯定可以,有些不支持的汉字百度都能搜到)。
The text was updated successfully, but these errors were encountered: