You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 1, 2022. It is now read-only.
ngram_utils.py 和 phrase_extraction.py,注意到一开始的文本处理是
re.split(r'[;;.。,,!\n!??]',corpus)
,先按此列表中的标点符号进行切分re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", corpus)
,只保留汉字、数字、英文大小写(去掉了其它符号和无意义字符,但没有进行切分)然后想问两个小问题:
A. 为什么 [;;.。,,!\n!??] 符号部分要单独处理,就是说,为什么不选择在所有标点符号的部分都切开 ?
B. 在第 2 步中,去掉无意义字符后,其前后位置部分会自然拼接,
无意义字符前后位置对应搭配增多,会导致信息熵偏大;
会产生一些不合理的 n-gram 候选对("法修"、"订版全"、"版全"...)。
这种问题应该怎么处理?
The text was updated successfully, but these errors were encountered: