pip install pyunit-newword
该算法采用Hash字典存储,大量消耗内存。100M的纯中文文本需要12G以上的内存,不然耗时太严重。
新增加自动识别新词模型,无需手动设置参数
from pyunit_newword import NewWords
if __name__ == '__main__':
nw = NewWords(filter_cond=10, filter_free=2)
nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
nw.analysis_data()
with open('分析结果.txt', 'w', encoding='utf-8')as f:
for word in nw.get_words():
print(word)
f.write(word[0] + '\n')
from pyunit_newword import NewWords
if __name__ == '__main__':
nw = NewWords(accuracy=0.01)
nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
nw.analysis_data()
with open('分析结果.txt', 'w', encoding='utf-8')as f:
for word in nw.get_words():
print(word)
f.write(word[0] + '\n')
-
自动寻找过滤参数 - 参数自动寻找最优解