pyUnit-NewWord

无监督训练文本词库

安装

pip install pyunit-newword

注意事项

该算法采用Hash字典存储，大量消耗内存。100M的纯中文文本需要12G以上的内存，不然耗时太严重。

更新说明

新增加自动识别新词模型，无需手动设置参数

训练代码非模型(文本是UTF-8格式)

from pyunit_newword import NewWords

if __name__ == '__main__':
    nw = NewWords(filter_cond=10, filter_free=2)
    nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
    nw.analysis_data()
    with open('分析结果.txt', 'w', encoding='utf-8')as f:
        for word in nw.get_words():
            print(word)
            f.write(word[0] + '\n')

无监督训练新词模型

from pyunit_newword import NewWords

if __name__ == '__main__':
    nw = NewWords(accuracy=0.01)
    nw.add_text(r'C:\Users\Administrator\Desktop\微博数据.txt')
    nw.analysis_data()
    with open('分析结果.txt', 'w', encoding='utf-8')as f:
        for word in nw.get_words():
            print(word)
            f.write(word[0] + '\n')

微博数据下载

点击下载微博数据

爬虫的微博数据一部分截图（大概100M纯文本）

训练微博数据后的结果

训练后得到的词语视频

算法实现来源

基于改进互信息和邻接熵的微博新词发现方法

TODO

~~自动寻找过滤参数~~
参数自动寻找最优解

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs/source		docs/source
img		img
pyunit_newword		pyunit_newword
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyUnit-NewWord

无监督训练文本词库

安装

注意事项

更新说明

训练代码非模型(文本是UTF-8格式)

无监督训练新词模型

微博数据下载

爬虫的微博数据一部分截图（大概100M纯文本）

训练微博数据后的结果

训练后得到的词语视频

算法实现来源

TODO

About

Releases

Packages

Contributors 2

Languages

pyunits/pyunit-newword

Folders and files

Latest commit

History

Repository files navigation

pyUnit-NewWord

无监督训练文本词库

安装

注意事项

更新说明

训练代码非模型(文本是UTF-8格式)

无监督训练新词模型

微博数据下载

爬虫的微博数据一部分截图（大概100M纯文本）

训练微博数据后的结果

训练后得到的词语视频

算法实现来源

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages