Skip to content

znsoftm/baidubaike-corpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

baidubaike-corpus

from baidubaike get the corpus

从百度百科爬取中文语料库

采用单个初始界面爬取的词条数比较少,用‘科学’爬到7W+词条,用比较笨的办法,多用几个初始界面,尽量使用与自己项目相关的词汇

更新,采用多进程+多线程方式,增加爬取速度

添加词向量训练语料预处理及模型训练代码

About

from baidubaike get the corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%