# 1 简介


Genism是一个开源的Python库，用于便捷高效地提取文档中的语义话题。它用于处理原始的、非结构化的电子文本（“纯文本”），gensim中的一些算法，如 Latent Semantic Analysis（潜在语义分析）、 Latent Dirichlet Allocation（潜在Dirichlet分布）、Random Projections（随机预测）通过检查训练文档中的共现实体来挖掘语义结构。

# 2 快速上手

In [12]:
import logging
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)

In [13]:
#创建一个小的语料库
from gensim import corpora,models,similarities

corpus=[[(0,1.0),(1,1.0),(2,1.0)],
        [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
        [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
        [(0, 1.0), (4, 2.0), (7, 1.0)],
        [(3, 1.0), (5, 1.0), (6, 1.0)],
        [(9, 1.0)],
        [(9, 1.0), (10, 1.0)],
        [(9, 1.0), (10, 1.0), (11, 1.0)],
        [(8, 1.0), (10, 1.0), (11, 1.0)]]

2018-05-09 15:06:25,490:INFO:'pattern' package not found; tag filters are not available for English


In [14]:
#对向量进行加权
tfidf=models.TfidfModel(corpus)

2018-05-09 15:06:25,500:INFO:collecting document frequencies
2018-05-09 15:06:25,502:INFO:PROGRESS: processing document #0
2018-05-09 15:06:25,504:INFO:calculating IDF weights for 9 documents and 11 features (28 matrix non-zeros)


In [15]:
vec=[(0,1),(4,1)]
print(tfidf[vec])

[(0, 0.8075244024440723), (4, 0.5898341626740045)]


In [16]:
index= similarities.SparseMatrixSimilarity(tfidf[corpus],num_features=12)
sims=index[tfidf[vec]]
print(list(enumerate(sims)))

2018-05-09 15:06:25,726:INFO:creating sparse index
2018-05-09 15:06:25,727:INFO:creating sparse matrix from corpus
2018-05-09 15:06:25,728:INFO:PROGRESS: at document #0
2018-05-09 15:06:25,729:INFO:created <9x12 sparse matrix of type '<class 'numpy.float32'>'
	with 28 stored elements in Compressed Sparse Row format>


[(0, 0.4662244), (1, 0.19139354), (2, 0.2460055), (3, 0.82094586), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


# 3 生成中文语料的word2vec

## 3.1 对中文语料进行分词

In [17]:
import os
import jieba
sentences_file=open("files/data/python32-data/sentence.txt",encoding='utf8')
word_file=open("files/data/python32-word.txt","a",encoding="utf8")
lines=sentences_file.readlines()
for line in lines:
    line.replace('\t','').replace('\n','').replace(' ','')
    segment_words=jieba.cut(line,cut_all=False)
    word_file.write(" ".join(segment_words))
sentences_file.close()
word_file.close()

Building prefix dict from the default dictionary ...
2018-05-09 15:06:26,133:DEBUG:Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\Yanqiang\AppData\Local\Temp\jieba.cache
2018-05-09 15:06:27,277:DEBUG:Dumping model to file cache C:\Users\Yanqiang\AppData\Local\Temp\jieba.cache
Loading model cost 1.236 seconds.
2018-05-09 15:06:27,370:DEBUG:Loading model cost 1.236 seconds.
Prefix dict has been built succesfully.
2018-05-09 15:06:27,371:DEBUG:Prefix dict has been built succesfully.


## 3.2 使用gensim的word2vec训练模型

参考：[python初步实现word2vec](http://blog.csdn.net/xiaoquantouer/article/details/53583980)

In [21]:
# 导入包
from gensim.models import word2vec
import logging

#初始化
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
sentences=word2vec.Text8Corpus("files/data/python32-data/word.txt")#加载分词语料
model=word2vec.Word2Vec(sentences,size=200)#训练skip-gram模型，默认window=5
print("输出模型",model)

#计算两个单词的相似度
try:
    y1=model.similarity("企业","公司")
except KeyError:
    y1=0
print("【企业】和【公司】的相似度为：{}\n".format(y1))

#/计算某个词的相关词列表
y2=model.most_similar("科技",topn=20)#20个最相关的
print("与【科技】最相关的词有：\n")
for word in y2:
    print(word[0],word[1])
print("*********\n")

#寻找对应关系
print("公司-产品","生产")
y3=model.most_similar(["公司","产品"],["生产"],topn=3)
for word in y3:
    print(word[0],word[1])
print("*********\n")

#寻找不合群的词
y4 =model.doesnt_match(u"企业 公司 是 合作伙伴".split())  
print("不合群的词：{}".format(y4))  
print("***********\n"  )

#保存模型
model.save("files/data/python32-data/企业关系.model")

2018-05-09 15:07:52,206:INFO:collecting all words and their counts
2018-05-09 15:07:52,208:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-09 15:07:52,214:INFO:collected 4152 word types from a corpus of 20740 raw words and 3 sentences
2018-05-09 15:07:52,215:INFO:Loading a fresh vocabulary
2018-05-09 15:07:52,218:INFO:min_count=5 retains 579 unique words (13% of original 4152, drops 3573)
2018-05-09 15:07:52,219:INFO:min_count=5 leaves 15399 word corpus (74% of original 20740, drops 5341)
2018-05-09 15:07:52,223:INFO:deleting the raw counts dictionary of 4152 items
2018-05-09 15:07:52,224:INFO:sample=0.001 downsamples 73 most-common words
2018-05-09 15:07:52,224:INFO:downsampling leaves estimated 9696 word corpus (63.0% of prior 15399)
2018-05-09 15:07:52,226:INFO:estimated required memory for 579 words and 200 dimensions: 1215900 bytes
2018-05-09 15:07:52,227:INFO:resetting layer weights
2018-05-09 15:07:52,242:INFO:training model with 3 workers on 579 v

输出模型 Word2Vec(vocab=579, size=200, alpha=0.025)
【企业】和【公司】的相似度为：0.9999506965308783

与【科技】最相关的词有：

， 0.9999631643295288
是 0.9999608993530273
的 0.9999598264694214
有限公司 0.9999590516090393
产品 0.9999586343765259
。 0.9999575614929199
公司 0.9999558925628662
成为 0.9999552965164185
合作 0.9999547004699707
合作伙伴 0.9999545216560364
： 0.999954104423523
核心 0.9999538660049438
和 0.9999529123306274
及 0.9999523162841797
等 0.9999514818191528
中国 0.9999513626098633
经销商 0.9999507069587708
在 0.9999505281448364
代理商 0.9999504089355469
供应商 0.9999495148658752
*********

公司-产品 生产
及 0.9998089075088501
月 0.999807596206665
成为 0.9998065829277039
*********

不合群的词：企业
***********



# 4 生成英文语料的word2vec

## 4.1 数据预处理

In [30]:
import nltk
import re

# 加载语料内容
with open('files/data/python32-data/alice.txt',encoding='utf-8') as file:
    txt_raw=file.read()
# nltk    将段落分为句
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences_raw = tokenizer.tokenize(txt_raw) # 句子列表
# for sen in sentences_raw:
#     print(sen)

# 对句子列表进行预处理
sntncs = []
stops = set(nltk.corpus.stopwords.words('english'))
for sntnc in sentences_raw:
    lttr_only = re.sub('[^a-zA-z]', " ", sntnc)# 取出标点符号与数字
    wrds = lttr_only.lower().split() # 大写转为小写
    wrds_mnng = [w for w in wrds if not w in stops] # 去除停用词
    sntncs += [wrds_mnng]
# print(sntncs)


## 4.2 使用genism训练word2vec

In [32]:
from gensim.models import word2vec
num_features = 1000 #是指特征向量的维度，默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为几十到几百。
min_word_count = 10# 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5
num_workers = 4 #控制训练的并行数。
window = 5
# gensim函数库的Word2Vec的参数说明 https://blog.csdn.net/szlcw1/article/details/52751314
model = word2vec.Word2Vec(sntncs, workers=num_workers, \
        size=num_features, min_count=min_word_count, \
        window=window)
model.save('files/data/python32-data/alice.model')        

2018-05-09 15:28:00,788:INFO:collecting all words and their counts
2018-05-09 15:28:00,789:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-05-09 15:28:00,793:INFO:collected 2866 word types from a corpus of 14027 raw words and 1744 sentences
2018-05-09 15:28:00,794:INFO:Loading a fresh vocabulary
2018-05-09 15:28:00,798:INFO:min_count=10 retains 300 unique words (10% of original 2866, drops 2566)
2018-05-09 15:28:00,799:INFO:min_count=10 leaves 8052 word corpus (57% of original 14027, drops 5975)
2018-05-09 15:28:00,801:INFO:deleting the raw counts dictionary of 2866 items
2018-05-09 15:28:00,802:INFO:sample=0.001 downsamples 112 most-common words
2018-05-09 15:28:00,803:INFO:downsampling leaves estimated 5563 word corpus (69.1% of prior 8052)
2018-05-09 15:28:00,804:INFO:estimated required memory for 300 words and 1000 dimensions: 2550000 bytes
2018-05-09 15:28:00,806:INFO:resetting layer weights
2018-05-09 15:28:00,820:INFO:training model with 4 workers on 