# word2vec（2）
#### CBOW和Skip-gram两个模型，word2vec给出了两套框架，他们分别基于Hierarchical Softmax和Negative Sampling，在Hierarchical Softmax框架下，都是以Huffman树作为基础的。
#### 训练流程
#### 假设我们已经有了一个已经构造好的Huffman树，以及初始化完毕的各个向量，可以开始输入文本来进行训练了。训练的过程如下图所示，主要有输入层(input)，映射层(projection)和输出层(output)三个阶段。
![](https://github.com/xx674967/githubdesktop/blob/12/pic/11.30/4.png?raw=true)
#### 输入层即为某个单词A周围的n-1个单词的词向量。如果n取5，则词A(可记为w(t))前两个和后两个的单词为w(t-2),w(t-1),w(t+1),w(t+2)。相对应的，那4个单词的词向量记为v(w(t-2)),v(w(t-1)),v(w(t+1)),v(w(t+2))。从输入层到映射层比较简单，将那n-1个词向量相加即可。
#### 从映射层到输出层
#### 要完成这一步骤，需要借助之前构造的Huffman树。从根节点开始，映射层的值需要沿着Huffman树不断的进行logistic分类，并且不断的修正各中间向量和词向量。如下图所示：
![](https://github.com/xx674967/githubdesktop/blob/12/pic/11.30/5.png?raw=true)
#### 此时中间的单词为w(t)，而映射层输入为  pro(t)=v(w(t-2))+v(w(t-1))+v(w(t+1))+v(w(t+2))
#### 假设此时的单词为“足球”，即w(t)=“足球”，则其Huffman码可知为d(t)=”1001”(具体可见上一节),那么根据Huffman码可知，从根节点到叶节点的路径为“左右右左”，即从根节点开始，先往左拐，再往右拐2次，最后再左拐。既然知道了路径，那么就按照路径从上往下依次修正路径上各节点的中间向量。在第一个节点，根据节点的中间向量Θ(t,1)和pro(t)进行Logistic分类。如果分类结果显示为0，则表示分类错误(应该向左拐，即分类到1)，则要对Θ(t,1)进行修正，并记录误差量。
#### 接下来，处理完第一个节点之后，开始处理第二个节点。方法类似，修正Θ(t,2)，并累加误差量。接下来的节点都以此类推。在处理完所有节点，达到叶节点之后，根据之前累计的误差来修正词向量v(w(t)),这样，一个词w(t)的处理流程就结束了。如果一个文本中有N个词，则需要将上述过程在重复N遍，从w(0)~w(N-1)。



In [1]:

# -*- coding: utf-8 -*-
"""
Created on Fri Dec  1 16:05:09 2017

@author: XZ
"""


from gensim.models import word2vec
import logging

# 主程序
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus(u"E:\\ImageCaption\\ImageCaption\\text8")  # 加载语料
model = word2vec.Word2Vec(sentences, size=200)  # 训练skip-gram模型; 默认window=5

# 计算两个词的相似度/相关程度
y1 = model.similarity("woman", "man")
print (u"woman和man的相似度为：", y1)
print ("--------\n")

# 计算某个词的相关词列表
y2 = model.most_similar("good", topn=20)  # 20个最相关的
print (u"和good最相关的词有：\n")
for item in y2:
    print (item[0], item[1])
print ("--------\n")

# 寻找对应关系
print (' "boy" is to "father" as "girl" is to ...? \n')
y3 = model.most_similar(['girl', 'father'], ['boy'], topn=3)
for item in y3:
    print (item[0], item[1])
print ("--------\n")

more_examples = ["he his she", "big bigger bad", "going went being"]
for example in more_examples:
    a, b, x = example.split()
    predicted = model.most_similar([x, b], [a])[0][0]
    print ("'%s' is to '%s' as '%s' is to '%s'" % (a, b, x, predicted))
print ("--------\n")

# 寻找不合群的词
y4 = model.doesnt_match("breakfast cereal dinner lunch".split())
print (u"不合群的词：", y4)
print ("--------\n")

#具体的单词向量
y5 = model['man']
print(u'man的向量表示为:',y5)

# 保存模型，以便重用
model.save("text8.model")
# 对应的加载方式
# model_2 = word2vec.Word2Vec.load("text8.model")

# 以一种C语言可以解析的形式存储词向量
model.save_word2vec_format("text8.model.bin", binary=True)
# 对应的加载方式
# model_3 = word2vec.Word2Vec.load_word2vec_format("text8.model.bin", binary=True)

if __name__ == "__main__":
    pass

2017-12-01 17:06:00,745 : INFO : collecting all words and their counts
2017-12-01 17:06:00,773 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-01 17:06:07,124 : INFO : collected 253854 word types from a corpus of 17005207 raw words and 1701 sentences
2017-12-01 17:06:07,125 : INFO : Loading a fresh vocabulary
2017-12-01 17:06:07,515 : INFO : min_count=5 retains 71290 unique words (28% of original 253854, drops 182564)
2017-12-01 17:06:07,516 : INFO : min_count=5 leaves 16718844 word corpus (98% of original 17005207, drops 286363)
2017-12-01 17:06:07,795 : INFO : deleting the raw counts dictionary of 253854 items
2017-12-01 17:06:07,816 : INFO : sample=0.001 downsamples 38 most-common words
2017-12-01 17:06:07,817 : INFO : downsampling leaves estimated 12506280 word corpus (74.8% of prior 16718844)
2017-12-01 17:06:07,817 : INFO : estimated required memory for 71290 words and 200 dimensions: 149709000 bytes
2017-12-01 17:06:08,169 : INFO : resetting la

woman和man的相似度为： 0.69904747399
--------



2017-12-01 17:07:58,697 : INFO : saving Word2Vec object under text8.model, separately None
2017-12-01 17:07:58,698 : INFO : storing np array 'syn0' to text8.model.wv.syn0.npy


和good最相关的词有：

bad 0.7356971502304077
poor 0.5622576475143433
quick 0.5398947596549988
safe 0.5122597813606262
happy 0.5090628266334534
reasonable 0.5070134401321411
luck 0.5024716258049011
everyone 0.5022332668304443
courage 0.49608728289604187
easy 0.48916977643966675
really 0.48791834712028503
pleasant 0.4867483377456665
helpful 0.4853461682796478
simple 0.4842473268508911
you 0.48311030864715576
pleasure 0.47916221618652344
fun 0.47735875844955444
practical 0.4723237156867981
sick 0.46949368715286255
true 0.4688076972961426
--------

 "boy" is to "father" as "girl" is to ...? 

mother 0.755297064781189
wife 0.7149423360824585
lover 0.7115675210952759
--------

'he' is to 'his' as 'she' is to 'her'
'big' is to 'bigger' as 'bad' is to 'worse'
'going' is to 'went' as 'being' is to 'was'
--------

不合群的词： cereal
--------

man的向量表示为: [ 2.84990549  1.66100156 -0.46291554 -1.91552424  1.26991677 -0.53002125
 -1.65775323  0.16581546 -0.37104455 -0.66493225 -2.68795276 -0.98964429
  0.7402717

2017-12-01 17:07:59,107 : INFO : not storing attribute syn0norm
2017-12-01 17:07:59,108 : INFO : storing np array 'syn1neg' to text8.model.syn1neg.npy
2017-12-01 17:07:59,502 : INFO : not storing attribute cum_table
2017-12-01 17:07:59,753 : INFO : saved text8.model


DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

## 加载预料
![](https://github.com/xx674967/githubdesktop/blob/12/pic/11.30/6.png?raw=true)

## 运行结果：

## woman和man的相似度为： 0.69904747399


2017-12-01 17:07:58,697 : INFO : saving Word2Vec object under text8.model, separately None
2017-12-01 17:07:58,698 : INFO : storing np array 'syn0' to text8.model.wv.syn0.npy
## 和good最相关的词有：

#### bad 0.7356971502304077
#### poor 0.5622576475143433
#### quick 0.5398947596549988
#### safe 0.5122597813606262
#### happy 0.5090628266334534
#### reasonable 0.5070134401321411
#### luck 0.5024716258049011
#### everyone 0.5022332668304443
#### courage 0.49608728289604187
#### easy 0.48916977643966675
#### really 0.48791834712028503
#### pleasant 0.4867483377456665
#### helpful 0.4853461682796478
#### simple 0.4842473268508911
#### you 0.48311030864715576
#### pleasure 0.47916221618652344
#### fun 0.47735875844955444
#### practical 0.4723237156867981
#### sick 0.46949368715286255
#### true 0.4688076972961426


##  "boy" is to "father" as "girl" is to ...? 

#### mother 0.755297064781189 wife 0.7149423360824585 lover 0.7115675210952759


#### 'he' is to 'his' as 'she' is to 'her'
##### 'big' is to 'bigger' as 'bad' is to 'worse'
##### 'going' is to 'went' as 'being' is to 'was'


## 不合群的词： cereal


## man的向量表示为: 


[ 2.84990549  1.66100156 -0.46291554 -1.91552424  1.26991677 -0.53002125
 -1.65775323  0.16581546 -0.37104455 -0.66493225 -2.68795276 -0.98964429
  0.74027175  0.83170462  0.3797133  -0.36564422  0.78618819  0.33085719
  0.44892898  0.78222984 -0.07527302  1.03882289 -1.99759102  1.85718942
  2.02364588 -1.10365534  0.28239825 -0.72019976 -1.19803905  0.16315424
 -1.0026269   0.2484477   0.05102041  0.96586782  1.96214306 -2.19951439
  0.82643902 -0.63883281 -0.91182375 -0.25428298  1.67938423 -0.45735824
 -0.07179473 -1.43522036  0.82111633 -0.18213077 -1.57648122 -0.71429247
  0.36937949  0.25324577  0.31357452 -2.4207077  -0.0241117  -1.16797972
 -0.36098364  1.32045281  0.06746813  1.95746458  0.56274533  0.21990137
 -0.01401138  0.0574321   0.74110878 -0.67924148  0.12583764  1.28794312
 -1.37125194 -0.36716914 -0.08332597 -0.00648012  0.73951912 -0.36574325
 -0.37524316  0.65838474 -0.03875465 -1.91973841  2.54757881 -0.23880376
  1.2057147   0.15767473 -1.08966947  0.0893077  -1.56538272  0.56243378
  1.65325069 -0.26697111 -2.59878588  0.70056093 -0.78137916 -0.29614526
 -1.08889115  0.1567004   2.73792458  0.32043281  1.57779479  0.91225725
 -1.29899573 -1.18947673  0.47628629  0.54619694  2.03044605  0.69340295
 -4.02926826  1.31876552  0.59756422  0.43669066 -0.2351674   2.04298019
 -1.08337343  0.16523911  0.11295541 -0.52283877  0.01653624 -0.6800226
 -1.60326898  0.16517779 -1.72382617 -0.12129626  0.30450955  0.31627643
 -1.78943527  0.83437347  0.80744553 -0.62053263 -0.03210693  1.87041306
  2.54749107 -0.74052823 -0.00697328  1.22935104  0.20483659  0.79514205
 -1.25847864 -2.25778866 -0.21078563 -1.23147273  0.04141822  0.66311818
  0.43974233  1.46699226 -1.10270834  0.40597135  0.17748524  0.1116393
  0.35730216 -0.43965364  1.25133801  0.44471493 -0.91631937 -2.30782151
  1.3230077  -0.02770869 -0.58097899  0.34591112  0.21906033  0.36910716
 -1.18284619  0.68745226  1.17708778  0.23736008  1.04062402  2.32370615
  0.27462149  0.30262631  0.79007256  1.97054112  0.41340321  0.42966101
  0.1281496  -0.56907415  0.48163703 -1.56017959  0.13180627 -0.55732781
  0.12361874 -1.78840733 -0.61660272 -1.19609678  0.90965474  0.6987524
  0.17657876  1.22346163  0.05764034  1.02960694 -0.29207236 -0.65047288
  0.26938149  1.29409564 -0.0189758   0.02583921  1.06250405  0.07345543
 -0.52778798 -0.64760524  0.39429808  0.07163052  0.5894075  -0.68703949
 -0.20962349 -1.10046709]
 ## 模型保存
 