### 名词解释
- POS Part-Of-Speech 词性标注，动词，名词，形容词 https://en.wikipedia.org/wiki/Part-of-speech_tagging

### 词的表达
- 在传统的NLP中，用词典One hot编码，`[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] `。
- 传统词向量的问题
    - 占内存随着词典增大而增大
    - 维度非常大，耗内存
    - 系数特征会使得模型不鲁班
   
  因此需要对词向量降维。
- SVD降维   
  出现高级语义模式。
  论文：An Improved Model of Seman3c Similarity Based on Lexical CoKOccurrence Rohde et al. 2005    
  问题：
    - 数据量大的时候，计算量太大。
    - 没有考虑新词
- 直接学习低维词向量  
  - Learning representa3ons by backKpropaga3ng errors. (Rumelhart et al., 1986) 
  - A neural probabilis3c language model (Bengio et al., 2003)   
  - NLP from Scratch (Collobert & Weston, 2008) 
  - A recent and even simpler model:  word2vec (Mikolov et al. 2013) ! intro now



### 简化的word2vec模型
代价函数    
$$
J(\theta)= \frac{1}{T} \sum_{t=1}^T \sum_{-c\le j\le c, j \ne 0} \log p(w_{t+j}|w_t)
$$  

其中概率定义为
$$
p(w_O|w_I) = \frac{\exp(\hat{v}_{w_O}^T v_{w_I})}{\sum_{w=1}^W \exp(\hat{v}_{w}^T v_{w_I})}
$$

也就是每个词对应两个向量 $v$和$\hat{v}$。

优化算法采用梯度下降， 或随机梯度下降。随机梯度下降时，每次更新的梯度是稀疏的。

最大计算量在分母上，每一次迭代需要的计算量与词典大小$W$成正比。

### 语料
Google word2vec 语料 http://mattmahoney.net/dc/text8.zip

### 资源
- gensim word2vec 
- python-glove

In [21]:
import numpy as np
import theano
import theano.tensor as T

f = open('data/text8')
corpus = f.read(1000000)
f.close()
word_list = corpus.split(' ')[1:-1]
word_dict = list(set(word_list))

In [49]:
def prob(theta, n_word, n_features):
    L = theta[:n_word*n_features].reshape((n_features, n_word))
    LL = theta[n_word*n_features:].reshape((n_features, n_word))
    word_dot = np.exp(np.dot(LL.T, L))
    B = np.sum(word_dot, axis=0).reshape((1,n_word))
    #print word_dot.shape,B.shape
    p = word_dot / B
    return p

In [81]:
n_word = 3
n_features = 100
theta = np.random.rand(2*n_word*n_features)
#theta[:n_features]=0
p = prob(theta, n_word, n_features)
word = np.random.randint(0,3,1000)
c = 3
ker = np.array([1]*c + [0] + [1]*c)

from scipy.signal import convolve2d,convolve
convp = convolve(np.log(p[word,word]), ker, 'valid')
J = convp.mean()
print J

-11.8662970023


### 补充numpy的nditer
The iterator object nditer, introduced in NumPy 1.6, provides many flexible ways to visit all the elements of one or more arrays in a systematic fashion. This page introduces some basic ways to use the object for computations on arrays in Python, then concludes with how one can accelerate the inner loop in Cython. Since the Python exposure of nditer is a relatively straightforward mapping of the C array iterator API, these ideas will also provide help working with array iteration from C or C++.

简单来说nditer会帮你加速。参考[页面](http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html#arrays-nditer)


In [85]:
a = np.arange(6).reshape(2,3)
for x in np.nditer(a):
    print x,


0 1 2 3 4 5


nditer只会按照内存存储的实际顺序遍历，不会按照行先或者列先的顺序访问的。

In [92]:
for x in np.nditer(a.T):
    print x,

0 1 2 3 4 5


In [91]:
for x in np.nditer(a.T.copy(order='C')):
    print x,

0 3 1 4 2 5


#### 控制nditer的迭代顺序
The default, having the behavior described above, is order=’K’ to keep the existing order. This can be overridden with order=’C’ for C order and order=’F’ for Fortran order.

Fortran是列主序，C是行主序。

In [96]:
for x in np.nditer(a, order='F'):  
    print x,
print 
for x in np.nditer(a, order='C'):
    print x,

0 3 1 4 2 5
0 1 2 3 4 5


### skip-gram model
- 论文：Distributed Representations of Words and Phrases and their Compositionality, Mikolov, 2013
- 基本思想：对每一个上下文单词与中心单词，训练一个二元逻辑回归，中心单词随机负采样。
- 优化目标函数
$$
F(w_i|r) = \log \sigma(w_i^T r) + \sum_{j=1}^K E_{w_j \sim p_n(w)} \log(\sigma(-w_j^T r))
$$

$$
J_{skip-gram}(w_{i-c},\dots , w_{i+c}) = \sum_{-c\le j \le c, j\ne 0} F(w_{i+j}|r)
$$

### CBOW
skip-gram model是用中间的词预测上下文，而CBOW用上下文预测中间词。
$$ r_i =  \sum_{-c\le j \le c, j\ne 0} w_{i+j}
$$
$$ J_{CBOW}(w_{i-c},\dots , w_{i+c}) =  F(w_{i}|r_i)
$$