# Note on Word Representation
    Chao Yang
    Microsoft
    Suzhou yangchao.42@outlook.com

https://github.com/placebokkk/writing/blob/master/word_embedding_note/word_representation_note_en.pdf

## 总结
1. 拥有相同context的word在语义上更相似，vector representation也更接近。
1. 用矩阵M表征word与它的context的count关系。
1. SVD分解矩阵可以得到W（word向量）和C（context向量）

## Model
Two kind of models:
1. the count-based model
1. the log-bilinear model
    1. Skip-Gram
    1. Glove

All the count-based and log-bilinear models only use bag-of-words contexts information.

### word-context matrix
Each row is the representation of a word by its context counts. We use M count to notate this matrix.

Table 1: The word-context count matrix $M^{count}$

$$
 \begin{bmatrix}
   -, & c_1, & c_2, & \cdots, & c_{|V_c|}\\
   w_1 & 17, & 0, & \cdots, & 1\\
   w_2 & 1, & 42, & \cdots, & 0\\
   \cdots, & \cdots, & \cdots,& \cdots,& \cdots,\\
   w_{|V_w|} & 2, & 89, & \cdots, & 0
  \end{bmatrix}
$$

First, we make some assumption:

- Any word w could be represented by a vector $\vec{w}$ and any context could be represented by a vector $\vec{c}$.
- $\vec{w} \cdot \vec{c}$ could be used as an association metric of the word w and context c.
- The words having similar context will have the similar meaning, or close vector representation.

Then for two words $w_i$ and $w_j$ , if they are similar in semantic space, for most context $c_k$, $\vec{w_i} \cdot \vec{c_k}$ and $\vec{w_j} \cdot \vec{c_k}$ will have similar value. If we put all the products $\vec{w} \cdot \vec{c}$ into a matrix M , in which $M_{ik} =\vec{w_i} \cdot \vec{c_k}$, we get

$M=WC^T\ (1)$

where

$$
W=
 \begin{bmatrix}
   \vec{w_1}^T, \\
   \vec{w_2}^T, \\
   \cdots,\\
   \vec{w_{|V_w|}}^T
  \end{bmatrix}
$$

$$C^T=(\vec{c_1},\vec{c_2},\cdots,\vec{|V_c|})\ (2)$$

A good measure of the association of word and context:**PMI**

$$PMI(w_i,c_k)=log\frac{p(w_i,c_k)}{p(w_i)p(c_k)}\ (3)$$

$$PPMI(w_i,c_k)=max\{PMI(w_i,c_k),0\}$$

We need to find a factorization of M to get W and C:

$$W=M,C=\Lambda\ (5)$$

In **SVD**, M will be first factorized into $U \cdot \Sigma \cdot V^T$ , where U and V are orthonormal and Σ is a diagonal matrix of eigenvalues in decreasing order. By keeping only the top d elements of Σ, we obtain $U_d \cdot \Sigma_d \cdot V_d^T$ . $M_d$ is a low-rank approximation of M.

Now the words could be represented in a new linear space $W^{SVD} = U_d \cdot \Sigma_d$, spanned by a set of more compact basics $C^{SVD} = V_d$ .

### Skip Gram Negative Sampling
In mikolov’s paper, the objective function of Skip Gram is defined as:

$$argmax_{\theta}\sum_{t=1}^T\sum_{-l \le j \le l,j \ne 0} log p(w_{t+j}|w_t)\ (6)$$

A more general formation:

$$argmax_{\theta}\sum_{t=1}^T\sum_{c' \in Context(w_t)} log p(c'|w_t)\ (7)$$

The conditional probability of a context given its word is defined as:

$$p(c|w)=\frac{exp(\vec{c}\cdot\vec{w})}{\sum_{c' \in V_c}exp(\vec{c'}\cdot\vec{w})}\ (8)$$

**Neural view** (a three layer neural network):

- Input layer with one-hot input.
- Hidden layer with identity activation function.
- Softmax output layer.
- The transition between the input layer and hidden layer is the matrix W 
- The transition between the hidden layer and output layer is the matrix C

**Optimization** The objective (6) could be computed by gradient descent directly. But it is expensive due to the summation $\sum_{c' \in V_c}exp(\vec{c'}\cdot\vec{w})$. So Mikolov used an easier objective to replace the log p(c|w):

$$log\sigma(\vec{c}\cdot\vec{w})+\sum_{i=1}^k E_{c_i \in P_n(w)}log \sigma(-\vec{c_i}\cdot\vec{w})\ (9)$$

$P_n(w)$ is the non-context probability distribution of word w. (9) change the output layer’s activation function form softmax to sigmoid. And at each iteration, it considers the current context and randomly use another k non-contexts (called as negative samples) to update the parameters.