# Distributed Representations of Sentences and Documents

    Quoc Le
    Tomas Mikolov

    2014 May 22

https://arxiv.org/pdf/1405.4053.pdf

## 总结
- Word Vector
    - 目标
        - 最大化average log probability：$\frac{1}{T}\sum_{t=k}^{T-k}logp(w_t|w_{t-k},\cdots,w_{t+k})$
    - 预测
        - 多分类softmax：$p(w_t|w_{t-k},\cdots,w_{w+k})=\frac{e^{y_{w_t}}}{\sum_i e^{y_i}}$
        - 每个$y_i$是word的un-normalized log-probability：$y=b+Uh(w_{t-k},\cdots,w_{t+k};W)\ (1)$
            - U，b是softmax参数
            - h是concatenation
- Paragraph Vector
    - 每个paragraph被映射到矩阵D的一列
    - 每个word被映射到矩阵W的一列
    - 再做averaged或concatenated预测word
- Paragraph Vector without word ordering
    - 忽略上下文word，且预测的word必须从这段paragraph中选出

## Algorithms
### Learning Vector Representation of Words
Given a sequence of training words $w_1,w_2,w_3,\cdots,w_T$, the objective of the word vector model is to maximize the average log probability

$\frac{1}{T}\sum_{t=k}^{T-k}logp(w_t|w_{t-k},\cdots,w_{t+k})$

The prediction task is typically done via a multiclass classifier, such as softmax. There, we have

$p(w_t|w_{t-k},\cdots,w_{w+k})=\frac{e^{y_{w_t}}}{\sum_i e^{y_i}}$

Each of $y_i$ is un-normalized log-probability for each output
word i, computed as

$y=b+Uh(w_{t-k},\cdots,w_{t+k};W)\ (1)$

- U, b are the softmax parameters
- h is constructed by a concatenation or average of word vectors extracted from W

![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/80388170.jpg)

### Paragraph Vector: A distributed memory model
In our Paragraph Vector framework (see Figure 2), every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W . The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.

The only change in this model compared to the word vector framework is in equation 1, where h is constructed from W and D.

![2](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/3001285.jpg)

In summary, the algorithm itself has two key stages: 

1. training to get word vectors W , softmax weights U, b and paragraph vectors D on already seen paragraphs;
1. “the inference stage” to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W, U, b fixed.

### Paragraph Vector without word ordering: Distributed bag of words
Another way is to ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. In reality, what this means is that at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector.

![3](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/2194169.jpg)