# Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov
    Kai Chen
    Greg Corrado
    Jeffrey Dean

    7 Sep 2013

https://arxiv.org/pdf/1301.3781.pdf

## 总结
- Continuous Bag-of-Words Model(CBOW)
    - 类似 feedforward NNLM，不同的地方是使用一个non-linear activation function，即projection layer
- Continuous Skip-gram Model
    - 输入当前word
    - 使用log-linear分类器和 continuous projection layer
    - 预测当前word之前和之后一个范围内的word

## Model Architectures
For all the following models, the training complexity is proportional to

$O=E \times T \times Q,\ (1)$

- E is number of the training epochs
- T is the number of the words in the training set
- Q is defined further for each model architecture

Common choice is E = 3 − 50 and T up to one billion. All models are trained using stochastic gradient descent and backpropagation.

### Feedforward Neural Net Language Model (NNLM)
It consists of **input**, **projection**, **hidden** and output **layers**.

1. At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary.
1. The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix. As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.
1. For a common choice of N = 10, the size of the projection layer (P) might be 500 to 2000, while the hidden layer size H is typically 500 to 1000 units. Moreover, the hidden layer is used to compute probability distribution over all the words in the vocabulary, resulting in an output layer with dimensionality V . Thus, the computational complexity per each training example is $Q=N \times D + N \times D \times H + H \times V,\ (2)$
    - where the dominating term is H × V .
    - with binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around $log_2(V)$. Thus, most of the complexity is caused by the term N × D × H.

### Recurrent Neural Net Language Model (RNNLM)
The RNN model does not have a projection layer; only **input**, **hidden** and **output** layer.

The complexity per training example of the RNN model is

$Q=H \times H + H \times V,\ (3)$

- where the word representations D have the same dimensionality as the hidden layer H

Again, the term H × V can be efficiently reduced to $H \times log_2(V)$ by using hierarchical softmax. Most of the complexity then comes from H × H.

## New Log-linear Models
### Continuous Bag-of-Words Model
The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged). 

We denote this model further as `CBOW`, as unlike standard bag-of-words model, it uses continuous distributed representation of the context.

### Continuous Skip-gram Model
The second architecture is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word.

![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/19271731.jpg)

## Results
To measure quality of the word vectors, we define a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions. Two examples from each category are shown in Table 1.

Table 1: Examples of five types of semantic and nine types of syntactic questions in the Semantic-Syntactic Word Relationship test set.

Type of relationship | Word Pair 1 | Word Pair 2
---------------------|-------------|-----------
Common capital city | Athens-Greece | Oslo-Norway
All capital cities | Astana-Kazakhstan | Harare-Zimbabwe
Currency | Angola-kwanza | Iran-rial
City-in-state | Chicago-Illinois | Stockton-California
Man-Woman | brother-sister | grandson-granddaughter
Adjective to adverb | apparent-apparently | rapid-rapidly
Opposite | possibly-impossibly | ethical-unethical
Comparative | great-greater | tough-tougher
Superlative | easy-easiest | lucky-luckiest
Present Participle | think-thinking | read-reading
Nationality adjective | Switzerland-Swiss | Cambodia-Cambodian
Past tense | walking-walked | swimming-swam
Plural nouns | mouse-mice | dollar-dollars
Plural verbs | work-works | speak-speaks

Table 2: Accuracy on subset of the Semantic-Syntactic Word Relationship test set, using word vectors from the CBOW architecture with limited vocabulary. Only questions containing words from the most frequent 30k words are used.

Dimensionality / Training words | 24M | 49M | 98M | 196M | 391M | 783M
--------------------------------|-----|-----|----|---|------|-----
50 | 13.4 | 15.7 | 18.6 | 19.1 | 22.5 | 23.2
100 | 19.4 | 23.1 | 27.8 | 28.7 | 33.4 | 32.2
300 | 23.2 | 29.2 | 35.3 | 38.6 | 43.7 | 45.9
600 | 24.0 | 30.1 | 36.5 | 40.8 | 46.6 | 50.4

Table 3: Comparison of architectures using models trained on the same data, with 640-dimensional word vectors. The accuracies are reported on our Semantic-Syntactic Word Relationship test set, and on the syntactic relationship test set of

Model Architecture | RNNLM | NNLM | CBOW | Skip-gram
-------------------|-------|------|------|-------
Word Relationship test set Semantic Accuracy[%] | 9 | 23 | 24|  55
Word Relationship test set Syntactic Accuracy[%] | 36 | 53 | 64 | 59
MSR Word Relatedness Test Set | 35 | 47 | 61 | 56

Table4: Comparison of publicly available word vectors on the Semantic-Syntactic Word Relationship test set, and word vectors from our models. Full vocabularies are used.

Model | Collobert-Weston NNLM | Turian NNLM | Turian NNLM | Mnih NNLM | Mnih NNLM | Mikolov RNNLM | Mikolov RNNLM | Huang NNLM
------|-----------------------|-------------| ---------| -------------|-----------|---------------|---------------|-----------
Vector Dimensionality | 50 | 50 | 200 | 50 | 100 | 80 | 640 | 50
Training words | 660M | 37M | 37M | 37M | 37M | 320M | 320M | 990M
Accuracy[%]Semantic | 9.3 | 1.4 | 1.4 | 1.8 | 3.3 | 4.9 | 8.6 | 13.3
Accuracy[%]Syntactic | 12.3 | 2.6 | 2.2 | 9.1 | 13.2 | 18.4 | 36.5 | 11.6
Accuracy[%]Total | 11.0 | 2.1 | 1.8 | 5.8 | 8.8 | 12.7 | 24.6 | 12.3
Model | Our NNLM | Our NNLM | Our NNLM | CBOW | Skip-gram
Vector Dimensionality | 20 | 50 | 100 | 300 | 300
Training words | 6B | 6B | 6B | 783M | 783M
Accuracy[%]Semantic | 12.9 | 27.9 | 34.2 | 15.5 | 50.0
Accuracy[%]Syntactic | 26.4 | 55.8 | 64.5 | 53.1 | 55.9
Accuracy[%]Total | 20.3 | 43.2 | 50.8 | 36.1 | 53.3

Table 5: Comparison of models trained for three epochs on the same data and models trained for one epoch. Accuracy is reported on the full Semantic-Syntactic data set.

Model | 1 epoch CBOW | 1 epoch CBOW | 1 epoch CBOW
------| ------------ | ------------ | ------------
Vector Dimensionality | 300 | 300 | 600
Training words | 783M | 1.6B | 783M
Accuracy[%]Semantic | 13.8 | 16.1 | 15.4
Accuracy[%]Syntactic | 49.9 | 52.6 | 53.3
Accuracy[%]Total | 33.6 | 36.1 | 36.2
Training time[days] | 0.3 | 0.6 | 0.7
Model | 1 epoch Skip-gram | 1 epoch Skip-gram | 1 epoch Skip-gram
Vector Dimensionality | 300 | 300 | 600
Training words | 783M | 1.6B | 783M
Accuracy[%]Semantic | 45.6 | 52.2 | 56.7
Accuracy[%]Syntactic | 52.2 | 55.1 | 54.5
Accuracy[%]Total | 49.2 | 53.8 | 55.5
Training time[days] | 1 | 2 | 2.5
Model | 3 epoch CBOW | 3 epoch Skip-gram
Vector Dimensionality | 300 | 300
Training words | 783M | 783M
Accuracy[%]Semantic | 15.5 | 50.0
Accuracy[%]Syntactic | 53.1 | 55.9
Accuracy[%]Total | 36.1 | 53.3
Training time[days] | 1 | 3

Table 6: Comparison of models trained using the DistBelief distributed framework. Note that training of NNLM with 1000-dimensional vectors would take too long to complete.

Model | Vector Dimensionality | Training words | Accuracy[%]Semantic | Accuracy[%]Syntactic | Accuracy[%]Total | Training time[days x CPU cores]
------|-----------------------|----------------|----------------|--------------------------|------------------|--------------------------------
NNLM | 100 | 6B | 34.2 | 64.5 | 50.8 | 14 x 180
CBOW | 1000 | 6B | 57.3 | 68.9 | 63.7 | 2 x 140
Skip-gram | 1000 | 6B | 66.1 | 65.1 | 65.6 | 2.5 x 125

Table 7: Comparison and combination of models on the Microsoft Sentence Completion Challenge.

Architecture | Accuracy[%]
-------------|------------
4-gram | 39
Average LSA similarity | 49
Log-bilinear model | 54.8
RNNLMs | 55.4
Skip-gram | 48.0
Skip-gram + RNNLMs | 58.9

Table 8: Examples of the word pair relationships, using the best word vectors from Table 4 (Skip-gram model trained on 783M words with 300 dimensionality).

Relationship | Example 1 | Example 2 | Example 3
-------------|-----------|-----------|----------
France - Paris | Italy: Rome | Japan: Tokyo | Florida: Tallahassee
big - bigger | small: larger | cold: colder | quick: quicker
Miami - Florida | Baltimore: Maryland | Dallas: Texas | Kona: Hawaii
Einstein - scientist | Messi: midfielder | Mozart: violinist | Picasso: painter
Sarkozy - France | Berlusconi: Italy | Merkel: Germany | Koizumi: Japan
copper - Cu | zinc: Zn | gold: Au | uranium: plutonium
Berlusconi - Silvio | Sarkozy: Nicolas | Putin: Medvedev | Obama: Barack
Microsoft - Windows | Google: Android | IBM: Linux | Apple: iPhone
Microsoft - Ballmer | Google: Yahoo | IBM: McNealy | Apple: Jobs
Japan - sushi | Germany: bratwurst | France: tapas | USA: pizza