# word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method

    Yoav Goldberg and Omer Levy
    {yoav.goldberg,omerlevy}@gmail.com

    February 14, 2014

https://arxiv.org/pdf/1402.3722v1.pdf

## 总结
1. skip-gram model目标是极大似然估计得到w条件下的c（w，c分别代表当前词和该词上下文的word vector）。
1. Negative Sampling：给定一个corpus D以及它的补集D'，优化它的极大似然估计，使得两个集合尽可能的分开。

## The skip-gram model
In this model we are given a corpus of words w and their contexts c. 

$argmax_{\theta}\prod_{w \in Text}[\prod_{c \in C(w)}p(c|w;\theta)]\ (1)$

- C(w) is the set of contexts of word w.

$argmax_{\theta}\prod_{(w,c) \in D}p(c|w;\theta)\ (2)$

- D is the set of all word and context pairs we extract from the text.

$p(c|w;\theta)=\frac{e^{v_c \cdot v_w}}{\sum_{c' \in C e^{v_{c'} \cdot v_w}}}\ (3)$

- $v_c$ and $v_w \in R^d$ are vector representations for c and w respectively,
- C is the set of all available contexts
- the parameters $\theta$ are $v_{c_i}, v_{w_i}$ for $w \in V , c \in C,i \in 1,\cdots,d$(a total of |C|×|V|×d parameters)

Take the log and switch from product to sum:

$argmax_{\theta}\sum_{(w,c) \in D} log p(c|w)=\sum_{(w,c) \in D}(log e^{v_c \cdot v_w}-log\sum_{c'}e^{v_{c'} \cdot v_w})\ (4)$

While objective (4) can be computed, it is computationally expensive to do so, because the term p(c|w; θ) is very expensive to compute due to the summation $\sum_{c' \in C}e^{v_{c'} \cdot v_w}$ over all the contexts c′ (there can be hundreds of thousands of them). One way of making the computation more tractable is to replace the softmax with an **hierarchical softmax.**

## Negative Sampling
Consider a pair (w,c) of word and context.

- p(D = 1|w, c) the probability that (w, c) came from the corpus data
- p(D = 0|w, c) = 1 − p(D = 1|w, c) will be the probability that (w,c) did not come from the corpus data

Our goal is now to find parameters to maximize the probabilities that all of the observations indeed came from the data:

$argmax_{\theta}\prod_{(w,c) \in D}p(D=1|w,c;\theta)$

$=argmax_{\theta}log\prod_{(w,c) \in D}p(D=1|w,c;\theta)$

$=argmax_{\theta}\sum_{(w,c) \in D}log p(D=1|w,c;\theta)$

The quantity p(D = 1|c,w; θ) can be defined using softmax:

$p(D=1|w,c;\theta)=\frac{1}{1+e^{-v_c \cdot v_w}}$

Leading to the objective:

$argmax_{\theta}\sum_{(w,c) \in D}log\frac{1}{1+e^{-v_c \cdot v_w}}$

We need a mechanism that prevents all the vectors from having the same value, by disallowing some (w, c) combinations. One way to do so, is to present the model with some (w,c) pairs for which p(D = 1|w,c;θ) must be low, i.e. pairs which are not in the data. This is achieved by generating the set D′ of random (w,c) pairs, assuming they are all incorrect (the name “negative sampling” stems from the set D′ of randomly sampled negative examples).

$argmax_{\theta}\prod_{(w,c) \in D}p(D=1|c,w;\theta)\prod_{(w,c) \in D'}p(D=0|c,w;\theta)$

$=argmax_{\theta}\prod_{(w,c) \in D}p(D=1|c,w;\theta)\prod_{(w,c) \in D'}(1-p(D=1|c,w;\theta))$

$=argmax_{\theta}\sum_{(w,c) \in D}log p(D=1|c,w;\theta)+\sum_{(w,c) \in D'}log(1-p(D=1|c,w;\theta))$

$=argmax_{\theta}\sum_{(w,c) \in D}log\frac{1}{1+e^{-v_c \cdot v_w}}+\sum_{(w,c) \in D'}log(1-\frac{1}{1+e^{-v_c \cdot v_w}})$

$=argmax_{\theta}\sum_{(w,c) \in D}log\frac{1}{1+e^{-v_c \cdot v_w}}+\sum_{(w,c) \in D'}log\frac{1}{1+e^{v_c \cdot v_w}}$