# Part 1: Usupervised learning

© Anatolii Stehnii, 2018

## Lecture 2: Word2Vec

The main advantage of deep learning is **representation learning**. How representation learning can be used to find dense word vectors?

---
*Returning to the previous lecture: "Linguistic items with similar distributions have similar meanings".*

---

Deep learning not always used to predict something; **representation of data** which is learned without supervision, also can be valuable (see autoencoders, neural language models etc).

Word2vec exploit this approach by **training a model**, which takes each word and tries to predict all words from it's context.

![Word2vec approach simple](Word2vec_words.png)

Let's denote:

- $w_t$ – a word at step $t \in 1..T$, center word;
- $n$ – window size;
- $w_{t+i}, i \in [-n/2, 0) \cap (0, n/2]$ – a word in a context of $w_t$; 
- $\textbf{P}(w_{t+i}|w_t), $ – conditional probability of word $w_{t+i}$ given $w_t$;
- $\theta$ – model parameters (neural network weights).

Then likelihood for neural network will be: 

$$
\textbf{L}(\theta) = \prod_{t=1}^T\prod_{-n/2 \le i \le n/2, i \ne 0} \textbf{P}(w_{t+i}|w_t; \theta)
$$

Objective function of average negative log likelihood:

$$
\textbf{J}(\theta) = -\frac{1}{T}\textbf{L}(\theta) = -\frac{1}{T}\sum_{t=1}^T\sum_{-n/2 \le i \le n/2, i \ne 0} log \textbf{P}(w_{t+i}|w_t; \theta)
$$

$\textbf{P}(w_{t+i}|w_t; \theta)$ is a conditional probability, approximated by a neural network. Minimizing $\textbf{J}(\theta)$ with respect $\theta$ we will get neural model, which predicts context of words, and model parameters $\theta$ will actually be dense representations of words. But what is a function for $\textbf{P}(w_{t+i}|w_t; \theta)$?

Let's define two matrices $\textbf{W}_{in}$ and $\textbf{W}_{out}$ of size $len(V) \times m$, where $V$ is our vocabulary and $m$ is a desireable number of dimensions for word embeddings. This matrices will be all our parameters, so $\theta = {\textbf{W}_{in}, \textbf{W}_{out}}$. Each word will have a correspoding row vector in both $\textbf{W}_{in}$ and $\textbf{W}_{out}$. Let's denote a center word as $c$ and a context word as $o$, and corresponding word vectors as $v_c$ and $u_o$.

$$ 
\textbf{P}(o|c; \textbf{W}_{in}, \textbf{W}_{out}) = 
\frac{exp(v_c^T u_o)}{\sum_{w \in V} exp(v_c^T u_w)}
$$

Dot product in numerator $v_c^T u_o$ denotes similarity of words $c$ and $o$. Denominator and exponent normalizes it by similarities of $c$ and all other words in a vocabulary (softmax). 

With this formulas it could be pretty incomprehensive how it should work. Let's revise implementation of this formula and implement a forward propagation of a neural network.

![Word2vec implementation](Word2vec_diagram.png)

In [8]:
import numpy as np

corpus = 'he was old man'
vocab = {
    'he': 0,
    'was': 1,
    'old': 2,
    'man': 3

}
vocab_len = len(vocab)

center_word = 2
center_word_encoded = np.zeros((vocab_len, ))
center_word_encoded[center_word] = 1

# this a skip-gram implementation
context_words = [0, 1, 3]
context_words_encoded = np.zeros((len(context_words), vocab_len))
context_words_encoded[np.arange(len(context_words)), context_words] = 1
print('Center word one-hot encoding: {}\nContext words one-hot encodings: \n{}'.format(center_word_encoded, context_words_encoded))

Center word one-hot encoding: [ 0.  0.  1.  0.]
Context words one-hot encodings: 
[[ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]]


In [12]:
# number of dimensions for word vectors
n_dim = 5
W_in = np.random.rand(vocab_len, n_dim)
W_out = np.random.rand(vocab_len, n_dim)
print('W in matrix:\n{}\nW out matrix:\n{}'.format(W_in, W_out))

W in matrix:
[[ 0.24186189  0.89357085  0.26148547  0.23270299  0.49560695]
 [ 0.01482146  0.68605003  0.74224183  0.91068227  0.35435903]
 [ 0.8115948   0.18869666  0.463782    0.31674507  0.27754381]
 [ 0.2423285   0.35453277  0.30785807  0.14330915  0.49802627]]
W out matrix:
[[ 0.44731698  0.59629897  0.89405164  0.12812246  0.93850568]
 [ 0.10365351  0.11948719  0.87945264  0.7286056   0.0193697 ]
 [ 0.92330252  0.35593004  0.03198809  0.14778626  0.61177179]
 [ 0.79678261  0.79271956  0.85932509  0.38838388  0.11762184]]


In [16]:
v_c = np.dot(center_word_encoded, W_in)
print('Center word vector v: {}'.format(v_c))

Center word vector v: [ 0.8115948   0.18869666  0.463782    0.31674507  0.27754381]


In [18]:
U_w = np.dot(v_c, W_out.T)
print('Similarities:\n{}'.format(U_w))

Similarities:
[ 1.19126342  0.75070395  1.04794988  1.35045156]


In [19]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

conditional_probabilities = softmax(U_w)
print('Probabilities:\n{}'.format(conditional_probabilities))

Probabilities:
[ 0.27153864  0.17478296  0.23528344  0.31839495]


In [22]:
probabilities_of_context_words = np.dot(conditional_probabilities, context_words_encoded.T)
print('Probabilities of context:\n{}'.format(probabilities_of_context_words)) 

Probabilities of context:
[ 0.27153864  0.17478296  0.31839495]


In [26]:
negative_log_likelihood = -np.sum(np.log(probabilities_of_context_words))
print('Negative log likelihood: {}'.format(negative_log_likelihood)) 

Negative log likelihood: 4.192323783664577


derivative

After minimization of loss function we will have two word-vector matrices: $\textbf{W}_{in}$ and $\textbf{W}_{out}$. We can use any of them or average value as word embeddings.