# Distributed representations

1. Word2vec  
    1.1 skip-gram model  
    1.2 Hierarchical Huffman Trees  
    1.3 Negative Sampling  
    1.4 CBoW model   
2. Glove  
3. FastText  
4. Context-dependent Embeddings (e.g. BERT)  

# Readings
1. (general 1) https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
1. (general 2) https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795
1. (glove ) https://towardsdatascience.com/emnlp-what-is-glove-part-i-3b6ce6a7f970
1. (embeddings in pytorch) https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
1. http://jalammar.github.io/illustrated-bert/
1. https://arxiv.org/abs/1802.05365

# 1 Word2vec
<img src="images/w2v.png" style="height:300px">

## 1.1 Skip-gram model
<img src="images/skip.png" style="height:500px">

For each word $t$ predict surrounding words in a windox of size $m$ (context)

Objective is to maximize probability of context words given the current center word:  
    
$$J(\theta) = \prod^T_{t=1} \prod_{-m \le j \le m; j != 0 }  p(x_{t+j} | x_t; \theta)  \rightarrow max $$,
where  
$x_t$ - center word,  
$x_{t+j}$ - word from context,  
$m$ - context size.  

or negative log-likelihood:

$$J(\theta) = -\frac{1}{T}\sum^T_{t=1} \sum_{-m \le j \le m; j != 0 }  log p(x_{t+j} | x_t; \theta)  \rightarrow min $$

$$p(x_{t+j} | x_t) = p(out | center) = \frac{\exp(u_{out}^T v_{center})}{\sum_{k=1}^K \exp(u_{k}^T v_{center})}$$

## 1.2 Hierarchial Huffman trees

Complexity $O(V) \rightarrow O(\log_2 V)$

$x = v_{n(w,j)}^T v_{w}$,   
where $n(w,j)$ is the j-th node on the path from the root to $w$.  

$p(n, left) = \sigma (v_n^T v_w)$ - probability to go left.  
$p(n, right) = \sigma (- v_n^T v_w )$ - probability to go right.  

Then,  
$p(w_j | w) = \prod_{j=1}^{L(w) - 1} \sigma ( [ n(w, j+1) == child(n(w,j)) ] v_n^T v_w)$,  
where $L(w)$ - depth of the tree,  
$child(n)$ - child of node n.


<img src="images/hier.png" style="height:200px">

How to build binary prefix tree? -> Huffman Tree.
<img src="images/huffman.png" style="height:300px">

## 1.3 Negative sampling

Using negative sampling with k samples:   
    
$\log p(w_{t+j} | w_t; \theta) = \log \sigma(u_{outer}^T v_{center})  + \sum_{i=1}^k E_{j \sim P(w)} [\log \sigma (-u_j^T v_{center})]$

In [14]:
sentences = df.text.apply(lambda x: x.split()).values

In [15]:
%%time

from gensim.models.word2vec import Word2Vec


w2v = Word2Vec(sentences, negative=5, size=100, iter=5, sg=1)

CPU times: user 3.73 s, sys: 13.4 ms, total: 3.75 s
Wall time: 1.41 s


In [16]:
w2v.wv.most_similar('airline')

[('airline.', 0.9007248878479004),
 ('best', 0.8687483072280884),
 ('ever', 0.8615865111351013),
 ('awful', 0.8586821556091309),
 ('most', 0.8517616987228394),
 ('worst', 0.84912109375),
 ('disappointed', 0.8412606716156006),
 ('horrible', 0.838492751121521),
 ('company', 0.8372955322265625),
 ('absolute', 0.8371437788009644)]

## 1.4 CBOW model

= Predict center word from surrounding context

<img src="images/cbow.png" style="height:400px">

$$h = W^T x$$  
$$x = [x_{j-m}, x_{j-m+1}, ... x_{j-1}, x_{j+1}, ..., x_{j+m}] $$  

$$p(x_j | x) = \frac{\exp(v_j^T h)}{\sum_{k=1}^K \exp(v_k^T h)}$$

In [None]:
%%time

from gensim.models.word2vec import Word2Vec


w2v = Word2Vec(sentences, negative=5, size=100, iter=100, sg=0)

In [18]:
w2v.wv.most_similar('police')

[('assult', 0.6154369115829468),
 ('reported', 0.45402228832244873),
 ('most', 0.4082372188568115),
 ('communication,', 0.40658941864967346),
 ('Delays', 0.38628947734832764),
 ('Gate', 0.38524293899536133),
 ('Atlantic', 0.38465529680252075),
 ('engine', 0.3818504810333252),
 ('computer', 0.3697021007537842),
 ('SNA', 0.35934001207351685)]

# 2 Glove

= Word embeddings through decomposition of co-occurance matrix

<img src="images/matrix.png" style="height:300px">

$P_{ij}$ - occurance of i-th word along with j-th in the window of size m

Cons: 
1. Very high-dimensional, not used in practice
2. Hard to add new words and docs

Trivial solution: use some dimension-reduction method, usually SVD

Singular Value Decomposition

$M = U \Sigma V$  
$Mv = \sigma u$  
$M^{*}u = \sigma v$   
U, V are unitary matrices  
$\Sigma$ - diagonal


$O(nm^2)$ for case n < m


<img src="images/glove.png" style="height:300px">

$J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^T v_j - log P_{ij})$

# 3 FastText

Subword embeddings.

Introduce **scoring function** (instead of scalar product in w2v):
$$s(w,c) = \sum_{g \in G_w} z_g^T v_c$$
where  
$G_w$ - set of 3-grams appearing in word $w$  
$z_g$ - embedding of 3-gram g  
$v_c$ - context vector  


**Objective** function for skip-gram case:

$$  \sum_{t=1}^T [\sum_{c \in C_t}log (1 + \exp(- s(w_t, w_c))) + \sum_{n \in N_{t,c}} \log(1 + \exp(s(w_t, n)))] \rightarrow \min$$

where  
$c$ - chosen context position  
$C_t$ set of context position dependent on current word $t$  
$T$ - total number of words  
$N_{t,c}$ - set of negative samples dependent on chosen word and context  


**Inference**:
Embedding of word $w$ from 3-grams $G_w$:
$$v_w = \sum_{g \in G_w} z_g$$



**Tweaks** in Negative sampling: sampling negative examples with probability 
$$ p(w) = \frac {\sqrt {U(w)}} {Z}$$
where $Z = \sum_w \sqrt {U(w)}$  
and $U(w)$ - the count of a particular word $w$  

Probability of token $w$ to be discarded during training:
$$ P(w) = \sqrt {\frac t {f(w)}} + \frac t {f(w)} $$
where $f(w) = \frac {U(w)} Z$ - frequency of token $w$  



<img src="images/ft.png" style="height:300px">

# 4 Context-dependent Embeddings (e.g. BERT)

Embedding of token depends on the context.

<img src="images/bert.png" style="height:300px">

In [4]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM


# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text_1 = "I like to drive a car"
text_2 = "I had a ticket for 8th train car"
marked_text_1 = "[CLS] " + text_1 + " [SEP]"
marked_text_2 = "[CLS] " + text_2 + " [SEP]"

# Tokenize our sentence with the BERT tokenizer.
tokenized_text_1 = tokenizer.tokenize(marked_text_1)
tokenized_text_2 = tokenizer.tokenize(marked_text_2)

print(tokenized_text_1)
print(tokenized_text_2)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
I1211 18:04:43.223597 139916672685888 file_utils.py:32] TensorFlow version 2.0.0 available.
I1211 18:04:43.224047 139916672685888 file_utils.py:39] PyTorch version 1.3.0 available.
I1211 18:04:43.402624 139916672685888 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1211 18:04:44.165966 139916672685888 tokenization_utils.py:373] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /home/denis.litvinov/.cache/torch/transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


['[CLS]', 'i', 'like', 'to', 'drive', 'a', 'car', '[SEP]']
['[CLS]', 'i', 'had', 'a', 'ticket', 'for', '8th', 'train', 'car', '[SEP]']


In [5]:
indexed_tokens_1 = tokenizer.convert_tokens_to_ids(tokenized_text_1)
segments_ids_1 = [1] * len(tokenized_text_1)
indexed_tokens_2 = tokenizer.convert_tokens_to_ids(tokenized_text_2)
segments_ids_2 = [1] * len(tokenized_text_2)

# Convert inputs to PyTorch tensors
tokens_tensor_1 = torch.tensor([indexed_tokens_1])
segments_tensors_1 = torch.tensor([segments_ids_1])
tokens_tensor_2 = torch.tensor([indexed_tokens_2])
segments_tensors_2 = torch.tensor([segments_ids_2])

In [6]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers_1, _ = model(tokens_tensor_1, segments_tensors_1)
    encoded_layers_2, _ = model(tokens_tensor_2, segments_tensors_2)

I1211 18:06:03.633340 139916672685888 file_utils.py:296] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /tmp/tmpmut8pmm2
100%|██████████| 313/313 [00:00<00:00, 436441.87B/s]
I1211 18:06:04.196304 139916672685888 file_utils.py:309] copying /tmp/tmpmut8pmm2 to cache at /home/denis.litvinov/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
I1211 18:06:04.199301 139916672685888 file_utils.py:313] creating metadata file for /home/denis.litvinov/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
I1211 18:06:04.199918 139916672685888 file_utils.py:322] removing temp file /tmp/tmpmut8pmm2
I1211 18:06:04.200409 139916672685888 configuration_utils.py:151] loading configuration file https://

In [20]:
encoded_layers_1.shape

torch.Size([1, 8, 768])

In [21]:
car_embedding_1 = encoded_layers_1[0, tokenized_text_1.index('car'), :]
car_embedding_2 = encoded_layers_2[0, tokenized_text_2.index('car'), :]

torch.sum((car_embedding_1 - car_embedding_2)**2)

tensor(158.6157)