<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="Word2Vec">Word2Vec<a class="anchor-link" href="#Word2Vec">&#182;</a></h1><p>We use the tf-idf to transform a document into a vector. This vector can be seen as the weighted sum of <strong>one-hot vectors</strong> of distinct words in the document. The one-hot vectors for words have some limitations. For example, we know that "cat" is more similar to "dog" than "cow" semantically. However, the one-hot vectors of these three words has the same distance: 0. Can we embed words into a vector space that preserves the semantic distance between words?</p>
<p><strong>Word2vec</strong> is an NN-based word embedding method. It is able to represent words in a continuous, low dimensional vector space (i.e., the embedding space) where semantically similar words are mapped to nearby points. The main idea of word2vec is to uses an NN with only <strong>1 linear hidden layer</strong> (i.e., each hidden unit has the linear activation function $a_j^{(1)}=z_j^{(1)}$) to use a word to predict its neighbors, as shown below:</p>
<p><img src="../images/fig-word2vec-sg.png" width="350"></p>
<p>This is based on a key observation that <strong>semantically similar words are often used interchangeably in different contexts</strong>. For example, the words "cat" and "dog" may both appear in a context "___ is my favorate pet." When feeding "cat" and "dog" into the NN to predict their nearby words, these two words will be likely to share the same/similar hidden representation ($\boldsymbol{a}^{(1)}$ at layer 1) in order to predict the same/similar nearby words ($\boldsymbol{a}^{(2)}$ at layer 2). After training the NN, we can use the weight matrix $\boldsymbol{W}^{(1)}$ to encode a one-hot vector $\boldsymbol{x}$ into a low dimensional dense vector $\boldsymbol{h}$, by $\boldsymbol{h}=\boldsymbol{W}^{(1)\top}\boldsymbol{x}$.</p>
<p>In the literature, the architecture of an word2vect NN comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the skip-gram (SG) model.</p>
<table>
<thead><tr>
<th style="text-align:center"><img src="../images/fig-word2vec-cbow.png" width="350"></th>
<th style="text-align:center"><img src="../images/fig-word2vec-sg_1.png" width="350"></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">CBOW</td>
<td style="text-align:center">Skip-gram</td>
</tr>
</tbody>
</table>
<p>Algorithmically, these models are similar, except that CBOW predicts target words (e.g. "cat") from context words ("is," "my," "favorite," and "pet"), while the skip-gram does the inverse and predicts context words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one example (C:["is," "my," "favorite," "pet"], T:"cat")). For the most part, this turns out to be a useful thing for smaller datasets. On the other hand, skip-gram treats each context-target pair (e.g., (T:"cat", C:"pet")) as a new observation and is shown to be able to capture the semantics better when we have a large dataset. We focus on the skip-gram model in the following.</p>
<p>Note that the weights are shared across words to ensure that each word has a single embedding. This is called <strong>weight tying</strong>. Also, word2vec is a <strong>unsupervised learning</strong> task as it does not require explicit labels. An NN can be used for both supervised and unsupervised learning tasks.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Cost-Function-and-Output-Layer">Cost Function and Output Layer<a class="anchor-link" href="#Cost-Function-and-Output-Layer">&#182;</a></h3><p>As most NNs, a skip-gram word2vec model is trained using the maximum likelihood (ML) principle:</p>
$$\arg\min_{\Theta}\sum_{i=1}^{N}{-\log\mathrm{P}(\boldsymbol{y}^{(i)}\,|\,\boldsymbol{x}^{(i)},\Theta)}.$$<p>Unlike the binary classifier NN we have seen previously, it needs to output multiple classes (the vocabulary size $V$ in total). In a multiclass task where $y=1,\cdots,V$, we usually assume</p>
$$\Pr(y\,|\,\boldsymbol{x})\sim\mathrm{Categorical}(y\,|\,\boldsymbol{x};\boldsymbol{\rho})=\prod_{i=1}^{V}\rho_{i}^{1(y;\,y=i)}.$$<p>It is natural to use $V$ <strong>Softmax units</strong> in the output layer. That is, the activation function of each unit outputs one dimension of the softmax function, a generalization of the logistic sigmoid:</p>
$$a_i^{(L)}=\rho_i=\mathrm{softmax}(\boldsymbol{z}^{(L)})_{i}=\frac{\exp(z_{i}^{(L)})}{\sum_{j=1}^{{\color{red}V}}\exp(z_{j}^{(L)})}.$$<p>The cost function then becomes:</p>
$$\arg\min_{\Theta}\sum_{i}-\log\prod_{j}\left(\frac{\exp(z_{j}^{(L)})}{\sum_{k=1}^{{\color{red}V}}\exp(z_{k}^{(L)})}\right)^{1(y^{(i)};y^{(i)}=j)}=\arg\min_{\Theta}\sum_{i}-z_{y^{(i)}}^{(L)}+\log\sum_{k=1}^{{\color{red}V}}\exp(z_{k}^{(L)})$$<p>Basically, we want to maximize $\rho_j$ when seeing an example of class $j$. However, this objective introduces hight training cost when $V$ is large. Recall from the lecture that, at every training step in SGD, we need to compute the gradient of the cost function with respect to $\boldsymbol{z}^{(L)}$. This gradient involves the $z_{i}^{(L)}$ of <strong>every unit</strong> at the output layer, which in turn leads to a lot of weight updates in $\boldsymbol{W}^{(1)}$ and $\boldsymbol{W}^{(2)}$ at every training step. The training will be very slow.</p>
<h3 id="Negative-Sampling">Negative Sampling<a class="anchor-link" href="#Negative-Sampling">&#182;</a></h3><p>To speed up the training process, we can instead replace the $V$ Softmax units at the output layer with $V$ Sigmoid units and use a new cost function:</p>
$$\arg\min_{\Theta}\sum_{i=1}^{N}\left[-\log\mathrm{P}(y^{(i)}=1\,|\,\boldsymbol{x}^{(i)},\Theta)-\sum_{j\in\mathbb{N}^{(i)}}\log\mathrm{P}(y^{(j)}=0\,|\,\boldsymbol{x}^{(j)},\Theta)\right],$$<p>which amounts to:</p>
$$\arg\min_{\Theta}\sum_{i}\left[-\log\rho_{y^{(i)}}-\sum_{j\in\mathbb{N}^{(i)}}\log(1-\rho_{y^{(j)}})\right],$$<p>where $\rho_{y^{(i)}}=\sigma(z_{y^{(i)}}^{(L)})$ and $\mathbb{N}^{(i)}$, $\vert\mathbb{N}^{(i)}\vert=U\ll N$, is a set of indexes of the <strong>noise words</strong> that have never been in the same context as the target word. The cost function is minimized when the model assigns high $\rho_{y^{(i)}}$ to the context words, and low $\rho_{y^{(j)}}$'s to all its noise words. It can be shown that the new model approximates the original one (with the softmax output units) when training on infinite examples. But the new model is computationally advantageous because the cost function involves only $O(U)$ attributes in $\boldsymbol{z}^{(L)}$ (thus $O(U)$ output units). This leads to much less weight updates at each training step and higher efficiency.</p>
<p>The above trick for improving the training efficiency in a multiclass classification task is called the <strong>negative sampling</strong>. Note that each example now contains one context word and additionally $U$ noise words. We need some extra preprocessing steps.</p>

</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="Preparing-Training-Examples">Preparing Training Examples<a class="anchor-link" href="#Preparing-Training-Examples">&#182;</a></h3><p>With the negative sampling, each training example consists of one correct context word and $U$ noise words. To give an example, let's consider the corpus:</p>
<pre>
the quick brown fox jumped over the lazy dog ...
</pre><p>Suppose the context window is 2 words (1 at left and 1 at right). We scan through the corpus by moving the window from left to right and get the context words (C) and target words (T):</p>
<pre>
(C:[the, brown], T:quick), 
(C:[quick, fox], T:brown), 
(C:[brown, jumped], T:fox), 
...
</pre><p>Recall that the skip-gram model tries to predict <strong>each</strong> context word from its target word. So, we break a <code>(C:[context1, context2, ...], T:target)</code> pair into <code>(T:target, C:context1)</code>, <code>(T:target, C:context2)</code>, and so on. Our dataset becomes:</p>
<pre>
(T:quick, C:the), 
(T:quick, C:brown), 
(T:brown, C:quick), 
(T:brown, C:fox), 
...
</pre><p>To support the negative sampling, we need to sample $U$ noise words (N) for each of the above pairs by following some distribution (typically the unigram distribution). For example, let $U=2$, our final dataset could be:</p>
<pre>
(T:quick, C:the, N:[over, lazy]), 
(T:quick, C:brown, N:[fox, over]), 
(T:brown, C:quick, N:[the, lazy]), 
(T:brown, C:fox, N:[jumped, dog]), 
...
</pre><p>Given an example tuple, we regard the context word (C) as positive and noise words (N1 and N2) as negative when evaluating the cost function defined above. Specifically, the Sigmoid unit in the output layer outputs $\rho_C$ for the context word and $\rho_{N1}$ and $\rho_{N2}$ for the noise words, and the loss function for this example is</p>
$$-\log\rho_{C}-\log(1-\rho_{N1})-\log(1-\rho_{N2}).$$
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>

## Word embeddings 

Let us consider the high-level API - it has already implemented various classes of spare parts for learning neurons. We will solve the same problem as last time - learning word embeddings, only now we will teach them ourselves! First you need to prepare data for training.

Let's collect and tokenize the texts:

In [349]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

torch.manual_seed(1)

torch.set_default_dtype(torch.float64)

In [350]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward>)


In [351]:
torch.sum(hello_embed, dim=0)

tensor([ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519], grad_fn=<SumBackward2>)

### Skip-Gram Word2vec

Let's start with the skip-gram learning model word2vec.

This is a simple model of just two layers. Her idea is to teach embeddingings so that they can predict the context of the corresponding words as best you can. That is, if we have learned well how to encode the words with which this occurs, then we know something about ourselves. For example, in a natural way it turns out that words that occur in the same contexts (say, `apple` and` orange`) will have close embeddingings vectors.

<img src="https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/Word2vecExample.jpeg" width="50%">

To do this, we model the probabilities $\{P (w_ {c + j} | w_c): j = ck, ..., c + k, j \neq c \} $, where $ k $ is the size of the context window, $ c $ is the index of the central word.

Let's put together such a model: we will teach a pair of matrices $ U $ - the embedding matrix, which we will later take for our own tasks, and $ V $ - the matrix of the output layer.


Each word in the dictionary corresponds to a row in the $ U $ matrix and a $ V $ column.<img src="https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/SkipGram.png" width="50%">


What's going on here? The word is displayed in embedding - the string $ u_c $. Then this embedding is multiplied by the $ V $ matrix.

As a result, we obtain a set of the number $ v_j ^ T u_c $ - the degree of similarity of the word with the number $ j $ and our word.

We transform these numbers into something like probabilities — let's use the softmax function: $ P (i) = \frac {e ^ {x_i}} {\sum_j e ^ {x_j}} $.

And then we will consider the cross-entropy loss:

$$-\sum_{-k \leq j \leq k, j \neq 0} \log \frac{\exp(v_{c+j}^T u_c)}{\sum_{i=1}^{|V|} \exp(v_i^T u_c)} \to \min_{U, V}.$$

As a result, the vector $ u_c $ will approach the vector $ v_ {c_j} $ from its context.

We realize this all to understand.

#### Generation batches

First you need to collect contexts.

In [387]:
import time
import math
import random
import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from collections import Counter

random.seed(1024)

In [307]:
def tokenizer(sentence):
    sentence = re.sub(r"([.!?])", r" \1", sentence)
    sentence = re.sub(r"[^a-zA-Z.!?]+", r" ", sentence)
    
    return [x.text for x in nlp.tokenizer(sentence) if x.text != " "]

In [364]:
quora_data = pd.read_csv('.data/quora_train.csv')
quora_data = quora_data[:3000]
quora_data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [365]:
quora_data.question1 = quora_data.question1.replace(np.nan, '', regex=True)
quora_data.question2 = quora_data.question2.replace(np.nan, '', regex=True)

texts = list(pd.concat([quora_data.question1, quora_data.question2]).unique())

In [366]:
tokenized_texts = [tokenizer(text.lower()) for text in texts]

In [367]:
MIN_COUNT = 5

words_counter = Counter(token for tokens in tokenized_texts for token in tokens)
word2index = {
    '<unk>': 0
}

for word, count in words_counter.most_common():
    if count < MIN_COUNT:
        break
        
    word2index[word] = len(word2index)
    
index2word = [word for word, _ in sorted(word2index.items(), key=lambda x: x[1])]

In [368]:
print('Vocabulary size:', len(word2index))
print('Tokens count:', sum(len(tokens) for tokens in tokenized_texts))
print('Unknown tokens appeared:', sum(1 for tokens in tokenized_texts for token in tokens if token not in word2index))
print('Most freq words:', index2word[1:21])

Vocabulary size: 1628
Tokens count: 73546
Unknown tokens appeared: 10052
Most freq words: ['?', 'the', 'what', 'is', 'i', 'how', 'to', 'in', 'a', 'do', 'of', 'are', 'and', 'can', 'for', 'you', 'why', 'my', 'it', 'best']


In [371]:
def build_contexts(tokenized_texts, window_size):
    contexts = []
    for tokens in tokenized_texts:
        for i in range(len(tokens)):
            central_word = tokens[i]
            context = [tokens[i + delta] for delta in range(-window_size, window_size + 1) 
                       if delta != 0 and i + delta >= 0 and i + delta < len(tokens)]

            contexts.append((central_word, context))
            
    return contexts

In [372]:
contexts = build_contexts(tokenized_texts, window_size=2)
contexts[:5]

[('what', ['is', 'the']),
 ('is', ['what', 'the', 'step']),
 ('the', ['what', 'is', 'step', 'by']),
 ('step', ['is', 'the', 'by', 'step']),
 ('by', ['the', 'step', 'step', 'guide'])]

In [373]:
contexts = [(word2index.get(central_word, 0), [word2index.get(word, 0) for word in context]) 
            for central_word, context in contexts]
contexts[:5]

[(3, [4, 2]),
 (4, [3, 2, 511]),
 (2, [3, 4, 511, 62]),
 (511, [4, 2, 62, 511]),
 (62, [2, 511, 511, 0])]

In [378]:
def make_skip_gram_batchs_iter(contexts, window_size, num_skips, batch_size):
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * window_size
    
    central_words = [word for word, context in contexts if len(context) == 2 * window_size and word != 0]
    contexts = [context for word, context in contexts if len(context) == 2 * window_size and word != 0]
    
    batch_size = int(batch_size / num_skips)
    batchs_count = int(math.ceil(len(contexts) / batch_size))
    
    print('Initializing batchs generator with {} batchs per epoch'.format(batchs_count))
    
    while True:
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)

        for i in range(batchs_count):
            batch_begin, batch_end = i * batch_size, min((i + 1) * batch_size, len(contexts))
            batch_indices = indices[batch_begin: batch_end]

            batch_data, batch_labels = [], []

            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                
                words_to_use = random.sample(context, num_skips)
                batch_data.extend([central_word] * num_skips)
                batch_labels.extend(words_to_use)
            
            yield batch_data, batch_labels

In [422]:
batch, labels = next(make_skip_gram_batchs_iter(contexts, window_size=2, num_skips=2, batch_size=32))

Initializing batchs generator with 2626 batchs per epoch


In [425]:
batch = torch.LongTensor(batch)
batch

tensor([   2,    2, 1419, 1419, 1183, 1183,   41,   41,   11,   11,    8,    8,
         661,  661,   28,   28,  380,  380,   99,   99,   61,   61,    1,    1,
          53,   53,  463,  463,    1,    1,  488,  488])

In [426]:
labels = torch.LongTensor(labels)
labels

tensor([ 315,    3,    0,    9,  417,    0,    0,  426,  704,   11,  161,  541,
           0,  456,   42,    9,    4,   31,    0,    2,  117,    0,    9, 1190,
         136,   18,   13, 1561,   17, 1546,   73,  789])

In [384]:
EMBEDDING_DIM = 32
VOCAB_SIZE = len(word2index)

#### Word2Vec Model

In [395]:
class SkipGram(nn.Module):
    def __init__(self):
        super(SkipGram, self).__init__()
        self.model = nn.Sequential(nn.Embedding(VOCAB_SIZE, 32), nn.Linear(32, VOCAB_SIZE))

    def forward(self, x):
        output = self.model(x)
        return output
    
model = SkipGram()

In [396]:
loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()
stop = 0
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training
for step, (batch, labels) in enumerate(make_skip_gram_batchs_iter(contexts, window_size=2, num_skips=4, batch_size=128)):
    batch = torch.LongTensor(batch)
    labels = torch.LongTensor(labels)

    optimizer.zero_grad()
    output = model(batch)
    
    loss = criterion(output, labels)
    loss.backward()
    
    optimizer.step()

    total_loss += loss.item()
    
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, 
                                                                    time.time() - start_time))
        total_loss = 0
        start_time = time.time()
        stop += 1
    if stop == 10:
        break
    

Initializing batchs generator with 1313 batchs per epoch
Step = 1000, Avg Loss = 5.5083, Time = 8.91s
Step = 2000, Avg Loss = 5.0394, Time = 8.38s
Step = 3000, Avg Loss = 4.9165, Time = 7.70s
Step = 4000, Avg Loss = 4.8583, Time = 8.71s
Step = 5000, Avg Loss = 4.7689, Time = 8.51s
Step = 6000, Avg Loss = 4.7246, Time = 8.85s
Step = 7000, Avg Loss = 4.7054, Time = 11.46s
Step = 8000, Avg Loss = 4.7047, Time = 14.48s
Step = 9000, Avg Loss = 4.6684, Time = 12.16s
Step = 10000, Avg Loss = 4.6527, Time = 9.94s


#### Analysis

You can get embeddings by casting this spell:

In [403]:
for child in model.children():
    print(child)

Sequential(
  (0): Embedding(1628, 32)
  (1): Linear(in_features=32, out_features=1628, bias=True)
)


In [405]:
for name,parameters in model.named_parameters():
    print(name,':',parameters.size())

model.0.weight : torch.Size([1628, 32])
model.1.weight : torch.Size([1628, 32])
model.1.bias : torch.Size([1628])


In [410]:
for child in model.children():
    for param in child.parameters():
        print("This is what a parameter looks like - \n",param)

This is what a parameter looks like - 
 Parameter containing:
tensor([[-0.7005, -0.3949, -1.0316,  ...,  1.3106,  0.8832, -0.4425],
        [ 0.1537, -0.6839,  0.0646,  ..., -0.1717, -0.2411,  0.0848],
        [-0.6529,  0.5762, -0.3311,  ..., -0.2396, -0.1716, -0.1415],
        ...,
        [ 0.3938, -0.8552,  0.4637,  ..., -0.0057, -0.5812,  0.3205],
        [-0.8397,  0.7625,  0.6288,  ...,  0.9212,  0.1511, -0.1192],
        [-1.3223, -2.1495,  1.1332,  ...,  1.4895,  0.7985,  1.5277]],
       requires_grad=True)
This is what a parameter looks like - 
 Parameter containing:
tensor([[ 2.3062e-02, -1.3275e-01,  6.2402e-02,  ...,  2.8661e-02,
         -9.2706e-03,  1.3411e-02],
        [-8.6404e-04, -3.5879e-01,  4.5102e-01,  ..., -4.2325e-02,
          3.7672e-02,  6.7543e-02],
        [ 1.3683e-01, -3.8622e-01,  4.2439e-01,  ...,  1.0466e-01,
         -1.9804e-02,  4.5757e-01],
        ...,
        [ 2.7370e-01,  6.5045e-01,  1.8759e-01,  ..., -9.5559e-01,
          7.1552e-01, -1.8

In [416]:
embeddings = model.model[0].weight.cpu().data.numpy()

Check whether it turned out at least somehow adequately.

In [419]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(embeddings, index2word, word2index, word):
    word_emb = embeddings[word2index[word]]
    
    similarities = cosine_similarity([word_emb], embeddings)[0]
    top10 = np.argsort(similarities)[-10:]
    
    return [index2word[index] for index in reversed(top10)]

most_similar(embeddings, index2word, word2index, 'best')

['best',
 'easiest',
 'funniest',
 'worst',
 'review',
 'downloading',
 'most',
 'fastest',
 'hardest',
 'favorite']

And visualize!

In [421]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2, verbose=100)
    return scale(tsne.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, index2word, word_count):
    word_vectors = embeddings[1: word_count + 1]
    words = index2word[1: word_count + 1]
    
    word_tsne = get_tsne_projection(word_vectors)
    draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)
    
    
visualize_embeddings(embeddings, index2word, 1000)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1000 samples in 0.007s...
[t-SNE] Computed neighbors for 1000 samples in 0.100s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1000
[t-SNE] Mean sigma: 1.360045
[t-SNE] Computed conditional probabilities in 0.039s
[t-SNE] Iteration 50: error = 83.9013748, gradient norm = 0.2899156 (50 iterations in 7.337s)
[t-SNE] Iteration 100: error = 84.1195221, gradient norm = 0.3060217 (50 iterations in 7.885s)
[t-SNE] Iteration 150: error = 84.7624664, gradient norm = 0.3165305 (50 iterations in 8.592s)
[t-SNE] Iteration 200: error = 86.3966522, gradient norm = 0.2803719 (50 iterations in 9.326s)
[t-SNE] Iteration 250: error = 86.0357513, gradient norm = 0.2777553 (50 iterations in 8.905s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 86.035751
[t-SNE] Iteration 300: error = 2.5670052, gradient norm = 0.0045316 (50 iterations in 6.542s)
[t-SNE] Iteration 350: error = 2.4186151, gradient norm = 0.00141



### N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.

In [205]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [206]:
class NGramLanguageModeler(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

In [207]:
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[518.1398809110159, 515.6318509353883, 513.1396400811631, 510.66284127663016, 508.19781289771703, 505.7445230296734, 503.3024996208805, 500.8712962733133, 498.4513797301769, 496.0399662879884]


### Continuous Bag of Words (CBoW)

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.


<img src="https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/CBOW.png" width="50%">

Implement this model in Pytorch by filling in the class below. Some tips:

In [438]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
contexts = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    contexts.append((context, target))

print(contexts[:5])

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


In [441]:
# create your model and train.  here are some functions to help you make
# the data ready for use by your module
def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


make_context_vector(contexts[0][0], word_to_ix)  # example

tensor([ 6, 39,  4, 37])

In [446]:
def make_cbow_batchs_iter(contexts, window_size, batch_size):
    data = np.array([context for word, context in contexts if len(context) == 2 * window_size and word != 0])
    labels = np.array([word for word, context in contexts if len(context) == 2 * window_size and word != 0])
        
    batchs_count = int(math.ceil(len(data) / batch_size))
    
    print('Initializing batchs generator with {} batchs per epoch'.format(batchs_count))
    
    print(data)
    print(labels)
    
#     while True:
#         indices = np.arange(len(contexts))
#         np.random.shuffle(indices)

#         for i in range(batchs_count):
#             batch_begin, batch_end = i * batch_size, min((i + 1) * batch_size, len(contexts))
#             batch_indices = indices[batch_begin: batch_end]

#             batch_data, batch_labels = [], []

#             for data_ind in batch_indices:
#                 central_word, context = central_words[data_ind], contexts[data_ind]
                
#                 words_to_use = random.sample(contex] num_skips)
#                 batch_data.extend([central_word] * num_skips)
#                 batch_labels.extend(words_to_use)
            
#             yield batch_data, batch_labels

In [448]:
next(make_cbow_batchs_iter(contexts, window_size=2, batch_size=32))

Initializing batchs generator with 1 batchs per epoch
['idea' 'that' 'they' 'with']
[['study' 'the' 'of' 'a']
 ['abstract' 'beings' 'inhabit' 'computers.']
 ['computers.' 'As' 'evolve,' 'processes']
 ['the' 'computer' 'our' 'spells.']]


NameError: name 'central_words' is not defined

In [440]:
batch, labels = next(make_cbow_batchs_iter(contexts, window_size=2, batch_size=32))

Initializing batchs generator with 1 batchs per epoch


In [435]:
batch

[19,
 19,
 5,
 5,
 8,
 8,
 9,
 9,
 600,
 600,
 13,
 13,
 149,
 149,
 179,
 179,
 11,
 11,
 517,
 517,
 807,
 807,
 2,
 2,
 16,
 16,
 1109,
 1109,
 8,
 8,
 493,
 493]

In [430]:
class CBoWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBoWModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        output = self.embeddings(inputs)
        output = self.out_layer(output)
        return output
      

model = CBoWModel(VOCAB_SIZE, EMBEDDING_DIM)

In [431]:
loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()
stop = 0
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training
for step, (batch, labels) in enumerate(make_skip_gram_batchs_iter(contexts, window_size=2, num_skips=4, batch_size=128)):
    batch = torch.LongTensor(batch)
    labels = torch.LongTensor(labels)

    optimizer.zero_grad()
    output = model(batch)
    
    loss = criterion(output, labels)
    loss.backward()
    
    optimizer.step()

    total_loss += loss.item()
    
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, 
                                                                    time.time() - start_time))
        total_loss = 0
        start_time = time.time()
        stop += 1
    if stop == 10:
        break
    

Initializing batchs generator with 1313 batchs per epoch
Step = 1000, Avg Loss = 5.4950, Time = 8.14s
Step = 2000, Avg Loss = 5.0360, Time = 8.18s
Step = 3000, Avg Loss = 4.9155, Time = 7.52s
Step = 4000, Avg Loss = 4.8623, Time = 7.57s
Step = 5000, Avg Loss = 4.7669, Time = 7.86s
Step = 6000, Avg Loss = 4.7298, Time = 8.20s
Step = 7000, Avg Loss = 4.7059, Time = 8.20s
Step = 8000, Avg Loss = 4.7011, Time = 8.24s
Step = 9000, Avg Loss = 4.6778, Time = 8.11s
Step = 10000, Avg Loss = 4.6427, Time = 8.73s


### Negative Sampling

What is the hardest thing now? Calculating softmax and applying gradients to all words in $ V $.

One way to handle this is to use * Negative Sampling *.

In fact, instead of predicting the index of a word by context, it is predicted that such a word $ w $ can be in this context $ c $: $ P (D = 1 | w, c) $.

You can use a regular sigmoid to get this probability:
$$ P (D = 1 | w, c) = \sigma (v_w ^ T u_c) = \frac 1 {1 + \exp (-v ^ T_w u_c)}. $$

The learning process then looks like this: for each pair, the word and its context generate a set of negative examples:
<img src="https://github.com/DanAnastasyev/DeepNLP-Course/raw/master/Week%2003/Images/Negative_Sampling.png" width="50%">

For CBoW, the loss function will look like this:

$$ - \log \sigma (v_c ^ T u_c) - \sum_ {k = 1} ^ K \log \sigma (- \tilde v_k ^ T u_c), $$

where $ v_c $ is the vector of the central word, $ u_c $ is the context vector (sum of context vectors), $ \tilde v_1, \ldots, \tilde v_K $ are the sampled negative examples.

Compare this formula with the usual CBoW:
$$ - v_c ^ T u_c + \log \sum_ {i = 1} ^ {| V |} \exp (v_i ^ T u_c). $$

Usually words are sampled from $ U ^ {3/4} $, where $ U $ is the unigram distribution, that is, the frequency of occurrence of words divided by the total number of words.

Frequencies we have already considered: they are obtained in `Counter (words)`. Simply convert them to probabilities and multiply these probabilities by $ \frac 3 4 $. Why $ \frac 3 4 $? Some intuition can be found in the following example:

$$P(\text{is}) = 0.9, \ P(\text{is})^{3/4} = 0.92$$
$$P(\text{Constitution}) = 0.09, \ P(\text{Constitution})^{3/4} = 0.16$$
$$P(\text{bombastic}) = 0.01, \ P(\text{bombastic})^{3/4} = 0.032$$

The probability for high-frequency words is not particularly increased (relatively), but low-frequency ones will fall out with a noticeably greater probability.

**Task** Implement your Negative Sampling.

First, let's set the distribution for sampling:

In [None]:
words_sum_count = sum(words_counter.values())
word_distribution = np.array([(words_counter[word] / words_sum_count) ** (3 / 4) for word in index2word])
# Вообще-то, тут нечестно сделанно, можно лучше
word_distribution /= word_distribution.sum()

indices = np.arange(len(word_distribution))

np.random.choice(indices, p=word_distribution, size=(32, 5))

In [None]:
class NegativeSamplingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs, targets, num_samples):
        '''
        inputs: (batch_size, context_size)
        targets: (batch_size)
        num_samples: int
        '''
        
        <calculate u_c's>
        
        <calculate v_c>
        
        <sample indices>
        <calculate negative vectors v'_c>
        
        <apply F.logsigmoid to v_c * u_c and to -v'_c * u_c>
        
        <calc result loss>

In [None]:
model = NegativeSamplingModel(vocab_size=len(word2index), embedding_dim=32).cuda()

optimizer = optim.Adam(model.parameters(), lr=0.01)  

loss_every_nsteps = 1000
total_loss = 0
start_time = time.time()

for step, (batch, labels) in enumerate(make_cbow_batchs_iter(contexts, window_size=2, batch_size=128)):
    <copy-paste (mostly) learning cycle>

    total_loss += loss.item()
    
    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, 
                                                                    time.time() - start_time))
        total_loss = 0
        start_time = time.time()

## GloVe: Global Vectors

The Global Vector (GloVe) model proposed by Pennington et al. ([2014](http://www.aclweb.org/anthology/D14-1162)) aims to combine the count-based matrix factorization and the context-based skip-gram model together.

We all know the counts and co-occurrences can reveal the meanings of words. To distinguish from $p(w_O \vert w_I)$ in the context of a word embedding word, we would like to define the co-ocurrence probability as:

$$
p_{\text{co}}(w_k \vert w_i) = \frac{C(w_i, w_k)}{C(w_i)}
$$

$C(w_i, w_k)$ counts the co-occurrence between words $w_i$ and $w_k$.


Say, we have two words, $w_i$="ice" and $w_j$="steam". The third word $\tilde{w}_k$="solid" is related to "ice" but not "steam", and thus we expect $p_{\text{co}}(\tilde{w}_k \vert w_i)$ to be much larger than $p_{\text{co}}(\tilde{w}_k \vert w_j)$ and therefore $\frac{p_{\text{co}}(\tilde{w}_k \vert w_i)}{p_{\text{co}}(\tilde{w}_k \vert w_j)}$ to be very large. If the third word $\tilde{w}_k$ = "water" is related to both or $\tilde{w}_k$ = "fashion" is unrelated to either of them, $\frac{p_{\text{co}}(\tilde{w}_k \vert w_i)}{p_{\text{co}}(\tilde{w}_k \vert w_j)}$ is expected to be close to one. 

The intuition here is that the word meanings are captured by the ratios of co-occurrence probabilities rather than the probabilities themselves. The global vector models the relationship between two words regarding to the third context word as:

$$
F(w_i, w_j, \tilde{w}_k) = \frac{p_{\text{co}}(\tilde{w}_k \vert w_i)}{p_{\text{co}}(\tilde{w}_k \vert w_j)}
$$


Further, since the goal is to learn meaningful word vectors, $F$ is designed to be a function of the linear difference between two words $w_i - w_j$:

$$
F((w_i - w_j)^\top \tilde{w}_k) = \frac{p_{\text{co}}(\tilde{w}_k \vert w_i)}{p_{\text{co}}(\tilde{w}_k \vert w_j)}
$$

With the consideration of $F$ being symmetric between target words and context words, the final solution is to model $$F$$ as an **exponential** function. Please read the original paper ([Pennington et al., 2014](http://www.aclweb.org/anthology/D14-1162)) for more details of the equations.

$$
\begin{align}
F({w_i}^\top \tilde{w}_k) &= \exp({w_i}^\top \tilde{w}_k) = p_{\text{co}}(\tilde{w}_k \vert w_i) \\
F((w_i - w_j)^\top \tilde{w}_k) &= \exp((w_i - w_j)^\top \tilde{w}_k) = \frac{\exp(w_i^\top \tilde{w}_k)}{\exp(w_j^\top \tilde{w}_k)} = \frac{p_{\text{co}}(\tilde{w}_k \vert w_i)}{p_{\text{co}}(\tilde{w}_k \vert w_j)}
\end{align}
$$

Finally,

$$
{w_i}^\top \tilde{w}_k = \log p_{\text{co}}(\tilde{w}_k \vert w_i) = \log \frac{C(w_i, \tilde{w}_k)}{C(w_i)} = \log C(w_i, \tilde{w}_k) - \log C(w_i)
$$

Since the second term $-\log C(w_i)$ is independent of $k$, we can add bias term $$b_i$$ for $w_i$ to capture $-\log C(w_i)$. To keep the symmetric form, we also add in a bias $\tilde{b}_k$ for $\tilde{w}_k$.

$$
\log C(w_i, \tilde{w}_k) = {w_i}^\top \tilde{w}_k + b_i + \tilde{b}_k
$$

The loss function for the GloVe model is designed to preserve the above formula by minimizing the sum of the squared errors:

$$
\mathcal{L}_\theta = \sum_{i=1, j=1}^V f(C(w_i,w_j)) ({w_i}^\top \tilde{w}_j + b_i + \tilde{b}_j - \log C(w_i, \tilde{w}_j))^2
$$

The weighting schema $f(c)$ is a function of the co-occurrence of $w_i$ and $w_j$ and it is an adjustable model configuration. It should be close to zero as $c \to 0$; should be non-decreasing as higher co-occurrence should have more impact; should saturate when $c$ become extremely large. The paper proposed the following weighting function.

$$
f(c) = 
  \begin{cases}
  (\frac{c}{c_{\max}})^\alpha & \text{if } c < c_{\max} \text{, } c_{\max} \text{ is adjustable.} \\
  1 & \text{if } \text{otherwise}
  \end{cases}
$$


# Referrence
[On word embeddings - Part 1, Sebastian Ruder](http://ruder.io/word-embeddings-1/)  
[On word embeddings - Part 2: Approximating the Softmax, Sebastian Ruder](http://ruder.io/word-embeddings-softmax/index.html)  
[Word2Vec Tutorial - The Skip-Gram Model, Chris McCormick](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)  
[Word2Vec Tutorial Part 2 - Negative Sampling, Chris McCormick](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/) 

[Word2vec Parameter Learning Explained (2014), Xin Rong](https://arxiv.org/abs/1411.2738)  
[Neural word embedding as implicit matrix factorization (2014), Levy, Omer, and Yoav Goldberg](http://u.cs.biu.ac.il/~nlp/wp-content/uploads/Neural-Word-Embeddings-as-Implicit-Matrix-Factorization-NIPS-2014.pdf) 
[Two/Too Simple Adaptations of Word2Vec for Syntax Problems (2015), Ling, Wang, et al.](https://www.aclweb.org/anthology/N/N15/N15-1142.pdf)  
[Not All Neural Embeddings are Born Equal (2014)](https://arxiv.org/pdf/1410.0718.pdf)  
[Retrofitting Word Vectors to Semantic Lexicons (2014), M. Faruqui, et al.](https://arxiv.org/pdf/1411.4166.pdf)  
[All-but-the-top: Simple and Effective Postprocessing for Word Representations (2017), Mu, et al.](https://arxiv.org/pdf/1702.01417.pdf)  

[Skip-Thought Vectors (2015), Kiros, et al.](https://arxiv.org/pdf/1506.06726)  

[Backpropagation, Intuitions, cs231n + next parts in the Module 1](http://cs231n.github.io/optimization-2/)   
[Calculus on Computational Graphs: Backpropagation, Christopher Olah](http://colah.github.io/posts/2015-08-Backprop/)

[cs224n "Lecture 2 - Word Vector Representations: word2vec"](https://www.youtube.com/watch?v=ERibwqs9p38&index=2&list=PLqdrfNEc5QnuV9RwUAhoJcoQvu4Q46Lja&t=0s)  
[cs224n "Lecture 5 - Backpropagation"](https://www.youtube.com/watch?v=isPiE-DBagM&index=5&list=PLqdrfNEc5QnuV9RwUAhoJcoQvu4Q46Lja&t=0s)   
