# <center> Diving into Word2Vec basics. Skip-Gram implementation with Pytorch. <center>
##### <center> By Artur Kolishenko (@payonear)

### Structure
1. [Import libraries](#Import-libraries)
3. [Skip-Gram example with PyTorch](#Skip-Gram-example-with-PyTorch)
6. [Summary](#Summary)

Word Embeddings are extremely important for NLP tasks. Ready-to-use solutions like Glove, FastText etc. are very useful and relatively efficient, so why wasting time on Skip-Gram, which works empirically worse and need much time to fit? The goal of this notebook is to manually implement this model for deeper understanding of Word2Vec mechanism to prepare myself and probably readers for diving deeper into more complex solutions. In the notebook I'll use PyTorch, but if you're not familiar with it, no problem, some comments for understanding the code are provided. Nevertheless, it's assumed the one is familiar with basic Neural Nets concepts. Besides, I'll not discuss here why we need Word2Vec as there are much useful info on this topic throughout internet.

### Import libraries

In [None]:
import torch
torch.manual_seed(10)
from torch.autograd import Variable
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
from sklearn import decomposition
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (10,8)
import nltk
#Import stopwords
from nltk.corpus import stopwords

### Skip-Gram example with PyTorch

Consider we have a simplified corpus of words like below.

In [None]:
corpus = [
    'drink milk',
    'drink cold water',
    'drink cold cola',
    'drink juice',
    'drink cola',
    'eat bacon',
    'eat mango',
    'eat cherry',
    'eat apple',
    'juice with sugar',
    'cola with sugar',
    'mango is fruit',
    'apple is fruit',
    'cherry is fruit',
    'Berlin is Germany',
    'Boston is USA',
    'Mercedes from Germany',
    'Mercedes is a car',
    'Ford from USA',
    'Ford is a car'
]

Skip-Gram model tries to predict context given a word. So as input it expect word and as output words which often appears with the inputed one. Below I implement some suportive functions.

In [None]:
def create_vocabulary(corpus):
    '''Creates a dictionary with all unique words in corpus with id'''
    vocabulary = {}
    i = 0
    for s in corpus:
        for w in s.split():
            if w not in vocabulary:
                vocabulary[w] = i
                i+=1
    return vocabulary

def prepare_set(corpus, n_gram = 1):
    '''Creates a dataset with Input column and Outputs columns for neighboring words. 
       The number of neighbors = n_gram*2'''
    columns = ['Input'] + [f'Output{i+1}' for i in range(n_gram*2)]
    result = pd.DataFrame(columns = columns)
    for sentence in corpus:
        for i,w in enumerate(sentence.split()):
            inp = [w]
            out = []
            for n in range(1,n_gram+1):
                # look back
                if (i-n)>=0:
                    out.append(sentence.split()[i-n])
                else:
                    out.append('<padding>')
                
                # look forward
                if (i+n)<len(sentence.split()):
                    out.append(sentence.split()[i+n])
                else:
                    out.append('<padding>')
            row = pd.DataFrame([inp+out], columns = columns)
            result = result.append(row, ignore_index = True)
    return result

def prepare_set_ravel(corpus, n_gram = 1):
    '''Creates a dataset with Input column and Output column for neighboring words. 
       The number of neighbors = n_gram*2'''
    columns = ['Input', 'Output']
    result = pd.DataFrame(columns = columns)
    for sentence in corpus:
        for i,w in enumerate(sentence.split()):
            inp = w
            for n in range(1,n_gram+1):
                # look back
                if (i-n)>=0:
                    out = sentence.split()[i-n]
                    row = pd.DataFrame([[inp,out]], columns = columns)
                    result = result.append(row, ignore_index = True)
                
                # look forward
                if (i+n)<len(sentence.split()):
                    out = sentence.split()[i+n]
                    row = pd.DataFrame([[inp,out]], columns = columns)
                    result = result.append(row, ignore_index = True)
    return result

A bit of preprocessing.

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess(corpus):
    result = []
    for i in corpus:
        out = nltk.word_tokenize(i)
        out = [x.lower() for x in out]
        out = [x for x in out if x not in stop_words]
        result.append(" ". join(out))
    return result

corpus = preprocess(corpus)
corpus

Here we are creating a vocabulary which gives id for each word appearing in corpus. 

In [None]:
vocabulary = create_vocabulary(corpus)
vocabulary

Below we can observe the logic. We are taking two neighbors from each side of center word. We can see many padding tokens, that is because maximal length of our sentences is 3, which is why each word will have at least two neighbors being padding. It's done just for presentation purposes. Below you can see some plots, which will help to understand the logic better.

In [None]:
train_emb = prepare_set(corpus, n_gram = 2)
train_emb.head()

![caption](https://miro.medium.com/max/335/1*uYiqfNrUIzkdMrmkBWGMPw.png)

Let's put it in form needed for training.

In [None]:
train_emb = prepare_set_ravel(corpus, n_gram = 2)
train_emb.head()

Now, replace words with their indexes.

In [None]:
train_emb.Input = train_emb.Input.map(vocabulary)
train_emb.Output = train_emb.Output.map(vocabulary)
train_emb.head()

Ok, now we have some input and output, what does it mean? Time to build a model. The architecture of Skip-Gram model is pretty simple. We have One-Hot encoded vector as an input and One-Hot encoded vector as an output. Which means input and output layers are vectors of length which equals the length of vocabulary, where almost all elements are zeros except one which let us identify which word from vocabulary is encoded. In between of Input and Output there is a hidden layer of length we choose. The length of hidden layer predefines the dimension of embedding vectors. The most interesting elements of this NN are weights between hidden layer and two other. Multiplicatin of One-Hot encoded vector with matrix of weights will activate just one corresponding row of this weights matrix. In other words one step of Neural Net will optimize only one row which corresponds to one particular word from vocabulary. This allow to use vectors from weights matrix as vector representations of words afterwards. 

![caption](https://www.researchgate.net/publication/322905432/figure/fig1/AS:614314310373461@1523475353979/The-architecture-of-Skip-gram-model-20.png)

It's time to define the Loss function. In Skip-Gram we want to predict context by given word. So we want to maximize the following equation:

\begin{align} 
& max \prod_{center} \prod_{context} P(context | center; \theta) 
\end{align}
<br />
<br />
We want to maximize it, because we're interested in maximizng of $P(context|center)$ for each `context` `center` pair. But neural nets don't like to maximize, but rather minimize. So equation transforms to:

\begin{align} 
& min -\prod_{center} \prod_{context} P(context | center; \theta) 
\end{align}
<br />
<br />
Adding logrithm before the equation helps to use it's useful property, concretely:

\begin{align} 
& min -\prod_{center} \prod_{context} log\;P(context | center; \theta)
\end{align}
<br />
<br />
\begin{align} 
& log(a * b) = log(a) + log(b) 
\end{align}
<br />
<br />
\begin{align} 
& min -\sum_{center} \sum_{context} log\;P(context | center; \theta) 
\end{align}
<br />
<br />
It's left to define $P(context|center; \theta)$, here Softmax function is used:

\begin{align} 
& P(context|center) = \frac{exp(u^T_{context}v_{center})}{\sum_{\omega\in vocab} exp(u^T_{\omega} v_{center})} 
\end{align}
<br />
<br />
where $u^T_{context}v_{center}$ is a scalar product of vectors $u$ and $v$ (`context`, `center` respectively). Summarizing, the cost or loss function looks like this:

\begin{align} 
& min -\sum_{center} \sum_{context} log\;\frac{exp(u^T_{context}v_{center})}{\sum_{\omega\in vocab} exp(u^T_{\omega} v_{center})} 
\end{align}
<br />
<br />
Thanks to PyTorch developers it contains CrossEntropyLoss funtion which is exactly the funtion above. [See details in PyTorch documentation](https://pytorch.org/docs/stable/nn.html).

Let's write some supportive functions.

In [None]:
vocab_size = len(vocabulary)

def get_input_tensor(tensor):
    '''Transform 1D tensor of word indexes to one-hot encoded 2D tensor'''
    size = [*tensor.shape][0]
    inp = torch.zeros(size, vocab_size).scatter_(1, tensor.unsqueeze(1), 1.)
    return Variable(inp).float()

We want our words to be represented by vectors consisting of 5 elements. So, `embedding_dims` equals 5. If you want to perform calculations on GPU, just replace `cpu` with GPU id, for example `cuda:1`.

In [None]:
embedding_dims = 5
device = torch.device('cpu')

Next, weight initialization. Look above on the architecture of NN. W1 matrix of size **vocab_size $\times $ embedding_dims**, W2 of shape **embedding_dims $\times $ vocab_size**. Pay attention we put requires_grad as True, because we want NN to compute gradients for those weights matices for their optimization. Function torch.randn randomly initialize weights. But very important to initialize weights correctly. What does it mean? Weights should be initialized to small random numbers. If you are not careful enough with this step, model can generate unexpected and not useful results. For this `uniform` function is used here, it limits bounds of weights to $(-0.5/$embedding_dims, $0.5/$embedding_dims).

In [None]:
initrange = 0.5 / embedding_dims
W1 = Variable(torch.randn(vocab_size, embedding_dims, device=device).uniform_(-initrange, initrange).float(), requires_grad=True) # shape V*H
W2 = Variable(torch.randn(embedding_dims, vocab_size, device=device).uniform_(-initrange, initrange).float(), requires_grad=True) #shape H*V
print(f'W1 shape is: {W1.shape}, W2 shape is: {W2.shape}')

Initalization of model hyperparams.

In [None]:
num_epochs = 2000
learning_rate = 2e-1
lr_decay = 0.99
loss_hist = []

Here I use DataLoader from PyTorch, it loads data by batches. That's very useful tool, if you are not familiar with it I encourage you to have a look to avoid unnecessary coding. My dataset is very small, so I don't need DataLoader here, it's added just for fun, that's why batch_size equals the dataset's number of rows. I simply use the whole dataset for one iteration, here one iteration == one epoch. Pay attention, that `get_input_tensor` function defined above is used only for input layer, that's because CrossEntropyLoss expect true outputs as vector in long format, provided by DataLoader.

In [None]:
%%time
for epo in range(num_epochs):
    for x,y in zip(DataLoader(train_emb.Input.values, batch_size=train_emb.shape[0]), DataLoader(train_emb.Output.values, batch_size=train_emb.shape[0])):
        
        # one-hot encode input tensor
        input_tensor = get_input_tensor(x) #shape N*V
     
        # simple NN architecture
        h = input_tensor.mm(W1) # shape 1*H
        y_pred = h.mm(W2) # shape 1*V
        
        # define loss func
        loss_f = torch.nn.CrossEntropyLoss() # see details: https://pytorch.org/docs/stable/nn.html
        
        #compute loss
        loss = loss_f(y_pred, y)
        
        # bakpropagation step
        loss.backward()
        
        # Update weights using gradient descent. For this step we just want to mutate
        # the values of w1 and w2 in-place; we don't want to build up a computational
        # graph for the update steps, so we use the torch.no_grad() context manager
        # to prevent PyTorch from building a computational graph for the updates
        with torch.no_grad():
            # SGD optimization is implemented in PyTorch, but it's very easy to implement manually providing better understanding of process
            W1 -= learning_rate*W1.grad.data
            W2 -= learning_rate*W2.grad.data
            # zero gradients for next step
            W1.grad.data.zero_()
            W1.grad.data.zero_()
    if epo%10 == 0:
        learning_rate *= lr_decay
    loss_hist.append(loss)
    if epo%50 == 0:
        print(f'Epoch {epo}, loss = {loss}')
    

Let's have a look at our embeddings. W2 is transposed to get shape [vocab_size, embedding_dims]. Weights are tensors, so we should convert it to numpy.

In [None]:
W1 = W1.detach().numpy()
W2 = W2.T.detach().numpy()

Using decomposition, PCA for example, we can visualize our vectors in 2D vector space.

In [None]:
svd = decomposition.TruncatedSVD(n_components=2)
W1_dec = svd.fit_transform(W1)
x = W1_dec[:,0]
y = W1_dec[:,1]
plot = sns.scatterplot(x, y)

for i in range(0,W1_dec.shape[0]):
     plot.text(x[i], y[i]+2e-2, list(vocabulary.keys())[i], horizontalalignment='center', size='small', color='black', weight='semibold');


In [None]:
W2_dec = svd.fit_transform(W2)
x1 = W2_dec[:,0]
y1 = W2_dec[:,1]
plot1 = sns.scatterplot(x1, y1)
for i in range(0,W2_dec.shape[0]):
     plot1.text(x1[i], y1[i]+1, list(vocabulary.keys())[i], horizontalalignment='center', size='small', color='black', weight='semibold');

That's interesting, we can use W1 or W2 as embeddings, or maybe both. Look at those plots above. In case of W1 we see, that this embedding learnt that Mercedes and Berlin are somehow similar. So as Ford and Boston. That's because they co-occur often with Germany and USA respectively. Food objects highly concentrated, so as beverages. Cola and juice are near sugar. W2 is interesting as well. Ford-USA and Merceds-Germany vectors seems quite similar. So as Boston-USA and Berlin-Germany. Again, fruits are highly conentrated,as they co-occur just with fruit and eat tokens. That's why eat and fruit are near as well.

[This](https://ronxin.github.io/wevi/) is fantastic interactive tool, which helps to understand how these to models work for those who love good visualization. I suggest you to have a look.

### Summary

Skip-Gram model is a good Word2Vec mechanism which sometimes track semantic similarity and context of words. But why is that not the best solution in case of this competition. First and most obvious, too few data. That's simply not enough to train good Word2Vec model. Second, cross entropy error has the unfortunate property that distributions with long tails are often modeled poorly with too much weight given to the unlikely events. Third, evaluating the normalization factor of the softmax for each term is costly and training model even on our competition tweets' corpus takes much time. Fourth, Skip-Gram is sensitive to noise and poorly vectorize rare words from corpus. Nevertheless, understanding the mechanism of Skip-Gram is important for further diving to more complex Word2Vec models. Hope thin notebook is useful. If so, please upvote.