# Word2vec

In this work you should implement Word2vec and compare it's results with LSA word embeddings.

In [1]:
%env LC_ALL=en_US.UTF-8
%env LANG=en_US.UTF-8

import os
import math

import nltk
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]

torch.manual_seed(42)

env: LC_ALL=en_US.UTF-8
env: LANG=en_US.UTF-8


<torch._C.Generator at 0x10b414f70>

## Data preparation
Idea is to compare LSA and Word2vec word embeddings. Therefore we need to use the same data as we used for LSA. Load `vocabulary.txt` as dictionary and `corpus_preprocessed.txt` as list of documents.

In [2]:
data_dir = "../data/part1"
corpus_file = os.path.join(data_dir, "corpus_preprocessed.txt")
vocabulary_file = os.path.join(data_dir, "vocabulary.txt")

In [8]:
with open(vocabulary_file, 'r') as f:
    vocabulary = {w.strip():i for i, w in enumerate(f.readlines())}

with open(corpus_file, 'r') as f:
    corpus = [list(map(int, line.split())) for line in f.readlines()]

Next we need to prepare X and y data for training. You have two options: CBOW and skip-gram. To decide, take a reminder of your age divided by 2.

In [10]:
reminder = 1990 % 2
if reminder == 0:
    print('Implementing CBOW')
else: 
    print('Implementing skip-gram')

Implementing CBOW


*Disclaimer: if you already familiar with Word2vec, you can implement GloVe or fasttext approach. For sake of productivity, adjust your complexity level*.

Both options are pretty straightforward in data preparation. You need to iterate over each word in a document (center word) and select all context words for predefined window size. For example, sentence "Mike is a dog" for window size = 2 will result in following pairs:

| Center        | Context       | 
| ------------- |:-------------:| 
| Mike          | is            |
| Mike          | a             |
| is            | Mike          |
| is            | a             |
| is            | dog           |
| a             | Mike          |
| a             | is            |
| a             | dog           |
| dog           | is            |
| dog           | a             |

*Keep in mind, that our preprocessed corpus contains indices instead of actually words. Therefore your result should be pairs of indices.*

In [14]:
window_size = 2
pairs = []

for d in corpus:
    for i, center in enumerate(d):
        for k in range(i-window_size, i+window_size+1):
            if k < 0 or k > len(d)-1 or k == i:
                continue
            context = d[k]
            pairs.append([context, center])
        
pairs = torch.tensor(pairs)

Select what (center or context) is your `x` and what is your `y`, according to your implementation (CBOW or skip-gram).

In [23]:
x = pairs[:, 0]
y = pairs[:, 1]

One-hot encoded vectors will be highly sparse and possibly wouldn't fit into your memory. But we don't need them. Module `nn.Embedding` can use word indices as input and `nn.NLLLoss` accepts true values as indices as well.

---
## Neural Network

Next we define a main module of our network. Write down code for `__init__` and `forward`. You may define `forward` for both `x` and `y` since we won't use this network for prediction, only for training. This way you can use module [`CrossEntropyLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss), which combines modules `Softmax` and [`NLLLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.NLLLoss) to simplify gradient calculation. Also, keep in mind, that `x` and `y` will contain batches of values.

In [20]:
import torch.nn.init as init

class Word2vecModule(nn.Module):
    def __init__(self, vocabulary_length, embedding_dim):
        super(Word2vecModule, self).__init__()
        self.W1 = nn.Embedding(vocabulary_length, embedding_dim)
        self.W2 = nn.Parameter(torch.empty(embedding_dim, vocabulary_length))
        init.normal(self.W2)
        self.loss = nn.CrossEntropyLoss()
        
    def forward(self, x, y): # x: (batch_size, 1)
        # v: (batch_size, embedding_dim)
        v = self.W1(x)
        # h: (batch_size, vocabulary_length)
        h = v.mm(self.W2)
        
        loss = self.loss(h, y)
        return loss

What is coded, should be trained. Define your optimizer.

In [21]:
n_embedding = 2
word2vec = Word2vecModule(len(vocabulary), n_embedding)
optimizer = optim.Adam(word2vec.parameters(), lr=0.001)

  


In [24]:
# shuffle training indices
batch_indices = torch.randperm(x.shape[0])
batch_size = 10000
# prepare minibatch generator
def batch_generator(batch_indices, batch_size):
    batches = math.ceil(len(batch_indices)/batch_size)
    for i in range(batches):
        batch_start = i*batch_size
        batch_end = (i+1)*batch_size
        if batch_end > len(batch_indices):
            yield x[batch_indices[batch_start:]], x[batch_indices[batch_start:]]
        else:
            yield x[batch_indices[batch_start:batch_end]], x[batch_indices[batch_start:batch_end]]

In [None]:
for epoch in range(100):
    description = 'Training epoch {}'.format(epoch+1)
    batches = math.ceil(len(batch_indices)/batch_size)
    total_loss = 0
    for x_batch, y_batch in tqdm_notebook(batch_generator(batch_indices, batch_size), desc=description, total=batches):
        optimizer.zero_grad()
        loss = word2vec(x_batch, y_batch)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print('Epoch finished, average loss: {0:.4f}'.format(total_loss/batches))

A Jupyter Widget

### Evaluation

1. Load your saved LSA embeddings and plot your test words using LSA and Word2vec principal components;
2. Try to evaluate word similarity for word2vec and compare results for LSA.

In [26]:
torch.save(word2vec, 'model.pth')

  "type " + obj.__name__ + ". It won't be checked "


In [27]:
word2vec = torch.load('model.pth')