# Word2vec

In this work you should implement Word2vec and compare it's results with LSA word embeddings.

Code from [Mateusz Bednarski](https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb) article is used.

In [1]:
%env LC_ALL=en_US.UTF-8
%env LANG=en_US.UTF-8

import os
import math

import nltk
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]

torch.manual_seed(42)

env: LC_ALL=en_US.UTF-8
env: LANG=en_US.UTF-8


<torch._C.Generator at 0x1a0e71bf10>

## Data preparation
Idea is to compare LSA and Word2vec word embeddings. Therefore we need to use the same data as we used for LSA. Load `vocabulary.txt` as dictionary and `corpus_preprocessed.txt` as list of documents.

In [2]:
data_dir = "../data/part1"
corpus_file = os.path.join(data_dir, "corpus_preprocessed.txt")
vocabulary_file = os.path.join(data_dir, "vocabulary.txt")

In [3]:
vocabulary = {}
with open(vocabulary_file) as f:
    vocabulary = {w.strip():i for i, w in enumerate(f.readlines())}

corpus = []
with open(corpus_file) as f:
    corpus = [[int(w) for w in document.strip().split()] for document in f.readlines()]

Next we need to prepare X and y data for training. You have two options: CBOW and skip-gram. To decide, take a reminder of your age divided by 2.

In [4]:
reminder = 1990 % 2
if reminder == 0:
    print('Implementing CBOW')
else: 
    print('Implementing skip-gram')

Implementing CBOW


Both options are pretty straightforward in data preparation. You need to iterate over each word in a document (center word) and select all context words for predefined window size. For example, sentence "Mike is a dog" for window size = 2 will result in following pairs:

| Center        | Context       | 
| ------------- |:-------------:| 
| Mike          | is            |
| Mike          | a             |
| is            | Mike          |
| is            | a             |
| is            | dog           |
| a             | Mike          |
| a             | is            |
| a             | dog           |
| dog           | is            |
| dog           | a             |

*Keep in mind, that our preprocessed corpus contains indices instead of actually words. Therefore your result should be pairs of indices.*

In [7]:
window_size = 2
pairs = []
# for each document
for document in tqdm_notebook(corpus):
    # for each word, threated as center word
    for center_word_pos in range(len(document)):
        # for each window position
        for w in range(-window_size, window_size + 1):
            context_word_pos = center_word_pos + w
            # make soure not jump out sentence
            if context_word_pos < 0 or context_word_pos >= len(document) or center_word_pos == context_word_pos:
                continue
            pairs.append((document[center_word_pos], document[context_word_pos]))

pairs = torch.tensor(pairs)

A Jupyter Widget




Select what (center or context) is your `x` and what is your `y`, according to your implementation (CBOW or skip-gram).

In [8]:
x = pairs[:, 1]
y = pairs[:, 0]

One-hot encoded vectors will be highly sparse and possibly wouldn't fit into your memory. But we don't need them. Module `nn.Embedding` can use word indices as input and `nn.NLLLoss` accepts true values as indices as well.

---
## Neural Network

Next we define a main module of our network. Write down code for `__init__` and `forward`. You may define `forward` for both `x` and `y` since we won't use this network for prediction, only for training. This way you can use module [`CrossEntropyLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss), which combines modules `Softmax` and [`NLLLoss`](https://pytorch.org/docs/stable/nn.html#torch.nn.NLLLoss) to simplify gradient calculation. Also, keep in mind, that `x` and `y` will contain batches of values.

In [9]:
class Word2vecModule(nn.Module):
    def __init__(self, vocabulary_length, embedding_dim):
        super(Word2vecModule, self).__init__()
        self.W_in = nn.Embedding(vocabulary_length, embedding_dim)
        self.W_out = nn.Linear(embedding_dim, vocabulary_length, bias=False)
        self.cross_entropy = nn.CrossEntropyLoss()
        
    def forward(self, x, y):
        h = self.W_in(x)
        u = self.W_out(h)
        loss = self.cross_entropy(u, y)
        return loss

What is coded, should be trained. Define your optimizer.

In [10]:
n_embedding = 200
word2vec = Word2vecModule(len(vocabulary), n_embedding)
optimizer = optim.Adam(word2vec.parameters(), lr=0.001)

In [11]:
# shuffle training indices
batch_indices = torch.randperm(x.shape[0])
batch_size = 10000
# prepare minibatch generator
def batch_generator(batch_indices, batch_size):
    batches = math.ceil(len(batch_indices)/batch_size)
    for i in range(batches):
        batch_start = i*batch_size
        batch_end = (i+1)*batch_size
        if batch_end > len(batch_indices):
            yield x[batch_indices[batch_start:]], x[batch_indices[batch_start:]]
        else:
            yield x[batch_indices[batch_start:batch_end]], x[batch_indices[batch_start:batch_end]]

In [None]:
for epoch in range(100):
    description = 'Training epoch {}'.format(epoch+1)
    batches = math.ceil(len(batch_indices)/batch_size)
    total_loss = 0
    for x_batch, y_batch in tqdm_notebook(batch_generator(batch_indices, batch_size), desc=description, total=batches):
        optimizer.zero_grad()
        loss = word2vec(x_batch, y_batch)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print('Epoch finished, average loss: {0:.4f}'.format(total_loss/batches))

A Jupyter Widget