# Word2vec using Pytorch

This notebook introduces how to implement the NLP technique, so-called word2vec, using Pytorch. This tutorial explains:
1. how to generate the dataset
2. how to build the neural network
3. how to sample data

## The data

Let's introduce the basic concepts:

- *Corpus*: the corpus is the set of texts that define the data set
- *vocabulary*: the set of words in the data set.

For the example, we use the [Reuters dataset](http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html), also available on sklearn and nltk. Non letter characters are removed from the string. Also the text is also set in lowercase.

In [131]:
import itertools
import re
import nltk
nltk.download('reuters')

corpus = []

for text_id in reuters.fileids()[:500]:
    text = reuters.raw(text_id)
    text = text.lower()
    text = re.sub("&lt;[a-z\- ]+>", '', text)
    text = re.sub('[^a-z\- ]+', '', text)
    corpus.append([w for w in text.split(' ') if w != ''])
    
vocabulary = set(itertools.chain.from_iterable(corpus))

word_to_index = {w: idx for (idx, w) in enumerate(vocabulary)}
index_to_word = {idx: w for (idx, w) in enumerate(vocabulary)}

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/rguigoures/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


Word2vec is a bag of words approach. For each word of the data set, we need to extract the context words, i.e the neighboring words in a certain window of fixed length. For example, in the following sentence:

*My cat is lazy, he sleeps all day long*

For the target word *lazy*, if we consider a window of size 2, then context words are *cat*, *is*, *he* and *sleeps*. 

In [132]:
import numpy as np
import random

context_tuple_list = []
w = 2

for text in corpus:
    for i, word in enumerate(text):
        first_context_word_index = max(0,i-w)
        last_context_word_index = min(i+w, len(text))
        for j in range(first_context_word_index, last_context_word_index):
            if i!=j:
                context_tuple_list.append((word, text[j]))

## The Neural network

There two approach of word2vec:

- CBOW (Continuous Bag Of Words). It predicts the target word conditionally to the context. In other words, context words are the input and the target word is the output.
- Skip-gram. It predicts the context conditionally to the target word. In other words, the target word is the input and context words are the output.

The following code is suited for CBOW.

In [133]:
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.optim as optim
import torch.nn.functional as F


class Word2Vec(nn.Module):

    def __init__(self, embedding_size, vocab_size):
        super(Word2Vec, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_size)
        self.linear = nn.Linear(embedding_size, vocab_size)
        
    def forward(self, context_word):
        emb = self.embeddings(context_word)
        hidden = self.linear(emb)
        out = F.log_softmax(hidden)
        return out

In [134]:
class EarlyStopping():
    def __init__(self, patience=5, min_gain=0.01):
        self.patience = patience
        self.loss_list = [0]
        self.min_gain = min_gain
        
    def update_loss(self, loss):
        self.loss_list.append(loss)
        if len(self.loss_list) > self.patience:
            del self.loss_list[0]
    
    def stop_training(self):
        if max(self.loss_list) - min(self.loss_list) < self.min_gain:
            return True
        else:
            return False

In [None]:
import time

vocabulary_size = len(vocabulary)

loss_function = nn.CrossEntropyLoss()
net = Word2Vec(embedding_size=2, vocab_size=vocabulary_size)
optimizer = optim.Adam(net.parameters())
early_stopping = EarlyStopping()
context_tensor_list = []

for target, context in context_tuple_list:
    target_tensor = autograd.Variable(torch.LongTensor([word_to_index[target]]))
    context_tensor = autograd.Variable(torch.LongTensor([word_to_index[context]]))
    context_tensor_list.append((target_tensor, context_tensor))
    
while True:
    losses = []
    for target_tensor, context_tensor in context_tensor_list:
        net.zero_grad()
        log_probs = net(context_tensor)
        loss = loss_function(log_probs, target_tensor)
        loss.backward()
        optimizer.step()
        losses.append(loss.data)
    print(np.mean(losses))
    early_stopping.update_loss(np.mean(losses))
    if early_stopping.stop_training():
        break



In [None]:
import numpy as np

def get_closest_word(word):
    word_distance = []
    emb = net.embeddings
    pdist = nn.PairwiseDistance()
    i = word_to_index[word]
    lookup_tensor_i = torch.tensor([i], dtype=torch.long)
    v_i = emb(lookup_tensor_i)
    for j in range(len(vocabulary)):
        if j != i:
            lookup_tensor_j = torch.tensor([j], dtype=torch.long)
            v_j = emb(lookup_tensor_j)
            word_distance.append((index_to_word[j], float(pdist(v_i, v_j))))
    word_distance.sort(key=lambda x: x[1])
    return word_distance

In [None]:
print(get_closest_word('france')[:10])