# Word2Vec

> "Word2vec is a technique for natural language processing. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors." [ https://en.wikipedia.org/wiki/Word2vec ]


There are two Word2Vec architectures: 

- **CBOW (Continuous Bag-of-Words)** predicts the central word from the sum of context vectors. This simple sum of word vectors is called "bag of words", which gives the name for the model.

- **Skip-Gram** predicts context words given the central word. Skip-Gram with negative sampling is the most popular approach.

Here we will build a PyTorch model that implements Word2Vec's CBOW strategy.

<img src="https://www.researchgate.net/profile/Daniel_Braun6/publication/326588219/figure/fig1/AS:652185784295425@1532504616288/Continuous-Bag-of-words-CBOW-CB-and-Skip-gram-SG-training-model-illustrations.png" width="60%" />

## What we can do with it?

To calculate the proximity of words, usually the cosine or euclidean distances between vectors are used. Using word embeddings, we can build semantic proportions (also known as analogies) and solve examples like:

$$
\textit{king: male = queen: female}  \\
\Downarrow \\
\textit{king - man + woman = queen}
$$

<img src="https://camo.githubusercontent.com/d136b7862ae0c1c6e55c2218bfd6749b5b927898a8bd696158bad6af6f58794f/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f323630302f312a73584e5859664171664c556569445850436f313330772e706e67" />

## Implementing Word2vec CBOW

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(0);

In [None]:
class CBOW(nn.Module):

    def __init__(self, vocab_size, emb_size):
        super().__init__()
        self.word_emb = nn.Embedding(vocab_size, emb_size)
        self.linear = nn.Linear(emb_size, vocab_size)

    def forward(self, x):
        # (batch_size, context_size) -> (batch_size, context_size, emb_dim)
        x = self.word_emb(x)
        
        # (batch_size, context_size, emb_dim) -> (batch_size, emb_dim)
        x = x.sum(dim=1)

        # (batch_size, emb_dim) -> (batch_size, vocab_size)
        logits = self.linear(x)

        return  torch.log_softmax(logits, dim=-1)

## Exercise
Instantiate the model and write a proper training loop. Here are some functions to help you make the data ready for use:

## Data

In [None]:
from torch.utils.data import Dataset, DataLoader

class ContextDataset(Dataset):
    
    def __init__(self, tokenized_texts, context_size=2):
        super().__init__()
        # shifted by 2 due to special tokens for padding and unknown tokens
        self.word_to_ix = {}
        self.word_to_ix['<pad>'] = 0
        self.word_to_ix['<unk>'] = 1
        for text in tokenized_texts:
            self.add_to_vocab(text)
        self.context_size = context_size
        self.contexts = []
        self.targets = []
        for text in tokenized_texts:
            self.add_to_context_and_target(text)
    
    def add_to_vocab(self, text):
        for word in text:
            if word not in self.word_to_ix.keys():
                self.word_to_ix[word] = len(self.word_to_ix)
    
    def add_to_context_and_target(self, text):
        # k words to the left and k to the right
        k = self.context_size
        for i in range(len(text)):
            context = [text[i+j] if 0 <= i+j < len(text) else '<pad>' for j in range(-k, k+1) if j != 0]
            target = text[i]
            self.contexts.append(self.get_words_ids(context))
            self.targets.append(self.get_word_id(target))
    
    def get_word_id(self, word):
        if word in self.word_to_ix.keys():
            return self.word_to_ix[word]
        return self.word_to_ix['<unk>']

    def get_words_ids(self, words):
        return [self.get_word_id(w) for w in words]
    
    @property
    def ix_to_word(self):
        return list(self.word_to_ix.keys())
    
    @property
    def vocab_size(self):
        return len(self.word_to_ix)
    
    def __getitem__(self, idx):
        context = torch.tensor(self.contexts[idx], dtype=torch.long)
        target = torch.tensor(self.targets[idx], dtype=torch.long)
        return context, target
    
    def __len__(self):
        return len(self.contexts)


In [None]:
raw_texts = [
    "we are about to study the idea of a computational process .",
    "computational processes are abstract beings that inhabit computers .",
    "as they evolve, processes manipulate other abstract things called data .",
    "the evolution of a process is directed by a pattern of rules called a program .",
    "people create programs to direct processes .", 
    "in effect , we conjure the spirits of the computer with our spells ."
]
tokenized_texts = [text.lower().split() for text in raw_texts]

train_dataset = ContextDataset(tokenized_texts, context_size=2)
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
vocab = train_dataset.word_to_ix
print('Dataset size:', len(train_dataset))
print('Vocab size:', train_dataset.vocab_size)

## Model

In [None]:
emb_size = 2
lr = 0.1

model = CBOW(train_dataset.vocab_size, emb_size)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

## Training loop

In [None]:
def train_model(model, dataloader, optimizer, loss_function, num_epochs=1):
    model.train()
    losses = []
    for epoch in range(1, num_epochs+1):
        print('Starting epoch %d' % epoch)
        total_loss = 0
        hits = 0
        for batch_x, batch_y in train_dataloader:
            optimizer.zero_grad()
            logits = model(batch_x)
            loss = loss_function(logits, batch_y)
            loss.backward()
            optimizer.step()

            loss_value = loss.item()
            total_loss += loss_value
            losses.append(loss_value)
            y_pred = logits.argmax(dim=1)
            hits += torch.sum(y_pred == batch_y).item()
        avg_loss = total_loss / len(train_dataloader.dataset)
        print('Epoch loss: %.4f' % avg_loss)
        acc = hits / len(train_dataloader.dataset)
        print('Epoch accuracy: %.4f' % acc)
    print('Done!')
    return losses

In [None]:
losses = train_model(model, train_dataloader, optimizer, loss_function, num_epochs=10)

In [None]:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
ax.plot(losses, ".")

## Plot vectors

Since we mapped words to 2D vectors, we can actually plot them. In the real world, however, we would use much larger vector dimensionalities, so we would need some sort of dimensionality reduction algorithm to see a plot like this.

In [None]:
def get_vector(w):
    return model.word_emb(torch.tensor(vocab[w]))

with torch.no_grad():
    fig, ax = plt.subplots(figsize=(12, 8))
    for w in train_dataset.word_to_ix:
        vec = get_vector(w)
        ax.plot(vec[0], vec[1], 'k.')
        ax.annotate(w, (vec[0], vec[1]), textcoords="offset points", xytext=(0, 5), ha='center')

## Finding closest words

In [None]:
def closest(word, n=10):
    vec = get_vector(word)
    all_dists = [(w, torch.dist(vec, get_vector(w)).item()) for w in vocab.keys()]
    return sorted(all_dists, key=lambda t: t[1])[:n]

In [None]:
closest('program', n=10)

## Exercise

Try to implement the SkipGram approach.

## More information

If you like, these PyTorch's NLP tutorials are a good place to start building NLP models:

- https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
- https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
- https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html
- https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
- https://pytorch.org/tutorials/beginner/transformer_tutorial.html