# Word2Vec

"Word2vec is a technique for natural language processing. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors." [https://en.wikipedia.org/wiki/Word2vec]

Here we will build a PyTorch model that implements Word2Vec's CBOW strategy.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
from pprint import pprint

import matplotlib.pyplot as plt
import numpy as np
from IPython.core.debugger import set_trace

In [4]:
np.random.seed(0)
torch.manual_seed(0);

In [7]:
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()
print(raw_text[:11])

['We', 'are', 'about', 'to', 'study', 'the', 'idea', 'of', 'a', 'computational', 'process.']


In [18]:
# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

# shifted by 2 due to special tokens for padding and unknown tokens
word_to_ix = {word: i + 2 for i, word in enumerate(vocab)}
ix_to_word = list(word_to_ix.values())
print(vocab_size)
print(word_to_ix)

49
{'by': 2, 'we': 3, 'computers.': 4, 'they': 5, 'called': 6, 'computer': 7, 'The': 8, 'is': 9, 'rules': 10, 'a': 11, 'effect,': 12, 'to': 13, 'are': 14, 'things': 15, 'data.': 16, 'our': 17, 'computational': 18, 'direct': 19, 'process.': 20, 'of': 21, 'beings': 22, 'spells.': 23, 'We': 24, 'evolve,': 25, 'directed': 26, 'programs': 27, 'pattern': 28, 'In': 29, 'conjure': 30, 'program.': 31, 'manipulate': 32, 'evolution': 33, 'idea': 34, 'about': 35, 'with': 36, 'abstract': 37, 'inhabit': 38, 'As': 39, 'spirits': 40, 'Computational': 41, 'study': 42, 'process': 43, 'processes': 44, 'People': 45, 'processes.': 46, 'that': 47, 'the': 48, 'create': 49, 'other': 50}


In [15]:
context_size = 2  # 2 words to the left, 2 to the right
data = []
for i in range(context_size, len(raw_text) - context_size):
    context = [raw_text[i - 2], raw_text[i - 1], raw_text[i + 1], raw_text[i + 2]]
    context = [raw_text[i - j] for j in range(- context_size, context_size + 1) if j != 0]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])

[(['study', 'to', 'are', 'We'], 'about'), (['the', 'study', 'about', 'are'], 'to'), (['idea', 'the', 'to', 'about'], 'study'), (['of', 'idea', 'study', 'to'], 'the'), (['a', 'of', 'the', 'study'], 'idea')]


In [13]:
class CBOW(nn.Module):

    def __init__(self, vocab_size, emb_size):
        self.embeddings = nn.Embedding(vocab_size, emb_size)
        self.lin_out = nn.Linear(emb_size, vocab_size)

    def forward(self, x):
        # (bs, 4, vocab_size) -> (bs, 4, emb_dim)
        x = self.emb(x)
        # (bs, 4, emb_dim) -> (bs, 4*emb_dim)
        x = x.view(x.shape[0], -1)
        # (bs, 4*emb_dim) -> (bs, vocab_size)
        x = self.lin_out(x)
        return torch.log_softmax(x, dim=-1)


[(['study', 'to', 'about', 'are', 'We'], 'about'), (['the', 'study', 'to', 'about', 'are'], 'to'), (['idea', 'the', 'study', 'to', 'about'], 'study'), (['of', 'idea', 'the', 'study', 'to'], 'the'), (['a', 'of', 'idea', 'the', 'study'], 'idea')]


tensor([40, 11, 33, 12, 22])

## Exercise
Instantiate the model and write a proper training loop. Here are some functions to help you make the data ready for use:

In [21]:
def get_list_of_ids(context, word_to_ix):
    list_of_ids = []
    for w in context:
        if w in word_to_ix:
            list_of_ids.append(word_to_ix[w])
        else:
            list_of_ids.append(1)  # unknown id = 1
    return list_of_ids


def get_target_id(target, word_to_ix):
    target_word_id = 0
    if target in word_to_ix:
        target_word_id = word_to_ix[target]
    return target_word_id


def make_context_vector(context, word_to_ix):
    idxs = get_list_of_ids(context, word_to_ix)
    return torch.tensor(idxs, dtype=torch.long)

In [23]:
print(get_list_of_ids(data[0][0], word_to_ix))
print(get_target_id(data[0][1], word_to_ix))
print(make_context_vector(data[0][0], word_to_ix))

[42, 13, 14, 24]
35
tensor([42, 13, 14, 24])


## More information

If you like, these PyTorch's NLP tutorials are a good place to start building NLP models:

- https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
- https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
- https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html
- https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html