# Word2vec Word Embeddings with Pytorch

In most natural language processing tasks, the most basic data unit is a word. However, in order to work with machine learning models, we need a numerical respresentation for words to serve as model input features.

There are several ways to represent words numerically. The most basic way is to assign a unique integer to each word in the vocabulary and then create a one-hot encoding vector to represent each word. A one-hot vector is a vector where all values are 0 except for the value indexed by the word's ID which is set to 1. However, this approach suffers from several drawbacks:
- The size of each one-hot vector needs to be the size of the complete vocabulary. In text datasets where it is easy to encounter vocabularies with thousands of words, this would entail large memory requirements.
- Additonally, the one-hot encoding representation does not carry any semantic or contextual representation. It simply models every word as a standalone entity without encoding any information regarding its meaning.

Word embeddings are another word encoding method that address the issues with one-hot encoding. Word embeddings are also high dimensional vectors however they are dense as opposed to the sparse one-hot vectors. More importantly, the values in these vectors carry information regarding the meaning and semantic context of words.

Let us examine the case of the words "orange" and "carrot" and build a toy handcrafted embedding vector for each word considering relevant semantic attributes.

|  | color orange | elongated shape | sweet flavor |
|:---:|:---:|:---:|:---:|
| carrot | 5.2 | 6.4 | 1.1 |
| orange | 4.9 | 0.2 | 3.8 |

In the example above, every value in the embedding vector expresses a certain contextual attribute. Since both "carrot" and "orange" have orange color, they both have close values in the feature related to orange color however they differ in terms of shape and flavor. Word embeddings not only provide features that are informative of the meaning of the word, but are also strong indicators of the semantic similarity between words.

In most NLP applications, word embeddings are used as input features to models. These embeddings are learned by training an embedding model. As opposed to the toy example shown above, these embeddings are not interpretable, meaning that we do not know the semantic attribute expressed by each feature. However, we can demonstrate that they represent semantic information which we will show at the end of this notebook.

## Word2vec Model

There are two ways in general to train word embeddings. The simplest way is to include an embedding layer as the input of the model that is being trained for the NLP task of interest (classification, translation, relation extraction...). The embeddings would then be trained with the rest of the model which results in word embeddings that are attuned to the specific task being performed.

The second way is to separately train a word embedding model to extract robust and meaningful features. This is usually done by training on large corpora of text in an unsupervised framework. The resulting embedding weights can then be transferred to another model performing a certain task. They can be further trained as part of the model to fine-tune them for the given task. In most cases this results in improved performance.

[Word2vec](https://arxiv.org/abs/1301.3781) is an unsupervised framework that allows training word embeddings. For each word, Word2vec learns representative word embeddings by observing the words that appear in its context. A context is defined as a group of words that surround the target word. This is similar to how we, as humans, can understand the meaning of a word from its context. The Word2vec paper proposed two ways to train word embeddings under this framework. The first is called continuous bag of words (CBOW) where the model learns to predict the target word given its context. The second method called skip-gram trains the model to predict a set of context words given a target word. In this notebook, we demonstrate the CBOW model as the skip-gram only differs in how the inputs and outputs are modeled.

In practice, Word2vec can be trained on any text corpus. We will demonstrate this concept on the [WikiText2](https://arxiv.org/abs/1609.07843v1) dataset which is a collection of Wikipedia articles with over 100 million tokens. This dataset can be easily loaded using the torchtext library as shown below.

*References:*
- *https://github.com/OlgaChernytska/word2vec-pytorch*
- *https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html*

In the below cell, we first load the dataset which is presented as three train, validation and test splits. Each sample in the dataset in a string containing a paragraph. Paragraphs are of various lengths and contain different symbols and punctuation. We define a function that takes a string and removes all symbols and punctuation and splits the string into a list of words (tokens).

In [None]:
from collections import Counter
from torchtext.datasets import WikiText2
from torchtext.data.functional import to_map_style_dataset

#by default the datasets are loaded as iterators
#we convert them to map style datasets for easier manipulation
train, val, test = WikiText2()
train = to_map_style_dataset(train)
val = to_map_style_dataset(val)
test = to_map_style_dataset(test)

def preprocess_text(text):
    #lowercase
    text = text.lower()
    #remove punctuation
    punc = """!"#$%&'()*+,-./:;=?@[\]^_`{|}~"""
    text = text.translate(str.maketrans('', '', punc))
    #remove new line characters
    text = text.translate(str.maketrans('', '', '\n'))
    #split into list of tokens
    tokens = text.split(' ')
    #remove empty tokens
    tokens = list(filter(('').__ne__, tokens))
    return tokens

To use the words as input to the embedding layer, we convert words into integer encoding by assigning a unique integer for each word. We start by building the full vocabulary. We exclude short paragraphs since we will not be able to use these to create training samples for the CBOW model. We also exclude words that appear less than 50 times since the model may not be able to learn robust embeddings for these rare words. These words are replaced with an unknown token. The dataset already has some words replaced by an \<unk> token. Finally, we create the word ID mapping. Unknown tokens are assigned an ID of 0.

In [2]:
CONTEXT_SIZE = 4

#build vocab
vocab = []
for text in train:
    tokens = preprocess_text(text)
    #only include sentences of length larger than the CBOW window
    if len(tokens) > 2*CONTEXT_SIZE+1:
        vocab.extend(tokens)
for text in val:
    tokens = preprocess_text(text)
    if len(tokens) > 2*CONTEXT_SIZE+1:
        vocab.extend(tokens)
for text in test:
    tokens = preprocess_text(text)
    if len(tokens) > 2*CONTEXT_SIZE+1:
        vocab.extend(tokens)

#remove uncommon tokens
counts = dict(Counter(vocab))
vocab = [word for word in counts if counts[word] > 50]
vocab.remove('<unk>')

#word integer encoding
#0 is for unknown
word2id = {word: id+1 for id, word in enumerate(sorted(vocab))}
id2word = {id+1: word for id, word in enumerate(sorted(vocab))}

## Continuous Bag of Words

The CBOW model learns semantic word features by training under a text completion setting. Given a set of context words, the model must predict a target word that completes the phrase being considered.

To build the training samples, we define a window that we translate across the text. At every position, the word in the center of the window is the target word and the words before and after are the context words. The context words form the model input and the target word is the output. We set the size of the window using the *CONTEXT_SIZE* constatnt. This means that we will have *CONTEXT_SIZE* words before and *CONTEXT_SIZE* words after the target word.

Below we show an example of how the data pais (context, target) are created using this moving window.

In [None]:
tokens = preprocess_text(train[4])
print(' '.join(tokens))

In [4]:
X = []
y = []
for n, i in enumerate(range(CONTEXT_SIZE, len(tokens) - CONTEXT_SIZE)):
    context = [tokens[i - j - 1] for j in range(CONTEXT_SIZE)] + [tokens[i + j + 1] for j in range(CONTEXT_SIZE)]
    print(f'Context: {context}; Target: {tokens[i]}')
    if n==5:
        break

Context: ['development', 'began', 'game', 'the', '2010', 'carrying', 'over', 'a']; Target: in
Context: ['in', 'development', 'began', 'game', 'carrying', 'over', 'a', 'large']; Target: 2010
Context: ['2010', 'in', 'development', 'began', 'over', 'a', 'large', 'portion']; Target: carrying
Context: ['carrying', '2010', 'in', 'development', 'a', 'large', 'portion', 'of']; Target: over
Context: ['over', 'carrying', '2010', 'in', 'large', 'portion', 'of', 'the']; Target: a
Context: ['a', 'over', 'carrying', '2010', 'portion', 'of', 'the', 'work']; Target: large


We put this into a function to call during the training process. Note that we also convert wrods to their integer IDs while building the samples.

In [14]:
def prepare_batch(data):
    X = []
    y = []
    for txt in data:
        tokens = preprocess_text(txt)
        #only consider sentences that are bigger than the context window
        if len(tokens) > 2*CONTEXT_SIZE+1:
            #start with the first target word (skipping CONTEXT_SIZE words)
            for i in range(CONTEXT_SIZE, len(tokens) - CONTEXT_SIZE):
                #select CONTEXT_SIZE words before and after the target word
                context = [tokens[i - j - 1] for j in range(CONTEXT_SIZE)] + [tokens[i + j + 1] for j in range(CONTEXT_SIZE)]
                #convert words to integer encoding
                context = [word2id.get(word, 0) for word in context]
                X.append(context)
                #select target word
                y.append(word2id.get(tokens[i], 0))
    X = torch.tensor(X)
    y = torch.tensor(y)
    return X, y

## Model

The CBOW model has a simple architecture consistng of an embedding layer followed by a dense softmax layer. For each sample, the embedding layer is given *CONTEXT_SIZE*×2 context words as input and it produces an embedding vector for each word. We take the average of these embeddings and the resulting vector is used as input to the softmax layer. The output is a vector of the size of the vocabulary indicating the probability for each word which should be the target word. As the output is a multiclass classification, we train this model with the categorical crossentropy loss.

We only train for 5 epochs since the validation loss starts increasing after that point. After the training is done, the embedding lookup matrix is simply the weight matrix of the embedding layer. The softmax layer can be discarded.

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CBOW(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.out = nn.Linear(embed_dim, vocab_size)
    
    def forward(self, X):
        X = self.embeddings(X)
        #calculate the mean of the context word embeddings
        X = X.mean(axis=1)
        X = self.out(X)
        return F.log_softmax(X, dim=1)
    
    def get_embedding_weights(self):
        # extract the embedding lookup matrix to use after training
        return self.embeddings.weight.detach().cpu()

In [None]:
import numpy as np

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.manual_seed(41)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

#hyperparameters
num_epochs = 5
batch_size = 128
embedding_dim = 300
lr = 0.03

#model
model = CBOW(len(vocab)+1, embedding_dim)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

#evaluation function
def evaluate(data):
    X, y = prepare_batch(data)
    X = X.to(device)
    with torch.no_grad():
        model.eval()
        out = model(X).cpu()
        loss = F.nll_loss(input=out, target=y)
    return loss

#training loop
for epoch in range(num_epochs):
    #training
    losses = []
    model.train()
    for i in range(0, len(train), batch_size):
        X_train_batch, y_train_batch = prepare_batch(train[i:i+batch_size])
        #make sure that X is not empty
        if len(X_train_batch):
            X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
            optimizer.zero_grad()
            out = model(X_train_batch)
            loss = F.nll_loss(input=out, target=y_train_batch)
            loss.backward()
            optimizer.step()
            losses.append(loss.detach().item())
    #validation
    loss_val = evaluate(val)
    #epoch end
    print(f'********Epoch {epoch+1}********')
    print(f'Train loss: {np.mean(losses):.2f};    Val loss {loss_val:.2f}\n')

## Word Embedding Evaluation and Vizualization

In the last section of this notebook, we will explore the quality of our embeddings using some qualitative and quantitative techniques.

### TSNE Plot

In [8]:
embeddings = model.get_embedding_weights()
embeddings.shape

torch.Size([4581, 300])

In [9]:
from sklearn.manifold import TSNE

tsne_embeddings = TSNE(random_state=41).fit_transform(embeddings)
tsne_embeddings.shape



(4581, 2)

In [10]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=tsne_embeddings[:,0],
        y=tsne_embeddings[:,1],
        mode="text",
        text=[id2word.get(id, '<unk>') for id in range(tsne_embeddings.shape[0])],
        textposition="middle center"
    )
)
fig.show()

### King - Man = Queen - Woman

In [11]:
from sklearn.metrics.pairwise import cosine_similarity
import scipy.linalg as la

king_man = embeddings[word2id['king']] - embeddings[word2id['man']]
queen_woman = embeddings[word2id['queen']] - embeddings[word2id['woman']]

print(la.norm(king_man))
print(la.norm(queen_woman))

16.021299362182617
20.660226821899414


### Top Similar Words

In [12]:
def get_top_similar(word, n=5):
    try:
        wordid = word2id[word]
    except KeyError:
        raise KeyError('Out of vocabulary.')
    
    similarities = cosine_similarity(embeddings[wordid].reshape(1,-1),embeddings)[0].argsort()[::-1][:n]
    print(f'Top {n} similar words to [{word}]:')
    print('\n'.join(id2word.get(sim_id, '<unk>') for sim_id in similarities))

get_top_similar('king')

Top 5 similar words to [king]:
king
governor
reign
pope
emperor


In [13]:
get_top_similar('book')

Top 5 similar words to [book]:
book
novel
poem
game
books
