## Training CBoW Model

This notebooks is a part of [AI for Beginners Curriculum](http://aka.ms/ai-beginners)

In this example, we will look at training CBoW language model to get our own Word2Vec embedding space. We will use AG News dataset as the source of text.

In [9]:
import torch
import torchtext
import os
import collections
import builtins
import random
import numpy as np

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

First let's load our dataset and define tokenizer and vocabulary. We will set `vocab_size` to 5000 to limit computations a bit.

In [14]:
def load_dataset(ngrams = 1, min_freq = 1, vocab_size = 5000 , lines_cnt = 500):
    tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
    print("Loading dataset...")
    test_dataset, train_dataset  = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    classes = ['World', 'Sports', 'Business', 'Sci/Tech']
    print('Building vocab...')
    counter = collections.Counter()
    for i, (_, line) in enumerate(train_dataset):
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=ngrams))
        if i == lines_cnt:
            break
    vocab = torchtext.vocab.Vocab({k: i for i, k in enumerate(dict(counter.most_common(vocab_size)).keys())})
    return train_dataset, test_dataset, classes, vocab, tokenizer, counter

In [12]:
import torch.nn as nn
from typing import List

class Vocab(nn.Module):
    def __init__ (self, vocab_list):
        self.vocab = {k: i for i, k in enumerate(vocab_list)}
        self.vocab.update({'<unk>': len(self.vocab), '<pad>': len(self.vocab) + 1})
    
    def __len__(self):
        return len(self.vocab)
    
    def __getitem__(self, token: str) -> int:
        return self.vocab[token] if token in self.vocab.keys() else self.vocab['<unk>']
    
    def forward(self, tokens: List[str]) -> List[int]:
        return [self.vocab[token] for token in tokens]


In [15]:
train_dataset, test_dataset, _, vocab, tokenizer, counter = load_dataset()

Loading dataset...
Building vocab...


In [17]:
# Filter the counter to include only the top 5000 most common tokens
most_common_tokens = dict(counter.most_common(5000))

# Create the vocabulary
vocab = torchtext.vocab.Vocab(collections.Counter(most_common_tokens))

In [19]:
vocab
print(f"Vocabulary size: {len(vocab)}")

Vocabulary size: 5000


In [20]:
vocab = torchtext.vocab.Vocab(collections.Counter(dict(counter.most_common(5000))))

In [21]:
vocab

Vocab()

In [22]:
def encode(x, vocabulary, tokenizer = tokenizer):
    return [vocabulary[s] for s in tokenizer(x)]

In [58]:
{k: i for i, k in enumerate(dict(counter.most_common(5000)).keys())}

{'.': 0,
 'the': 1,
 ',': 2,
 'to': 3,
 'a': 4,
 'in': 5,
 'of': 6,
 's': 7,
 'and': 8,
 'on': 9,
 'for': 10,
 '(': 11,
 ')': 12,
 '-': 13,
 "'": 14,
 '#39': 15,
 'at': 16,
 'as': 17,
 'reuters': 18,
 'that': 19,
 'ap': 20,
 'it': 21,
 'with': 22,
 'new': 23,
 'its': 24,
 'said': 25,
 'by': 26,
 'is': 27,
 'has': 28,
 'from': 29,
 'an': 30,
 'after': 31,
 'his': 32,
 'will': 33,
 'have': 34,
 'but': 35,
 'athens': 36,
 'u': 37,
 'olympic': 38,
 '--': 39,
 'he': 40,
 'gold': 41,
 'monday': 42,
 'us': 43,
 'their': 44,
 'was': 45,
 'oil': 46,
 'over': 47,
 'up': 48,
 'more': 49,
 'are': 50,
 'first': 51,
 'tuesday': 52,
 'be': 53,
 'wednesday': 54,
 'they': 55,
 'prices': 56,
 'two': 57,
 'who': 58,
 '&lt': 59,
 'into': 60,
 'inc': 61,
 'sunday': 62,
 'china': 63,
 'afp': 64,
 'team': 65,
 'one': 66,
 'quot': 67,
 'were': 68,
 'out': 69,
 'off': 70,
 'about': 71,
 'can': 72,
 'would': 73,
 'medal': 74,
 'night': 75,
 'president': 76,
 'not': 77,
 'united': 78,
 'win': 79,
 't': 80,
 'yor

In [59]:
vocab.vocab

<torchtext._torchtext.Vocab at 0x7f9ba0842970>

## CBoW Model

CBoW learns to predict a word based on the $2N$ neighboring words. For example, when $N=1$, we will get the following pairs from the sentence *I like to train networks*: (like,I), (I, like), (to, like), (like,to), (train,to), (to, train), (networks, train), (train,networks). Here, first word is the neighboring word used as an input, and second word is the one we are predicting.

To build a network to predict next word, we will need to supply neighboring word as input, and get word number as output. The architecture of CBoW network is the following:

* Input word is passed through the embedding layer. This very embedding layer would be our Word2Vec embedding, thus we will define it separately as `embedder` variable. We will use embedding size = 30 in this example, even though you might want to experiment with higher dimensions (real word2vec has 300)
* Embedding vector would then be passed to a linear layer that will predict output word. Thus it has the `vocab_size` neurons.

For the output, if we use `CrossEntropyLoss` as loss function, we would also have to provide just word numbers as expected results, without one-hot encoding.

In [23]:
vocab_size = len(vocab)

embedder = torch.nn.Embedding(num_embeddings = vocab_size, embedding_dim = 30)
model = torch.nn.Sequential(
    embedder,
    torch.nn.Linear(in_features = 30, out_features = vocab_size),
)

print(model)

Sequential(
  (0): Embedding(5000, 30)
  (1): Linear(in_features=30, out_features=5000, bias=True)
)


## Preparing Training Data

Now let's program the main function that will compute CBoW word pairs from text. This function will allow us to specify window size, and will return a set of pairs - input and output word. Note that this function can be used on words, as well as on vectors/tensors - which will allow us to encode the text, before passing it to `to_cbow` function.

In [24]:
def to_cbow(sent,window_size=2):
    res = []
    for i,x in enumerate(sent):
        for j in range(max(0,i-window_size),min(i+window_size+1,len(sent))):
            if i!=j:
                res.append([sent[j],x])
    return res

print(to_cbow(['I','like','to','train','networks']))
print(to_cbow(encode('I like to train networks', vocab)))

[['like', 'I'], ['to', 'I'], ['I', 'like'], ['to', 'like'], ['train', 'like'], ['I', 'to'], ['like', 'to'], ['train', 'to'], ['networks', 'to'], ['like', 'train'], ['to', 'train'], ['networks', 'train'], ['to', 'networks'], ['train', 'networks']]
[[11, 14], [530, 14], [14, 11], [530, 11], [0, 11], [14, 530], [11, 530], [0, 530], [3, 530], [11, 0], [530, 0], [3, 0], [530, 3], [0, 3]]


In [25]:
to_cbow(['I','like','to','train','networks'], window_size=2)

[['like', 'I'],
 ['to', 'I'],
 ['I', 'like'],
 ['to', 'like'],
 ['train', 'like'],
 ['I', 'to'],
 ['like', 'to'],
 ['train', 'to'],
 ['networks', 'to'],
 ['like', 'train'],
 ['to', 'train'],
 ['networks', 'train'],
 ['to', 'networks'],
 ['train', 'networks']]

In [26]:
encode('I like to train networks', vocab)

[14, 11, 530, 0, 3]

Let's prepare the training dataset. We will go through all news, call `to_cbow` to get the list of word pairs, and add those pairs to `X` and `Y`. For the sake of time, we will only consider first 10k news items - you can easily remove the limitation in case you have more time to wait, and want to get better embeddings :)

In [27]:
X = []
Y = []
for i, x in zip(range(10000), train_dataset):
    for w1, w2 in to_cbow(encode(x[1], vocab), window_size = 5):
        X.append(w1)
        Y.append(w2)

X = torch.tensor(X)
Y = torch.tensor(Y)

We will also convert that data to one dataset, and create dataloader:

In [28]:
class SimpleIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, X, Y):
        super(SimpleIterableDataset).__init__()
        self.data = []
        for i in range(len(X)):
            self.data.append( (Y[i], X[i]) )
        random.shuffle(self.data)

    def __iter__(self):
        return iter(self.data)

We will also convert that data to one dataset, and create dataloader:

In [29]:
ds = SimpleIterableDataset(X, Y)
dl = torch.utils.data.DataLoader(ds, batch_size = 256)

Now let's do the actual training. We will use `SGD` optimizer with pretty high learning rate. You can also try playing around with other optimizers, such as `Adam`. We will train for 10 epochs to begin with - and you can re-run this cell if you want even lower loss.

In [30]:
import torch

def train_epoch(net, dataloader, optimizer, loss_fn, epochs, device=device, report_freq=1):
    # Move model to the specified device (GPU or CPU)
    net = net.to(device)
    loss_fn = loss_fn.to(device)  # Ensure the loss function is also on the same device
    net.train()

    for epoch in range(epochs):
        total_loss = 0.0
        batch_count = 0

        for labels, features in dataloader:
            # Move data to the specified device
            features, labels = features.to(device), labels.to(device)

            # Zero gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = net(features)

            # Calculate loss
            loss = loss_fn(outputs, labels)

            # Backward pass and optimization
            loss.backward()
            optimizer.step()

            # Accumulate loss
            total_loss += loss.item()
            batch_count += 1

        # Report average loss for the epoch
        if epoch % report_freq == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Loss: {total_loss / batch_count:.4f}")

    return total_loss / batch_count

In [31]:
train_epoch(net = model, dataloader = dl, optimizer = torch.optim.SGD(model.parameters(), lr = 0.1), loss_fn = torch.nn.CrossEntropyLoss(), epochs = 10)

Epoch [1/10], Loss: 3.9329
Epoch [2/10], Loss: 3.6152
Epoch [3/10], Loss: 3.6075
Epoch [4/10], Loss: 3.6048
Epoch [5/10], Loss: 3.6033
Epoch [6/10], Loss: 3.6023
Epoch [7/10], Loss: 3.6016
Epoch [8/10], Loss: 3.6010
Epoch [9/10], Loss: 3.6005
Epoch [10/10], Loss: 3.6002


3.6001607200101673

## Trying out Word2Vec

To use Word2Vec, let's extract vectors corresponding to all words in our vocabulary:

In [40]:
from torchtext.vocab import vocab as Vocab
from collections import Counter

# Assuming `counter` is your word frequency counter
# Create a vocabulary from the counter
most_common_tokens = Counter(dict(counter.most_common(5000)))
vocab = Vocab(most_common_tokens)


In [41]:
tokens = vocab.get_itos()

In [45]:
vectors = torch.stack(
    [embedder(torch.tensor(vocab[token]).to(device)) for token in tokens], dim=0
)


Let's see, for example, how the word **Paris** is encoded into a vector:

In [47]:
paris_vec = embedder(torch.tensor(vocab['paris']).to(device))
print(paris_vec)

tensor([-1.5799, -0.6555, -0.3048,  0.7853, -1.3623,  3.3791,  0.8314, -0.4976,
         0.0909, -0.0541,  0.7252,  0.4757,  2.4713,  0.4871, -0.7350, -0.1321,
        -0.4680,  0.2433, -2.4521,  0.7610,  0.6885,  0.7293, -0.7003, -1.4590,
         1.1957, -0.7732,  0.3247,  0.7781, -0.3802, -0.4512], device='cuda:0',
       grad_fn=<EmbeddingBackward0>)


It is interesting to use Word2Vec to look for synonyms. The following function will return `n` closest words to a given input. To find them, we compute the norm of $|w_i - v|$, where $v$ is the vector corresponding to our input word, and $w_i$ is the encoding of $i$-th word in the vocabulary. We then sort the array and return corresponding indices using `argsort`, and take first `n` elements of the list, which encode positions of closest words in the vocabulary.  

In [55]:
def close_words(x, n=5):
    # Ensure all computations are performed on the same device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Move vectors to the correct device
    vec = embedder(torch.tensor(vocab[x]).to(device))
    vectors_on_device = vectors.to(device)
    
    # Compute distances and find closest words
    top5 = np.linalg.norm(
        vectors_on_device.detach().cpu().numpy() - vec.detach().cpu().numpy(),
        axis=1
    ).argsort()[:n]
    
    # Map indices to tokens using the appropriate method
    try:
        tokens = vocab.get_itos()  # Use for modern versions
    except AttributeError:
        tokens = vocab.itos  # Use for older versions
    
    # Return the closest tokens
    return [tokens[x] for x in top5]

In [56]:
close_words('basketball')

['basketball', 'qualifying', 'tang', 'expand', 'holding']

In [57]:
close_words('funds')

['funds', 'pitch', 'weary', 'maoist', 'm']

## Takeaway

Using clever techniques such as CBoW, we can train Word2Vec model. You may also try to train skip-gram model that is trained to predict the neighboring word given the central one, and see how well it performs. 