## Training CBoW Model

This notebooks is a part of [AI for Beginners Curriculum](http://aka.ms/ai-beginners)

In this example, we will look at training CBoW language model to get our own Word2Vec embedding space. We will use AG News dataset as the source of text.

In [1]:
import torch
import torchtext
import os
import collections
import builtins
import random
import numpy as np

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

First let's load our dataset and define tokenizer and vocabulary. We will set `vocab_size` to 5000 to limit computations a bit.

In [3]:
def load_dataset(ngrams = 1, min_freq = 1, vocab_size = 5000 , lines_cnt = 500):
    tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
    print("Loading dataset...")
    test_dataset, train_dataset  = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    classes = ['World', 'Sports', 'Business', 'Sci/Tech']
    print('Building vocab...')
    counter = collections.Counter()
    for i, (_, line) in enumerate(train_dataset):
        counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=ngrams))
        if i == lines_cnt:
            break
    vocab = torchtext.vocab.Vocab(collections.Counter(dict(counter.most_common(vocab_size))), min_freq=min_freq)
    return train_dataset, test_dataset, classes, vocab, tokenizer

In [21]:
import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import collections

def load_dataset(ngrams=1, min_freq=1, vocab_size=5000, lines_cnt=500):
    tokenizer = get_tokenizer('basic_english')
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    classes = ['World', 'Sports', 'Business', 'Sci/Tech']
    print('Building vocab...')

    # Build the counter
    def yield_tokens(data_iter):
        for _, text in data_iter:
            yield torchtext.data.utils.ngrams_iterator(tokenizer(text), ngrams=ngrams)

    vocab = build_vocab_from_iterator(yield_tokens(train_dataset[:lines_cnt]),
                                      min_freq=min_freq,
                                      specials=["<unk>"])
    vocab.set_default_index(vocab["<unk>"])

    return train_dataset, test_dataset, classes, vocab, tokenizer

train_dataset, test_dataset, _, vocab, tokenizer = load_dataset()

Loading dataset...
Building vocab...


In [6]:
train_dataset, test_dataset, _, vocab, tokenizer = load_dataset()

Loading dataset...


Building vocab...


check vocab

In [19]:
toch

Vocab()

In [26]:
def encode(x, vocabulary, tokenizer = tokenizer):
    return [vocabulary[s] for s in tokenizer(x)]

## CBoW Model

CBoW learns to predict a word based on the $2N$ neighboring words. For example, when $N=1$, we will get the following pairs from the sentence *I like to train networks*: (like,I), (I, like), (to, like), (like,to), (train,to), (to, train), (networks, train), (train,networks). Here, first word is the neighboring word used as an input, and second word is the one we are predicting.

To build a network to predict next word, we will need to supply neighboring word as input, and get word number as output. The architecture of CBoW network is the following:

* Input word is passed through the embedding layer. This very embedding layer would be our Word2Vec embedding, thus we will define it separately as `embedder` variable. We will use embedding size = 30 in this example, even though you might want to experiment with higher dimensions (real word2vec has 300)
* Embedding vector would then be passed to a linear layer that will predict output word. Thus it has the `vocab_size` neurons.

For the output, if we use `CrossEntropyLoss` as loss function, we would also have to provide just word numbers as expected results, without one-hot encoding.

In [27]:
vocab_size = len(vocab)

embedder = torch.nn.Embedding(num_embeddings = vocab_size, embedding_dim = 30)
model = torch.nn.Sequential(
    embedder,
    torch.nn.Linear(in_features = 30, out_features = vocab_size),
)

print(model)

Sequential(
  (0): Embedding(5702, 30)
  (1): Linear(in_features=30, out_features=5702, bias=True)
)


## Preparing Training Data

Now let's program the main function that will compute CBoW word pairs from text. This function will allow us to specify window size, and will return a set of pairs - input and output word. Note that this function can be used on words, as well as on vectors/tensors - which will allow us to encode the text, before passing it to `to_cbow` function.

In [28]:
def to_cbow(sent,window_size=2):
    res = []
    for i,x in enumerate(sent):
        for j in range(max(0,i-window_size),min(i+window_size+1,len(sent))):
            if i!=j:
                res.append([sent[j],x])
    return res

print(to_cbow(['I','like','to','train','networks']))
print(to_cbow(encode('I like to train networks', vocab)))

[['like', 'I'], ['to', 'I'], ['I', 'like'], ['to', 'like'], ['train', 'like'], ['I', 'to'], ['like', 'to'], ['train', 'to'], ['networks', 'to'], ['like', 'train'], ['to', 'train'], ['networks', 'train'], ['to', 'networks'], ['train', 'networks']]
[[91, 60], [5, 60], [60, 91], [5, 91], [0, 91], [60, 5], [91, 5], [0, 5], [2063, 5], [91, 0], [5, 0], [2063, 0], [5, 2063], [0, 2063]]


Let's prepare the training dataset. We will go through all news, call `to_cbow` to get the list of word pairs, and add those pairs to `X` and `Y`. For the sake of time, we will only consider first 10k news items - you can easily remove the limitation in case you have more time to wait, and want to get better embeddings :)

In [29]:
X = []
Y = []
for i, x in zip(range(10000), train_dataset):
    for w1, w2 in to_cbow(encode(x[1], vocab), window_size = 5):
        X.append(w1)
        Y.append(w2)

X = torch.tensor(X)
Y = torch.tensor(Y)

We will also convert that data to one dataset, and create dataloader:

In [30]:
class SimpleIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, X, Y):
        super(SimpleIterableDataset).__init__()
        self.data = []
        for i in range(len(X)):
            self.data.append( (Y[i], X[i]) )
        random.shuffle(self.data)

    def __iter__(self):
        return iter(self.data)

We will also convert that data to one dataset, and create dataloader:

In [31]:
ds = SimpleIterableDataset(X, Y)
dl = torch.utils.data.DataLoader(ds, batch_size = 256)

Now let's do the actual training. We will use `SGD` optimizer with pretty high learning rate. You can also try playing around with other optimizers, such as `Adam`. We will train for 10 epochs to begin with - and you can re-run this cell if you want even lower loss.

In [32]:
def train_epoch(net, dataloader, lr = 0.01, optimizer = None, loss_fn = torch.nn.CrossEntropyLoss(), epochs = None, report_freq = 1):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr = lr)
    loss_fn = loss_fn.to(device)
    net.train()

    for i in range(epochs):
        total_loss, j = 0, 0, 
        for labels, features in dataloader:
            optimizer.zero_grad()
            features, labels = features.to(device), labels.to(device)
            out = net(features)
            loss = loss_fn(out, labels)
            loss.backward()
            optimizer.step()
            total_loss += loss
            j += 1
        if i % report_freq == 0:
            print(f"Epoch: {i+1}: loss={total_loss.item()/j}")

    return total_loss.item()/j

In [33]:
import torch

def train_epoch(net, dataloader, optimizer, loss_fn, epochs, device=device, report_freq=1):
    # Move model to the specified device (GPU or CPU)
    net = net.to(device)
    loss_fn = loss_fn.to(device)  # Ensure the loss function is also on the same device
    net.train()

    for epoch in range(epochs):
        total_loss = 0.0
        batch_count = 0

        for labels, features in dataloader:
            # Move data to the specified device
            features, labels = features.to(device), labels.to(device)

            # Zero gradients
            optimizer.zero_grad()

            # Forward pass
            outputs = net(features)

            # Calculate loss
            loss = loss_fn(outputs, labels)

            # Backward pass and optimization
            loss.backward()
            optimizer.step()

            # Accumulate loss
            total_loss += loss.item()
            batch_count += 1

        # Report average loss for the epoch
        if epoch % report_freq == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Loss: {total_loss / batch_count:.4f}")

    return total_loss / batch_count

In [18]:
train_epoch(net = model, dataloader = dl, optimizer = torch.optim.SGD(model.parameters(), lr = 0.1), loss_fn = torch.nn.CrossEntropyLoss(), epochs = 10)

Epoch [1/10], Loss: 5.9816
Epoch [2/10], Loss: 5.6350
Epoch [3/10], Loss: 5.5669
Epoch [4/10], Loss: 5.5314
Epoch [5/10], Loss: 5.5085
Epoch [6/10], Loss: 5.4922
Epoch [7/10], Loss: 5.4798
Epoch [8/10], Loss: 5.4699
Epoch [9/10], Loss: 5.4618
Epoch [10/10], Loss: 5.4548


5.454845255846027

## Trying out Word2Vec

To use Word2Vec, let's extract vectors corresponding to all words in our vocabulary:

In [34]:
vectors = torch.stack([embedder(torch.tensor(vocab[s])) for s in vocab.itos], 0)

AttributeError: 'Vocab' object has no attribute 'itos'

In [35]:
vectors = torch.stack([embedder(torch.tensor(vocab[s])) for s in vocab.get_itos()], 0)

Let's see, for example, how the word **Paris** is encoded into a vector:

In [36]:
paris_vec = embedder(torch.tensor(vocab['paris']))
print(paris_vec)

tensor([-0.9116, -1.0474, -1.0554, -0.5568, -2.3852, -3.1054,  0.5675,  0.0480,
        -0.1078,  0.0874, -1.2703, -1.3895, -0.2785,  0.1189,  0.9470, -0.9221,
         0.3426,  0.8443,  1.4859,  0.5325,  0.4277, -1.9230, -0.5424, -0.6481,
        -1.4530,  1.4176, -1.1798, -0.1354,  1.7082, -0.9876],
       grad_fn=<EmbeddingBackward0>)


It is interesting to use Word2Vec to look for synonyms. The following function will return `n` closest words to a given input. To find them, we compute the norm of $|w_i - v|$, where $v$ is the vector corresponding to our input word, and $w_i$ is the encoding of $i$-th word in the vocabulary. We then sort the array and return corresponding indices using `argsort`, and take first `n` elements of the list, which encode positions of closest words in the vocabulary.  

In [37]:
def close_words(x, n = 5):
  vec = embedder(torch.tensor(vocab[x]))
  top5 = np.linalg.norm(vectors.detach().numpy() - vec.detach().numpy(), axis = 1).argsort()[:n]
  return [ vocab.itos[x] for x in top5 ]

close_words('microsoft')

AttributeError: 'Vocab' object has no attribute 'itos'

In [38]:
def close_words(x, n=5):
    vec = embedder(torch.tensor(vocab[x]))
    # Compute the distance from the given word's vector to all other vectors
    distances = np.linalg.norm(vectors.detach().numpy() - vec.detach().numpy(), axis=1)
    # Find the indices of the `n` closest vectors
    top_n_indices = distances.argsort()[:n]
    # Convert indices back to tokens using `get_itos()`
    itos = vocab.get_itos()
    return [itos[idx] for idx in top_n_indices]

# Example usage
print(close_words('microsoft'))

['microsoft', 'stop', 'patrick', 'effectively', 'levels']


In [39]:
close_words('basketball')

['basketball', 'consultants', 'start', 'according', '\\come']

In [40]:
close_words('funds')

['funds', 'computerized', 'range', 'ducking', 'walters']

## Takeaway

Using clever techniques such as CBoW, we can train Word2Vec model. You may also try to train skip-gram model that is trained to predict the neighboring word given the central one, and see how well it performs. 