# PyTorch🔥Embeddings - Build Word2vec CBOW Model (150 Novels Dataset)

Subject: Building a Word2vec-like CBOW Model to create embeddings for the dataset's words representing their near closeness to each other (in 100 embedding dimensions).

Data: txtlab_Novel150_English (150 English novels from the 19th century)

Procedure:
- Tokenizing with nltk.tokenize.word_tokenize and nltk.corpus.stopwords
- Creating contexts and targets from five words each: (01 34) with (2) as target
- Tensorizing contexts and targets
- Creating a vocabulary with torchtext.vocab.build_vocab_from_iterator
- Creating a custom torch.utils.data.Dataset for a torch.utils.data.DataLoader
- Word2vec-like CBOW model with torch.nn.module, torch.nn.Embedding, torch.nn.Linear, torch.nn.ReLU,  and torch.nn.LogSoftmax
- Training with torch.nn.NLLLoss, torch.optim.SGD, and torch.optim.lr_scheduler.StepLR
- Evaluation by finding some nearest words and playing with word vectors
- Disappointing results (probably much more data required)

Others:
- CUDA support
- working on Colab with Google Drive for saving/loading interim stages

Sources used:
- https://github.com/FraLotito/pytorch-continuous-bag-of-words/blob/master/cbow.py

Tested on Colab with CODA for multiple parameters. LR, batch size, and scheduler seem to make no big difference. After 5 epochs, the loss is usually around 8.3. The tests for KING, QUEEN, etc. fail. The dataset seems not to be enough. 

In [58]:
import torch
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Running on {DEVICE}')

if IN_COLAB := 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')
  BASE_PATH = './drive/MyDrive/Colab/'
  import nltk
  nltk.download('stopwords')

else:
  BASE_PATH = './'

Running on cpu


## Dataset

In [1]:
import pandas as pd
import locale
locale.setlocale(locale.LC_ALL, locale='')  # for thousands separator via ... print(f'{value:n}')

'German_Germany.1252'

In [2]:
PATH_METADATA_TABLE = './data/novel/txtlab_Novel150_English.csv'
PATH_DIR_NOVELS = './data/novel/txtlab_Novel150_English/'

metadata_tb = pd.read_csv(PATH_METADATA_TABLE)  # contains novel filenames

In [3]:
novels: list[str] = []

for filename in metadata_tb['filename']:
    with open(PATH_DIR_NOVELS+filename, 'r', encoding="utf8") as file_in:
        novel = file_in.read()
    
    # add the whole novel text as single string
    novels.append(novel)


#todo
#novels = novels[:2]

print(f"Read list of {len(novels) :n} Novels.")

Read list of 150 Novels.


In [4]:
# print('Extract from 16th novel:\n\n', novels[15][1003:1105])

## Preprocessing

For CBOW, we need training data with...
- the two previous words and 
- the two next words as X
- and the word in-between as target (y).

Preprocessing:
- We start with the whole novels as a string each.
- We tokenize the words in each novel.
- We swipe along the words, from index=2 to index=-3.
- For each of those target words, we collect the two previous and the two next words as X.
- We do that for each novel.
- The results are collected in a list of tuples, each tuple containing the two previous and two next words in a tuple (of 4), and the target word as string.

Note: Note tokenizing the sentences is not ideal! We don't do it here for the sake of simplicity.

### Tokenize Words

In [5]:
import numpy as np
import string
import re
from collections import Counter

import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')

import torch
import torch.nn as nn
import torchtext


In [6]:
MIN_WORD_OCCURRENCES = 20

In [7]:
average_size = round(np.average([len(novel) for novel in novels]))
print(f"Starting with a list of {len(novels)} Novels with an average size of {average_size :n} characters each.")

Starting with a list of 150 Novels with an average size of 688.646 characters each.


In [8]:
# to remove punctuation
# translator = str.maketrans('', '', string.punctuation + '“”-;'’)
translator = str.maketrans('', '', string.punctuation + '“”’…')  # —
novels_without_punctuation =  [novel.translate(translator) for novel in novels]


average_size_without_punctuation = round(np.average([len(novel) for novel in novels_without_punctuation]))
print(f"Removed an average of {average_size - average_size_without_punctuation :n} punctuation characters.",
      f"Remaining average characters per novel: {average_size_without_punctuation :n}.")

Removed an average of 25.383 punctuation characters. Remaining average characters per novel: 663.263.


In [9]:
tokenized_novels_ = [nltk.tokenize.word_tokenize(novel.lower(), 
                                                language='english')
                    for novel in novels_without_punctuation]

average_num_words = round(np.average([len(novel) for novel in tokenized_novels_]))
print(f'Tokenized words in {len(novels_without_punctuation) :n} novels with an average of {average_num_words :n} words per novel.')

Tokenized words in 150 novels with an average of 123.369 words per novel.


In [10]:
# Remove most common words
stop_words = set(stopwords.words('english'))

tokenized_novels = [[word for word in novel if word not in stop_words]
                    for novel in tokenized_novels_]

average_num_words = round(np.average([len(novel) for novel in tokenized_novels]))
print(f'Remaining average of {average_num_words :n} words per novel after removing most common words (stopwords).')

Remaining average of 59.212 words per novel after removing most common words (stopwords).


In [11]:
# Remove words with special characters, e.g. 'élégante', 'wicked–base–ever'
# this would remove only those with 2+ special characters:
#tokenized_novels = [[word for word in novel if not (special := re.findall(pattern='[^A-Za-z0-9.]+', string=word)) or len(special[0]) < 2  ]
#                    for novel in tokenized_novels]
tokenized_novels = [[word for word in novel if not (special := re.findall(pattern='[^A-Za-z0-9.]+', string=word))]
                    for novel in tokenized_novels]
average_num_words = round(np.average([len(novel) for novel in tokenized_novels]))
print(f'Remaining average of {average_num_words :n} words per novel after removing words with 2+ special characters.')

Remaining average of 58.341 words per novel after removing words with 2+ special characters.


In [12]:
# Remove words that have a word occurrence below the threshold
flat_words = [item for sublist in tokenized_novels for item in sublist]
words_counter = Counter(flat_words)

tokenized_novels = [[word for word in novel if words_counter[word] >= MIN_WORD_OCCURRENCES]
                    for novel in tokenized_novels]

average_num_words = round(np.average([len(novel) for novel in tokenized_novels]))
print(f'Remaining average of {average_num_words :n} words per novel after removing words below minimum occurrence count.')

Remaining average of 56.038 words per novel after removing words below minimum occurrence count.


### Create Context Words

In [13]:
contexts: list[tuple[tuple[str, str, str, str], str]] = []

for novel in tokenized_novels:

    novel_contexts = []
    for i in range(2, len(novel) - 2):
        context = (novel[i - 2], 
                   novel[i - 1],
                   novel[i + 1], 
                   novel[i + 2])
        target = novel[i]
        contexts.append((context, target))

print(f"Collected context training data of size {len(contexts) :n} with 4 context words and a target word each. ")

Collected context training data of size 8.405.037 with 4 context words and a target word each. 


In [14]:
#Example: From Beginning of novel 1 to first context words

print(f'Beginning of first novel: {novels[0][:62]}\n')

print(f'First tokens: {tokenized_novels[0][:8]}\n')
    
print(f'Context 0: {contexts[0]}')
print(f'Context 1: {contexts[1]}')
print(f'Context 2: {contexts[2]}')
print(f'Context 3: {contexts[3]}')
print(f'Context 4: {contexts[4]}')

Beginning of first novel: 


AUTHOR’S INTRODUCTION



My dog had made a point on a piece

First tokens: ['authors', 'introduction', 'dog', 'made', 'point', 'piece', 'led', 'curate']

Context 0: (('authors', 'introduction', 'made', 'point'), 'dog')
Context 1: (('introduction', 'dog', 'point', 'piece'), 'made')
Context 2: (('dog', 'made', 'piece', 'led'), 'point')
Context 3: (('made', 'point', 'led', 'curate'), 'piece')
Context 4: (('point', 'piece', 'curate', 'two'), 'led')


### Words to Index

Unlike gensim's Words2vec, torch always requires indices instead of strings. 

In [15]:
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_novels)

print(f'The {type(vocab)} has indices for a total of {len(vocab) :n} different words.')

The <class 'torchtext.vocab.vocab.Vocab'> has indices for a total of 24.326 different words.


In [16]:
# convert our context training data's contents to indices
context_indices: list[tuple[tuple[int, int, int, int], int]] = []
for (context, target) in contexts:
    context_ind = [vocab[c] for c in context]
    target_ind = vocab[target]
    context_indices.append((context_ind, target_ind))
    
print(f'Example: {contexts[2]} -> {context_indices[2]}')

Example: (('dog', 'made', 'piece', 'led'), 'point') -> ([935, 17, 949, 423], 321)


### Tensorize

Finally, we need to tensorize our indices and targets.
- Shape [185033, 4] for context words
- Shape [185033] for target, i.e. value only

In [17]:
contexts = torch.Tensor([context_ind for (context_ind, _) in context_indices]).type(torch.long)

In [18]:
targets =  torch.Tensor([target_ind for (_, target_ind) in context_indices]).type(torch.long)

assert len(contexts) == len(targets)

## Dataset and DataLoader

In [19]:
from torch.utils.data import DataLoader, Dataset

### Dataset
Create a Torch Dataset

In [20]:
class CustomDataset(Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        assert len(x) == len(y)
        
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

In [21]:
dataset = CustomDataset(contexts.to(DEVICE), targets.to(DEVICE))

## Save and Load
Preprocessing takes a long time when including all novels.

In [22]:
DATASET_CONTEXTS = BASE_PATH + 'saves/contexts_novel.pt'
DATASET_TARGETS = BASE_PATH + 'saves/targets_novel.pt'
VOCAB_PATH = BASE_PATH + 'saves/vocab.pt'

In [23]:
torch.save(vocab, VOCAB_PATH)

In [24]:
torch.save(contexts, DATASET_CONTEXTS)
torch.save(targets, DATASET_TARGETS)

In [25]:
#vocab = torch.load(VOCAB_PATH)
#print(f'Loaded vocab of size {len(vocab) :n}.')

In [26]:
# we save & load not the dataset but x and y to make transfer to device easier
#contexts = torch.load(DATASET_CONTEXTS).to(DEVICE)
#targets = torch.load(DATASET_TARGETS).to(DEVICE)
#dataset = CustomDataset(contexts, targets)
#print(f'Loaded {len(dataset) :n} context tensors as training data.')

### DataLoader

In [27]:
BATCH_SIZE = 512

In [28]:
train_loader = DataLoader(dataset, 
                          batch_size=BATCH_SIZE, 
                          shuffle=True)

## Model

In [29]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()

        #out: 1 x emdedding_dim

        self.embeddings = nn.Embedding(num_embeddings=vocab_size,  # size of the dictionary of embeddings
                                       embedding_dim=embedding_dim)  # size of each embedding vector
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()
        
        #out: 1 x vocab_size
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim = -1)
        

    def forward(self, inputs):  # inputs: [16, 4]
        embeds_ = self.embeddings(inputs)  # [16, 4, 100]
        embeds = torch.sum(embeds_, dim=1)  # [16, 100]
        # embeds = sum(self.embeddings(inputs)).view(1, -1)  # [1, 400]
        out = self.linear1(embeds)  # [16, 128]
        out = self.activation_function1(out)  # [16, 128]
        out = self.linear2(out)  # [16, 20420]
        out = self.activation_function2(out)  # [16, 20420]
        return out

    #def get_word_emdedding(self, word):
    #    word = torch.tensor([word_to_ix[word]])
    #    return self.embeddings(word).view(1,-1)

## Training

In [30]:
from tqdm.auto import tqdm
import math
import time
import torch.optim.lr_scheduler as lr_scheduler

In [31]:
EMDEDDING_DIM = 100

N_EPOCHS = 1  # 50

DISPLAY_EVERY_N_STEPS = 10000

In [48]:
model = CBOW(len(vocab), EMDEDDING_DIM).to(DEVICE)

loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30)

In [49]:
print(f'Starting training with N_EPOCHS = {N_EPOCHS} and a training data of {len(contexts) :n} context tensors.')

model.train()


steps_done = 0
total_steps = N_EPOCHS * len(train_loader.dataset)
recent_losses = []
interim_time = time.time()
for epoch in range(N_EPOCHS):

    # x_batch: [{batch_size}, 4]
    # y_batch: [{batch_size}]
    for batch, (x_batch, y_batch) in enumerate(tqdm(train_loader)):
            
        optimizer.zero_grad()
        
        log_probs = model(x_batch)  # [16, 20420]
        current_loss = loss_function(log_probs, y_batch)  # [16]

        recent_losses.append(current_loss.item())

        current_loss.backward()
        optimizer.step()

        steps_done += len(x_batch)
        
        if (steps_done) % DISPLAY_EVERY_N_STEPS < len(x_batch):
            elapsed_time = time.time() - interim_time
            interim_time = time.time()
            average_loss = np.average(recent_losses)
            recent_losses = []
            scheduler.step()
            print(f'| epoch {epoch + 1 :3d}/{N_EPOCHS} ',
                  f'| batch {batch + 1 :n}/{len(train_loader) :n} ',
                  f'| {steps_done :n}/{total_steps :n} vectors done ',
                  f'| {elapsed_time :.2f} sec. ',
                  f'| new lr {optimizer.param_groups[0]["lr"]}',
                  f'| loss {average_loss:5.2f}') #  :n
            

Starting training with N_EPOCHS = 1 and a training data of 8.405.037 context tensors.


  0%|          | 0/16417 [00:00<?, ?it/s]

| epoch   0/1  | batch 20/16.417  | 10.240/8.405.037 vectors done  | 5.64 sec.  | new lr 0.1 | loss 10.20
| epoch   0/1  | batch 40/16.417  | 20.480/8.405.037 vectors done  | 4.79 sec.  | new lr 0.1 | loss 10.14
| epoch   0/1  | batch 59/16.417  | 30.208/8.405.037 vectors done  | 4.53 sec.  | new lr 0.1 | loss 10.09
| epoch   0/1  | batch 79/16.417  | 40.448/8.405.037 vectors done  | 4.70 sec.  | new lr 0.1 | loss 10.05
| epoch   0/1  | batch 98/16.417  | 50.176/8.405.037 vectors done  | 4.44 sec.  | new lr 0.1 | loss 10.01
| epoch   0/1  | batch 118/16.417  | 60.416/8.405.037 vectors done  | 4.67 sec.  | new lr 0.1 | loss  9.96
| epoch   0/1  | batch 137/16.417  | 70.144/8.405.037 vectors done  | 4.45 sec.  | new lr 0.1 | loss  9.93
| epoch   0/1  | batch 157/16.417  | 80.384/8.405.037 vectors done  | 4.80 sec.  | new lr 0.1 | loss  9.88
| epoch   0/1  | batch 176/16.417  | 90.112/8.405.037 vectors done  | 4.64 sec.  | new lr 0.1 | loss  9.83
| epoch   0/1  | batch 196/16.417  | 100.3

KeyboardInterrupt: 

In [36]:
# save trained model
MODEL_PATH = BASE_PATH + 'saves/model.pt'

In [37]:
torch.save(model.state_dict(), MODEL_PATH)

In [None]:
# load
#model = CBOW(len(vocab), EMDEDDING_DIM)
#model.load_state_dict(torch.load(MODEL_PATH))
#model.eval()

## Evaluation

In [38]:
model.eval()

CBOW(
  (embeddings): Embedding(24326, 100)
  (linear1): Linear(in_features=100, out_features=128, bias=True)
  (activation_function1): ReLU()
  (linear2): Linear(in_features=128, out_features=24326, bias=True)
  (activation_function2): LogSoftmax(dim=-1)
)

In [50]:
def get_index(*tokens) -> list[int] | int:
    indices = []
    for token in tokens:
        if token not in vocab:
            raise ValueError(f'Token not found: {token}')
        indices.append(vocab[token])
    return indices if len(indices) > 1 else indices[0]

get_index('king')

670

### Normalize Embeddings

In [51]:
# read embedding from first model layer
embeddings = next(model.embeddings.parameters()).cpu().detach().numpy()  # (16390, 100)


In [52]:
# normalize
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)  # ndarray (16390,)
norms = np.reshape(norms, (len(norms), 1))  # (16390, 1)
embeddings_normalized = embeddings / norms  # (16390, 100)

### Find Similar Words

In [53]:
def get_top_similar(word: str, top_n: int):
    if word not in vocab:
        raise ValueError(f'Not found: {word}')
    word_index = vocab[word]

    word_vector = embeddings_normalized[word_index]
    word_vector = np.reshape(word_vector, (len(word_vector), 1))
    distances = np.matmul(embeddings_normalized, word_vector).flatten()
    top_n_indices = np.argsort(-distances)[1 : top_n + 1]  # the nearest is always the word itself

    top_n_dict = {}
    for similar_word_index in top_n_indices:
        similar_word = vocab.lookup_token(similar_word_index)
        top_n_dict[similar_word] = distances[similar_word_index]
    return top_n_dict

In [54]:
for word, similarity in get_top_similar("king", top_n=10).items():
    print(f"{word: <20}: {similarity :.3f}")

backhouse           : 0.424
samuel              : 0.375
afraid              : 0.356
frockcoat           : 0.353
isabella            : 0.353
feelingly           : 0.346
amory               : 0.345
maturity            : 0.345
uniformity          : 0.332
extenuate           : 0.331


### Vector Equations

In [55]:
emb1 = embeddings[vocab["king"]]
emb2 = embeddings[vocab["man"]]
emb3 = embeddings[vocab["woman"]]

emb4 = emb1 - emb2 + emb3
emb4_norm = (emb4 ** 2).sum() ** (1 / 2)
emb4 = emb4 / emb4_norm

emb4 = np.reshape(emb4, (len(emb4), 1))
dists = np.matmul(embeddings_normalized, emb4).flatten()

top5 = np.argsort(-dists)[:5]

for word_id in top5:
    print("{}: {:.3f}".format(vocab.lookup_token(word_id), dists[word_id]))

woman: 0.667
king: 0.599
fright: 0.346
signal: 0.345
rival: 0.326
