<a href="https://colab.research.google.com/github/veekaybee/viberary/blob/main/notebooks/cbow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Wrap text results for easier display purposes
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# CBOW Implementation of Word2Vec

This is part of the background research that I'm working on for [viberary.pizza](https://viberary.pizza/).

## Background 

[Word2vec](https://arxiv.org/abs/1301.3781) was a critical point in NLP work, building on previous work in dimensionality reduction in NLP such as tf-idf, topic modeling, and latent semantic analysis to reduce vocabulary sizes for computational complexity, and additionally, to add context by embedding similar words in the same latent space. 

As of 2022, it's almost been superceded by [transformers-based architectures](https://e2eml.school/transformers.html), but it's still worth understanding how it works in a historical context, as well as because there is a fair amount of it [in production in Spark](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.ml.feature.Word2Vec.html).  

# Word2Vec Implementation

There are numerous word2vec implementations in libraries like Spark and Tensorflow. There is not an exact one in PyTorch, but following[this code](https://github.com/FraLotito/pytorch-continuous-bag-of-words/blob/master/cbow.py), as well as reading about the [architecture here](https://towardsdatascience.com/word2vec-with-pytorch-implementing-original-paper-2cd7040120b0) and [here](https://jalammar.github.io/illustrated-word2vec/),  I was able to implement and understand how it works under te covers


Original explanation in [PyTorch implementation](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html) is here. 

As a starting point, we take our raw data for input. 
Our [training set for Viberary is some sample the Goodreads input dataset](https://github.com/veekaybee/viberary#input-data-sample), 
which is a string of text containing the metadata for each unique book
id. 

For a single book id, it will contain the book description book title, etc. So a sample of a single book, will look like this


```
Raw text: All's Fairy in Love and War (Avalon: Web of Magic, #8) To Kara's astonishment, she discovers that a portal has opened in her bedroom closet and two goblins 
have fallen through! They refuse to return to the fairy realms and be drafted for an impending war. 
In an attempt to roust the pesky creatures, Kara falls through the portal, smack into the middle of a huge war.
Kara meets Queen Selinda, who appoints Kara as a Fairy Princess and assigns her an impossible task: 
to put an end to the war using her diplomatic skills.
```

This is initially stored as a Python string. 

Our final goal in learning a Word2Vec model with CBOW is, given an input phrase over a context window, to predict the word that's missing. The context window is how many words before and after the word we care about. So, given the phrase "Kara falls X the portal", we should be able to predict that the correct word is "through."

We do this in Word2Vec by continuously sampling from the raw text over the context window, where the context window around the word is the X variable and the word itself is the target variable. 

For the first example, "Kara falls the portal" is the context and "through" is the response variable. Then we shift the window by 1 word and generate another entry. This is the whole of the [continuous bag of words approach.](https://arxiv.org/pdf/1301.3781.pdf)

When we're first training the model, we pass these samples into the model and ask it to make a prediction on a single word given all these samples. The output is a vector of propabilities of the sample related to each word. We then compare that prediction to the actual label (I.e. for the sample "Kara falls X the portal" we KNOW the correct word is "through").

We compare the actual vector (i.e. where through = 1) to the probability vector, and the difference between the two is the loss. The parameters are passed to the model across multiple epochs and continuously updated until we minimize the loss, i.e. we get as close to the predicted word as possible. 

In the process of doing this prediction, we create a lookup table of words, or embeddings matrix, to their vector representations. It is these vectors that become our embeddings. Andiamo!

In [None]:
!pip install -q torch # if you don't have it already 

In [None]:
# we need these bois
import torch
import torch.nn as nn

## Text Preparation

In [None]:
# First, we'll initialize our hyperparameters for the model: 

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right - this is our context window
EMBEDDING_DIM = 100 # size of the embeddings matrix - we'll get to this in a bit

In [None]:
# Our tiny training dataset 

raw_text = """To Kara's astonishment, she discovers that a portal has opened in her bedroom closet and two goblins have fallen through! They refuse to return to the fairy realms and be drafted for an impending war. 
In an attempt to roust the pesky creatures, Kara falls through the portal, 
smack into the middle of a huge war. Kara meets Queen Selinda, who appoints 
Kara as a Fairy Princess and assigns her an impossible task: 
to put an end to the war using her diplomatic skills.""".split()

In [None]:
# Text preprocessing get only individual words
vocab = set(raw_text) # dedup
vocab_size = len(vocab)

In [None]:
print(vocab)

{'she', 'middle', 'put', 'be', 'discovers', 'into', 'Selinda,', 'opened', 'and', 'using', 'the', 'creatures,', 'has', 'Fairy', 'end', 'Princess', 'meets', 'realms', 'skills.', 'a', 'To', 'that', 'goblins', 'fairy', 'impending', 'appoints', 'impossible', 'astonishment,', 'fallen', 'for', 'an', 'closet', 'attempt', "Kara's", 'of', 'task:', 'war', 'They', 'In', 'war.', 'pesky', 'smack', 'through', 'diplomatic', 'refuse', 'assigns', 'bedroom', 'portal,', 'have', 'her', 'huge', 'as', 'to', 'Queen', 'return', 'through!', 'in', 'Kara', 'drafted', 'roust', 'two', 'portal', 'who', 'falls'}


In [None]:
# we create simple mappings of word to an index of the word
word_to_ix = {word: ix for ix, word in enumerate(vocab)}
ix_to_word = {ix: word for ix, word in enumerate(vocab)}

In [None]:
print(word_to_ix)

{'she': 0, 'middle': 1, 'put': 2, 'be': 3, 'discovers': 4, 'into': 5, 'Selinda,': 6, 'opened': 7, 'and': 8, 'using': 9, 'the': 10, 'creatures,': 11, 'has': 12, 'Fairy': 13, 'end': 14, 'Princess': 15, 'meets': 16, 'realms': 17, 'skills.': 18, 'a': 19, 'To': 20, 'that': 21, 'goblins': 22, 'fairy': 23, 'impending': 24, 'appoints': 25, 'impossible': 26, 'astonishment,': 27, 'fallen': 28, 'for': 29, 'an': 30, 'closet': 31, 'attempt': 32, "Kara's": 33, 'of': 34, 'task:': 35, 'war': 36, 'They': 37, 'In': 38, 'war.': 39, 'pesky': 40, 'smack': 41, 'through': 42, 'diplomatic': 43, 'refuse': 44, 'assigns': 45, 'bedroom': 46, 'portal,': 47, 'have': 48, 'her': 49, 'huge': 50, 'as': 51, 'to': 52, 'Queen': 53, 'return': 54, 'through!': 55, 'in': 56, 'Kara': 57, 'drafted': 58, 'roust': 59, 'two': 60, 'portal': 61, 'who': 62, 'falls': 63}


In [None]:
# Creating our training data and context window

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = [raw_text[i - 2], raw_text[i - 1], raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))

In [None]:
# We have our [input, input, input, input, target]
# based on the context window of +2 words -2 words
# you can see how we're building words close to each other now
print(*data[0:10], sep="\n")

(['To', "Kara's", 'she', 'discovers'], 'astonishment,')
(["Kara's", 'astonishment,', 'discovers', 'that'], 'she')
(['astonishment,', 'she', 'that', 'a'], 'discovers')
(['she', 'discovers', 'a', 'portal'], 'that')
(['discovers', 'that', 'portal', 'has'], 'a')
(['that', 'a', 'has', 'opened'], 'portal')
(['a', 'portal', 'opened', 'in'], 'has')
(['portal', 'has', 'in', 'her'], 'opened')
(['has', 'opened', 'her', 'bedroom'], 'in')
(['opened', 'in', 'bedroom', 'closet'], 'her')


## Model Set Up

# CBOW Architecture 

<img width="344" alt="Screen Shot 2023-02-14 at 3 48 16 PM" src="https://user-images.githubusercontent.com/3837836/218859716-495a0a6f-aed7-40aa-aba9-5c0f1949788c.png">

We have two layers in the CBOW implementation of Word2Vec: an input Embedding layer that maps each word to a space in the embedding dictionary, a hidden linear activation layer, and then the output layer that is the proportional probabilities [softmax](https://en.wikipedia.org/wiki/Softmax_function) of all the correct words given an input window. 

The critical part is the first part, creating the Embeddings lookup. 

First, we associate each word in the vocabulary with an index, aka `{'she': 0, 'middle': 1, 'put': 2`

Then, what we want to do is create an embeddings table, or matrix, that we will multiply with these indices to map each one to its correct place in relation to the other indices via a table lookup, based on how many vectors you'd like to represent the word. 

There is a [really good explanation](https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work/305032#305032) of how these are generated: 

``` 
For a given word, you create a one-hot vector based on its index and multiply it by the embeddings matrix, effectively replicating a lookup. For instance, for the word "soon" the index is 4, and the one-hot vector is [0, 0, 0, 0, 1, 0, 0]. If you multiply this (1, 7) matrix by the (7, 2) embeddings matrix you get the desired two-dimensional embedding, which in this case is [2.2, 1.4].
```

In [None]:
class CBOW(torch.nn.Module):
    def __init__(self, vocab_size, embedding_dim): # we pass in vocab_size and embedding_dim as hyperparams
        super(CBOW, self).__init__()

        # out: 1 x embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim) # initialize an Embedding matrix based on our inputs
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.activation_function1 = nn.ReLU()

        # out: 1 x vocab_size
        self.linear2 = nn.Linear(128, vocab_size)
        self.activation_function2 = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):
        embeds = sum(self.embeddings(inputs)).view(1, -1)
        out = self.linear1(embeds)
        out = self.activation_function1(out)
        out = self.linear2(out)
        out = self.activation_function2(out)
        return out

    def get_word_emdedding(self, word):
        word = torch.tensor([word_to_ix[word]])
        # Embeddings lookup of a single word once the Embeddings layer has been optimized 
        return self.embeddings(word).view(1, -1)

In [None]:
# We initialize the model:

model = CBOW(vocab_size, EMBEDDING_DIM)

In [None]:
# then, we initialize the loss function 
# (aka how close our predicted word is to the actual word and how we want to minimize it using the optimizer)

loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [None]:
# Training

# 50 to start with, no correct answer here
for epoch in range(50):
    # we start tracking how accurate our intial words are
    total_loss = 0

    # for the x, y in the training data: 
    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)

        # we look at loss
        log_probs = model(context_vector)

        # we compare the loss from what the actual word is related to the probaility of the words
        total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))

    # optimize at the end of each epoch
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()
    
    # Log out some metrics to see if loss decreases
    print("end of epoch {} | loss {:2.3f}".format(epoch, total_loss))

end of epoch 0 | loss 339.945
end of epoch 1 | loss 328.208
end of epoch 2 | loss 317.001
end of epoch 3 | loss 306.260
end of epoch 4 | loss 295.829
end of epoch 5 | loss 285.631
end of epoch 6 | loss 275.647
end of epoch 7 | loss 265.843
end of epoch 8 | loss 256.195
end of epoch 9 | loss 246.698
end of epoch 10 | loss 237.405
end of epoch 11 | loss 228.335
end of epoch 12 | loss 219.449
end of epoch 13 | loss 210.748
end of epoch 14 | loss 202.229
end of epoch 15 | loss 193.892
end of epoch 16 | loss 185.753
end of epoch 17 | loss 177.816
end of epoch 18 | loss 170.046
end of epoch 19 | loss 162.439
end of epoch 20 | loss 154.975
end of epoch 21 | loss 147.686
end of epoch 22 | loss 140.564
end of epoch 23 | loss 133.619
end of epoch 24 | loss 126.860
end of epoch 25 | loss 120.294
end of epoch 26 | loss 113.946
end of epoch 27 | loss 107.797
end of epoch 28 | loss 101.887
end of epoch 29 | loss 96.191
end of epoch 30 | loss 90.744
end of epoch 31 | loss 85.534
end of epoch 32 | los

In [None]:
# Now, let's test to see if the model predicts the correct word using our initial input
context = ["Kara","falls" , "the", "portal"]
context_vector = make_context_vector(context, word_to_ix)
a = model(context_vector)

In [None]:
print(f'Raw text: {" ".join(raw_text)}\n')
print(f"Context: {context}\n")
print(f"Prediction: {ix_to_word[torch.argmax(a[0]).item()]}")


Raw text: To Kara's astonishment, she discovers that a portal has opened in her bedroom closet and two goblins have fallen through! They refuse to return to the fairy realms and be drafted for an impending war. In an attempt to roust the pesky creatures, Kara falls through the portal, smack into the middle of a huge war. Kara meets Queen Selinda, who appoints Kara as a Fairy Princess and assigns her an impossible task: to put an end to the war using her diplomatic skills.

Context: ['Kara', 'falls', 'the', 'portal']

Prediction: through


In [None]:

# Now let's get what we care about, which is the embeddings!
print(f'Getting vectors for a sequence:\n', model.embeddings(torch.LongTensor([1, 2, 3])))

Getting vectors for a sequence:
 tensor([[ 3.6677e-01,  2.8267e-02,  9.5658e-01,  9.5077e-01,  1.0641e+00,
          8.9874e-01, -1.3958e-01, -8.8929e-01, -9.2349e-01,  2.4945e-01,
         -1.6626e+00,  3.1749e-01,  4.5353e-01,  7.8733e-01, -1.7945e+00,
          4.6523e-01,  1.4962e+00, -5.3494e-01,  3.3327e-01,  4.4590e-01,
          2.7505e-01,  4.8399e-01,  4.5670e-01, -9.6859e-01,  7.5164e-01,
          4.5564e-01, -1.8508e-01,  4.7951e-01, -5.0327e-01,  8.9468e-01,
         -5.2872e-01, -2.8511e-01, -1.3353e-02, -4.4388e-01,  3.2415e-01,
         -6.8152e-01, -3.0049e-01, -1.6878e+00, -1.6340e+00,  1.1231e+00,
          1.4558e+00, -4.5023e-01, -1.1745e-01,  1.2026e+00, -1.0683e+00,
         -3.7055e-01, -1.0187e-01,  6.5679e-01, -6.2459e-01, -6.7784e-01,
         -8.9891e-01, -8.0431e-01,  8.7571e-01, -4.1768e-01, -6.3977e-01,
          3.2761e-01,  1.9852e+00, -1.5843e-01, -2.9237e-01,  1.1127e+00,
          9.7873e-01,  1.2410e+00,  1.2661e+00, -1.3038e+00, -1.8683e+00,
     

In [None]:
print('Getting weights:\n', model.embeddings.weight.data[1]) # we can get the entire matrix this way

Getting weights:
 tensor([ 3.6677e-01,  2.8267e-02,  9.5658e-01,  9.5077e-01,  1.0641e+00,
         8.9874e-01, -1.3958e-01, -8.8929e-01, -9.2349e-01,  2.4945e-01,
        -1.6626e+00,  3.1749e-01,  4.5353e-01,  7.8733e-01, -1.7945e+00,
         4.6523e-01,  1.4962e+00, -5.3494e-01,  3.3327e-01,  4.4590e-01,
         2.7505e-01,  4.8399e-01,  4.5670e-01, -9.6859e-01,  7.5164e-01,
         4.5564e-01, -1.8508e-01,  4.7951e-01, -5.0327e-01,  8.9468e-01,
        -5.2872e-01, -2.8511e-01, -1.3353e-02, -4.4388e-01,  3.2415e-01,
        -6.8152e-01, -3.0049e-01, -1.6878e+00, -1.6340e+00,  1.1231e+00,
         1.4558e+00, -4.5023e-01, -1.1745e-01,  1.2026e+00, -1.0683e+00,
        -3.7055e-01, -1.0187e-01,  6.5679e-01, -6.2459e-01, -6.7784e-01,
        -8.9891e-01, -8.0431e-01,  8.7571e-01, -4.1768e-01, -6.3977e-01,
         3.2761e-01,  1.9852e+00, -1.5843e-01, -2.9237e-01,  1.1127e+00,
         9.7873e-01,  1.2410e+00,  1.2661e+00, -1.3038e+00, -1.8683e+00,
        -6.9838e-01,  9.7294e-01,

In [None]:
# And, what we actually care about is being able to look up individual words with their embeddings: 
torch.set_printoptions(threshold=10_000)
print(f"Embedding for Kara: {model.embeddings.weight[word_to_ix['Kara']]}")

Embedding for Kara: tensor([-0.6111,  1.5321,  2.2814,  1.3241, -1.8293,  0.6344, -0.1314, -0.9478,
        -0.5118, -0.4566,  0.2793, -0.4865,  0.5040, -0.6995,  1.5808,  1.2579,
        -0.0353,  0.8555, -0.9626,  1.3800, -0.4329,  2.5045, -0.0540, -2.1763,
         1.7599,  0.1144, -0.3841, -0.6929, -0.6074,  1.4371, -0.1853, -1.0044,
        -0.8496, -0.3266, -1.7892,  0.2947, -0.9695,  1.2488, -1.3763,  0.4803,
         0.7801,  1.1124,  0.1927,  0.2877,  0.7960,  0.2902,  0.5956,  0.2125,
        -0.9660,  1.1759, -0.7608, -0.8166,  1.4223,  1.3507,  0.1251,  0.3420,
         0.9456,  0.2384,  1.4669, -0.0079, -0.2634,  1.2884,  0.9114,  0.6785,
         0.7871, -1.2732, -0.8136,  0.6783,  0.8478, -2.0053, -0.2100, -0.8468,
        -1.1765, -0.7192,  0.1634, -1.2002,  0.5847, -0.0994,  0.1786,  0.7368,
        -1.0057,  2.0974,  1.6705,  0.1553,  0.5119,  0.1653, -0.8831,  1.2920,
         0.6210, -0.7722, -0.1652, -1.2688,  0.6914, -1.1334,  0.7620, -0.4927,
        -0.1202, -0.

In [None]:
# This way, when we create our second tower of book words, we know which ones are likely related to a given book