# General Note
In order to guide you through the homework, we put "...COMPLETE HERE..." as placeholder for you to complete the homework.

# Word Embeddings
Word embeddings are *dense vectors of real numbers* (they have tipically entries that are non-zero). One dense vector per element in a dictionary.
In Natural Language Processing (NLP) you deal with words.

Thus, you have to represent words in a computer, how to do it?
It is possible to use the ASCII representation of the characters composing the word.

However, this kind of representation does not tell you anything about its meaning, but only about what it is exactly.

Generally you want to obtain a sort of representation of your high-dimensional data to a smaller representation.

You can also usa the so-called ''one-hot encoding'' of the word in a vocabulary. Let's represent the word $w$ in a vocabulary of $V$ words:

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where we have $1$ at the position of $w$.

There are different drawbacks in representing the word like that. It could be an enormous representation, but also it is considering the words as independent. We would like to have a representation of the words that contains a sort of notion of *similarity*.

Let's see why.

Suppose we want to build a language model, and that we have seen the following sentences:


*   the girl ran to the shop
*   the boy ran to the shop
*   the girl bought a flower

in the training data. Then, suppose we see sentences never seen before in our training data

*   the boy bought a flower

Maybe, the language model would tell that it is probable the previous sentence. But what about using the following information we have seen in our training data:
*  we have seen the boy and the girl in the same role in the sentence, thus maybe they share some semantic information
*  we have seen the girl in the same relation of the boy we are seeing now

These two facts shold increase the probability of correctness of the sentences.
We mean this for *semantic similarity*. This is a technique to combat the sparsity of the linguistic data.

This representation however rely on the assumption that words appearning in similar contexts are related to each other semantically.


## Getting Dense Word Embeddings
How could we encode semantic similarity in words? Maybe, we can introduce some semantic attributes.

For example, we know that both the boy and girl can run. Thus, we can give an high score to the attribute "is able to run". We can think other semantic attributes and in this way we construct the embeddings of the two subjects: the boy and the girl.

\begin{align}q_\text{boy} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{likes beach}, \dots \right]\end{align}

\begin{align}q_\text{girl} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{likes beach}, \dots \right]\end{align}

We can measure the similarity between the boy and girl by doing a dot-product between the two representations:

\begin{align}\text{Similarity}(\text{boy}, \text{girl}) = q_\text{boy} \cdot q_\text{girl}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{boy}, \text{girl}) = \frac{q_\text{boy} \cdot q_\text{girl}}
   {\| q_\text{boy} \| \| q_\text{girl} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


We let the network learn this semantic attributes, by definining them as parameters. We will have some *latent semantic attributes* that the network can, in principle, learn. Note that the word embeddings will probably not be interpretable.

In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**.

## Word Embeddings in Pytorch

Similarly to what we do with the one-hot-encoding representation, we have to define an index for each word in the vocabulary. These will be the keys in a lookup table. That is embeddings are stored in a $|V| \times D$ matrix, were $D$ is the dimensionality of the embeddings, such that the word assigned inex $i$ has the embedding stored in the row of the matrix of index $i$.

The module that permits you to do the embeddings is torch.nn.Embedding which takes to parameters: the dimensionality of the vocabulary and the dimension of the embeddings.





In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import spacy
import numpy as np
import random

In [2]:
# this is all the necessary code to set the seed
def set_seed(seed : int = 123):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)

In [3]:
set_seed()

In [4]:
word_to_ix = {"hello": 0, "world": 1}

# rows = n. words in the vocabulary
# cols = dimension of the embedding
embeds = nn.Embedding(2, 5)
# tensor to store the index to access the embedding of a word
word = 'hello'
# word = 'world'
lookup_tensor = torch.tensor([word_to_ix[word]], dtype=torch.long)

word_embed = embeds(lookup_tensor)
print(word_embed)

tensor([[-0.1115,  0.1204, -0.3696, -0.2404, -1.1969]],
       grad_fn=<EmbeddingBackward0>)


In [None]:
# load google drive to see the files in google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!wget #link from where to download the file

# Exercise: Compute the Word Embedding Continuous Bag-of-Words

The Continuous Bag-of-Words (CBOW) is widely used in NLP, it is a way to embed the context of few words before and after of the target word.

CBOW is used to train word embeddings and these embeddings are used as initialisation of a more complicate model.

This is usually referred as *pretraining embeddings*.

Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}\left(A(\sum_{w \in C} q_w) + b\right)\end{align}

where $q_w$ is the embedding for word $w$.


Implement the CBOW class in the following snippet of code and train it.

In [5]:
# 2 words to the left and two to the right
CONTEXT_SIZE = 2
EMBEDDING_SIZE = 100
EPOCHS = 50
LEARNING_RATE = 1e-3

In [None]:
def create_dataset():
  # read the text file raw_text.txt as utf-8 in a string
  with open('/content/drive/MyDrive/Colab Notebooks/data/anna.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    # tokenize raw text with spacy
    nlp = spacy.load("en_core_web_sm")
    text = text[:nlp.max_length//10]

    tokenized_text = [token.text for token in nlp(text)] # text, pos_, dep_
    # to sort the tokens is not necessary
    vocab = sorted(set(tokenized_text))

    w_2_idx = {w:i for i,w in enumerate(vocab)}
    idx_2_w = {i:w for i,w in enumerate(vocab)}

    # create the data composed by the tuple of the context and the target
    data = []
    for i in range(CONTEXT_SIZE, len(tokenized_text)-CONTEXT_SIZE):

      context_idx = [w_2_idx[tokenized_text[i-j-1]] for j in range(CONTEXT_SIZE)]+ \
              [w_2_idx[tokenized_text[i+j+1]] for j in range(CONTEXT_SIZE)]
      target_idx = w_2_idx[tokenized_text[i]]

      data.append((context_idx, target_idx))
    return data, w_2_idx, idx_2_w

# Question 1
Implement CBOW model by defining an appropriate class which extends `nn.Module` and train it on the data.

In [None]:
# Create the model
class CBOW(nn.Module):

    def __init__(self, "...COMPLETE HERE..."):
        super(CBOW, self).__init__()
        "...COMPLETE HERE..."

    def forward(self, x):
        "...COMPLETE HERE..."
        return x

# Question 2
Choose the correct loss and explain why you chose that loss.

# Question 3
Make an exaple of prediction and see that the prediction is correct. See the end of the following script.

In [None]:

data, w_2_idx, idx_2_word = create_dataset()
# implement the loss
# explain why you used the loss you chose
criterion = "...COMPLETE HERE..."
model = CBOW("...COMPLETE HERE...")
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# create the context data and targets in tensors
context_data = torch.tensor([d[0] for d in data])
target_data = torch.tensor([d[1] for d in data])

# create a dataset and a dataloader
dataset = torch.utils.data.TensorDataset(context_data, target_data)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

for e in range(EPOCHS):
  for idx, data in enumerate(dataloader):
    context = data[0]
    target = data[1]
    predicted = model(context)
    loss = criterion(predicted, target)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f'Epoch {e} - loss: {loss.item()}')

# --- TESTING ---
data, w_2_idx, idx_2_word = create_dataset()
# create the context data and targets in tensors
context_data = torch.tensor([d[0] for d in data])
target_data = torch.tensor([d[1] for d in data])

# create a dataset and a dataloader
dataset = torch.utils.data.TensorDataset(context_data, target_data)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=True)

for idx, data in enumerate(dataloader):
  context = data[0]
  target = data[1]
  out = model(context)
  break

#print result
print(f'Context: {"...COMPLETE HERE..."}')
print(f'Target: {"...COMPLETE HERE..."}')
print(f'Prediction: {"...COMPLETE HERE..."}')
