# Embeddings

In [1]:
import torch
from pathlib import Path

We start with a simple corpus of two sentences

In [57]:
corpus = ['the cat sat on the mat', 'where is the cat']

Our models wont handle strings. So, what we typically do is to `tokenize` the words. We will assign arbitrary integers to the words.

First, we need to get all the words. So we split the strings on spaces, which gives us the words in every sentence.

Then, we flatten the sentences so we get one very long list of words. After that, we run through the corpus to count all words.

In [79]:
from collections import Counter, OrderedDict
from typing import List

def split_and_flat(corpus: List[str]):
    corpus = [x.split() for x in corpus]
    corpus = [x for y in corpus for x in y]
    return corpus

data = split_and_flat(corpus)
counter = Counter(data)
counter

Counter({'the': 3, 'cat': 2, 'sat': 1, 'on': 1, 'mat': 1, 'where': 1, 'is': 1})

This is usefull, if our corpus grows too big and we want to drop some words. We better drop the less frequent words in that case.

In [80]:
ordered_dict = OrderedDict(counter)
ordered_dict

OrderedDict([('the', 3),
             ('cat', 2),
             ('sat', 1),
             ('on', 1),
             ('mat', 1),
             ('where', 1),
             ('is', 1)])

With this, we can build a vocab.

In [81]:
from torchtext.vocab import vocab

v1 = vocab(ordered_dict)
v1.set_default_index(-1)

This is just a mapping from tokens to arbitrary integers. 

In [82]:
v1["the"]

0

The default index is returned when we have unknown words

In [83]:
v1["test"]

-1

And we can translate back to strings

In [84]:
v1.lookup_token(0)

'the'

So, we are now able to map the sentence from strings to integers.

In [85]:
sentence = corpus[0].split()
tokenized_sentence = [v1[word] for word in sentence]
tokenized_sentence

[0, 1, 2, 3, 0, 4]

Can you "read" the original sentence?

Ok, now, how to represent this. A naive way would be to use a one hot encoding.

<img src=https://www.tensorflow.org/text/guide/images/one-hot.png width=400/>

In [87]:
import torch.nn.functional as F

tokenized_tensor = torch.tensor(tokenized_sentence)
oh = F.one_hot(tokenized_tensor)
oh

tensor([[1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0],
        [1, 0, 0, 0, 0],
        [0, 0, 0, 0, 1]])

While this might seem like a nice workaround, it is very memory inefficient. 
Vocabularies can easily grow into the 10.000+ words!

So, let's make a more dense space. We simply decide on a dimensionality, and start with assigning a random vector to every word.

<img src=https://www.tensorflow.org/text/guide/images/embedding2.png width=400/>

In [90]:
vocab_size = len(v1)
hidden_dim = 4

embedding = torch.nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=hidden_dim,
    padding_idx=-2
)
x = embedding(tokenized_tensor)
x

tensor([[-1.0161,  0.2648, -0.3497, -0.9465],
        [-0.6656, -0.0263, -0.7860,  0.3781],
        [ 1.0528, -2.0017, -1.2581,  0.3277],
        [ 0.5182, -1.2019,  0.3222, -0.2016],
        [-1.0161,  0.2648, -0.3497, -0.9465],
        [-0.4293, -1.5902,  0.7622,  0.7740]], grad_fn=<EmbeddingBackward0>)

So:

- we started with a sentence of strings.
- we map the strings to arbitrary integers
- the integers are used with an Embedding layer; this is nothing more than a lookup table where every word get's a random vector assigned

We started with a 6-word sentence. But we ended with a (6, 4) matrix of numbers.

So, let's say we have a batch of 32 sentences. We can now store this for example as a (32, 15, 6) matrix: batchsize 32, length of every sentence is 15 (use padding if the sentence is smaller), and every word in the sentence represented with 6 numbers.

This is exactly the same as what we did before with timeseries! We have 3 dimensional tensors, (batch x sequence_length x dimensionality) that we can feed into an RNN!

In [92]:
x_ = x[None, ...]
rnn = torch.nn.GRU(
    input_size=hidden_dim,
    hidden_size=16,
    num_layers=1
)

out, hidden = rnn(x_)
out.shape, hidden.shape

(torch.Size([1, 6, 16]), torch.Size([1, 6, 16]))