# Embeddings

In [1]:
import torch
from pathlib import Path
import tokenizers as tk


## tokenizers
Because we can want to work with a vectorspace in the hidden layers of our model, the question is, how do we convert words into vectors? The answer is tokenization. 
With tokenization we map strings to arbitrary integers, and use the integers to look up a vector in a table (the embedding).

In mathematical terms:

$$
f\colon \text{str} \rightarrow \mathbb{N}\\ 
g\colon \mathbb{N} \rightarrow \mathbb{R}^d
$$

where $f$ is the tokenizer and $g$ is the embedding function.

### What is BPE?

BPE (Byte Pair Encoding) is a subword tokenization algorithm. Instead of splitting words into individual characters or using entire words as tokens, BPE breaks words into smaller subword units. It starts with individual characters and merges the most frequent pairs of characters iteratively, creating subwords that can effectively represent both common words and rare or new words through combinations of these subwords.
A rough outline of the BPE algorithm is as follows:
1.	Start with characters: Initially, BPE represents each word as a sequence of characters.
2.	Iterative merging: It then identifies the most frequent pairs of characters and merges them into a single token. This process continues, merging pairs iteratively until the specified vocabulary size is reached.
3.	Handling rare words: With BPE, rare words that haven’t been seen during training can still be decomposed into recognizable subword tokens (e.g., “unhappiness” might become [“un”, “happy”, “ness”]).

### Why is BPE better than a naive tokenizer?

A naive tokenizer splits on spaces, treating each word as a token. This approach doesn’t work well for rare words, misspellings, or words not seen during training, as they would each be treated as unique tokens. BPE helps by breaking these words into smaller subword units, ensuring that even rare or new words can still be tokenized into meaningful subparts, reducing the size of the vocabulary and improving generalization.

In [None]:
def buildBPE(corpus: list[str], vocab_size: int) -> tk.Tokenizer:
    tokenizer = tk.Tokenizer(tk.models.BPE())
    trainer = tk.trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=1,
        special_tokens=["<pad>", "<s>", "</s>", "<mask>"],
    )

    # handle spaces better by removing the prefix space
    tokenizer.pre_tokenizer = tk.pre_tokenizers.ByteLevel(add_prefix_space=False)
    tokenizer.decoder = tk.decoders.ByteLevel()

    # train the BPE model
    tokenizer.train_from_iterator(corpus, trainer)
    # Padding is enabled to make sure input sequences match in length during training or inference.
    tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
    return tokenizer

Special tokens:

- Padding (`<pad>`): Padding ensures all input sequences are of equal length by adding this token where needed.
- Start (`<s>`) and stop (`</s>`) tokens: These mark the beginning and end of a sequence, helping models understand where a sentence starts and ends.
- Unknown (`<unk>`): This token is used for words or subwords that the tokenizer doesn’t know or hasn’t been trained on.
- Mask (`<mask>`): This token is used in tasks like masked language modeling, where certain tokens are hidden, and the model is asked to predict them.

We start with a simple corpus of two sentences

In [2]:
corpus = ["the cat sat on the mat", "where is the cat"]


In [None]:
tokenizer = buildBPE(corpus, 50)

Ok, lets see how our vocabulary looks:

In [3]:
print(f"First 10 tokens: {list(tokenizer.get_vocab())[:10]}")
print(f"Last 10 tokens: {list(tokenizer.get_vocab())[-10:]}")

Counter({'the': 3, 'cat': 2, 'sat': 1, 'on': 1, 'mat': 1, 'where': 1, 'is': 1})

We can now encode a word

In [4]:
tokenizer.encode("the")

OrderedDict([('the', 3),
             ('cat', 2),
             ('sat', 1),
             ('on', 1),
             ('mat', 1),
             ('where', 1),
             ('is', 1)])

Which maps the string to an arbitrary integer

In [5]:
tokenizer.encode("the").ids



This works for full sentences:

In [6]:
s1 = tokenizer.encode(corpus[0])
s1.ids

0

An we can map back from the integers to the strings

In [7]:
tokenizer.decode(s1.ids)

-1

We can see that an unknown word is broken down into subwords, or even letters

In [8]:
s2 = tokenizer.encode("barbapapa")
s2.ids

'the'

In [None]:
for i, token in enumerate(s2.ids):
    print(f"Token #{i} is {token} and {tokenizer.decode([token])}")

In [None]:
for i, token in enumerate(s1.ids):
    print(f"Token #{i} is {token} and {tokenizer.decode([token])}")

Here we see we are missing letters. This is because for BPE, normally you would use a bigger input corpus such that you will encounter at minimum to full vocabulary, and the BPE can always fall back to spelling the word.

So, we are now able to map the sentence from strings to integers.

In [9]:
print(f"First sentence: {corpus[0]}")
tokenized_sentence = tokenizer.encode(corpus[0])
tokenized_sentence.ids

[0, 1, 2, 3, 0, 4]

Can you "read" the original sentence?

Ok, now, how to represent this. A naive way would be to use a one hot encoding.

<img src=https://www.tensorflow.org/text/guide/images/one-hot.png width=400/>

In [10]:
import torch.nn.functional as F

tokenized_tensor = torch.tensor(tokenized_sentence.ids)
oh = F.one_hot(tokenized_tensor)
oh


tensor([[1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0],
        [1, 0, 0, 0, 0],
        [0, 0, 0, 0, 1]])

While this might seem like a nice workaround, it is very memory inefficient. 
Vocabularies can easily grow into the 10.000+ words!

So, let's make a more dense space. We simply decide on a dimensionality, and start with assigning a random vector to every word.

<img src=https://www.tensorflow.org/text/guide/images/embedding2.png width=400/>

In [11]:
len(tokenizer.get_vocab())

tensor([[-1.0231,  0.1896,  0.5405,  1.0020],
        [-0.1649, -1.5943,  0.3578, -1.5106],
        [-0.4982, -0.4390, -0.7737,  0.0240],
        [ 1.3668, -0.0218,  0.4670, -0.5132],
        [-1.0231,  0.1896,  0.5405,  1.0020],
        [ 0.2561, -0.7609,  0.4631, -1.8276]], grad_fn=<EmbeddingBackward0>)

In [None]:
tokenizer.encode("<pad>").ids

In [None]:
vocab_size = len(tokenizer.get_vocab())
hidden_dim = 4

embedding = torch.nn.Embedding(
    num_embeddings=vocab_size, embedding_dim=hidden_dim, padding_idx=0
)
x = embedding(tokenized_tensor)
x


So:

- we started with a sentence of strings.
- we map the strings to arbitrary integers
- the integers are used with an Embedding layer; this is nothing more than a lookup table where every word get's a random vector assigned

We started with a 6-word sentence. But we ended with a (6, 4) matrix of numbers.

So, let's say we have a batch of 32 sentences. We can now store this for example as a (32, 15, 6) matrix: batchsize 32, length of every sentence is 15 (use padding if the sentence is smaller), and every word in the sentence represented with 6 numbers.

This is exactly the same as what we did before with timeseries! We have 3 dimensional tensors, (batch x sequence_length x dimensionality) that we can feed into an RNN!

In [12]:
x_ = x[None, ...]
rnn = torch.nn.GRU(input_size=hidden_dim, hidden_size=16, num_layers=1)

out, hidden = rnn(x_)
out.shape, hidden.shape


(torch.Size([1, 6, 16]), torch.Size([1, 6, 16]))