# Embeddings

In [None]:
import tokenizers as tk
import torch
from pathlib import Path
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer, WordLevelTrainer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import Lowercase, StripAccents, Sequence


We start with a simple corpus of two sentences

In [None]:
corpus = ["The cat sat on the mat", "Where is the cat?", "The cat is blasé"]


In [None]:
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])
normalizer.normalize_str(corpus[2])

The NFD (Normalization Form D) normalizer tokenizer performs Unicode normalization on strings, specifically converting them into the Canonical Decomposition form.
Tokenizers use NFD normalization to ensure consistency in token representation, which is crucial for training effective machine learning models:

- Standardization: It guarantees that different ways of decomposing the same character 
- Vocabulary Reduction: By decomposing characters, it can simplify the model's vocabulary. Instead of having separate tokens for 'a', 'aˊ', 'aˋ', etc., the tokenizer might be able to create one token for the base 'a' and separate tokens for the combining accents, leading to a smaller, more efficient vocabulary.

In [None]:
tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
trainer = WordLevelTrainer(special_tokens=["[UNK]"])
tokenizer.pre_tokenizer = Whitespace() # This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
normalizer = Sequence([NFD(), StripAccents(), Lowercase()])
# Seqeunce allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order.
tokenizer.normalizer = normalizer
tokenizer.train_from_iterator(corpus, trainer=trainer)
tokenizer.get_vocab()

Our models wont handle strings. So, what we typically do is to `tokenize` the words. We will assign arbitrary integers to the words.

First, we need to get all the words. So we split the strings on whitespace, which gives us the words in every sentence.

Then we assign an integer to every word. We can do this by creating a dictionary that maps words to integers. We can then use this dictionary to convert the words to integers.

This gives us a vocabulary, which is just a mapping from tokens to arbitrary integers. 

In [None]:
enc = tokenizer.encode("cat")
enc.ids


In [None]:
enc = tokenizer.encode("The cat is drinking")
enc.ids

The default index is returned when we have unknown words

In [None]:
tokenizer.decode([0], skip_special_tokens=False)

And we can translate back to strings

In [None]:
tokenizer.decode(enc.ids, skip_special_tokens=False)

So, we are now able to map the sentence from strings to integers.

In [None]:
tokenized_sentence = tokenizer.encode(corpus[0])
tokenized_sentence.ids


Can you "read" the original sentence? You can use the vocab to translate back:

In [None]:
tokenizer.get_vocab()


Ok, now, how to represent this. A naive way would be to use a one hot encoding.

<img src=https://www.tensorflow.org/text/guide/images/one-hot.png width=400/>

In [None]:
import torch.nn.functional as F

tokenized_tensor = torch.tensor(tokenized_sentence.ids)
oh = F.one_hot(tokenized_tensor)
oh


While this might seem like a nice workaround, it is very memory inefficient. 
Vocabularies can easily grow into the 10.000+ words!

So, let's make a more dense space. We simply decide on a dimensionality, and start with assigning a random vector to every word.

<img src=https://www.tensorflow.org/text/guide/images/embedding2.png width=400/>

In [None]:
vocab_size = tokenizer.get_vocab_size()
print(f"the vocabulary size is {vocab_size}")
hidden_dim = 4

embedding = torch.nn.Embedding(
    num_embeddings=vocab_size, embedding_dim=hidden_dim, padding_idx=-2
)
x = embedding(tokenized_tensor)
x


So:

- we started with a sentence of strings.
- we map the strings to arbitrary integers
- the integers are used with an Embedding layer; this is nothing more than a lookup table where every word get's a random vector assigned

We started with a 6-word sentence. But we ended with a (6, 4) matrix of numbers.

So, let's say we have a batch of 32 sentences. We can now store this for example as a (32, 15, 6) matrix: batchsize 32, length of every sentence is 15 (use padding if the sentence is smaller), and every word in the sentence represented with 6 numbers.

This is exactly the same as what we did before with timeseries! We have 3 dimensional tensors, (batch x sequence_length x dimensionality) that we can feed into an RNN!

In [None]:
x_ = x[None, ...]
rnn = torch.nn.GRU(input_size=hidden_dim, hidden_size=16, num_layers=1)

out, hidden = rnn(x_)
out.shape, hidden.shape


# The Problem with Simple Tokenization
Consider these two approaches:

- Word-level tokenization: "playing" and "played" are treated as completely different tokens
- Character-level tokenization: "p", "l", "a", "y", "i", "n", "g" are all separate tokens

Both approaches have issues:

- Word-level creates an enormous vocabulary and misses relationships between similar words
- Character-level creates very long sequences and loses meaning

## Enter BPE
BPE is a clever middle ground that automatically learns to break words into meaningful subwords. Here's how it works:

- Start with characters as your base vocabulary
- Count all pairs of adjacent tokens in your training data
- Merge the most frequent pair to create a new token
- Repeat steps 2-3 until you reach your desired vocabulary size

Initial text: `"low lower lowest"`. Initial tokens:
 - `["l", "o", "w", " ", "l", "o", "w", "e", "r", " ", "l", "o", "w", "e", "s", "t"]`

After first merge (most common pair "l" "o" → "lo"):
- `["lo", "w", " ", "lo", "w", "e", "r", " ", "lo", "w", "e", "s", "t"]`

After second merge ("lo" "w" → "low"):
- `["low", " ", "low", "e", "r", " ", "low", "e", "s", "t"]`

Lets see this in action on our example:

In [None]:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()
normalizer = Sequence([NFD(), StripAccents(), Lowercase()])
# Seqeunce allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order.
tokenizer.normalizer = normalizer
tokenizer.train_from_iterator(corpus, trainer=trainer)
print((f"the vocabulary size is {tokenizer.get_vocab_size()}"))
tokenizer.get_vocab()

In [None]:
enc = tokenizer.encode("The cat is drinking")
enc.ids

In [None]:
tokenizer.decode(enc.ids, skip_special_tokens=False)

In [None]:
def buildBPE(corpus: list[str], vocab_size: int) -> tk.Tokenizer:
    tokenizer = tk.Tokenizer(tk.models.BPE())
    trainer = tk.trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=1,
        special_tokens=["<pad>", "<s>", "</s>", "<unk>", "<mask>"],
    )

    # handle spaces better by removing the prefix space
    tokenizer.pre_tokenizer = tk.pre_tokenizers.ByteLevel(add_prefix_space=False)
    tokenizer.decoder = tk.decoders.ByteLevel()

    # train the BPE model
    tokenizer.train_from_iterator(corpus, trainer)
    tokenizer.enable_padding(pad_id=0, pad_token="<pad>")
    return tokenizer