# What are learned token embeddings?

Learned token embeddings are vector representations of tokens (such as words, subwords, or characters) in a neural network model, where the embedding vectors are initialized randomly and then optimized (learned) during model training. 

These embeddings capture semantic and syntactic information about the tokens based on the training data, allowing the model to understand relationships and similarities between different tokens. 

Learned token embeddings are a foundational component in many natural language processing (NLP) models, such as word2vec, GloVe, and transformer-based architectures like BERT and GPT.


#### Step-by-step explanation of learned token embeddings:

 1. **Tokenization**: The input text is first split into smaller units called tokens (such as words, subwords, or characters).
 
 2. **Embedding Initialization**: Each unique token is assigned a vector of numbers (embedding). These vectors are usually initialized randomly.
 
 3. **Embedding Lookup**: When processing text, each token is replaced by its corresponding embedding vector, creating a sequence of vectors.
 
 4. **Model Training**: As the neural network trains on data, the embedding vectors are updated through backpropagation. This means the model learns to adjust the vectors so that they capture useful information about the tokens.
 
 5. **Semantic Representation**: After training, tokens with similar meanings or usage patterns have embedding vectors that are close together in the vector space. This helps the model understand relationships between tokens.
 
 6. **Usage in NLP Tasks**: These learned embeddings are used as input features for various NLP tasks, such as text classification, translation, or question answering.


# What is Sub word embeddings?

Subword embeddings are a type of word representation in natural language processing where words are broken down into smaller units, such as character n-grams or subword tokens. This approach helps handle rare or unseen words by representing them as combinations of subword units, improving the model's ability to generalize and capture morphological patterns. Popular methods for generating subword embeddings include Byte Pair Encoding (BPE), WordPiece, and SentencePiece.


### Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is a subword tokenization technique used in natural language processing to efficiently represent words as sequences of more frequent subword units. BPE starts with a base vocabulary of individual characters and iteratively merges the most frequent pairs of symbols (characters or character sequences) in the training data to form new subword units. This process continues for a predefined number of merges, resulting in a vocabulary that can represent both common words and rare or unseen words as combinations of subword units. ***BPE helps models handle out-of-vocabulary words and capture meaningful word structure.***



### WordPiece

WordPiece is a subword tokenization algorithm commonly used in natural language processing, especially in models like BERT. It works by breaking words into smaller, more frequent subword units based on their occurrence in a training corpus. The algorithm starts with a base vocabulary of characters and iteratively merges the most frequent pairs of symbols to form new subword tokens, similar to Byte Pair Encoding (BPE). WordPiece helps handle rare or unknown words by representing them as combinations of subword units, improving the model's ability to generalize and understand word structure.


### SentencePiece

SentencePiece is a text tokenization tool and algorithm that segments text into subword units, often used in natural language processing tasks. Unlike BPE and WordPiece, SentencePiece does not require pre-tokenized input and can operate directly on raw text, making it language-independent and suitable for languages without clear word boundaries. It uses algorithms like Unigram Language Model or BPE to learn a vocabulary of subword tokens from the training data. SentencePiece is widely used in models such as Google's T5 and ALBERT, enabling efficient handling of rare and unknown words by representing them as sequences of subword units.


### Character Embeddings

Character embeddings are a type of word representation in natural language processing where each character in a word is assigned a vector, and words are represented as sequences or combinations of these character vectors. This approach allows models to capture morphological patterns, handle misspellings, and process rare or unseen words by building word meaning from their constituent characters. Character embeddings are especially useful for languages with rich morphology or when dealing with noisy text data.


### Positional Embeddings

Positional embeddings are a technique used in natural language processing models, especially in transformer architectures, to encode the order of tokens in a sequence. Since models like transformers process input tokens in parallel and lack inherent knowledge of token positions, positional embeddings inject information about the position of each token into the model. This is typically done by adding or concatenating a unique vector (the positional embedding) to each token embedding, allowing the model to capture the sequential structure of the input. Common approaches include fixed sinusoidal embeddings and learnable positional embeddings.


## Difference between BPE, WordPiece, and SentencePiece
 
 | Feature                | BPE (Byte Pair Encoding) | WordPiece                | SentencePiece              |
 |------------------------|-------------------------|--------------------------|----------------------------|
 | **Algorithm**          | Greedy pair merging     | Greedy pair merging      | Unigram LM or BPE          |
 | **Input**              | Pre-tokenized text      | Pre-tokenized text       | Raw text (no pre-tokenization needed) |
 | **Merge Criteria**     | Most frequent pairs     | Most likely pairs (maximizes likelihood) | Probabilistic (Unigram LM) or frequent pairs (BPE) |
 | **Used in**            | GPT, RoBERTa            | BERT, DistilBERT         | T5, ALBERT, XLNet          |
 | **Language Independence** | Limited (needs tokenization) | Limited (needs tokenization) | High (works on raw text)   |
 | **Special Features**   | Simple, fast            | Handles unknowns with ## prefix | Can handle languages without spaces |
 
 **Summary:**
 - **BPE** and **WordPiece** both use iterative merging of frequent symbol pairs, but WordPiece optimizes for likelihood and often uses special prefixes for subwords.
 - **SentencePiece** can operate directly on raw text, supports multiple algorithms (Unigram LM, BPE), and is more language-independent.
