# Understanding Embedding Layers for the Transformer architecture

This notebook explains the embedding process in PyTorch using an example with a large vocabulary size and embedding dimension.

## What is an Embedding Layer?

An embedding layer maps discrete input tokens (like words) into continuous vectors. It is used to represent words in a dense vector space where each word has its own unique vector representation.

### Why Use Embeddings?
- **Dimensionality Reduction:** Instead of one-hot encoding with a large dimension, embeddings provide a lower-dimensional representation.
- **Semantic Meaning:** Vectors can capture semantic relationships between words.

## Step-by-Step Explanation

### Step 1: Define the Vocabulary
For simplicity, we use a small subset of words to illustrate the process. In practice, the vocabulary can be much larger.

In [14]:
import torch
import torch.nn as nn

# Define a small vocabulary for illustration
vocab = {'<pad>': 0, '<unk>': 1, 'the': 2, 'to': 3, 'and': 4, 'of': 5, 'a': 6, 'in': 7, 'that': 8, 'is': 9, 'cat': 10, 'hat': 11}
vocab_size = len(vocab)

print(vocab_size)


12


## Step 2: Initialize the Embedding Layer
The embedding layer is initialized with a random matrix of size `(vocab_size, embedding_dim)`.

In [17]:
# Define the embedding dimension
embedding_dim = 512

# Initialize the embedding layer
embedding_layer = nn.Embedding(vocab_size, embedding_dim)

print(f"Embedding layer: {embedding_layer} \n")

Embedding layer: Embedding(12, 512) 

Embedding matrix: 
 tensor([[-0.5306,  0.6974,  0.3685,  ..., -1.8144,  1.3594,  0.3562],
        [-1.0072,  0.0480, -0.0106,  ...,  0.3968,  2.0107,  0.1754],
        [-0.1164, -0.0656, -0.2980,  ...,  0.0169, -2.1042, -0.4275],
        ...,
        [ 1.2982,  1.4217, -0.7028,  ...,  1.1612,  0.6033, -1.4942],
        [ 0.2040,  0.0248, -0.7561,  ..., -0.6447, -0.5856, -0.4786],
        [-0.0211,  0.5515,  0.3734,  ..., -2.7401, -1.0484, -0.4281]])



## Step 3: Embedding Matrix Initialization
The embedding matrix \( E \) has dimensions `(vocab_size, embedding_dim)`. Each row represents a word vector in the embedding space.

### Math Behind Embedding Initialization
For each word in the vocabulary, a unique vector of size `embedding_dim` is initialized. This matrix is updated during training to capture the relationships between words.

Let's initialize and inspect the embedding matrix.

In [18]:
# Inspect the size of the embedding matrix
embedding_matrix = embedding_layer.weight.data
print(f"Embedding Matrix Size: \n {embedding_matrix.size()}\n")


print(f"Embedding Matrix: \n {embedding_matrix}\n")



Embedding Matrix Size: 
 torch.Size([12, 512])

Embedding matrix: 
 tensor([[-0.5306,  0.6974,  0.3685,  ..., -1.8144,  1.3594,  0.3562],
        [-1.0072,  0.0480, -0.0106,  ...,  0.3968,  2.0107,  0.1754],
        [-0.1164, -0.0656, -0.2980,  ...,  0.0169, -2.1042, -0.4275],
        ...,
        [ 1.2982,  1.4217, -0.7028,  ...,  1.1612,  0.6033, -1.4942],
        [ 0.2040,  0.0248, -0.7561,  ..., -0.6447, -0.5856, -0.4786],
        [-0.0211,  0.5515,  0.3734,  ..., -2.7401, -1.0484, -0.4281]])



## Step 4: Input Sequence to Embeddings
Given an input sequence of token indices, the embedding layer retrieves the corresponding vectors from the embedding matrix.

In [19]:
# Example input sequence: "the cat in the hat"
input_indices = torch.tensor([2, 10, 7, 2, 11], dtype=torch.long)  # Example indices based on the vocab

# Get the embeddings for the input sequence
embeddings = embedding_layer(input_indices)

print(f"Input Sequence: {input_indices}")

Input Sequence: tensor([ 2, 10,  7,  2, 11])


## Step 5: Inspect the Embeddings
Each token index in the input sequence is mapped to its corresponding 512-dimensional vector.

In [5]:
embeddings

tensor([[-0.2626,  1.2768,  0.1062,  ..., -0.1417, -0.6415,  1.0311],
        [-1.0452, -1.0182, -1.6060,  ..., -0.7584, -1.5634,  0.5559],
        [-1.2636, -0.2993,  0.5076,  ...,  1.8362,  0.6055, -1.3608],
        [-0.2626,  1.2768,  0.1062,  ..., -0.1417, -0.6415,  1.0311],
        [-0.2111, -0.4657, -1.3099,  ..., -0.4435, -0.6390, -0.2911]],
       grad_fn=<EmbeddingBackward0>)

## Explanation of the Embeddings
- Each row in the output corresponds to a token in the input sequence.
- The values in each row are the learned weights representing the token in the 512-dimensional space.
- These embeddings capture the semantic meaning and relationships between words.

For example, words that are similar or often appear in similar contexts will have similar vector representations.

## Recap
1. **Vocabulary Size:** The number of unique tokens in the vocabulary (e.g., 10,000).
2. **Embedding Dimension:** The size of the vector representing each token (e.g., 512).
3. **Embedding Matrix:** A matrix of size `(vocab_size, embedding_dim)` initialized randomly.
4. **Token Indices:** Each word in the input sequence is converted to its corresponding index.
5. **Embedding Lookup:** The embedding layer retrieves the vectors for the input token indices, resulting in a sequence of vectors.

This process allows the model to learn and represent words in a continuous vector space, capturing their semantic relationships and improving the model's performance in tasks like text generation.

## Embeddings Passed to Positional Encoder

In a Transformer architecture, after converting tokens to embeddings, the embeddings are passed through the positional encoder. This step adds positional information to the embeddings, enabling the model to understand the order of the tokens.

## More in-depth look at Embeddings:

In [6]:
import torch
import torch.nn as nn

# Define the input text
text = """
This script implements a decoder-only Transformer model for text generation, similar to the architecture used in GPT (Generative Pre-trained Transformer). The model is trained on a text dataset, specifically the book "Pride and Prejudice," to learn to generate text in the style of the book.
The script trains the model on the text of "Pride and Prejudice" to generate text in a similar style. After training, the model can be used to generate text by predicting the next token in a sequence based on the previous tokens.
"""

# Tokenize the text and create the vocabulary
def create_vocab(text):
    words = text.split()
    print(f"Words: {words}")
    unique_words = set(words) # removes duplicates
    print(f"Unique Words: {words}")
    vocab = {word: i + 4 for i, word in enumerate(unique_words)} # create dict - map each word to a index
    print(f"Vocab: {vocab}")

    # Add special tokens to the vocabulary with fixed indices
    vocab['<pad>'] = 0
    vocab['<unk>'] = 1
    vocab['<sos>'] = 2
    vocab['<eos>'] = 3
    return vocab

vocab = create_vocab(text)
print("Vocabulary size:", len(vocab))
print("Sample vocabulary:", {k: vocab[k] for k in list(vocab)[:10]})

Words: ['This', 'script', 'implements', 'a', 'decoder-only', 'Transformer', 'model', 'for', 'text', 'generation,', 'similar', 'to', 'the', 'architecture', 'used', 'in', 'GPT', '(Generative', 'Pre-trained', 'Transformer).', 'The', 'model', 'is', 'trained', 'on', 'a', 'text', 'dataset,', 'specifically', 'the', 'book', '"Pride', 'and', 'Prejudice,"', 'to', 'learn', 'to', 'generate', 'text', 'in', 'the', 'style', 'of', 'the', 'book.', 'The', 'script', 'trains', 'the', 'model', 'on', 'the', 'text', 'of', '"Pride', 'and', 'Prejudice"', 'to', 'generate', 'text', 'in', 'a', 'similar', 'style.', 'After', 'training,', 'the', 'model', 'can', 'be', 'used', 'to', 'generate', 'text', 'by', 'predicting', 'the', 'next', 'token', 'in', 'a', 'sequence', 'based', 'on', 'the', 'previous', 'tokens.']
Unique Words: ['This', 'script', 'implements', 'a', 'decoder-only', 'Transformer', 'model', 'for', 'text', 'generation,', 'similar', 'to', 'the', 'architecture', 'used', 'in', 'GPT', '(Generative', 'Pre-traine

In [7]:
# Define the embedding layer
vocab_size = len(vocab)  # Example vocabulary size
print(f"Vocab Size: {vocab_size}")

embedding_dim = 10  # Embedding dimension
embedding_layer = nn.Embedding(vocab_size, embedding_dim)
print(f"Embedding Matrix Shape: {embedding_layer}")

# Extract the embedding matrix
embedding_matrix = embedding_layer.weight.data

print("Embedding Matrix:")
print(embedding_matrix)


Vocab Size: 54
Embedding Matrix Shape: Embedding(54, 10)
Embedding Matrix:
tensor([[-0.7134,  0.8688,  1.3469,  1.1180,  0.1505, -0.0051, -0.3310,  1.2101,
          0.1746, -1.1810],
        [ 0.0110, -0.7756,  0.0655,  0.6984,  0.7567,  0.5876,  0.4589, -0.9991,
         -1.4687,  0.9774],
        [-0.8876,  0.1901,  0.6459,  0.7162, -0.1220, -1.2163,  1.0151,  1.9437,
          0.5806,  0.4763],
        [ 2.3391, -1.1586,  0.3593,  0.0393, -0.6560,  1.6169, -0.3945,  1.6842,
         -0.4104,  0.1307],
        [-0.6312,  0.9993,  0.1227,  0.1271, -0.7444, -1.6597, -0.6040, -0.3311,
         -0.5896,  0.4967],
        [ 1.6065, -0.7332,  0.1695,  0.9445, -0.0093, -0.5057,  1.2169,  0.4072,
          1.4626, -0.0852],
        [ 0.8678, -0.6703,  0.9492, -0.5831,  0.1969, -1.8614, -0.8551,  0.2058,
         -1.6572,  1.3315],
        [ 0.3368,  1.7116,  0.4607, -1.6562, -0.9687,  0.2387, -2.7380,  1.9829,
         -1.5454, -1.8589],
        [ 0.6160, -0.5725,  0.6636, -0.8246,  0.4503,

An embedding matrix of shape (54, 10) is created. Each row in this matrix corresponds to a word vector of size 10.  We can also print out the full Embedding Matrix as shown above.

### Extract the First Sentence and Convert to Indices

In [8]:
first_sentence = "This script implements a decoder-only Transformer model for text generation"
first_sentence_indices = [vocab.get(word, vocab['<unk>']) for word in first_sentence.split()]
print(f"Sentence indicies: {first_sentence_indices}")

# Turn it into a tensor for processing
input_indices = torch.tensor(first_sentence_indices, dtype=torch.long)
print(f"Input indices: {input_indices}")


Sentence indicies: [14, 33, 20, 53, 11, 38, 23, 34, 39, 1]
Input indices: tensor([14, 33, 20, 53, 11, 38, 23, 34, 39,  1])


### Get the embeddings for the input sequence

In [20]:
embeddings = embedding_layer(input_indices)

print(f"Input sequence embedding: \n {embeddings} \n")

Input sequence embedding: 
 tensor([[-0.1164, -0.0656, -0.2980,  ...,  0.0169, -2.1042, -0.4275],
        [ 0.2040,  0.0248, -0.7561,  ..., -0.6447, -0.5856, -0.4786],
        [ 1.6655, -1.0964,  0.6137,  ..., -0.9008, -0.4949,  0.7380],
        [-0.1164, -0.0656, -0.2980,  ...,  0.0169, -2.1042, -0.4275],
        [-0.0211,  0.5515,  0.3734,  ..., -2.7401, -1.0484, -0.4281]],
       grad_fn=<EmbeddingBackward0>) 



Retrieve the corresponding rows from the embedding matrix  E  for each index.

So, If first_sentence_indices = [21, 6, 15, 8, 53, 50, 34, 24, 1], the corresponding embeddings would be something like this:

In [21]:
embeddings_example = [
 [0.5, 0.6, ..., 0.8],  # Embedding for 'This' (index 21)
 [0.3, 0.4, ..., 0.7],  # Embedding for 'script' (index 6)
 [0.7, 0.1, ..., 0.9],  # Embedding for 'implements' (index 15)
 [0.4, 0.5, ..., 0.6],  # Embedding for 'decoder-only' (index 8)
 [0.2, 0.3, ..., 0.4],  # Embedding for 'Transformer' (index 53)
 [0.4, 0.5, ..., 0.6],  # Embedding for 'model' (index 50)
 [0.1, 0.2, ..., 0.3],  # Embedding for 'for' (index 34)
 [0.3, 0.4, ..., 0.7],  # Embedding for 'text' (index 24)
 [0.3, 0.4, ..., 0.7],  # Embedding for 'generation' (<unk>, index 1)
]

However, let's get the actualy Embeddings from the first_sentence_indices:

In [22]:
print("Embeddings shape: ", embeddings.shape)

print(f"\nfirst_sentence_indices Embeddings: \n\n{embeddings}")

Embeddings shape:  torch.Size([5, 512])

first_sentence_indices Embeddings: 

tensor([[-0.1164, -0.0656, -0.2980,  ...,  0.0169, -2.1042, -0.4275],
        [ 0.2040,  0.0248, -0.7561,  ..., -0.6447, -0.5856, -0.4786],
        [ 1.6655, -1.0964,  0.6137,  ..., -0.9008, -0.4949,  0.7380],
        [-0.1164, -0.0656, -0.2980,  ...,  0.0169, -2.1042, -0.4275],
        [-0.0211,  0.5515,  0.3734,  ..., -2.7401, -1.0484, -0.4281]],
       grad_fn=<EmbeddingBackward0>)


Now, that we have completed the first step and created embeddings for the input sequence (sentence), we can now proceed to the next step in the transformer architecture and add positional embeddings.  This will help the model understand the positions of words in a sequence or batch of sequences that are being processed.