# Building LLMs From Scratch (Part 4): The Embedding Layer

Welcome to the notebook for Part 4 of the series! Here, we'll implement the concepts discussed in the Medium article, building our model's first and most critical layer: the embedding layer.

This layer is responsible for turning meaningless token IDs into information-rich vectors that capture both **semantic meaning** and **sequential order**.

### 🔗 Quick Links
- **Medium Article**: [Part 4: The Embedding Layer](https://soloshun.medium.com/building-llms-from-scratch-part-4-embedding-layer-0803f6b8495b)
- **GitHub Repository**: [llm-from-scratch](https://github.com/soloeinsteinmit/llm-from-scratch)


## 1. Setup

First, let's import the necessary libraries. We'll need `torch` for our neural network components and the `create_dataloader_v1` function we built in Part 3 to feed us data.


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import tiktoken
import sys

# Add the parent directory to the system path to allow imports
sys.path.insert(0, '../')

from src.part03_dataloader import create_dataloader_v1


## 2. Token Embeddings

A token embedding layer is essentially a lookup table (a matrix) where we store a dense vector for every token in our vocabulary. 

- **Shape**: `[vocab_size, embedding_dim]`
- **Function**: It maps a token ID (an integer) to its corresponding vector representation.

Let's define our model's hyperparameters and create the layer.


In [None]:
# Define the hyperparameters for our model
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size (same as GPT-2)
    "context_size": 256,    # Context length
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-Key-Value bias
}

# For this notebook, we'll use smaller values to make things manageable
vocab_size = GPT_CONFIG_124M["vocab_size"]  # Keep the real vocab size
embedding_dim = 256  # Smaller embedding dimension for demonstration
context_size = 4     # Tiny context size for easy inspection

print("Configuration:")
print(f"  - Vocabulary Size: {vocab_size:,}")
print(f"  - Embedding Dimension: {embedding_dim}")
print(f"  - Context Size: {context_size}")
print(f"  - Embedding Matrix Shape: [{vocab_size:,}, {embedding_dim}]")


In [None]:
# Create the token embedding layer
torch.manual_seed(123)
token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

print("✅ Token Embedding Layer Created!")
print(f"Shape of the embedding matrix: {token_embedding_layer.weight.shape}")
print(f"This matrix contains {token_embedding_layer.weight.numel():,} learnable parameters!")

# Let's examine a few token embeddings
print("\n🔍 Examining specific token embeddings:")
print(f"Token ID 0 embedding shape: {token_embedding_layer.weight[0].shape}")
print(f"Token ID 0 embedding (first 10 values): {token_embedding_layer.weight[0][:10]}")

# Demonstrate that we can look up embeddings for specific tokens
sample_token_ids = torch.tensor([0, 1, 2])
sample_embeddings = token_embedding_layer(sample_token_ids)
print(f"\nSample embeddings for tokens [0, 1, 2] shape: {sample_embeddings.shape}")


### Processing a Batch of Data

Now, let's see how this layer processes a batch of token IDs from our `DataLoader`.


In [None]:
# 1. Load some text data
with open("../data/the-verdict.txt", 'r', encoding="utf-8") as f:
    raw_text = f.read()

# 2. Create a dataloader with a small context size
dataloader = create_dataloader_v1(
    raw_text, 
    batch_size=8, 
    max_length=context_size,
    stride=context_size,
    shuffle=False
)

# 3. Get one batch of data
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs (inputs):\\n", inputs)
print("\\nShape of inputs:", inputs.shape)


In [None]:
# 4. Pass the token IDs through the embedding layer
token_embeddings = token_embedding_layer(inputs)

print("🎯 Token Embeddings Results:")
print(f"Shape of token embeddings: {token_embeddings.shape}")
print("Each token ID has been converted to a 256-dimensional vector!")

# Let's examine one sample in detail
print(f"\n🔍 Examining first sample:")
print(f"Token IDs: {inputs[0]}")
print(f"Token embeddings shape for first sample: {token_embeddings[0].shape}")
print(f"First token embedding (first 10 values): {token_embeddings[0][0][:10]}")

# Show that different tokens have different embeddings
print(f"\n📊 Embedding differences:")
print(f"Are embeddings for tokens {inputs[0][0]} and {inputs[0][1]} the same? {torch.equal(token_embeddings[0][0], token_embeddings[0][1])}")

# Calculate the distance between two token embeddings
distance = torch.norm(token_embeddings[0][0] - token_embeddings[0][1])
print(f"L2 distance between first two token embeddings: {distance:.4f}")


As you can see, the input tensor of shape `[8, 4]` has been transformed into an output tensor of shape `[8, 4, 256]`. Each of the original integer token IDs is now represented by a 256-dimensional vector.


## 3. Positional Embeddings

Token embeddings capture meaning but lose the order of the words. To fix this, we introduce **positional embeddings**.

This is another learnable lookup table, but this time it encodes the *position* of a token in the sequence (0, 1, 2, ...), not its meaning.

- **Shape**: `[context_size, embedding_dim]`
- **Function**: It maps a position index to a vector.


In [None]:
# Create the positional embedding layer
pos_embedding_layer = torch.nn.Embedding(context_size, embedding_dim)

print("✅ Positional Embedding Layer Created!")
print(f"Shape of the positional embedding matrix: {pos_embedding_layer.weight.shape}")
print(f"This matrix contains {pos_embedding_layer.weight.numel():,} learnable parameters")

# Show what each position embedding looks like
print(f"\n🔍 Position embeddings:")
for i in range(context_size):
    pos_emb = pos_embedding_layer.weight[i]
    print(f"Position {i} embedding (first 10 values): {pos_emb[:10]}")


To get the positional embeddings for our sequence of length 4, we pass the indices `0, 1, 2, 3` to this layer.


In [None]:
# Get the positional embeddings
pos_ids = torch.arange(context_size)
print("🔢 Position IDs:", pos_ids)

pos_embeddings = pos_embedding_layer(pos_ids)
print(f"\n🎯 Positional Embeddings Results:")
print(f"Shape of positional embeddings: {pos_embeddings.shape}")
print(f"These represent position information for {context_size} positions")

# Show that position embeddings are different for different positions
print(f"\n📊 Position differences:")
print(f"Are position 0 and position 1 embeddings the same? {torch.equal(pos_embeddings[0], pos_embeddings[1])}")

# Calculate distance between position embeddings
pos_distance = torch.norm(pos_embeddings[0] - pos_embeddings[1])
print(f"L2 distance between position 0 and position 1 embeddings: {pos_distance:.4f}")


## 4. Combining Token and Positional Embeddings

The final input to the transformer is the sum of the token and positional embeddings. PyTorch's **broadcasting** feature makes this easy. When we add the `[8, 4, 256]` token embeddings to the `[4, 256]` positional embeddings, PyTorch automatically expands the positional embeddings to match the batch dimension.

`Input Embeddings = Token Embeddings + Positional Embeddings`


In [None]:
# Add the two embeddings together
input_embeddings = token_embeddings + pos_embeddings

print("🔄 Broadcasting and Addition:")
print(f"Token Embeddings shape:     {token_embeddings.shape}")
print(f"Positional Embeddings shape: {pos_embeddings.shape}")
print(f"Final Input Embeddings shape: {input_embeddings.shape}")

print("\n✨ PyTorch automatically broadcast the positional embeddings!")
print("The [4, 256] positional tensor was expanded to [8, 4, 256] to match the token embeddings")

# Verify that the addition worked correctly
print(f"\n🔍 Verification:")
print(f"Original token embedding for first token: {token_embeddings[0][0][:5]}")
print(f"Position 0 embedding: {pos_embeddings[0][:5]}")
print(f"Combined embedding for first token: {input_embeddings[0][0][:5]}")
print(f"Manual addition check: {(token_embeddings[0][0] + pos_embeddings[0])[:5]}")

# Show that all batches got the same positional information
print(f"\n📊 Batch consistency:")
print(f"Position embedding added to batch 0, position 0: {(input_embeddings[0][0] - token_embeddings[0][0])[:5]}")
print(f"Position embedding added to batch 7, position 0: {(input_embeddings[7][0] - token_embeddings[7][0])[:5]}")
print(f"Are they the same? {torch.equal(input_embeddings[0][0] - token_embeddings[0][0], input_embeddings[7][0] - token_embeddings[7][0])}")


## 5. Understanding the Embedding Lookup Process

Let's dive deeper into how the embedding layer works. It's essentially a **lookup table operation** that retrieves rows from the embedding matrix using token IDs.


In [None]:
# Demonstrate the lookup process step by step
print("🔍 Step-by-step Embedding Lookup Process:")
print("="*50)

# 1. Create a simple example with small vocabulary
simple_vocab_size = 6
simple_emb_dim = 3
torch.manual_seed(123)
simple_embedding = nn.Embedding(simple_vocab_size, simple_emb_dim)

print(f"Simple embedding matrix shape: {simple_embedding.weight.shape}")
print(f"Embedding matrix weights:")
print(simple_embedding.weight)

# 2. Look up specific tokens
test_token_ids = torch.tensor([2, 3, 4, 1])
print(f"\nToken IDs to look up: {test_token_ids}")

# Method 1: Using the embedding layer directly
embeddings_method1 = simple_embedding(test_token_ids)
print(f"\nMethod 1 - Using embedding layer:")
print(f"Result shape: {embeddings_method1.shape}")
print(f"Embeddings:\n{embeddings_method1}")

# Method 2: Manual lookup (equivalent to what happens internally)
embeddings_method2 = simple_embedding.weight[test_token_ids]
print(f"\nMethod 2 - Manual weight lookup:")
print(f"Result shape: {embeddings_method2.shape}")
print(f"Embeddings:\n{embeddings_method2}")

# Verify they're the same
print(f"\nAre both methods identical? {torch.equal(embeddings_method1, embeddings_method2)}")

print("\n" + "="*50)
print("✅ The embedding layer is just an efficient lookup table!")


## 6. Creating a Complete GPT Embedding Module

Let's now create a reusable module that combines both token and positional embeddings, similar to what we'll use in our complete GPT model.


In [None]:
class GPTEmbedding(nn.Module):
    """
    Combined token and positional embedding layer for GPT-style models.
    """
    def __init__(self, vocab_size, emb_dim, context_size):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, emb_dim)
        self.pos_emb = nn.Embedding(context_size, emb_dim)
        
    def forward(self, token_ids):
        """
        Args:
            token_ids: Tensor of shape [batch_size, seq_len]
        Returns:
            Combined embeddings of shape [batch_size, seq_len, emb_dim]
        """
        batch_size, seq_len = token_ids.shape
        
        # Get token embeddings
        tok_embeddings = self.tok_emb(token_ids)
        
        # Get positional embeddings
        pos_ids = torch.arange(seq_len, device=token_ids.device)
        pos_embeddings = self.pos_emb(pos_ids)
        
        # Combine them
        return tok_embeddings + pos_embeddings

# Test our complete embedding module
embedding_module = GPTEmbedding(vocab_size, embedding_dim, context_size)

# Test with the same data
final_embeddings = embedding_module(inputs)

print("🚀 Complete GPT Embedding Module Test:")
print(f"Input shape: {inputs.shape}")
print(f"Output shape: {final_embeddings.shape}")
print(f"Total parameters: {sum(p.numel() for p in embedding_module.parameters()):,}")

# Verify it gives the same result as our manual approach
manual_result = token_embeddings + pos_embeddings
print(f"\nDoes our module match manual approach? {torch.allclose(final_embeddings, manual_result)}")

print("\n🎉 Success! Our embedding layer is ready for the transformer!")


## 🎯 Summary

We have successfully created the first neural network layer of our LLM! 

**What we've accomplished:**
- ✅ Built token embeddings to capture semantic meaning
- ✅ Added positional embeddings to capture word order  
- ✅ Combined them into a complete input representation
- ✅ Created a reusable `GPTEmbedding` module

**Key takeaways:**
- Embeddings transform meaningless token IDs into rich, learnable vectors
- Positional embeddings solve the "bag of words" problem 
- The embedding layer is just an efficient lookup table
- Broadcasting makes it easy to combine different tensor shapes

**What's next:** In Part 5, we'll build the self-attention mechanism that will process these embeddings and allow our model to understand relationships between words!
