Ref: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

In [27]:
from importlib.metadata import version

print("torch version:", version("torch"))

import torch

torch version: 2.5.1


# Sentence Embedding

In [28]:
import torch
import torch.nn as nn

torch.manual_seed(123) #for reproducibility

# 1) Your naive tokenization
sentence = "my shoes are small, my feet are big."
tokens = sentence.replace(",", "").split()
print("Tokens:", tokens)

# 2) Build vocab
vocab = {}
for tok in tokens:
    if tok not in vocab:
        vocab[tok] = len(vocab)
print("Vocab:", vocab)

# 3) Convert tokens to IDs
token_ids = [vocab[tok] for tok in tokens]
token_ids_tensor = torch.tensor(token_ids)

print("Token IDs (no batch):", token_ids_tensor)
print("Shape:", token_ids_tensor.shape)  # [8]

# 4) Add a batch dimension
#    Now token_ids_tensor is [1, 8] instead of [8]
token_ids_tensor = token_ids_tensor.unsqueeze(0)
print("\nToken IDs with batch dimension:", token_ids_tensor)
print("Shape:", token_ids_tensor.shape)  # [1, 8]

# 5) Create an embedding layer
embedding_dim = 4
embed = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embedding_dim)

# 6) Apply the embedding
embedded_tokens = embed(token_ids_tensor)
print("\nEmbedded Tokens with batch size = 1:")
print(embedded_tokens)
print("Shape:", embedded_tokens.shape)  # [1, 8, 4]

Tokens: ['my', 'shoes', 'are', 'small', 'my', 'feet', 'are', 'big.']
Vocab: {'my': 0, 'shoes': 1, 'are': 2, 'small': 3, 'feet': 4, 'big.': 5}
Token IDs (no batch): tensor([0, 1, 2, 3, 0, 4, 2, 5])
Shape: torch.Size([8])

Token IDs with batch dimension: tensor([[0, 1, 2, 3, 0, 4, 2, 5]])
Shape: torch.Size([1, 8])

Embedded Tokens with batch size = 1:
tensor([[[ 0.3374, -0.1778, -0.3035, -0.5880],
         [ 0.3486,  0.6603, -0.2196, -0.3792],
         [-0.1606, -0.4015,  0.6957, -1.8061],
         [ 1.8960, -0.1750,  1.3689, -1.6033],
         [ 0.3374, -0.1778, -0.3035, -0.5880],
         [-0.7849, -1.4096, -0.4076,  0.7953],
         [-0.1606, -0.4015,  0.6957, -1.8061],
         [ 0.9985,  0.2212,  1.8319, -0.3378]]], grad_fn=<EmbeddingBackward0>)
Shape: torch.Size([1, 8, 4])


# Positional Encoding

In [29]:
import torch
import torch.nn as nn

class LearnedPositionalEncoding(nn.Module):
    """
    A simple learned positional embedding for a maximum sequence length of `max_len`.
    Each position i gets an embedding of size `d_model`.
    """
    def __init__(self, d_model, max_len=512):
        super().__init__()
        torch.manual_seed(123) #for reproducibility
        # Create an embedding layer for positions
        self.pos_embedding = nn.Embedding(num_embeddings=max_len, embedding_dim=d_model)

    def forward(self, x):
        """
        x: A tensor of shape (batch_size, seq_len, d_model)
           containing token embeddings.

        Returns:
        A tensor of shape (batch_size, seq_len, d_model),
        where each token embedding is augmented with its
        learned positional vector.
        """
        # Get the sequence length from the input
        seq_len = x.size(1)

        # Create a range of positions: 0..seq_len-1
        # shape: (1, seq_len)
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)

        # Retrieve the positional embeddings for these positions
        # shape: (1, seq_len, d_model)
        pos_emb = self.pos_embedding(positions)

        # Add them to the original token embeddings
        # Broadcasting over batch dimension
        x = x + pos_emb

        return x

In [30]:
# Parameters
seq_len = len(tokens)
d_model = embedding_dim
max_len = 10

# Initialize learned positional encoding
learned_pe = LearnedPositionalEncoding(d_model=embedding_dim, max_len=max_len)

# Forward pass
x_with_pe = learned_pe(embedded_tokens)

print("Input shape:", embedded_tokens.shape)    # (1, 8, 4)
print("Output shape:", x_with_pe.shape)         # (1, 8, 4)
print("Positional Embeddings added to the input embedding:", x_with_pe)

Input shape: torch.Size([1, 8, 4])
Output shape: torch.Size([1, 8, 4])
Positional Embeddings added to the input embedding: tensor([[[ 0.6747, -0.3556, -0.6071, -1.1760],
         [ 0.6972,  1.3207, -0.4393, -0.7583],
         [ 0.6065, -1.5940,  1.3941, -3.2158],
         [ 2.0753,  1.7201,  1.8644, -1.3341],
         [ 0.2603, -1.1982, -0.4725,  0.3298],
         [ 0.7961, -0.1085,  0.8677,  0.5944],
         [ 0.8019, -0.1523,  0.2112, -3.8990],
         [ 0.1786, -0.1998,  0.8699,  0.9446]]], grad_fn=<AddBackward0>)


At this point:

- Batch size = 1
- Sequence length = 8 (tokens)
- Embedding dim = 4

# Single Self-Attention Head

## Define Q/K/V Projections

Now, let’s discuss the widely utilized self-attention mechanism known as the scaled dot-product attention, which is integrated into the transformer architecture.

Self-attention utilizes three weight matrices, referred to as $W_q$, $W_k$ and $W_v$ which are adjusted as model parameters during training. These matrices serve to project the inputs into query, key, and value components of the sequence, respectively.


Since we are computing the dot-product between the query and key vectors, these two vectors have to contain the same number of elements, However, the number of elements in the value vector $v^{(i)}$, which determines the size of the resulting context vector, is arbitrary.

We’ll create small linear layers to transform from the embedding dimension (4) into Q, K, V. For simplicity, let’s make Q, K, and V all dimension = 4 as well (typical practice might use a smaller dimension if we wanted multi-head, but we’ll keep it direct here).

In [47]:
torch.manual_seed(123)

d_model = 4
embedded_tokens = x_with_pe

W_Q = nn.Linear(d_model, d_model, bias=False)
W_K = nn.Linear(d_model, d_model, bias=False)
W_V = nn.Linear(d_model, d_model, bias=False)

# Transform to Q, K, V
Q = W_Q(embedded_tokens)  # shape: [1, 8, 4]
K = W_K(embedded_tokens)  # shape: [1, 8, 4]
V = W_V(embedded_tokens)  # shape: [1, 8, 4]
 

In [48]:
print("W_Q", W_Q)
print("Q", Q)

W_Q Linear(in_features=4, out_features=4, bias=False)
Q tensor([[[-0.2145,  0.2703, -0.4016,  0.0485],
         [-0.1542,  0.6481,  0.0180, -0.5414],
         [-1.1026, -0.0709, -1.4118,  0.2862],
         [-1.1092, -0.4009, -0.8602, -1.6137],
         [ 0.1066, -0.5095, -0.2046,  0.3074],
         [-0.2675, -0.9306, -0.3252, -0.5226],
         [-0.9536,  1.0757, -1.0564,  0.0503],
         [-0.0776, -0.8410, -0.0856, -0.2677]]], grad_fn=<UnsafeViewBackward0>)


## Compute Self-Attention for a Single Token: "shoes"

Let’s isolate the second token (index 1) for “shoes.” Note that the entire Q/K/V are needed to compute attention across all tokens, but we’ll highlight the single query for “shoes”:

- These three matrices are used to project the embedded input tokens, $x^{(i)}$, into query, key, and value vectors via matrix multiplication:

  - Query vector: $q^{(i)} = W_q \,x^{(i)}$
  - Key vector: $k^{(i)} = W_k \,x^{(i)}$
  - Value vector: $v^{(i)} = W_v \,x^{(i)}$

In [49]:
import torch.nn.functional as F

# "shoes" is token index 1
shoes_query = Q[0, 1, :]  # shape: [4], we drop the batch dimension index 0
shoes_query = shoes_query.unsqueeze(0)  # shape: [1, 4] -> (1, d_model)
print("shoes_query", shoes_query)

# Keys: shape [1, 8, 4] -> we drop batch dimension => shape: [8, 4]
all_keys = K[0]  # shape: [8, 4]
all_values = V[0]  # shape: [8, 4]

# 1) Dot product between query and all keys => shape: [1, 4] x [8, 4]^T = [1, 8]
attn_scores = shoes_query.matmul(all_keys.transpose(0,1))  # => shape: [1, 8]

# 2) Scale by sqrt(d_model)
attn_scores = attn_scores / (d_model**0.5)

# 3) Softmax
attn_weights = F.softmax(attn_scores, dim=-1)  # shape: [1, 8]

print("\nAttention Scores (shoes):", attn_scores)
print("Attention Weights (shoes):", attn_weights)

shoes_query tensor([[-0.1542,  0.6481,  0.0180, -0.5414]], grad_fn=<UnsqueezeBackward0>)

Attention Scores (shoes): tensor([[-0.0820,  0.1623, -0.5437,  0.0661, -0.0908,  0.0062, -0.3285,  0.0021]],
       grad_fn=<DivBackward0>)
Attention Weights (shoes): tensor([[0.1247, 0.1592, 0.0786, 0.1446, 0.1236, 0.1362, 0.0975, 0.1356]],
       grad_fn=<SoftmaxBackward0>)


Now, attn_weights is a distribution over the 8 tokens, telling us how strongly “shoes” attends to each token (including itself).

## Context Vector for “shoes”

In [50]:
# shape: [1, 8] x [8, 4] = [1, 4]
shoes_context = attn_weights.matmul(all_values)
print("\nContext vector for 'shoes':", shoes_context)
print("Shape:", shoes_context.shape)  # [1, 4]


Context vector for 'shoes': tensor([[ 0.4238, -0.5244, -0.5151, -0.1296]], grad_fn=<MmBackward0>)
Shape: torch.Size([1, 4])


<b>Interpretation:</b>

- `shoes_context` is now the 4-dimensional “refined” embedding that incorporates relationships to all other tokens.
- If `attn_weights` is high for certain tokens (e.g., “small”), that token’s Value vector influences the final context more.

- `attn_scores` will likely be some random distribution because W_Q, W_K, W_V are uninitialized.
- After softmax, `attn_weights` might highlight certain tokens more (purely by random chance in this untrained scenario).
- The `shoes_context` is a 4D vector combining Value vectors from all tokens in proportion to these attention weights.
- In a trained setting, you’d see “shoes” paying higher attention to tokens like “small,” or maybe repeating “my,” reflecting learned semantics. But this snippet demonstrates exactly how to:

1. Get a single query for “shoes.”
2. Dot it with all keys.
3. Softmax → attention weights.
4. Weighted sum of Values → context vector for “shoes.”

# Multi-Head Attention

Below is a toy multi-head attention example in PyTorch, building on the previous single-head demo. We’ll assume:

- 𝑑_model=4
- num_heads=2
- head_dim= 𝑑_model/num_heads = 2


We’ll define a mini MultiHeadSelfAttention class that:

- Splits 𝑄,𝐾,𝑉 into heads,
- Performs scaled dot-product attention per head,
- Concatenates the results,
- Applies a final output projection.

Then we’ll run it on a batch of size 1 with our sentence. Remember, this is still untrained with random weights—just a mechanical demonstration.

In [54]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model=4, num_heads=2):
        """
        A simple multi-head self-attention for demonstration.
        d_model = total embedding dimension,
        num_heads = how many heads,
        head_dim = d_model // num_heads.
        """
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        # Linear layers for Q, K, V
        torch.manual_seed(123)
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)

        # Final projection after concatenating heads
        self.out_proj = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        """
        x shape: (batch_size, seq_len, d_model)
        Returns: (batch_size, seq_len, d_model)
        """
        bsz, seq_len, _ = x.size()

        # 1) Compute Q, K, V => shape: (bsz, seq_len, d_model)
        Q = self.W_Q(x)
        K = self.W_K(x)
        V = self.W_V(x)

        # 2) Reshape for multi-head: => (bsz, seq_len, num_heads, head_dim) => transpose
        # to (bsz, num_heads, seq_len, head_dim)
        Q = Q.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        # Now shape is (bsz, num_heads, seq_len, head_dim)

        # 3) Compute scaled dot-product attention for each head
        #    scores => (bsz, num_heads, seq_len, seq_len)
        scores = torch.matmul(Q, K.transpose(-1, -2)) / (self.head_dim ** 0.5)
        attn_weights = F.softmax(scores, dim=-1)  # across last dim => seq_len

        # Weighted sum => context: (bsz, num_heads, seq_len, head_dim)
        context = torch.matmul(attn_weights, V)

        # 4) Transpose & reshape back to (bsz, seq_len, d_model)
        context = context.transpose(1, 2).contiguous()  # (bsz, seq_len, num_heads, head_dim)
        context = context.view(bsz, seq_len, self.d_model)

        # 5) Final linear projection
        out = self.out_proj(context)
        return out, attn_weights


__Explanation__

- We create 𝑄,𝐾,𝑉 each of size `(batch_size, seq_len, d_model)`.
- Reshape to `(batch_size, num_heads, seq_len, head_dim)`.
- Dot-product attention per head → shape `(batch_size, num_heads, seq_len, head_dim)`.
- Concat heads → `(batch_size, seq_len, d_model)`.
- Final linear projection → `(batch_size, seq_len, d_model)`.

In [55]:
# Multi-head self-attention
mha = MultiHeadSelfAttention(d_model=d_model, num_heads=2)
mha_out, attn_weights_per_head = mha(embedded_tokens)

print("MHA Output shape:", mha_out.shape)  # (1, seq_len, 4)
print("MHA Output:", mha_out)
print("Attention Weights per head:", attn_weights_per_head)

MHA Output shape: torch.Size([1, 8, 4])
MHA Output: tensor([[[-0.1172,  0.0805, -0.3105,  0.2153],
         [-0.1017,  0.0579, -0.3384,  0.1675],
         [-0.1759,  0.1428, -0.3050,  0.3935],
         [-0.1817,  0.1242, -0.3209,  0.4163],
         [-0.0974,  0.0706, -0.2787,  0.1747],
         [-0.1218,  0.0870, -0.2922,  0.2572],
         [-0.1558,  0.1144, -0.3671,  0.3140],
         [-0.0999,  0.0696, -0.2889,  0.1924]]], grad_fn=<UnsafeViewBackward0>)
Attention Weights per head: tensor([[[[0.1174, 0.1157, 0.1151, 0.1445, 0.1276, 0.1423, 0.1021, 0.1353],
          [0.1049, 0.1176, 0.0787, 0.1652, 0.1316, 0.1727, 0.0659, 0.1635],
          [0.1124, 0.0697, 0.2446, 0.1225, 0.1006, 0.0949, 0.1778, 0.0776],
          [0.1093, 0.0617, 0.2912, 0.0955, 0.0866, 0.0709, 0.2260, 0.0587],
          [0.1327, 0.1206, 0.1679, 0.0930, 0.1109, 0.0895, 0.1923, 0.0932],
          [0.1206, 0.0836, 0.2523, 0.0667, 0.0839, 0.0561, 0.2813, 0.0554],
          [0.0907, 0.0821, 0.0897, 0.2099, 0.1253, 0.19

You’ll see some random values. In a __trained__ scenario, each head would learn distinct patterns of attention (e.g., focusing on “shoes” ↔ “small”), but here it’s random.

__Key Takeaways__

- __Split dimension__: We do `Q.view(...)` to split `d_model=4` into 2 heads, each dimension = 2.
- __Parallel__: Both heads compute attention in parallel, then we concatenate.
- __Final projection__: We ensure the output shape is `(batch_size, seq_len, d_model)` again.
- __Untrained__: The random initialization means it won’t yield meaningful attention patterns—but it shows you the mechanics.

# Add & Norm

__1) Residuals__
- We add the original input embeddings (`embedded_tokens`) back to the output of mha_out.
- This helps the model retain original information if the MHA transform is partially or fully bypassed.
- Also improves gradient flow in deep Transformers.

__2) Layer Normalization__
- Means each token embedding now has near zero mean and unit variance across its 4 features.
- This stabilizes training and balances the scale of embeddings for subsequent layers.

In [60]:
# ================================
# ADD & NORM LAYER
# ================================

# 1) Residual
residual_1 = embedded_tokens + mha_out

# 2) Layer Normalization
layer_norm_1 = nn.LayerNorm(normalized_shape=d_model)
normed_1 = layer_norm_1(residual_1)

print("\nAfter Add & Norm:")
print("Residual shape:", residual_1.shape)      # [1, seq_len, 4]
print("Normalized output shape:", normed_1.shape)  # [1, seq_len, 4]
print("Normalized output:", normed_1)  # [1, seq_len, 4]


After Add & Norm:
Residual shape: torch.Size([1, 8, 4])
Normalized output shape: torch.Size([1, 8, 4])
Normalized output: tensor([[[ 1.5543,  0.2013, -0.8427, -0.9129],
         [ 0.5031,  1.3901, -1.0524, -0.8408],
         [ 0.7244, -0.4937,  1.1506, -1.3812],
         [ 0.6876,  0.6453,  0.3876, -1.7206],
         [ 0.7042, -1.2470, -0.6777,  1.2205],
         [ 0.4706, -1.6513,  0.1694,  1.0113],
         [ 0.8681,  0.4527,  0.3810, -1.7018],
         [-0.6900, -1.1166,  0.3356,  1.4711]]],
       grad_fn=<NativeLayerNormBackward0>)


# Feed Forward Network

- Processes each token independently with two linear layers + ReLU.
- Goes from `d_model=4` → `hidden_dim=8` → back to 4.

In [61]:
class FeedForward(nn.Module):
    def __init__(self, d_model=4, hidden_dim=8):
        super().__init__()
        self.linear1 = nn.Linear(d_model, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, d_model)

    def forward(self, x):
        # x shape: (batch_size, seq_len, d_model)
        out = self.linear1(x)   # expand
        out = self.relu(out)    # non-linear
        out = self.linear2(out) # back to d_model
        return out

In [63]:
ffn = FeedForward(d_model=d_model, hidden_dim=8)
ffn_out = ffn(normed_1) # (1, seq_len, 4)
print(ffn_out)

tensor([[[ 0.1996, -0.7154, -0.0501, -0.1783],
         [ 0.1308, -0.4855,  0.0238, -0.0576],
         [ 0.2613, -0.3266,  0.0152, -0.0999],
         [ 0.3176, -0.4831, -0.0302, -0.1423],
         [-0.0124, -0.4780,  0.0971, -0.3309],
         [ 0.0464, -0.3165,  0.1180, -0.3305],
         [ 0.3129, -0.4818, -0.0272, -0.1422],
         [ 0.0503, -0.1795,  0.1856, -0.3700]]], grad_fn=<ViewBackward0>)


In [69]:
residual_2 = normed_1 + ffn_out
layer_norm_2 = nn.LayerNorm(normalized_shape=2)
encoder_out = layer_norm_2(residual_2)

print("\n-- After Feed-Forward --")
print("FFN output shape:", ffn_out.shape)
print("Final encoder output shape:", encoder_out.shape)
print("Final encoder output:", encoder_out)


-- After Feed-Forward --
FFN output shape: torch.Size([1, 8, 2])
Final encoder output shape: torch.Size([1, 8, 2])
Final encoder output: tensor([[[-1.0000,  1.0000],
         [-1.0000,  1.0000],
         [-1.0000,  1.0000],
         [-1.0000,  1.0000],
         [ 1.0000, -1.0000],
         [ 1.0000, -1.0000],
         [-1.0000,  1.0000],
         [-1.0000,  1.0000]]], grad_fn=<NativeLayerNormBackward0>)


In a full Transformer encoder, you’d stack multiple such blocks and potentially include positional embeddings at the start. However, this snippet shows the core steps for computing the FFN after MHA, along with the residual + layer normalization surrounding each sublayer.

# Add & Norm (again)

In [64]:
# 1) Residual Connection
# Add the FFN output back to the already normalized output of the MHA block
residual_2 = normed_1 + ffn_out

# 2) Layer Normalization
# Typically the same dimension as 'd_model'
d_model = 4  
layer_norm_2 = nn.LayerNorm(normalized_shape=d_model)
encoder_out = layer_norm_2(residual_2)

print("-- After Feed-Forward --")
print("FFN output shape:", ffn_out.shape)           # e.g. [1, seq_len, 4]
print("Final encoder output shape:", encoder_out.shape)  # [1, seq_len, 4]
print("Final encoder output:", encoder_out)  # [1, seq_len, 4]

-- After Feed-Forward --
FFN output shape: torch.Size([1, 8, 4])
Final encoder output shape: torch.Size([1, 8, 4])
Final encoder output: tensor([[[ 1.7031, -0.2881, -0.6204, -0.7946],
         [ 0.8375,  1.1476, -1.0672, -0.9180],
         [ 0.8981, -0.6871,  1.0562, -1.2672],
         [ 1.0155,  0.2299,  0.4118, -1.6572],
         [ 0.8274, -1.4635, -0.3788,  1.0149],
         [ 0.5928, -1.7174,  0.3793,  0.7452],
         [ 1.1443,  0.0501,  0.3964, -1.5908],
         [-0.5959, -1.2929,  0.6365,  1.2523]]],
       grad_fn=<NativeLayerNormBackward0>)


1. __Residual__

This ensures the original, already-normalized MHA information (`normed_1`) is kept, in case the FFN output is unhelpful or partially helpful for certain tokens.

2. __Layer Norm__
- Normalizes along the embedding dimension (e.g., 4 in our toy example).
- Ensures per-token embeddings have zero mean and unit variance, stabilizing training and balancing contributions from the residual and the FFN.