# Chapter 4 - Interactive

### This notebook will contain code blocks, images, and gifs to further enhance your understanding and intuition of specific topics listed below:

- #### Transformers
- #### GPTModel Outhead

## Transformer

#### The image below is from this website that has a 3D-visual of the GPT architecture: https://bbycroft.net/llm

<div style="max-width:800px">
    
![](images/interactive_5.png)

</div>

#### The images below make up the transformer block in the GPT architecture.

<div style="max-width:800px">
    
![](images/interactive_4.png)

</div>

<div style="max-width:800px">
    
![](images/interactive_3.png)

</div>

#### Lets explain each part:
1. #### LayerNorm 1 - This is the first operation applied to the input from the previous layer (or the embedding layer for the first block). It standardizes each token’s embedding — so that the values across features have zero mean and unit variance.
2. #### Q, K, V Projections - We take the normalized token embeddings and run them through three different linear layers to create: Q (Query), K (Key), V (Value).
3. #### Attention Scores (Scaled Dot Product) - Compute scaled-dot product between the Query and Key matrices, then apply softmax to make them probabilities.
4. #### Apply Attention Weights to V - The attention weights are multiplied by the V vectors to get a new, context-aware representation for each token.
5. #### Concatenate Heads - All heads are concatenated and passed through another linear layer (the “output projection”).
6. #### Residual Connections - Adds the original input (pre-attention) back to the attention output. This helps preserve the original signal and improves gradient flow. Essential for stable deep learning models.
7. #### LayerNorm 2 - Same as before — normalize the updated representation token-wise across features. Prepares the token embeddings for the next transformation (MLP) in a more numerically stable way.
8. #### MLP FeedForward Layer - First linear layer expands the embedding dimension (e.g., 768 → 3072). Activation function (usually GELU or ReLU) introduces non-linearity. Second linear layer contracts it back to the original dimension.
9. #### Final Residual Connection - Adds the original input to the MLP back into the output of the MLP.

#### Now lets implement our own mini-version to enhance your understanding:

#### Step 1: Setup and Fake Input Embeddings 

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

# Toy settings
seq_len = 3
emb_dim = 6

# Simulate embeddings for a 3-token sequence
x = torch.randn(seq_len, emb_dim)
print("Input Embeddings:\n", x)

Input Embeddings:
 tensor([[ 1.9269,  1.4873, -0.4974,  0.4396, -0.7581,  1.0783],
        [ 0.8008,  1.6806,  0.3559, -0.6866,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516,  0.8599, -0.3097, -0.3957]])


#### We begin with a fake sequence of 3 token embeddings, each of size 6. This simulates the output from the previous transformer block or the embedding layer.



#### Step 2: Firsy LayerNorm

In [9]:
# First LayerNorm
layer_norm1 = nn.LayerNorm(emb_dim)
x_norm = layer_norm1(x)
print("After First LayerNorm:\n", x_norm)

After First LayerNorm:
 tensor([[ 1.3309,  0.8856, -1.1243, -0.1754, -1.3883,  0.4715],
        [ 0.1565,  1.3214, -0.4327, -1.8131, -0.0956,  0.8635],
        [-0.4298,  0.2095, -0.4765,  2.1229, -0.6125, -0.8136]],
       grad_fn=<NativeLayerNormBackward0>)


#### We normalize each token’s features to have zero mean and unit variance. This stabilizes the inputs to the attention mechanism.



#### Step 3: Q, K, V Projections

In [10]:
# Linear projections for Q, K, V (no bias for clarity)
W_q = nn.Linear(emb_dim, emb_dim, bias=False)
W_k = nn.Linear(emb_dim, emb_dim, bias=False)
W_v = nn.Linear(emb_dim, emb_dim, bias=False)

Q = W_q(x_norm)
K = W_k(x_norm)
V = W_v(x_norm)

print("Query Vectors:\n", Q)
print("Key Vectors:\n", K)
print("Value Vectors:\n", V)

Query Vectors:
 tensor([[-0.2050, -0.6209,  0.6756, -0.4645, -0.0528, -0.0239],
        [-0.4284, -1.0410,  0.3973,  0.0177,  0.0273,  0.0124],
        [ 0.8572,  0.6222, -0.3372,  0.4344, -0.7646, -0.3994]],
       grad_fn=<MmBackward0>)
Key Vectors:
 tensor([[ 0.6382, -0.2814, -0.5932, -0.4497, -0.9684, -0.2530],
        [ 0.4410, -0.7229, -0.4119,  0.5000, -0.5736,  0.4171],
        [-0.5239,  1.1588, -0.3364, -0.5223,  0.0211, -0.2605]],
       grad_fn=<MmBackward0>)
Value Vectors:
 tensor([[-0.7430, -0.1025, -0.5163, -0.8837,  1.0014, -1.3047],
        [ 0.3022, -0.4374, -0.3117,  0.1834, -0.2640, -0.7827],
        [-0.5925, -0.0414, -0.0562, -0.6907,  0.8363, -0.0091]],
       grad_fn=<MmBackward0>)


#### We create linear transformations that project the input into query, key, and value spaces. These are used to compute attention.

#### Step 4: Compute Scaled Dot-Product Attention

In [11]:
# Compute attention scores and softmax
d_k = emb_dim
scores = Q @ K.T / d_k**0.5
attn_weights = F.softmax(scores, dim=-1)

print("Attention Scores:\n", scores)
print("Attention Weights (Softmaxed):\n", attn_weights)

Attention Scores:
 tensor([[-0.0371, -0.0538, -0.2415],
        [-0.1036,  0.1626, -0.4603],
        [ 0.4973,  0.2271,  0.1006]], grad_fn=<DivBackward0>)
Attention Weights (Softmaxed):
 tensor([[0.3573, 0.3514, 0.2913],
        [0.3328, 0.4343, 0.2329],
        [0.4105, 0.3133, 0.2761]], grad_fn=<SoftmaxBackward0>)


#### This computes the compatibility between each query and key (dot product), scales it, and converts it to a probability distribution using softmax.

#### Step 5: Compute Attention Output

In [12]:
# Attention output is weighted sum of value vectors
attn_output = attn_weights @ V
print("Attention Output:\n", attn_output)

Attention Output:
 tensor([[-0.3319, -0.2024, -0.3104, -0.4525,  0.5086, -0.7439],
        [-0.2540, -0.2337, -0.3203, -0.3753,  0.4134, -0.7762],
        [-0.3739, -0.1906, -0.3252, -0.4960,  0.5593, -0.7834]],
       grad_fn=<MmBackward0>)


#### This combines value vectors according to attention weights — giving us contextualized token embeddings.

#### Step 6: Attention Output Projection + Residual

In [13]:
# Final linear projection back to emb_dim
proj = nn.Linear(emb_dim, emb_dim, bias=False)
projected = proj(attn_output)

# Add residual connection (original input + projected attention output)
x_resid_attn = x + projected
print("Post-Attention Residual Output:\n", x_resid_attn)

Post-Attention Residual Output:
 tensor([[ 1.7323,  1.6815, -0.4611,  0.5775, -0.7837,  0.8959],
        [ 0.6157,  1.9135,  0.4339, -0.5653,  0.5614,  1.1892],
        [-0.4475,  0.2324, -0.2285,  1.0092, -0.3245, -0.6006]],
       grad_fn=<AddBackward0>)


#### We project the multi-head result back to embedding dimension and add the residual connection to preserve the original input signal.



#### Step 7: Second LayerNorm

In [14]:
# Second LayerNorm before MLP
layer_norm2 = nn.LayerNorm(emb_dim)
x_norm2 = layer_norm2(x_resid_attn)
print("After Second LayerNorm:\n", x_norm2)


After Second LayerNorm:
 tensor([[ 1.1668,  1.1141, -1.1077, -0.0307, -1.4421,  0.2995],
        [-0.1004,  1.6212, -0.3416, -1.6670, -0.1725,  0.6604],
        [-0.7138,  0.5383, -0.3105,  1.9686, -0.4871, -0.9956]],
       grad_fn=<NativeLayerNormBackward0>)


#### Normalizes again before feeding into the feedforward MLP.

#### Step 8: MLP Feedforward (2-layer NN)

In [17]:
# MLP: Linear → GELU → Linear
mlp_hidden = 12  # Hidden size for demonstration
linear1 = nn.Linear(emb_dim, mlp_hidden)
linear2 = nn.Linear(mlp_hidden, emb_dim)

# Apply MLP
mlp_intermediate = linear1(x_norm2)
print("MLP Intermediate Projection Shape:", mlp_intermediate.shape)

mlp_activated = F.gelu(mlp_intermediate)
print("MLP Activation Shape:", mlp_activated.shape)

mlp_output = linear2(mlp_activated)
print("MLP Output Shape:", mlp_output.shape)
print("MLP Output:\n", mlp_output)


MLP Intermediate Projection Shape: torch.Size([3, 12])
MLP Activation Shape: torch.Size([3, 12])
MLP Output Shape: torch.Size([3, 6])
MLP Output:
 tensor([[ 0.1352,  0.5397, -0.1804, -0.3725, -0.2662,  0.3049],
        [-0.0028,  0.2650, -0.2442, -0.3500, -0.0057,  0.0221],
        [ 0.2010,  0.5542, -0.0727, -0.0812, -0.0789, -0.0450]],
       grad_fn=<AddmmBackward0>)


#### Applies a small feedforward neural network to each token independently. First expands, then contracts the representation, with a GELU non-linearity in the middle.

#### Step 9: Final Residual Connection (Post-MLP)

In [18]:
# Final residual connection
final_output = x_resid_attn + mlp_output
print("Final Output of Transformer Block:\n", final_output)

Final Output of Transformer Block:
 tensor([[ 1.8675,  2.2212, -0.6416,  0.2050, -1.0499,  1.2008],
        [ 0.6129,  2.1785,  0.1896, -0.9153,  0.5557,  1.2112],
        [-0.2465,  0.7866, -0.3012,  0.9280, -0.4034, -0.6456]],
       grad_fn=<AddBackward0>)


#### Adds the result of the MLP back to the post-attention representation. This is the final output of the transformer block — ready to be passed to the next block or output head.

## GPTModel Outhead

#### The image below is from this website that has a 3D-visual of the GPT architecture: https://bbycroft.net/llm

<div style="max-width:800px">
    
![](images/interactive_2.png)

</div>

#### The image below is the architecture of the block above containing LayerNorm -> Linear -> Softmax highlighted in blue.

<div style="max-width:800px">
    
![](images/interactive_1.png)

</div>

#### Lets explain each part:

1. #### LayerNorm - The output tensor from the last transformer block goes undergoes this final Layer Normalization. The LN Agg: <b>μ</b>,<b>σ</b> show that LayerNorm computes the mean (μ) and standard deviation (σ) across the features. Finally, it normalizes each token embedding using learned parameters: $ $<b>γ</b> and <b>β</b> (gamma and beta) are learnable scale and shift parameters.

3. #### Linear Layer (LM Head) - The normalized output is passed to a linear projection layer, referred to here as LM Head Weights. It transforms the final hidden state into logits, one for each token in the vocabulary.

4. #### Logits Calculation - Each token’s embedding now corresponds to a logit vector of shape [vocab_size]. These logits represent the unnormalized scores for each token being the next in the sequence.

5. #### SM Agg (Softmax Aggregation) - The logits are passed through a Softmax layer, converting the logits to a probability distribution over the vocabulary.

6. #### Logits Softmax Output - The output is the predicted probability distribution for the next token. In the visual: Tokens like C, B, A, etc. are shown with their associated probabilities. The model seems to have assigned the highest probability to A, meaning it predicts that A is the most likely next token.

#### Now lets implement our own mini version to enhance your understanding.

#### Step 1: Setup and Imports

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set random seed for reproducibility
torch.manual_seed(42)

<torch._C.Generator at 0x10cfeb650>

#### We start by importing PyTorch and setting a manual seed to ensure consistent results each time we run the notebook. This will help us get reproducible random embeddings and model weights.

#### Step 2: Define Fake Token Embedding (output of the last Transformer block)

In [3]:
# Simulate final token embedding of dimension 6
emb_dim = 6
final_embedding = torch.randn(emb_dim)
print("Final embedding:", final_embedding)

Final embedding: tensor([ 0.3367,  0.1288,  0.2345,  0.2303, -1.1229, -0.1863])


#### Here we simulate the output from the last transformer block for a single token. This is just a random vector of length 6 (our simulated embedding dimension), which represents the information learned about this token up to this point.

#### Step 3: Apply LayerNorm

In [4]:
# LayerNorm applied to a single token's embedding
layer_norm = nn.LayerNorm(emb_dim)
normalized_embedding = layer_norm(final_embedding)
print("After LayerNorm:", normalized_embedding)

After LayerNorm: tensor([ 0.7971,  0.3827,  0.5933,  0.5851, -2.1126, -0.2456],
       grad_fn=<NativeLayerNormBackward0>)


#### LayerNorm normalizes the embedding so that it has zero mean and unit variance (per token). It also learns a scale (γ) and shift (β) for each dimension during training. This helps stabilize training and improves convergence in transformer models.

#### Step 4: Project Embedding to Vocabulary Logits (LM Head)

In [5]:
# Define vocab size and linear projection
vocab_size = 10
lm_head = nn.Linear(emb_dim, vocab_size, bias=False)

# Get logits
logits = lm_head(normalized_embedding)
print("Logits:", logits)

Logits: tensor([-0.3350,  0.0321, -0.5233,  0.6060, -0.0765, -0.2038,  0.3173, -0.3729,
        -0.3371, -0.9188], grad_fn=<SqueezeBackward4>)


#### The normalized embedding is passed through a linear layer that projects it from embedding space to vocabulary space (size 10 in this toy example). The result is a vector of logits, which are raw scores — one for each vocabulary tok

#### Step 5: Convert Logits to Probabilities using Softmax

In [6]:
# Apply softmax to get token probabilities
probs = F.softmax(logits, dim=-1)
print("Probabilities:", probs)

Probabilities: tensor([0.0787, 0.1136, 0.0652, 0.2017, 0.1019, 0.0897, 0.1511, 0.0758, 0.0785,
        0.0439], grad_fn=<SoftmaxBackward0>)


#### The softmax function converts the raw logits into probabilities. Each value is now between 0 and 1, and the sum of all values is 1. This gives us a distribution over the vocabulary — i.e., how likely each token is to be the next token.

#### Step 6: Get Predicted Token

In [7]:
# Get token with highest probability
predicted_token = torch.argmax(probs).item()
print(f"Predicted token index: {predicted_token}")

Predicted token index: 3


#### We pick the token index with the highest probability — this is the model’s prediction for the next token in the sequence. In a real model, we would use this to generate text autoregressively.

