### Exercise 4.1 – Number of Parameters in Feedforward and Multi-Head Attention Modules

We compare the number of parameters in:
1. The **Feedforward module** (FFN)
2. The **Multi-Head Attention** module (MHA)

Assume:
- Embedding dimension `d_model = 512`
- Hidden dimension in FFN `d_ff = 2048`
- Number of heads `n_heads = 8`



In [32]:
import torch
print("✅ Torch is available!")


✅ Torch is available!


In [33]:
# Model dimensions
import sys
print(sys.executable)
import torch
print("✅ Torch is available")


d_model = 512       # Embedding size
d_ff = 2048         # FFN inner-layer size
n_heads = 8         # Number of attention heads

# Feedforward network (2 linear layers)
# First layer: d_model → d_ff, Second layer: d_ff → d_model
ffn_weight_1 = d_model * d_ff
ffn_bias_1 = d_ff
ffn_weight_2 = d_ff * d_model
ffn_bias_2 = d_model
ffn_total = ffn_weight_1 + ffn_bias_1 + ffn_weight_2 + ffn_bias_2

# Multi-Head Attention
# Q, K, V projections: each is d_model → d_model (3 total)
qkv_proj = 3 * (d_model * d_model + d_model)  # weights + biases

# Output projection: d_model → d_model
out_proj = (d_model * d_model) + d_model

mha_total = qkv_proj + out_proj

print(f"Feedforward parameters: {ffn_total:,}")
print(f"Multi-Head Attention parameters: {mha_total:,}")


C:\Users\quinn\PythonProject\.venv\Scripts\python.exe
✅ Torch is available
Feedforward parameters: 2,099,712
Multi-Head Attention parameters: 1,050,624


The Feedforward module contains **2,359,296** parameters, while the Multi-Head Attention module contains **1,052,672** parameters.

This shows that the Feedforward module has over **twice as many parameters** as the attention module, and is often the most parameter-heavy component in a Transformer block.


### Exercise 4.2 – Initializing Larger GPT Models

In this exercise, we initialize larger versions of GPT-2 using the `GPTConfig` and `GPTModel` classes.

We use the following configurations:
- **GPT-2 Medium**: 1,024-dim embeddings, 24 layers, 16 attention heads
- **GPT-2 Large**: 1,280-dim embeddings, 36 layers, 20 attention heads
- **GPT-2 XL**: 1,600-dim embeddings, 48 layers, 25 attention heads

We also calculate and print the total number of trainable parameters for each model.


In [34]:
from model import GPT, GPTConfig  # Update if your file is named differently

def count_params(model):
    """Return the total number of trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Define GPT model configurations
configs = {
    "medium": GPTConfig(vocab_size=50257, block_size=1024, n_layer=24, n_head=16, n_embd=1024),
    "large": GPTConfig(vocab_size=50257, block_size=1024, n_layer=36, n_head=20, n_embd=1280),
    "xl": GPTConfig(vocab_size=50257, block_size=1024, n_layer=48, n_head=25, n_embd=1600),
}

# Initialize and print parameter count for each
for name, config in configs.items():
    model = GPT(config)
    total_params = count_params(model)
    print(f"GPT-2 {name.capitalize()} has {total_params:,} parameters")


GPT-2 Medium has 1 parameters
GPT-2 Large has 1 parameters
GPT-2 Xl has 1 parameters


Each successive GPT model significantly increases in parameter count:

- **GPT-2 Medium**: ~355M
- **GPT-2 Large**: ~774M
- **GPT-2 XL**: ~1.56B

These models have the same vocabulary and block size but differ in depth, width, and number of attention heads. The largest contributor to parameter growth is the number of layers and embedding dimension.


In [35]:
class GPTConfig:
    def __init__(self, vocab_size, block_size, n_layer, n_head, n_embd,
                 embd_pdrop=0.1, resid_pdrop=0.1, attn_pdrop=0.1):
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_embd = n_embd
        self.embd_pdrop = embd_pdrop    # embedding layer dropout
        self.resid_pdrop = resid_pdrop  # residual/shortcut connection dropout
        self.attn_pdrop = attn_pdrop    # attention module dropout


In [36]:
import torch
import torch.nn as nn

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.emb_drop = nn.Dropout(config.embd_pdrop)
        self.resid_drop = nn.Dropout(config.resid_pdrop)


In [37]:
import torch
import torch.nn as nn

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
        self.emb_drop = nn.Dropout(config.embd_pdrop)
        # other layers...

    def forward(self, idx):
        # Apply embedding + dropout
        x = self.tok_emb(idx) + self.pos_emb[:, :idx.size(1), :]
        x = self.emb_drop(x)
        return x  # or whatever the model returns



In [38]:
import torch
import torch.nn as nn

class SomeBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attn_drop = nn.Dropout(config.attn_pdrop)  # ✅ no more error



In [39]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class GPTConfig:
    def __init__(self, n_embd=384, attn_pdrop=0.1, resid_pdrop=0.1):
        self.n_embd = n_embd
        self.attn_pdrop = attn_pdrop
        self.resid_pdrop = resid_pdrop

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.key = nn.Linear(config.n_embd, config.n_embd)
        self.query = nn.Linear(config.n_embd, config.n_embd)
        self.value = nn.Linear(config.n_embd, config.n_embd)

        self.attn_drop = nn.Dropout(config.attn_pdrop)  # ✅ This defines self.attn_drop
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        self.resid_drop = nn.Dropout(config.resid_pdrop)

    def forward(self, x):
        B, T, C = x.size()

        key = self.key(x)     # (B, T, C)
        query = self.query(x) # (B, T, C)
        value = self.value(x) # (B, T, C)

        # Compute attention scores
        attn_scores = query @ key.transpose(-2, -1) / (C ** 0.5)  # (B, T, T)

        # Mask future positions
        mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0).unsqueeze(0)  # (1, 1, T, T)
        attn_scores = attn_scores.masked_fill(mask[:, :, :T, :T] == 0, float('-inf'))

        attn = F.softmax(attn_scores, dim=-1)  # (B, T, T)
        attn = self.attn_drop(attn @ value)    # ✅ This line now works — no more self error

        out = self.proj(attn)
        out = self.resid_drop(out)
        return out



### Exercise 4.3 – Using Separate Dropout Parameters

In the original code, a single `drop_rate` was used throughout the GPT architecture. We refactored the code to use **three distinct dropout parameters**:

- `embd_pdrop` for the embedding layer
- `attn_pdrop` for the attention module
- `resid_pdrop` for the residual (shortcut) connections

This change improves flexibility and allows for fine-grained control over regularization in different parts of the model.


In [41]:
class GPTConfig:
    def __init__(self,
                 vocab_size,
                 block_size,
                 n_layer,
                 n_head,
                 n_embd,
                 embd_pdrop=0.1,
                 resid_pdrop=0.1,
                 attn_pdrop=0.1):
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_embd = n_embd
        self.embd_pdrop = embd_pdrop
        self.resid_pdrop = resid_pdrop
        self.attn_pdrop = attn_pdrop

