# Introduction


## Building GPT from Scratch

This Colab notebook is mostly a compilation of personal notes created while working through Andrej Karpathy's YouTube tutorial titled ["Let's build GPT: from scratch, in code, spelled out"](https://www.youtube.com/watch?v=kCc8FmEb1nY) for the third time.

- Contains large Notes section with much interesting Pytorch and GPT related information.

# Imports

In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [52]:
torch.manual_seed(1337)

<torch._C.Generator at 0x7c1f6f105cd0>

In [53]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Download dataset

In [54]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-04-16 14:28:49--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-04-16 14:28:49 (18.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



# Vocabulary

In [55]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [56]:
len(text)

1115394

In [57]:
text[:100]

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'

In [58]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [59]:
chars = sorted(list(set(text)))

In [60]:
print(chars)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [61]:
vocab_size = len(chars)

In [62]:
print(vocab_size)

65


# Tokenizer

In [63]:
stoi = {c:i for i,c in enumerate(chars)}

In [64]:
print(stoi)

{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}


In [65]:
itos = {i:c for i,c in enumerate(chars)}

In [66]:
print(itos)

{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i', 48: 'j', 49: 'k', 50: 'l', 51: 'm', 52: 'n', 53: 'o', 54: 'p', 55: 'q', 56: 'r', 57: 's', 58: 't', 59: 'u', 60: 'v', 61: 'w', 62: 'x', 63: 'y', 64: 'z'}


In [67]:
encode = lambda s: [stoi[c] for c in s]

In [68]:
print(encode("hello world"))

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]


In [69]:
decode = lambda l: "".join([itos[i] for i in l])

In [70]:
print(decode(encode("hello world")))

hello world


# Train and test splits

In [71]:
data = torch.tensor(encode(text), dtype=torch.long)

In [72]:
data.shape

torch.Size([1115394])

In [73]:
n = int(len(data)*0.9)
train_data = data[:n]
val_data = data[n:]

In [74]:
print(train_data.shape)

torch.Size([1003854])


# Hyperparameters

In [75]:
block_size = 8 # context length
batch_size = 2
n_embed = 2

# Data Loader

In [76]:
def get_batch(split):
    data = train_data if split == "train" else val_data

    ix = torch.randint(len(data)- block_size , (batch_size,))

    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1: i+block_size+1] for i in ix])

    x, y = x.to(device), y.to(device)
    return x, y

In [77]:
get_batch("train")

(tensor([[24, 43, 58,  5, 57,  1, 46, 43],
         [44, 53, 56,  1, 58, 46, 39, 58]]),
 tensor([[43, 58,  5, 57,  1, 46, 43, 39],
         [53, 56,  1, 58, 46, 39, 58,  1]]))

# Model: Embedding Layer and Output Linear Transformation

In [77]:
class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # Embedding tokens
        tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)

        # Generating logits using the language model head
        logits = self.lm_head(tok_emb)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx


# Model: Generate Function

In [None]:

    def generate(self, idx, max_new_tokens, temperature=1.0):
        """
        Generate new tokens given an initial index tensor.

        Args:
            idx (torch.Tensor): Starting tensor of indices, shape (B, T)
            max_new_tokens (int): Number of new tokens to generate
            temperature (float): Controls the randomness of predictions by scaling logits

        Returns:
            torch.Tensor: Tensor containing the original and new generated indices
        """

        for _ in range(max_new_tokens):
            logits, _ = self(idx)  # Generate logits for the last token
            logits = logits[:, -1, :] / temperature  # Use the last token's logits, apply temperature
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample from the probability distribution
            idx = torch.cat((idx, idx_next), dim=1)  # Append the new index to the sequence

        return idx

In [78]:
class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # Embedding tokens
        tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)

        # Generating logits using the language model head
        logits = self.lm_head(tok_emb)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens, temperature=1.0):
        """
        Generate new tokens given an initial index tensor.

        Args:
            idx (torch.Tensor): Starting tensor of indices, shape (B, T)
            max_new_tokens (int): Number of new tokens to generate
            temperature (float): Controls the randomness of predictions by scaling logits

        Returns:
            torch.Tensor: Tensor containing the original and new generated indices
        """

        for _ in range(max_new_tokens):
            logits, _ = self(idx)  # Generate logits for the last token
            logits = logits[:, -1, :] / temperature  # Use the last token's logits, apply temperature
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample from the probability distribution
            idx = torch.cat((idx, idx_next), dim=1)  # Append the new index to the sequence

        return idx


model = GPT(vocab_size)
model = model.to(device)

start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=10)[0].tolist()
print(decode(generated_indices))



,p-dazPolX


# Evaluation loop

In [46]:
eval_iters = 200

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "eval"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


# Training loop

In [None]:
model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

In [33]:
block_size = 8 # context length
batch_size = 2
n_embed = 2

eval_iters = 20

eval_interval = 500
learning_rate = 1e-2
max_iters = 3000


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # Embedding tokens
        tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)

        # Generating logits using the language model head
        logits = self.lm_head(tok_emb)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens, temperature=1.0):
        """
        Generate new tokens given an initial index tensor.

        Args:
            idx (torch.Tensor): Starting tensor of indices, shape (B, T)
            max_new_tokens (int): Number of new tokens to generate
            temperature (float): Controls the randomness of predictions by scaling logits

        Returns:
            torch.Tensor: Tensor containing the original and new generated indices
        """

        for _ in range(max_new_tokens):
            logits, _ = self(idx)  # Generate logits for the last token
            logits = logits[:, -1, :] / temperature  # Use the last token's logits, apply temperature
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample from the probability distribution
            idx = torch.cat((idx, idx_next), dim=1)  # Append the new index to the sequence

        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))

step 0: training loss 4.468097686767578, val loss 4.53944206237793 
step 500: training loss 3.06418776512146, val loss 3.1177852153778076 
step 1000: training loss 2.915592670440674, val loss 2.8232614994049072 
step 1500: training loss 2.8693275451660156, val loss 2.8686513900756836 
step 2000: training loss 2.7696008682250977, val loss 2.90069580078125 
step 2500: training loss 2.908082962036133, val loss 2.775252103805542 

:Oty thedo


In [34]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


Mteonweet uueon v yhauinreiSmtaf m fakong goLUSAtea chitheon?

Mb ?W tr.
Y be tel  Bshhiveanf hinhhon o ! bt,
LN reen Ru wit dererre s abldibesbeathos rry k marudhobehreot  m sys:

RC .hede t g thovep


In [38]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 2)
  (lm_head): Linear(in_features=2, out_features=65, bias=True)
)

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 2])
lm_head.weight 	 torch.Size([65, 2])
lm_head.bias 	 torch.Size([65])

Total Parameters: 325
Trainable Parameters: 325


# Model: Positional Embeddings

In [47]:
block_size = 8 # context length
batch_size = 2
n_embed = 2

eval_iters = 20

eval_interval = 500
learning_rate = 1e-2
max_iters = 3000


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss


    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.3499064445495605, val loss 4.3931756019592285 
step 500: training loss 3.116835355758667, val loss 3.150700807571411 
step 1000: training loss 2.948291063308716, val loss 2.844757556915283 
step 1500: training loss 2.816631317138672, val loss 2.887141466140747 
step 2000: training loss 2.8248682022094727, val loss 2.795835256576538 
step 2500: training loss 2.9078564643859863, val loss 2.781541347503662 


In [48]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))

hipe so i


In [50]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 2)
  (position_embedding_table): Embedding(8, 2)
  (lm_head): Linear(in_features=2, out_features=65, bias=True)
)

------------ 

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 2])
position_embedding_table.weight 	 torch.Size([8, 2])
lm_head.weight 	 torch.Size([65, 2])
lm_head.bias 	 torch.Size([65])

------------ 

Total Parameters: 341
Trainable Parameters: 341


# Mathematical tricks in self-attention

## Version 1: Inefficient attention using averages

In [55]:
torch.manual_seed(1337)

B, T, C = 4, 8, 2

x = torch.randn(B, T, C)

print(x.shape)
print(x[0])

torch.Size([4, 8, 2])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])


In [57]:
xbow = torch.zeros((B, T, C))
xbow[0]

tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]])

In [66]:
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        if b == 0:
            print(xprev)
        xbow[b][t] = xprev.sum(dim=0)

print("\n---------")
print(x[0])
print(xbow[0])

tensor([[ 0.1808, -0.0700]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398]])
tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9

In [62]:
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        xbow[b][t] = xprev.mean(dim=0)

print(x[0])
print(xbow[0])

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


## Matrix multiplication - basic summing

In [75]:
torch.manual_seed(42)

a = torch.ones(3,3)
b = torch.randint(0,10, (3,2)).float()

c = a @ b

print('a=')
print(a)
print('b=')
print('----')
print(b)
print('c=')
print('----')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
b=
----
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c=
----
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


## `torch.tril` - **trick!!**

In [76]:
torch.manual_seed(42)

# sum
a = torch.tril(torch.ones(3,3))
b = torch.randint(0,10, (3,2)).float()

c = a @ b

print('a=')
print(a)
print('b=')
print('----')
print(b)
print('c=')
print('----')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
b=
----
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c=
----
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


In [96]:
torch.manual_seed(42)

# normalize the rows for INCREMENTAL AVERAGES
a = torch.tril(torch.ones(3,3))

print("\n---")
print(torch.sum(a, dim=1))
print(torch.sum(a, dim=1).shape)
print(torch.sum(a, dim=1, keepdim=True))
print(torch.sum(a, dim=1, keepdim=True).shape)
print("\n---")

a = a / torch.sum(a, dim=1, keepdim=True)
b = torch.randint(0,10, (3,2)).float()



c = a @ b

print('a=')
print(a)
print('b=')
print('----')
print(b)
print('c=')
print('----')
print(c)


---
tensor([1., 2., 3.])
torch.Size([3])
tensor([[1.],
        [2.],
        [3.]])
torch.Size([3, 1])

---
a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b=
----
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c=
----
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


## Version 2: Weighted aggregation using Batched Matrix Multiply - **trick!!!!**

In [120]:
torch.manual_seed(1337)

B, T, C = 4, 8, 2

x = torch.randn(B, T, C)

xbow = torch.zeros((B, T, C))

for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        xbow[b][t] = xprev.mean(dim=0)


wei = torch.tril(torch.ones(T, T))
wei = wei / torch.sum(wei, 1, keepdim=True)

xbow2 = wei @ x

# Check with a more relaxed tolerance
close = torch.allclose(xbow, xbow2, atol=1e-6, rtol=1e-4)
print("Are the tensors close within a relaxed tolerance?", close)

Are the tensors close within a relaxed tolerance? True


In [121]:
# Calculate the maximum difference
max_diff = torch.max(torch.abs(xbow - xbow2))
print("Maximum difference between xbow and xbow2:", max_diff)


Maximum difference between xbow and xbow2: tensor(3.2363e-08)


## Version 3: Softmax - **trick!!!!**

In [127]:
tril = torch.tril(torch.ones(T,T))
print(tril)
print("\n----")
wei = torch.zeros(T,T)
print(wei)
print("\n----")
wei = wei.masked_fill(tril==0, float('-inf'))
print(wei)
print("\n----")
wei = F.softmax(wei, dim=1)
print(wei)
print("\n----")

xbow3 = wei @ x

close = torch.allclose(xbow, xbow3, atol=1e-6, rtol=1e-4)
print("Are the tensors close within a relaxed tolerance?", close)

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

----
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

----
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        

## Version 4: Self-Attention

In [24]:
torch.manual_seed(1337)

B, T, C = 4, 8, 32

x = torch.randn(B, T, C)

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros(T,T) # we don't uniform. we want data-dependant
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=1)
print(wei.shape)

out = wei @ x

out.shape

torch.Size([8, 8])


torch.Size([4, 8, 32])

In [29]:
torch.manual_seed(1337)

B, T, C = 4, 8, 32

x = torch.randn(B, T, C)

head_size = 16

query = nn.Linear(C, head_size, bias=False) # (32, 16)
key = nn.Linear(C, head_size, bias=False) # (32, 16)

q = query(x) # (4, T, 16)
k = key(x) # (4, T, 16)
wei = q @ k.transpose(-2,-1) # (B, T, T)
# old wei = torch.zeros(T, T)

tril = torch.tril(torch.ones(T,T))
# wei = torch.zeros(T, T)
wei = torch.masked_fill(wei, tril==0, float('-inf'))
wei = F.softmax(wei, dim=1)
out = wei @ x
print(wei.shape)
print(out.shape)

torch.Size([4, 8, 8])
torch.Size([4, 8, 32])


In [51]:
torch.manual_seed(1337)

B, T, C = 4, 8, 32

x = torch.randn(B, T, C)

head_size = 16

query = nn.Linear(C, head_size, bias=False) # (32, 16)
key = nn.Linear(C, head_size, bias=False) # (32, 16)
value = nn.Linear(C, head_size, bias=False)

q = query(x) # (B, T, 16)
k = key(x) # (B, T, 16)
v = value(x) # (B, T, 16)
wei = q @ k.transpose(-2,-1) * head_size**-0.5 # (B, T, T)
print(wei.var())

tril = torch.tril(torch.ones(T,T))
# wei = torch.zeros(T, T)
wei = torch.masked_fill(wei, tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v # v is the vector which we aggregate (for this single head)

# out shape: (B, T, head_size)
print(out.shape)
print(wei[0])

tensor(0.1201, grad_fn=<VarBackward0>)
torch.Size([4, 8, 16])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5221, 0.4779, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3602, 0.3210, 0.3188, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2980, 0.4039, 0.1578, 0.1404, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1643, 0.1243, 0.1678, 0.1865, 0.3570, 0.0000, 0.0000, 0.0000],
        [0.2656, 0.2110, 0.1137, 0.1214, 0.2018, 0.0865, 0.0000, 0.0000],
        [0.1761, 0.1327, 0.1371, 0.0974, 0.1476, 0.1918, 0.1173, 0.0000],
        [0.1046, 0.1260, 0.0922, 0.0906, 0.1476, 0.1588, 0.1432, 0.1371]],
       grad_fn=<SelectBackward0>)


# Model: Single Attention Head

In [98]:
block_size = 8 # context length
batch_size = 2
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000

head_size = 32


class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_head = Head(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.sa_head(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.172791004180908, val loss 4.200776100158691 
step 500: training loss 3.2094807624816895, val loss 3.1048107147216797 
step 1000: training loss 2.867919445037842, val loss 2.8942389488220215 
step 1500: training loss 2.864621639251709, val loss 2.6551945209503174 
step 2000: training loss 2.5643837451934814, val loss 2.692009687423706 
step 2500: training loss 2.697690486907959, val loss 2.7760109901428223 


In [99]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))



CESSTk hitel mindendg,
Oe, win me
KWt wee.
LAlld t f rin.
QMMons best b ser yopince stele swy

Tersind ve;
B ll Vv:
IWousisind bsle bl
N; sheryo bovs iwerllerfon ord.
CHt fenovehat pfounapons.
OONTae


In [100]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (sa_head): Head(
    (query): Linear(in_features=32, out_features=32, bias=False)
    (key): Linear(in_features=32, out_features=32, bias=False)
    (value): Linear(in_features=32, out_features=32, bias=False)
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

------------ 

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
sa_head.tril 	 torch.Size([8, 8])
sa_head.query.weight 	 torch.Size([32, 32])
sa_head.key.weight 	 torch.Size([32, 32])
sa_head.value.weight 	 torch.Size([32, 32])
lm_head.weight 	 torch.Size([65, 32])
lm_head.bias 	 torch.Size([65])

------------ 

Total Parameters: 7553
Trainable Parameters: 7553


# Model: Multi-head Attention

In [106]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)



class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_heads = MultiHeadAttention(4, n_embed // 4)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.2066192626953125, val loss 4.234907627105713 
step 500: training loss 2.9258944988250732, val loss 2.9589881896972656 
step 1000: training loss 2.581874370574951, val loss 2.6997199058532715 
step 1500: training loss 2.6127994060516357, val loss 2.601308822631836 
step 2000: training loss 2.514824867248535, val loss 2.552429676055908 
step 2500: training loss 2.52724027633667, val loss 2.5238077640533447 


In [111]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


MENAELK:!r Gond, ig' f oud tathee fus 'whaone th yon deemt.
Blroret, keu Rounf fe;

Shpar no, wis
Ca'dl'sd to athevilleno nume yowsnry yoo.
May ma&hep vere, ghr Hof wound ram, ito dred wow hag.
Waco, 


In [112]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (sa_heads): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

------------ 

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
sa_heads.heads.0.tril 	 torch.Size([8, 8])
sa_heads.heads.0.query.weight 	 torch.Size([8, 32])
sa_heads.heads.0.key.weight 	 torch.Size([8, 32])
sa_heads.heads.0.value.weight 	 torch.Size([8, 32])
sa_heads.heads.1.tril 	 torch.Size([8, 8])
sa_heads.heads.1.query.weight 	 torch.Size([8, 32])
sa_heads.heads.1.key.weight 	 torch.Size([8, 32])
sa_heads.heads.1.value.weight 

# Model: Feed forward

In [115]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)

class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),
            nn.ReLU()
        )

    def forward(self, x):
        return self.net(x)




class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.sa_heads = MultiHeadAttention(4, n_embed // 4)
        self.ffwd = FeedForward(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.sa_heads(x)
        x = self.ffwd(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.222600936889648, val loss 4.2188401222229 
step 500: training loss 2.9812910556793213, val loss 2.913625955581665 
step 1000: training loss 2.76777720451355, val loss 2.5907235145568848 
step 1500: training loss 2.5140411853790283, val loss 2.4913179874420166 
step 2000: training loss 2.570660352706909, val loss 2.436124324798584 
step 2500: training loss 2.3858771324157715, val loss 2.5086424350738525 


In [119]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


Wat fhe e sarege paionst eaft eour doth P: lisas anding-eene Se
SESISINO:
I baml,
Meatof sape. Rurs nis
Youl, sou, inoth ou, creeng, dha waft!
 erot.

Yrouse Wes!d museds'd
Cere dronm heogor chelonp, 


In [120]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (sa_heads): MultiHeadAttention(
    (heads): ModuleList(
      (0-3): 4 x Head(
        (query): Linear(in_features=32, out_features=8, bias=False)
        (key): Linear(in_features=32, out_features=8, bias=False)
        (value): Linear(in_features=32, out_features=8, bias=False)
      )
    )
  )
  (ffwd): FeedForward(
    (net): Sequential(
      (0): Linear(in_features=32, out_features=32, bias=True)
      (1): ReLU()
    )
  )
  (lm_head): Linear(in_features=32, out_features=65, bias=True)
)

------------ 

Model's state_dict:
token_embedding_table.weight 	 torch.Size([65, 32])
position_embedding_table.weight 	 torch.Size([8, 32])
sa_heads.heads.0.tril 	 torch.Size([8, 8])
sa_heads.heads.0.query.weight 	 torch.Size([8, 32])
sa_heads.heads.0.key.weight 	 torch.Size([8, 32])
sa_heads.heads.0.value.weight 	 torch.Size([8, 32])
sa_heads.heads.1.tril 	 torch.Size([8, 

# Model: Transformer block

In [132]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),
            nn.ReLU()
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)

    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.179017543792725, val loss 4.179043769836426 
step 500: training loss 3.1474251747131348, val loss 3.066397190093994 
step 1000: training loss 2.839433431625366, val loss 2.9424216747283936 
step 1500: training loss 2.7702505588531494, val loss 2.642995834350586 
step 2000: training loss 2.664306163787842, val loss 2.6620941162109375 
step 2500: training loss 2.6165897846221924, val loss 2.5415029525756836 


In [134]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


LLR:' me hic my nro ce Xed whe An
:
Sdyes tey wath hrate
Pest.

EINw eeond pcis, par sleilbige nghik cis oveord'ed tho yo'? foo ad w'e'tar I yasd isfe, buer thoneast tit
Cend, sour, int
Low; Ylou ooua


In [135]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=32, bias=True)
          (1): ReLU()
        )
      )
    )
    (1): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
    

# Model optimization help 1: Skip connections

In [141]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU()
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 5.1030402183532715, val loss 5.022217750549316 
step 500: training loss 2.778379201889038, val loss 2.8679325580596924 
step 1000: training loss 2.7662031650543213, val loss 2.660550355911255 
step 1500: training loss 2.5764899253845215, val loss 2.5275442600250244 
step 2000: training loss 2.4651472568511963, val loss 2.4526543617248535 
step 2500: training loss 2.3259408473968506, val loss 2.3860621452331543 


In [142]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


Tith be uldevithe
Ther yheteaiuwanems sow pis.

Gepeak dand to are bedreen alim, ancd veringr,
Gof des,
Mhip ol four he par sat'vy forwougor
Fit silf theith nower the wow onik thy firs lon;;.

Weres g


In [143]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=32, bias=True)
          (1): ReLU()
        )
      )
    )
    (1): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
    

# Model: Multihead Attention Projection Layer

In [148]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embed)

    def forward(self, x):
        head_outputs = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.proj(head_outputs)


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed)
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)

    def forward(self, x):
        x = x + self.sa(x)
        x = x + self.ffwd(x)
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.460845947265625, val loss 4.462778568267822 
step 500: training loss 2.599686622619629, val loss 2.7281298637390137 
step 1000: training loss 2.5071492195129395, val loss 2.47788667678833 
step 1500: training loss 2.5382513999938965, val loss 2.4370779991149902 
step 2000: training loss 2.3791794776916504, val loss 2.384490489959717 
step 2500: training loss 2.3141043186187744, val loss 2.386993408203125 


In [149]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


Fouf prened:
And so meelf. I tota I:
Thel My ghay whe, ot, mam, at geste utht that opotee urik wir ta, xn the wokss fabrtam?
Tao dirp,
Bo sore thare;
t.
Le- I Jinced to groatI howsine
fratesppastng s 


In [150]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
        (proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=128, bias=True)
          (1): ReLU()
          (2): Linear(in_features=128, out_features=32, bias=True)
        )
      )
    )
    (1): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_feat

# Model optimization help 2: Layer Normalization

In [152]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embed)

    def forward(self, x):
        head_outputs = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.proj(head_outputs)


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed)
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(
            Block(n_embed, n_head=4),
            Block(n_embed, n_head=4),
            nn.LayerNorm(n_embed)
        )
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.372638702392578, val loss 4.395672798156738 
step 500: training loss 2.6571121215820312, val loss 2.645972728729248 
step 1000: training loss 2.5392327308654785, val loss 2.563103437423706 
step 1500: training loss 2.5982604026794434, val loss 2.486126661300659 
step 2000: training loss 2.4181649684906006, val loss 2.4356579780578613 
step 2500: training loss 2.403215169906616, val loss 2.3882241249084473 


In [153]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


NTo gou.

Hithe Xeswid.

This cull hbiedent thiserr:
Hot chast youll ly shilnd.

On be's ama?

thes thy the--bak a t tholl tjor, an to my,urtes,e nat pasketit, Vin en's wi'cge pruthhe ste mh Vit may v


In [154]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
        (proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=128, bias=True)
          (1): ReLU()
          (2): Linear(in_features=128, out_features=32, bias=True)
        )
      )
      (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
    (1): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(


# Model: Tidy Up

In [156]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000

n_head=4
n_layer=2



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embed)

    def forward(self, x):
        head_outputs = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.proj(head_outputs)


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed)
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(*[Block(n_embed, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.301537036895752, val loss 4.303574562072754 
step 500: training loss 2.633894681930542, val loss 2.664252281188965 
step 1000: training loss 2.502023458480835, val loss 2.4725911617279053 
step 1500: training loss 2.3814761638641357, val loss 2.540698528289795 
step 2000: training loss 2.3917155265808105, val loss 2.4259471893310547 
step 2500: training loss 2.5475194454193115, val loss 2.3882737159729004 


In [157]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


Hfamt haise to herooke fe, gurvealst let teasce,
Te ware chand cas; m?
-gorelfr:
And,
Illle;
Wefut dell beay ry Mou wat wafll laves,
Whankid yavehereucer cull haple t mick; otqres couj glo, hak had th


In [158]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
          )
        )
        (proj): Linear(in_features=32, out_features=32, bias=True)
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=128, bias=True)
          (1): ReLU()
          (2): Linear(in_features=128, out_features=32, bias=True)
        )
      )
      (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
    (1): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(


# Model: Dropout (prevent overfitting)

In [159]:
block_size = 8 # context length
batch_size = 4
n_embed = 32

eval_iters = 20

eval_interval = 500
learning_rate = 1e-3
max_iters = 3000

n_head = 4
n_layer = 2

dropout = 0.1



class Head(nn.Module):

    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, T)
        wei = torch.masked_fill(wei, self.tril[:T, :T]==0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        out = wei @ v

        return out


class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        head_outputs = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(head_outputs)
        return self.dropout(out)


class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),

        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size):
        """Initialize the GPT model components.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.blocks = nn.Sequential(*[Block(n_embed, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)

    def forward(self, idx, targets=None):
        """Forward pass for generating logits and optionally computing loss.

        Args:
            idx (torch.Tensor): Input tensor of token indices with shape (B, T).
            targets (torch.Tensor, optional): Target tensor of token indices with shape (B, T).
                Default is None, which skips loss computation.

        Returns:
            tuple: A tuple containing:
                - logits (torch.Tensor): Logits tensor of shape (B, T, vocab_size).
                - loss (torch.Tensor or None): Computed cross-entropy loss if targets are provided, else None.
        """
        # B, T = idx.shape
        # tok_emb = self.token_embedding_table(idx)  # Shape: (B, T, n_embed)
        # pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        # x = tok_emb + pos_emb
        B, T = idx.shape

        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        # Generating logits using the language model head
        logits = self.lm_head(x)  # Shape: (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            # Reshape for cross-entropy loss computation
            B, T, C = logits.shape
            logits_flat = logits.view(B*T, C)  # Shape: (B*T, vocab_size)
            targets_flat = targets.view(B*T)  # Shape: (B*T)
            loss = F.cross_entropy(logits_flat, targets_flat)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.ones(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X,Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


model = GPT(vocab_size)
model = model.to(device)

optimizer = torch.optim.AdamW(params=model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    if(iter % eval_interval == 0):
        losses = estimate_loss();
        print(f"step {iter}: training loss {losses['train']}, val loss {losses['val']} ")

    xb, yb = get_batch('train')

    logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()



step 0: training loss 4.305357456207275, val loss 4.3150129318237305 
step 500: training loss 2.766637086868286, val loss 2.698511838912964 
step 1000: training loss 2.568530321121216, val loss 2.5664477348327637 
step 1500: training loss 2.4630465507507324, val loss 2.4896674156188965 
step 2000: training loss 2.4210150241851807, val loss 2.4258134365081787 
step 2500: training loss 2.412698268890381, val loss 2.338477373123169 


In [160]:
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)
generated_indices = model.generate(start_idx, max_new_tokens=200)[0].tolist()
print(decode(generated_indices))


oofs verars, shayd s be croanc, ho'e and he tliery aiscin, stheare tay sat, melt ablet, move shaot you she forkons bureendes. bis,
Aye
The fatese my fourse dedranld issgabeays yat wraak
Ywareng iche n


In [161]:
# Print the model structure
print("\nModel's Structure: ")
print(model)

print("\n------------ ")

# Print model's state_dict
print("\nModel's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\n------------ ")

# Calculate and print the number of parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal Parameters: {total_params}')
print(f'Trainable Parameters: {trainable_params}')


Model's Structure: 
GPT(
  (token_embedding_table): Embedding(65, 32)
  (position_embedding_table): Embedding(8, 32)
  (blocks): Sequential(
    (0): Block(
      (sa): MultiHeadAttention(
        (heads): ModuleList(
          (0-3): 4 x Head(
            (query): Linear(in_features=32, out_features=8, bias=False)
            (key): Linear(in_features=32, out_features=8, bias=False)
            (value): Linear(in_features=32, out_features=8, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (proj): Linear(in_features=32, out_features=32, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (ffwd): FeedForward(
        (net): Sequential(
          (0): Linear(in_features=32, out_features=128, bias=True)
          (1): Dropout(p=0.1, inplace=False)
          (2): ReLU()
          (3): Linear(in_features=128, out_features=32, bias=True)
        )
      )
      (ln1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
  

# GPT-2

In [171]:
!pip install --upgrade transformers




In [173]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_name = "gpt2"
m = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
input_text = "This was a dark and dangerous "
input_tokens = tokenizer.encode(input_text, return_tensors="pt")
output_tokens = m.generate(input_tokens, max_length=100, num_return_sequences=1, no_repeat_ngram_size=2)
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
print(output_text)


RuntimeError: Failed to import transformers.models.gpt2 because of the following error (look up to see its traceback):
cannot import name 'is_torch_xla_available' from 'transformers.utils' (/usr/local/lib/python3.10/dist-packages/transformers/utils/__init__.py)

# Notes

## Understanding Embedding Table and LM Head Dimensions in Language Models

In the realm of natural language processing (NLP), the architecture of language models plays a critical role in determining their efficiency and effectiveness. Two essential components of these models are the **embedding table** and the **language model head (lm_head)**. Understanding the dimensions of these components and their interactions is crucial for both designing and interpreting model behaviors.

### Embedding Table: Where Words Become Vectors

The **embedding table** serves as the bridge between discrete language symbols (e.g., words or characters) and their continuous vector representations. Think of it as a lookup table where each unique word in your vocabulary is associated with a dense vector. These vectors capture semantic properties such that similar words have similar vectors. The dimensions of this table are typically denoted as `(vocab_size, embedding_dim)`:

- `vocab_size` is the number of unique tokens in your model's vocabulary. It represents the "width" of the table.
- `embedding_dim` (or `n_embd` in some contexts) is the size of the vector representation for each token. It represents the "depth" of the table and is a key factor in the model's capacity to capture semantic nuances.

### LM Head: Projecting Embeddings to Predict Next Tokens

The **language model head (lm_head)** is essentially a projection layer situated at the end of the model. Its role is to transform the output embeddings from the model's last layer back into the vocabulary space, facilitating the prediction of the next token in a sequence. The dimensions of this component are `(embedding_dim, vocab_size)`, mirroring the inverse of the embedding table:

- `embedding_dim` here matches the `embedding_dim` of the embedding table, ensuring compatibility in the transformation process.
- `vocab_size` is the same as in the embedding table, representing the target space for predictions.

### The Envelope Structure: A Conceptual Visualization

You can think of the relationship between the embedding table and the lm_head as forming an **envelope structure** in the architecture of a language model. Initially, the embedding table "expands" the discrete input tokens into a higher-dimensional, continuous vector space (embedding_dim). This expansion allows the model to process and learn from the semantic intricacies of the language. After processing through the model's layers, the lm_head "contracts" these learned representations back into the original vocabulary space, making predictions about the next tokens.

This envelope structure is not just a physical manifestation but a conceptual framework that underscores the essence of transforming discrete language symbols into a form that a model can learn from and then translating those learnings back into the language domain.

Understanding these dimensions and their roles is pivotal for customizing models to specific tasks, optimizing performance, and innovating on the existing architectures. It highlights the delicate balance between model complexity, computational efficiency, and the capacity to capture and generate nuanced language patterns.

---

## Understanding Reshaping `(B, T, C)` to `(B*T, C)`

In the context of your model:

- **B** represents the batch size.
- **T** is the sequence length (number of time steps per batch).
- **C** is the number of classes (vocabulary size).

The output `logits` tensor of shape `(B, T, C)` holds the predictions of the next token at each position in each sequence for each sample in the batch. When you're using the `torch.nn.functional.cross_entropy` loss, the function expects inputs of shape `(N, C)` where:
- **N** is the number of samples, and
- **C** is the number of classes.

To fit this requirement, the logits tensor is reshaped from `(B, T, C)` to `(B*T, C)`, essentially treating each position in each sequence as an independent sample. This allows the loss function to compute the loss for each predicted token against its corresponding true token in `targets`, which is also reshaped to `(B*T)`.

Here’s a simple example to illustrate this:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

# Parameters
batch_size = 2
seq_length = 3
vocab_size = 4

# Random sample data
logits = torch.randn(batch_size, seq_length, vocab_size)
targets = torch.randint(0, vocab_size, (batch_size, seq_length))

# Before reshaping
print("Original logits shape:", logits.shape)  # (B, T, C)
print("Original targets shape:", targets.shape)  # (B, T)

# Reshape logits and targets to fit cross_entropy requirements
logits_flat = logits.view(batch_size * seq_length, vocab_size)
targets_flat = targets.view(-1)

# Compute loss
loss = F.cross_entropy(logits_flat, targets_flat)
print("Loss:", loss.item())
```

### Alternative: Keeping the Batch Dimension Intact

If you prefer to keep the batch dimension and process each timestep independently, you can use `torch.nn.CrossEntropyLoss`, which can handle inputs in `(N, C, d_1, d_2, ..., d_K)` format when given a target of `(N, d_1, d_2, ..., d_K)` format by setting the `reduction` argument to `mean` or `sum`.

Here's how you can apply it:

```python
# Using CrossEntropyLoss to keep batch dimension
loss_fn = nn.CrossEntropyLoss()  # By default, reduction='mean'

# Apply loss function without reshaping
loss = loss_fn(logits, targets)
print("Loss without reshaping:", loss.item())
```

Using `nn.CrossEntropyLoss` this way allows the loss computation to internally handle the sequence as separate dimensions, keeping your batch structure intact. This method is not only cleaner but also maintains the semantic grouping of data, which can be beneficial in understanding and debugging the model's behavior across different sequences within the batch.

## The difference between `torch.nn.functional.cross_entropy` and `torch.nn.CrossEntropyLoss`

The difference between `torch.nn.functional.cross_entropy` (usually imported as `F.cross_entropy`) and `torch.nn.CrossEntropyLoss` mainly revolves around their usage patterns in PyTorch models. Both perform the same fundamental operation—computing the cross-entropy loss between the input logits and the target classes—but they are used in slightly different contexts.

### `torch.nn.functional.cross_entropy`
- **Functional API:** `F.cross_entropy` is a stateless function. It computes the cross-entropy loss given logits and targets directly whenever it is called. It does not maintain or update any internal state.
- **Usage:** This function is useful in scenarios where you do not need to configure the behavior of the loss function beyond basic parameters provided at each call. It's particularly handy in scripts or simpler models where you want direct control over the computation, or when you're defining a custom training loop without using many of PyTorch's object-oriented features.

### `torch.nn.CrossEntropyLoss`
- **Class-based API:** `nn.CrossEntropyLoss` is a class that creates a loss function object. This object can be configured during instantiation with specific attributes like `weight`, `size_average`, `ignore_index`, `reduction`, etc., and then it can be used as a callable to compute the loss.
- **Stateful:** Since it's an object, it can hold state. This can include things like class weights, making it suitable for datasets with class imbalances.
- **Usage:** This class is typically used in more structured or complex models, especially where the same loss computation settings are repeatedly applied across different batches or epochs. It aligns well with object-oriented programming practices, making it ideal for integration into models built as classes.

### Choosing Between `F.cross_entropy` and `nn.CrossEntropyLoss`

1. **Custom Training Loops:**
   - If you are writing a quick, custom training loop and you don't need to repeatedly configure the loss function, `F.cross_entropy` might be more straightforward. It's a one-off call that you can make with different parameters each time without creating an object.

2. **Reusable Models and Configurable Loss:**
   - For models that will be used in multiple training scenarios, or where you want to configure and possibly share the same loss configuration across different parts of the model or different models, `nn.CrossEntropyLoss` is more suitable. You set up the loss once and use the object throughout.

3. **Handling Class Weights and Other Parameters:**
   - If your training involves dealing with imbalanced data where you might want to specify weights for different classes to adjust the loss computation, using `nn.CrossEntropyLoss` becomes advantageous. It allows you to specify weights at instantiation and maintain consistent application of these settings.

4. **Code Clarity and Maintainability:**
   - Using `nn.CrossEntropyLoss` can make code cleaner and easier to maintain, especially in large projects where multiple loss computations might lead to clutter with the functional approach. The class-based approach encapsulates the functionality within an object, making the code more modular and easier to debug.

In summary, the choice between using the functional API or the class-based approach often depends on the complexity of your project, the need for reusability and configurability, and personal or project-specific coding standards. Both achieve the same end result but cater to different development environments and preferences.

## How to Visualize the Embedding Matrix

The `nn.Embedding` layer in PyTorch, particularly when thinking about its weight matrix, is a critical component to understand for visualizing and interpreting how tokens are represented in a model. In your example, `nn.Embedding(vocab_size, n_embed)` creates an embedding matrix of shape `(vocab_size, n_embed)`, which translates to a matrix with 65 rows and 10 columns.

### How to Visualize the Embedding Matrix

1. **Matrix Visualization**:
    - You should visualize this as a matrix with 65 rows and 10 columns. Each row corresponds to a unique token in the vocabulary.
    - The `vocab_size` (65 in your case) represents the number of unique tokens that can be embedded. Each row is a unique "embedding vector" or a representation of that token in a 10-dimensional space.
    - The `n_embed` (10) represents the number of features or dimensions each token is represented with. These are the "coordinates" of each token in the embedding space.

2. **Array of Arrays Visualization**:
    - Alternatively, you can think of the embedding matrix as an array of 65 indices, where each index contains an array of 10 items. This view is akin to seeing it as a list of vectors.
    - Each vector (or array of 10 items) represents the transformed representation of a corresponding token in a 10-dimensional space.

### Correct Mental Model
- The first method (Matrix Visualization) is typically more aligned with traditional linear algebra concepts, where each row of a matrix represents a vector in higher-dimensional space. This visualization helps in understanding operations like matrix multiplication that occur during the embedding lookup.
- The second method (Array of Arrays Visualization) might be more intuitive if you are thinking in terms of programming structures, particularly if you come from a background where data structures are pivotal.

### Practical Example in PyTorch
Here's how you might practically interact with such a matrix in a coding context:

```python
import torch
import torch.nn as nn

# Parameters
vocab_size = 65
n_embed = 10

# Embedding layer
embedding = nn.Embedding(vocab_size, n_embed)

# Visualize the weight matrix
print("Shape of embedding weight matrix:", embedding.weight.shape)
# Output should be torch.Size([65, 10])

# To get the embedding vector for the first token
first_token_vector = embedding.weight[0, :]
print("Embedding vector for the first token:", first_token_vector)
```

### Conclusion
When working with embeddings in models like GPT, visualize the embedding matrix as a table where each row corresponds to a token and each column a feature of the embedding. This mental model will assist in understanding both the transformations applied to these embeddings and their role in the model's architecture, such as when these embeddings are inputted to subsequent layers (like the linear model head in the GPT model).

## Visualizing Tensor Dimensions

When working with multi-dimensional tensors, especially in the context of machine learning and deep learning, it is very helpful to have a consistent strategy for visualizing tensor shapes. Here’s some strategies to manage higher-dimensional tensors effectively:

### Visualizing Tensor Dimensions

1. **Last Dimension as Features/Columns**:
   - Typically, the last dimension in many tensor operations (especially in PyTorch and similar libraries) represents the feature or channel dimension. For instance, in a 2D tensor (matrix), if the shape is `(8, 10)`, you can think of it as having 8 rows and 10 columns, where each row represents a data point and each column a feature.

2. **Second-Last Dimension as Rows/Sequences**:
   - The second-last dimension often acts as rows in matrix terminology or as a sequence in contexts like time-series or language models. For tensors used in neural networks, thinking of the second-last dimension as "rows" or "samples" aligns well with how data is often structured (e.g., batches of data points or sequences).

### Handling Higher Dimensions

For tensors with more than two dimensions, which is common in deep learning, here’s how to visualize:

- **Shape `(1, 8, 10)`**:
  - Think of it as one batch containing 8 sequences (or data points), each with 10 features.
  - Visualize as a single block of 8 rows and 10 columns.

- **Shape `(1, 8, 4, 10)`**:
  - This might represent one batch containing 8 sequences, where each sequence is a 4x10 grid (possibly an image or a transformed representation).
  - Visualize it as 8 separate blocks (or layers), each block being a 4x10 matrix.

### Strategies for Visualization

1. **Sketching**:
   - Drawing diagrams of the tensors can help, especially when first learning or when explaining concepts to others. Sketch each dimension as a separate axis in a diagram.

2. **Nested Lists Concept**:
   - Think in terms of nested lists or arrays. For a shape like `(1, 8, 4, 10)`, think of it as a list containing one element, which is a list of 8 elements, where each element is a list of 4 lists, each containing 10 elements.

3. **Use Real-world Analogies**:
   - Relate tensor dimensions to real-world containers if possible (like boxes within boxes). For images in batches, think of a box (batch) containing several albums (images), where each page (row) of the album shows a sequence of features (columns).

4. **Software Tools**:
   - Use tensor visualization tools available in libraries like TensorBoard for TensorFlow or equivalents in other ecosystems. These tools can represent high-dimensional data visually and can be particularly enlightening.

5. **Consistent Mental Model**:
   - Develop a consistent method of breaking down dimensions as you interpret them across different projects. Consistency in how you mentally unpack dimensions will help in quickly understanding and reasoning about the shapes of tensors you encounter.

## Understanding (B, T) to (B, T, C) Transition

In a tensor of shape `(B, T, C)`:
- **B** represents the batch size.
- **T** represents the sequence length or time-steps.
- **C** represents the number of channels or features per timestep.

### Correct Visualization Approach

1. **Thinking of B as "Planes"**:
   - Your concept of thinking about "B" as planes is a useful one. In this view, each plane can be seen as a separate entity (such as a data sample or a sequence), and within each plane, you visualize the sequence unfolding.
   - This plane analogy is helpful, especially when dealing with images or sequences where each batch element is distinct.

2. **Visualizing T (Time/Sequence Dimension)**:
   - It's common to conceptualize the sequence or time dimension (T) as horizontal when plotting or imagining sequences over time (like a timeline), but in tensor shape terms, T is often more usefully thought of as the vertical axis in each "plane" of B when considering operations like convolution or recurrent processing in neural networks.
   - So, for tensor operations, thinking of T as vertical is correct if you align it with rows within each B plane.

3. **Interpreting C (Channels or Features)**:
   - The last dimension, C, can be thought of as the depth at each point in the sequence. Each element of the sequence (each timestep) has C features, which can be thought of as extending "downward" or "depth-wise" from each point in the sequence.

### Visual Model for (B, T, C)

Visualizing `(B, T, C)` effectively means imagining a stack of B matrices (or planes), where each matrix is `T` rows tall and `C` columns wide. Each row in the matrix corresponds to a timestep, and each column within a row corresponds to a feature of that timestep.

### Example for Clarity

If you imagine processing sentences where each word at each timestep is represented by a vector of features:
- **B** would be the number of sentences you're processing simultaneously (batch size).
- **T** would be the number of words in each sentence (sequence length).
- **C** would be the features representing each word (like embeddings).

In this case, each sentence (or each plane) consists of several words (rows, one per word, hence T as vertical), and each word is described by a feature vector (C features deep at each point in the T sequence).

### Tips for Practical Visualization

- **Sketching Helps**: Draw B as separate matrices or grids, with T and C defining the rows and columns of each grid.
- **Use Tensor Manipulation Tools**: Experiment with reshaping tensors in your programming environment to see how changing dimensions affects the arrangement of data.
- **Analogies Are Useful**: Relate tensor dimensions to real-world examples where spatial orientation is easier to grasp, like pages in a book or layers in a cake, where each layer/page can have a grid layout.


## Step-by-Step Visualization of Tensor Transformations

1. **Input Tensor**:
   - Shape: `(4, 8)`
   - Visualization: Think of this as 4 sequences (or batches), where each sequence consists of 8 integers (token indices). Each integer is between 0 and `vocab_size - 1`.
   - Mental Model: 4 planes, each with 8 rows (and implicitly, each row has 1 column here because each entry is a scalar).

2. **Embedding Table**:
   - Shape: `(65, 10)`
   - Visualization: Imagine a table with 65 rows, each corresponding to a token. Each row has 10 columns representing the features of the token (embedding dimensions).
   - Mental Model: 65 "word profiles" each described by a 10-feature vector.

3. **Output After Embedding Lookup**:
   - Shape: `(4, 8, 10)`
   - Visualization: Now, each of the integers in the input tensor has been transformed into a 10-dimensional vector. So, for each of the 4 sequences, you have 8 tokens represented by 10 features each.
   - Mental Model: 4 planes, each with 8 rows of tokens, and each row now extends into 10 columns of features.

4. **Processing Through Transformer Blocks**:
   - The tensor retains its shape `(4, 8, 10)` through the transformer blocks, assuming no change in dimensionality. The processing here involves complex interactions within and across the feature dimensions but the shape perspective remains the same.

5. **The `lm_head` Layer**:
   - Weight Matrix Shape: `(10, 65)`
   - Bias Shape: `(65,)`
   - Visualization: Each of the 10 rows in the weight matrix corresponds to an input feature, and each column (65 in total) corresponds to a token in the output vocabulary.
   - Final Output Shape After `lm_head`: `(4, 8, 65)`
   - Process: The tensor `(4, 8, 10)` is transformed to `(4, 8, 65)`. Here, for each of the 8 tokens in each of the 4 sequences, the 10-dimensional feature vector is projected onto a 65-dimensional output space (the vocabulary space).
   - Matrix Multiplication: Yes, the operation can be thought of as `(4, 8, 10) @ (10, 65)`. The input tensor is on the left, and the weight matrix `W` is on the right during the matrix multiplication in a linear layer. The biases are then added to each resulting 65-dimensional vector.

### General Strategy for Visualizing Tensors

- **Row and Column Mentality**:
  - For lower-dimensional data (2D), rows and columns work well (e.g., sequences or sets of features).
  - For higher-dimensional data, think in terms of "planes" or "blocks". Each higher dimension adds a new "block" of data.
- **Time as Horizontal vs. Vertical**:
  - Conventionally in matrix notation, time (sequence length) can indeed be horizontal. However, in tensor notation, especially in PyTorch, time or sequence length as the second dimension often aligns better with being visualized vertically in each "batch" or "plane".
- **Nested Visualization**:
  - Start from the highest dimension and add depth as you go to lower dimensions. For a tensor like `(1, 8, 10)`, think of it as 1 group containing 8 sequences, each sequence containing 10 features.

## the embedding table (`nn.Embedding`) and the linear model head (`nn.Linear`)

### Embedding Table (`nn.Embedding`)

1. **Purpose**:
   - The embedding table converts integer indices (token IDs) into dense vectors. This transformation maps discrete tokens into a continuous, high-dimensional space where similar tokens often have similar vector representations.
   - It serves as the initial layer in most NLP models to provide a more meaningful representation of input tokens than their raw indices.

2. **Operation**:
   - **No Matrix Multiplication During Lookup**: The operation is a lookup, not a matrix multiplication. Each index corresponds directly to a row in the embedding matrix. When an index is provided, the corresponding row (vector) from the embedding matrix is returned. This is more akin to an array indexing operation than a mathematical matrix-vector multiplication.
   - Shape transformation: For an input tensor of shape `(B, T)` (batch size by sequence length), the output after passing through an embedding layer of shape `(vocab_size, n_embed)` will be `(B, T, n_embed)`. Each token ID is replaced by its corresponding embedding vector.

### Linear Model Head (`nn.Linear`)

1. **Purpose**:
   - The linear layer, or fully connected layer, is used to transform data from one space to another, often used at the end of networks to map the learned representations to the desired output size (e.g., vocabulary size in language models).
   - In the context of transformers, the lm_head is typically used to map the output embeddings of the last transformer block back to the vocabulary space, facilitating the prediction of the next token probabilities.

2. **Operation**:
   - **Matrix Multiplication**: Unlike the embedding lookup, the linear layer involves a matrix multiplication followed by a bias addition. The input data is multiplied by the weight matrix of the layer, and then a bias is added to each resulting vector.
   - Shape transformation: For an input tensor of shape `(B, T, n_embed)`, where `n_embed` is the number of embedding dimensions, the linear layer with a weight matrix of shape `(n_embed, vocab_size)` outputs a tensor of shape `(B, T, vocab_size)`. Each `(n_embed)` vector is transformed into a `(vocab_size)` vector, predicting scores for each vocabulary token.

### Matrix Multiplications and Shape Transformations

- **Embedding Table**: Does not involve matrix multiplication during its operation. It's a straightforward mapping of indices to vectors based on the pre-trained or learned embeddings stored in a matrix-like structure.
- **Linear Layer (lm_head)**: Involves matrix multiplication, transforming learned embeddings or features into outputs (like logits for each token in the vocabulary). The operation can be visualized as `(B, T, n_embed) @ (n_embed, vocab_size) + bias`, resulting in an output of shape `(B, T, vocab_size)`.

### Visualizing the Difference

- **Embedding Table**: Think of it as a dictionary where each word (token ID) has a specific and fixed vector representation (embedding) that is retrieved directly.
- **Linear Layer (lm_head)**: Picture it as a transformation mechanism where every input vector (learned representation of data) is systematically modified (via weights and biases) to produce a new vector in a different space, typically aimed at classification or regression tasks.

## `nn.Linear` examples

The `nn.Linear` module in PyTorch is widely used across various types of neural networks. It acts as a fully connected (dense) layer that applies a linear transformation to the incoming data. Below, I will showcase some typical use cases of `nn.Linear` in neural network models and discuss the details of its weights and biases.

### Use Cases of `nn.Linear`

1. **Single Layer Perceptron**: Often used in simple linear models for regression or binary classification.
2. **Hidden Layers in Multilayer Perceptrons (MLP)**: Employed as intermediate layers between the input and output in deep neural networks.
3. **Output Layer in Classification Models**: Transforms features to logit scores, which are then passed through a softmax function to derive probabilities for each class.
4. **Transforming Feature Dimensions**: Used in applications like dimensionality reduction or feature transformation before other operations, such as in convolutional networks before classification layers.

### PyTorch Code Examples

Below are examples that demonstrate some of these use cases:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

# Example 1: Using nn.Linear for a simple regression model
input_features = 10
output_features = 1
model = nn.Linear(input_features, output_features)

# Example input tensor
x = torch.randn(1, input_features)  # One sample with 10 features
output = model(x)
print("Output of single-layer perceptron for regression:", output)

# Example 2: Using nn.Linear in a Multilayer Perceptron for classification
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.hidden1 = nn.Linear(10, 50)  # First hidden layer, 10 to 50 features
        self.hidden2 = nn.Linear(50, 20)  # Second hidden layer, 50 to 20 features
        self.output_layer = nn.Linear(20, 3)  # Output layer for 3 class classification

    def forward(self, x):
        x = F.relu(self.hidden1(x))
        x = F.relu(self.hidden2(x))
        x = self.output_layer(x)
        return x

# Example usage
mlp_model = MLP()
x = torch.randn(1, 10)  # One sample, 10 input features
output = mlp_model(x)
print("Output of MLP for classification:", output)

# Example 3: Dimensionality Reduction
dim_reduction = nn.Linear(100, 10)  # Reducing dimensions from 100 to 10
x = torch.randn(5, 100)  # Batch of 5 samples
reduced_x = dim_reduction(x)
print("Output after dimensionality reduction:", reduced_x.shape)
```

### Understanding Bias in `nn.Linear`

- **Shape**: The bias in `nn.Linear` is a 1D tensor with a size equal to `output_features`. In the above examples, for the first model, the bias shape would be `(1,)`, and for the output layer in the MLP, it would be `(3,)`.
- **Application**: During the linear transformation, the bias is added to each row of the output matrix from the matrix multiplication. This operation is broadcasted across all inputs in the batch.

Here’s how the bias is applied conceptually:
- **Matrix multiplication**: `output = x @ weight^T + bias`
  - Where `x` is the input matrix with shape `(batch_size, input_features)`, and `weight` is transposed to match dimensions.
  - The bias is automatically expanded to match the dimensions of `batch_size` and `output_features` and added to each output.

This addition of bias is crucial for the model as it provides an additional degree of freedom, allowing the model to fit the data better, especially when the inputs are zero or near-zero. This helps in shifting the activation function, thereby improving the model's ability to learn complex patterns.

## PyTorch's `nn.Linear` layer

In PyTorch's `nn.Linear` layer, the bias term is added to each row of the output of the matrix multiplication. The bias vector has the same number of elements as the output features (`out_features`), and each element of the bias is added to the corresponding element of every output row.

### How Bias is Added

When you use a linear layer configured as `(2, 3)` in PyTorch (`nn.Linear(3, 2)`), the matrix multiplication involves a weight matrix of shape `(2, 3)` and inputs of shape `(N, 3)` where `N` is the batch size. The output shape will be `(N, 2)`. Here, the bias will be a vector of length `2`, and each element of this vector is added to the corresponding column of the output.

Here's a simple PyTorch example that demonstrates how bias is added:

```python
import torch
import torch.nn as nn

# Define a simple model with a single linear layer
class SimpleLinearModel(nn.Module):
    def __init__(self):
        super(SimpleLinearModel, self).__init__()
        # Linear layer with 3 input features and 2 output features
        self.linear = nn.Linear(3, 2)

    def forward(self, x):
        # Pass the input through the linear layer
        return self.linear(x)

# Initialize the model
model = SimpleLinearModel()

# Print the weights and bias of the linear layer
print("Weights of the linear layer:\n", model.linear.weight)
print("Bias of the linear layer:\n", model.linear.bias)

# Example input tensor of shape (N, 3) where N is the batch size (e.g., N=4)
input_tensor = torch.tensor([[1.0, 2.0, 3.0],
                             [4.0, 5.0, 6.0],
                             [7.0, 8.0, 9.0],
                             [10.0, 11.0, 12.0]])

# Forward pass to get outputs
outputs = model(input_tensor)
print("Output of the linear layer:\n", outputs)
```

### Explanation of the Code

1. **Model Definition**: A simple model containing one `nn.Linear` layer is defined. This layer transforms inputs from 3-dimensional space to 2-dimensional space.

2. **Weights and Bias**: The weights (`W`) of the layer are initialized randomly and have a shape of `(2, 3)`, while the bias (`b`) is a vector of shape `(2,)`.

3. **Input Tensor**: An input tensor of shape `(4, 3)` is created, representing a batch of 4 samples, each with 3 features.

4. **Forward Pass**: When the input is passed through the linear layer, the weight matrix multiplies each input row (shape `(3,)`), and the resulting `(4, 2)` matrix has the bias vector added to each row. This addition is performed automatically by broadcasting the bias vector across all rows of the output.

### What Happens Specifically with the Bias

Each element of the bias vector is added to each corresponding element of the rows produced by the matrix multiplication of the input with the transpose of the weight matrix. This operation ensures that the bias affects the entire batch uniformly, adjusting the linear transformation appropriately for each output feature.

## Rules for Matrix Multiplication (for dummies)

Here are some general rules and guidelines for matrix multiplication in machine learning, along with how bias is handled:

### Rules for Matrix Multiplication

1. **Dimension Matching**:
   - For two matrices \(A\) and \(B\) to be multiplied, the number of columns in \(A\) must equal the number of rows in \(B\). Mathematically, if \(A\) is of size \(m \times n\) and \(B\) is of size \(n \times p\), then the matrix multiplication \(A \times B\) (or \(A@B\) in Python) is defined and results in a new matrix \(C\) of size \(m \times p\).
   - If the dimensions do not match (i.e., the columns of \(A\) do not match the rows of \(B\)), you cannot multiply the matrices directly. You might need to transpose one of the matrices or select different dimensions that do match.

2. **Resultant Matrix Shape**:
   - The resulting matrix \(C\) after the multiplication of \(A\) and \(B\) will have the number of rows of \(A\) and the number of columns of \(B\).

3. **Element Calculation**:
   - Each element \(c_{ij}\) of the matrix \(C\) is calculated as the dot product of the \(i\)-th row of \(A\) and the \(j\)-th column of \(B\). This means \(c_{ij} = \sum_{k=1}^{n} a_{ik} \cdot b_{kj}\), where \(n\) is the common dimension of \(A\) and \(B\).

### Rules for Adding Bias

1. **Bias Shape**:
   - In the context of neural networks, when a bias is added to the result of a matrix multiplication, the bias is typically a vector with a length equal to the number of columns of \(B\) (or the number of output features of the resulting matrix \(C\)).
   - The bias shape is therefore \(1 \times p\) when \(B\) is \(n \times p\).

2. **Broadcasting Bias**:
   - The bias is added to each row of the resulting matrix \(C\). This operation leverages broadcasting, where the bias vector \(b\) of shape \(1 \times p\) is added to every row of \(C\), effectively adjusting each element in the row by the corresponding element in the bias vector.
   - Specifically, if \(C\) is \(m \times p\), then each \(1 \times p\) row of \(C\) has the \(1 \times p\) bias vector added to it.

### Python Code Example

Here’s a simple Python code example to illustrate matrix multiplication and bias addition using NumPy:

```python
import numpy as np

# Define matrices A and B
A = np.array([[1, 2, 3], [4, 5, 6]])  # Shape (2, 3)
B = np.array([[1, 4], [2, 5], [3, 6]])  # Shape (3, 2)

# Multiply matrices
C = np.dot(A, B)  # Alternatively, use A @ B in Python
print("Matrix C (result of A @ B):\n", C)

# Define bias
bias = np.array([1, 2])  # Shape (2,)

# Add bias to each row of matrix C
C_plus_bias = C + bias  # Broadcasting bias across each row
print("Matrix C after adding bias:\n", C_plus_bias)
```

### Summary

These rules encapsulate the core principles of matrix operations in machine learning, specifically highlighting how dimensions must align for multiplication and how biases adjust the outputs, playing a crucial role in neural network layers. Understanding and applying these rules helps in designing and debugging neural network architectures effectively.

In [4]:
import numpy as np

# Define matrices A and B
A = np.array([[1, 2, 3], [4, 5, 6]])  # Shape (2, 3)
B = np.array([[1, 4], [2, 5], [3, 6]])  # Shape (3, 2)

# Multiply matrices
C = np.dot(A, B)  # Alternatively, use A @ B in Python
print("Matrix C (result of A @ B):\n", C)

# Define bias
bias = np.array([1, 2])  # Shape (2,)

# Add bias to each row of matrix C
C_plus_bias = C + bias  # Broadcasting bias across each row
print("Matrix C after adding bias:\n", C_plus_bias)


Matrix C (result of A @ B):
 [[14 32]
 [32 77]]
Matrix C after adding bias:
 [[15 34]
 [33 79]]


## the main uses of bias in machine learning

Bias terms play a crucial role in machine learning, particularly in neural networks, by providing an additional degree of freedom in model training and influencing the behavior of the activation functions applied to neurons. Let's discuss the main uses of bias in machine learning in detail, highlighting your example as one of the primary roles.

### 1. **Offsetting the Input to Activation Functions**

As you mentioned, one of the most critical roles of the bias is to adjust the input to the activation functions. This adjustment can be pivotal in determining the behavior of the activation function:

- **Threshold Shifting**: Bias allows the activation thresholds of neurons to be shifted. For example, in a neuron with a ReLU activation function (`ReLU(x) = max(0, x)`), a negative bias can make it harder for the neuron to activate (output a non-zero value), as the bias would need to be overcome by positive input values. Conversely, a positive bias can make it easier for the neuron to activate by effectively lowering the threshold at which the activation function starts producing a non-zero output.
- **Avoiding Dead Neurons in ReLU**: In the context of ReLU activation functions, without a proper bias, a significant number of neurons can end up never activating (a problem known as "dying ReLU"). By adjusting the bias, it's possible to ensure that more neurons fire, thus maintaining healthy gradients and improving the learning capabilities of the network.

### 2. **Improving Model Flexibility and Fit**

Bias increases the flexibility of the model to fit the data:

- **Non-zero Output at Zero Input**: Without a bias, a linear model or a neural network layer would be forced to pass through the origin (zero output when input is zero), which can be highly restrictive. Bias terms allow the model to output non-zero values even when the input is zero, which can lead to a better fit, especially in cases where the data does not naturally center around the origin.
- **Adjusting Decision Boundaries**: In classification tasks, bias terms help in shifting decision boundaries. For example, in a simple binary classifier like logistic regression, the bias term shifts the decision boundary away from the origin, allowing for more accurate classification when data classes are not symmetrically distributed about the origin.

### 3. **Stabilizing and Accelerating Convergence**

Bias terms can help in stabilizing and sometimes accelerating the convergence of learning algorithms:

- **Initialization**: Proper initialization of bias can lead to a more stable start in the training process. For instance, initializing biases to a small positive value can help avoid initial dead neurons in networks using ReLU activations.
- **Gradient Flow**: Biases affect the gradients during backpropagation. By properly managing bias values, it is possible to maintain healthier gradient flows across deep networks, which can prevent issues like vanishing and exploding gradients.

### 4. **Model Complexity and Overfitting**

While primarily adding flexibility, biases also contribute to the overall parameter count:

- **Regulation and Overfitting**: Just like weights, biases need to be regularized. Overfitting can occur if biases are too large, leading to overly complex models that react too strongly to particular features of the input data.

### Practical Example

Here’s a quick example using PyTorch to illustrate the effect of bias in a neural network layer:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

# Simple model with one linear layer followed by a ReLU
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.linear = nn.Linear(2, 1)  # Two input features to one output
        # Initialize bias such that the neuron has a negative threshold
        self.linear.bias.data.fill_(-1.0)

    def forward(self, x):
        x = self.linear(x)
        return F.relu(x)

# Instantiate the model and provide an example input
model = SimpleNN()
input_tensor = torch.tensor([[0.5, 0.5]])
output = model(input_tensor)
print("Output with bias affecting activation:", output)
```

In this example, the bias is initialized to `-1.0`, which means the input sum must exceed 1.0 for the ReLU to activate, showcasing how bias can control neuron activation thresholds.

### Conclusion

Bias terms are fundamental components that enhance the functional power of machine learning models by providing additional degrees of freedom, influencing activation functions, and aiding in the model's ability to generalize from training data. Proper management and tuning of bias parameters are crucial for building effective neural network architectures.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Simple model with one linear layer followed by a ReLU
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.linear = nn.Linear(2, 1)  # Two input features to one output
        # Initialize bias such that the neuron has a negative threshold
        self.linear.bias.data.fill_(-1.0)

    def forward(self, x):
        x = self.linear(x)
        return F.relu(x)

# Instantiate the model and provide an example input
model = SimpleNN()
input_tensor = torch.tensor([[0.5, 0.5]])
output = model(input_tensor)
print("Output with bias affecting activation:", output)


Output with bias affecting activation: tensor([[0.]], grad_fn=<ReluBackward0>)


In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Simple model with one linear layer followed by a ReLU
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.linear = nn.Linear(2, 1)  # Two input features to one output
        # Initialize bias such that the neuron has a negative threshold
        self.linear.bias.data.fill_(1.0)

    def forward(self, x):
        x = self.linear(x)
        return F.relu(x)

# Instantiate the model and provide an example input
model = SimpleNN()
input_tensor = torch.tensor([[0.5, 0.5]])
output = model(input_tensor)
print("Output with bias affecting activation:", output)

Output with bias affecting activation: tensor([[1.0063]], grad_fn=<ReluBackward0>)


## Mathematical tricks with matrices

Matrix multiplication is not just a basic linear algebra operation; it's a powerful tool packed with possibilities for manipulating and transforming data, especially in computational settings like programming and data science. By carefully designing or modifying the matrices involved—particularly the second matrix in a multiplication, \(B\), as you noted—you can leverage a number of mathematical "tricks" or techniques. Here are several key methods and their practical implications:

### 1. **Summing Rows of \(A\)**

If you want to sum all the rows of matrix \(A\), you can multiply \(A\) by a column vector \(B\) where every element of \(B\) is 1.

- **Matrix Setup**: Let \(A\) be a matrix of size \(m \times n\) and \(B\) a column vector of size \(n \times 1\) with all elements being 1.
- **Operation**: \( A \times B \)
- **Result**: The resulting vector will be of size \(m \times 1\) where each element is the sum of the corresponding row in \(A\).

### 2. **Calculating Column Averages of \(A\)**

To find the average of each column in matrix \(A\), multiply \(A\) by a column vector \(B\) where each element is \(1/n\) (where \(n\) is the number of columns in \(A\)).

- **Matrix Setup**: \(A\) is \(m \times n\), \(B\) is \(n \times 1\) where each element is \(1/n\).
- **Operation**: \( A \times B \)
- **Result**: A vector of size \(m \times 1\) where each element is the average of the rows in \(A\).

### 3. **Accumulating Values Across Columns**

To accumulate (sum up) all values across the columns of \(A\) into a single sum, multiply \(A\) by a vector \(B\) where all elements are 1, and then sum up all the elements of the resulting vector.

- **Matrix Setup**: \(A\) is \(m \times n\), \(B\) is \(n \times 1\) filled with 1s.
- **Operation**: \( C = A \times B \) then sum all elements of \(C\).
- **Result**: A single scalar that is the sum of all elements in \(A\).

### 4. **Transforming Data by Weighting**

You can apply different weights to the columns of \(A\) by using a diagonal matrix \(B\) where diagonal elements represent the weights.

- **Matrix Setup**: \(A\) is \(m \times n\), \(B\) is \(n \times n\) diagonal matrix with weights as diagonal elements.
- **Operation**: \( A \times B \)
- **Result**: A matrix where each column of \(A\) has been scaled by the corresponding weight.

### 5. **Projection Operations**

Projecting data onto lower dimensions for operations like PCA or for simplifying models can be achieved by matrix multiplication. \(B\) would typically contain the projection vectors.

- **Matrix Setup**: \(A\) is \(m \times n\), \(B\) is \(n \times k\) where \(k < n\) representing lower-dimensional space.
- **Operation**: \( A \times B \)
- **Result**: A matrix of dimension \(m \times k\) representing the data in \(A\) projected onto a \(k\)-dimensional subspace.

---

### Useful calculations such as summing rows, calculating averages, and transforming data using weights.

```

### 1. Summing Rows of \(A\)

Here's how you can sum all the rows of a matrix \(A\) using NumPy:

```python
import numpy as np

# Create a matrix A
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Create a vector B with all elements being 1
B = np.ones((3, 1))

# Multiply A by B
sum_rows = np.dot(A, B)

print("Sum of each row in A:\n", sum_rows)
```

### 2. Calculating Column Averages of \(A\)

To calculate the average of each column in a matrix \(A\), you can scale the vector \(B\) by \(1/n\):

```python
# Number of columns in A
n = A.shape[1]

# Create a vector B with each element being 1/n
B = np.full((3, 1), 1/n)

# Multiply A by B
column_averages = np.dot(A, B)

print("Average of each column in A:\n", column_averages)
```

### 3. Accumulating Values Across Columns

Here's how to accumulate all values in matrix \(A\):

```python
# Create a vector B with all elements being 1
B = np.ones((3, 1))

# Multiply A by B
column_sums = np.dot(A, B)

# Sum all elements of the resulting vector
total_sum = np.sum(column_sums)

print("Total sum of all elements in A:", total_sum)
```

### 4. Transforming Data by Weighting

To apply different weights to the columns of \(A\):

```python
# Create a diagonal matrix B with weights on the diagonal
B = np.diag([0.5, 1.0, 1.5])

# Multiply A by B
weighted_A = np.dot(A, B)

print("Weighted A:\n", weighted_A)
```

### 5. Projection Operations

Projecting data onto lower dimensions can be achieved as follows:

```python
# Assume A is m x n and we want to project onto a 2-dimensional space
# Create a projection matrix B with size n x 2
B = np.random.rand(3, 2)  # Randomly chosen projection vectors

# Multiply A by B
projected_A = np.dot(A, B)

print("Projected A onto a 2-dimensional space:\n", projected_A)
```

These examples showcase how you can leverage matrix multiplication in NumPy to perform various mathematical operations efficiently, exploiting linear algebra for data transformations and summarizations in practical settings.

In [12]:
import numpy as np

# Create a matrix A
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Create a vector B with all elements being 1
B = np.ones((3, 1))

# Multiply A by B
sum_rows = np.dot(A, B)

print("B:\n", B)
print("Sum of each row in A:\n", sum_rows)


B:
 [[1.]
 [1.]
 [1.]]
Sum of each row in A:
 [[ 6.]
 [15.]
 [24.]]


In [7]:
# Number of columns in A
n = A.shape[1]

# Create a vector B with each element being 1/n
B = np.full((3, 1), 1/n)

# Multiply A by B
column_averages = np.dot(A, B)
print("B:\n", B)
print("Average of each column in A:\n", column_averages)


B:
 [[0.33333333]
 [0.33333333]
 [0.33333333]]
Average of each column in A:
 [[2.]
 [5.]
 [8.]]


In [9]:
# Create a vector B with all elements being 1
B = np.ones((3, 1))

# Multiply A by B
column_sums = np.dot(A, B)

# Sum all elements of the resulting vector
total_sum = np.sum(column_sums)

print("B:", B)
print("Total sum of all elements in A:", total_sum)


B: [[1.]
 [1.]
 [1.]]
Total sum of all elements in A: 45.0


In [15]:
# Create a diagonal matrix B with weights on the diagonal
B = np.diag([0.5, 1.0, 1.5])

# Multiply A by B
weighted_A = np.dot(A, B)

print("A:\n", A)
print("B:\n", B)
print("Weighted A:\n", weighted_A)


A:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
B:
 [[0.5 0.  0. ]
 [0.  1.  0. ]
 [0.  0.  1.5]]
Weighted A:
 [[ 0.5  2.   4.5]
 [ 2.   5.   9. ]
 [ 3.5  8.  13.5]]


In [16]:
# Assume A is m x n and we want to project onto a 2-dimensional space
# Create a projection matrix B with size n x 2
B = np.random.rand(3, 2)  # Randomly chosen projection vectors

# Multiply A by B
projected_A = np.dot(A, B)

print("B:\n", B)
print("Projected A onto a 2-dimensional space:\n", projected_A)


B:
 [[0.6329597  0.57643139]
 [0.34358579 0.07585142]
 [0.09522777 0.53817742]]
Projected A onto a 2-dimensional space:
 [[1.6058146  2.34266647]
 [4.82113438 5.91404714]
 [8.03645417 9.4854278 ]]


## Model: Embedding Layer and Output Linear Transformation (Tensor shapes)

Focusing on the data, weight matrices, bias, and tensor shapes at each step of the model. This step-by-step breakdown will make it easier to grasp the structure and operation of your model.

### Input

1. **Input Shape**:
   - **Shape**: `(4, 8)`
   - **Description**: This represents a batch of 4 sequences, each containing 8 tokens. Each token is represented by an integer index ranging from 0 to 64 (inclusive).

### Embedding Layer

2. **Embedding Table** (`nn.Embedding`):
   - **Module**: `nn.Embedding(65, 10)`
   - **Weight Matrix Shape**: `(65, 10)`
   - **Operation**: Maps each token index to a 10-dimensional embedding vector.
   - **Output Shape**: `(4, 8, 10)`
   - **Description**: The output after the embedding layer is a tensor where each token index from the input is replaced by its corresponding 10-dimensional embedding. Each sequence now is represented as a matrix of size `8 x 10`.

### Linear Model Head (lm_head)

3. **Linear Layer** (`nn.Linear`):
   - **Module**: `nn.Linear(10, 65)`
   - **Weight Shape**: `(65, 10)`
   - **Bias Shape**: `(65,)`
   - **Operation**: Transforms the embedding vector of dimension 10 back into the vocabulary space of dimension 65, effectively calculating logits for each token in the vocabulary.
   - **Output Shape**: `(4, 8, 65)`
   - **Description**: The output after the `lm_head` is a tensor where each 10-dimensional embedding vector is transformed into a 65-dimensional vector representing the raw scores (logits) for each vocabulary token. This output shape corresponds to each of the 8 positions in each of the 4 sequences now having a vector of length 65, which encodes the likelihood of each token being the next token.

### Detailed Breakdown of Operations

- **Embedding Lookup**:
  - For each token index in the input shape `(4, 8)`, the embedding layer looks up the corresponding 10-dimensional vector in its weight matrix of shape `(65, 10)`.
- **Matrix Multiplication and Bias Addition in lm_head**:
  - The embedding output `(4, 8, 10)` is transformed by the linear layer. This involves a matrix multiplication between each `(10)` vector and the weight matrix `(65, 10)`. The resulting shape is `(4, 8, 65)`.
  - The bias vector of shape `(65,)` is added to each `(65)` vector in the output, utilizing broadcasting to match dimensions. This means that the same bias values are added to the outputs of all positions in all sequences, effectively shifting the logits.

### Visualization of Tensor Transformations

- **Input to Embedding**:
  - Input: Indices -> Embedding Lookup -> 10D Vectors
- **Embedding to lm_head**:
  - 10D Embedding Vectors -> Linear Transformation (Weight Multiplication + Bias Addition) -> 65D Logits

This structured presentation clarifies how data is transformed through your model, detailing the roles and interactions of weights, biases, and tensor dimensions. This approach not only aids in understanding the model's function but also prepares you for debugging and potentially scaling or modifying the model architecture.

##  how the `dim` parameter works in PyTorch

### Understanding `dim` in PyTorch

In PyTorch, when you specify a `dim` parameter for operations like `softmax`, you're indicating which dimension of the input tensor should be used to compute the function such that the operation is applied across that dimension.

- **Matrix [n x m]**: Think of it as `n` rows and `m` columns.
- **dim=0**: The operation will be applied to each column across all rows.
- **dim=1**: The operation will be applied to each row across all columns.

### What does `dim=1` mean?

For a matrix `[n x m]` (say, `[[1, 2, 3], [4, 5, 6]]`):
- **dim=0**: Collapses each column across the rows. Think of squishing the matrix vertically.
- **dim=1**: Collapses each row across the columns. Think of squishing the matrix horizontally.

When applying `softmax` or any similar function with `dim=1`, it means the function processes each row individually, treating each row as a separate set of inputs. In many neural network operations, this is common for treating each row as a separate data point (or batch item) with multiple features.

### Simple Examples

Let's see how this works with actual PyTorch code:

```python
import torch
import torch.nn.functional as F

# Define a 2x3 matrix
A = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])

# Apply softmax with dim=1
softmax_A_dim1 = F.softmax(A, dim=1)
print("Softmax over dim=1:\n", softmax_A_dim1)

# Apply softmax with dim=0
softmax_A_dim0 = F.softmax(A, dim=0)
print("Softmax over dim=0:\n", softmax_A_dim0)
```

**Output Explanation**:
- **Softmax with `dim=1`**: Processes each row independently, converting the numbers in each row into probabilities that sum to 1. This treats each row as a separate set of class scores in a classification task.
- **Softmax with `dim=0`**: Processes each column independently, converting the numbers in each column into probabilities that sum to 1. This is less common but might be used in specific contexts where columns represent different classes or categories.


## `dim` in PyTorch

There are two distinct types of operations based on how they handle tensor dimensions when applying a function across a specified `dim` in PyTorch. Let's discuss the two function types in detail and explore how the `dim` parameter affects their behavior:

### 1. **Functions that Preserve Tensor Shape (e.g., `softmax`)**

**Softmax** is a classic example of a function where the tensor shape remains unchanged. It is primarily used to convert raw scores (logits) into probabilities which are easier to interpret and are required for tasks such as classification. Each element is transformed based on other elements along the specified dimension, but the overall shape of the tensor doesn't change.

- **`dim=1` Example**:
  - When applied with `dim=1` on a tensor, `softmax` processes each row independently. For a tensor `A` of shape `[n, m]`:
    ```python
    A = torch.tensor([[1.0, 2.0, 3.0],
                      [4.0, 5.0, 6.0]])
    softmax_A = torch.nn.functional.softmax(A, dim=1)
    ```
    Each row in `A` is treated as a separate group of logits, and `softmax` converts them into probabilities that sum to 1 across each row. The shape remains `[n, m]`.

- **`dim=0` Example**:
  - When `softmax` is applied with `dim=0`, it processes each column across rows:
    ```python
    softmax_A = torch.nn.functional.softmax(A, dim=0)
    ```
    Each column of `A` is now treated separately, converting values into probabilities that sum to 1 down each column. Again, the shape remains `[n, m]`.

### 2. **Functions that Change Tensor Shape (e.g., `sum`)**

**Sum** is an operation that aggregates elements along the specified dimension, resulting in a tensor whose shape is reduced in that dimension.

- **`dim=1` Example**:
  - Applying `sum` with `dim=1` collapses the rows (sums across them), reducing the dimensionality of the tensor:
    ```python
    sum_A = A.sum(dim=1)
    ```
    For the tensor `A` with shape `[n, m]`, the result of `sum_A` will have the shape `[n]`. Each element of the resulting tensor is the sum of elements across the corresponding row in `A`.

- **`dim=0` Example**:
  - Similarly, applying `sum` with `dim=0` collapses the columns (sums down them):
    ```python
    sum_A = A.sum(dim=0)
    ```
    The output `sum_A` will have the shape `[m]`. Each element of `sum_A` is the sum of elements down the corresponding column in `A`.

### Visual Memory Aid: "Collapsing or Expanding?"

- **Preserve Shape (Expanding)**: Think of operations like `softmax` as **expanding** the information across the dimension without changing the structure—each element affects others along that dimension but retains the "space" (shape).
- **Change Shape (Collapsing)**: Operations like `sum` **collapse** the tensor along the specified dimension, reducing its dimensionality by aggregating values along it.

This framework not only aids in understanding the effect of these operations on data structure but also helps in deciding which operation to use based on the desired outcome in data transformation and analysis tasks.

## `torch.sum()`

The `torch.sum()` function in PyTorch is a versatile tool for summing elements of a tensor across specified dimensions. The parameters `dim` and `keepdim` influence how the summation is performed and the shape of the output. Below, I'll provide examples that demonstrate how these parameters work.

### Basic Summation without Specifying `dim`

First, let's start with the simplest use of `torch.sum()` where no dimension is specified. The function sums all elements of the tensor, reducing it to a single scalar value.

```python
import torch

# Create a 2x3 tensor
A = torch.tensor([[1, 2, 3], [4, 5, 6]])
total_sum = torch.sum(A)
print("Total sum of all elements:", total_sum)  # Output: 21
```

### Summation with `dim` Parameter

The `dim` parameter specifies the dimension along which to sum, reducing the size of that dimension to 1 unless `keepdim` is set to `True`.

#### Example 1: Summing Along a Dimension

```python
# Sum along columns (dim=0)
column_sum = torch.sum(A, dim=0)
print("Sum of each column:", column_sum)  # Output: [5, 7, 9]

# Sum along rows (dim=1)
row_sum = torch.sum(A, dim=1)
print("Sum of each row:", row_sum)  # Output: [6, 15]
```

#### Example 2: Using `keepdim=True`

When `keepdim=True`, the output tensor retains the same number of dimensions as the input, though the dimension summed over is of size 1.

```python
# Sum along columns but keep the dimensions
column_sum_keepdim = torch.sum(A, dim=0, keepdim=True)
print("Sum of each column with same dimensions:", column_sum_keepdim)  # Output: [[5, 7, 9]]
print("Shape:", column_sum_keepdim.shape)  # Output: torch.Size([1, 3])

# Sum along rows but keep the dimensions
row_sum_keepdim = torch.sum(A, dim=1, keepdim=True)
print("Sum of each row with same dimensions:", row_sum_keepdim)  # Output: [[6], [15]]
print("Shape:", row_sum_keepdim.shape)  # Output: torch.Size([2, 1])
```

### Explanation and Visualization

- **`dim=0`**: Summing along the first dimension (columns) collapses the columns (vertical sum). Each element in the output corresponds to the sum of elements in that column across all rows.
- **`dim=1`**: Summing along the second dimension (rows) collapses the rows (horizontal sum). Each element in the output is the sum of elements in that row across all columns.
- **`keepdim=True`**: By keeping dimensions, the output tensor retains the number of dimensions of the input tensor, making it easier to align this output with other tensors for further computations that require matching dimensions.

These examples illustrate how `torch.sum()` can be used in various scenarios to compute sums across different dimensions while controlling the shape of the output. This flexibility is crucial in data processing pipelines, especially when dealing with multidimensional data in neural network operations.

In [88]:
import torch

# Create a 2x3 tensor
A = torch.tensor([[1, 2, 3], [4, 5, 6]])
total_sum = torch.sum(A)
print("Total sum of all elements:", total_sum)  # Output: 21

print("\n-----")
# Sum along columns (dim=0)
column_sum = torch.sum(A, dim=0)
print("Sum of each column:", column_sum)  # Output: [5, 7, 9]
print("Shape:", column_sum.shape)  # Output: torch.Size([3])

# Sum along columns but keep the dimensions
column_sum_keepdim = torch.sum(A, dim=0, keepdim=True)
print("Sum of each column with same dimensions:", column_sum_keepdim)  # Output: [[5, 7, 9]]
print("Shape:", column_sum_keepdim.shape)  # Output: torch.Size([1, 3])

print("\n-----")
# Sum along rows (dim=1)
row_sum = torch.sum(A, dim=1)
print("Sum of each row:", row_sum)  # Output: [6, 15]
print("Shape:", row_sum.shape)  # Output: torch.Size([2])

# Sum along rows but keep the dimensions
row_sum_keepdim = torch.sum(A, dim=1, keepdim=True)
print("Sum of each row with same dimensions:", row_sum_keepdim)  # Output: [[6], [15]]
print("Shape:", row_sum_keepdim.shape)  # Output: torch.Size([2, 1])

Total sum of all elements: tensor(21)

-----
Sum of each column: tensor([5, 7, 9])
Shape: torch.Size([3])
Sum of each column with same dimensions: tensor([[5, 7, 9]])
Shape: torch.Size([1, 3])

-----
Sum of each row: tensor([ 6, 15])
Shape: torch.Size([2])
Sum of each row with same dimensions: tensor([[ 6],
        [15]])
Shape: torch.Size([2, 1])


## `model.to(device)`

In PyTorch, when you use the `.to(device)` method to move a model or tensors to a specific device (like GPU or CPU), the method returns a new object that resides on the specified device. This is crucial to understand because the operation itself does not mutate the model in-place but instead returns a new model on the specified device.

Here’s the breakdown of the best practice for each option you listed:

### 1. `model.to(device)`
- This statement moves the `model` to the specified `device`, but it does not update the `model` variable itself. Therefore, if you only write `model.to(device)` without reassigning it back to `model` or another variable, the original `model` will remain on the original device (typically CPU).
- This can lead to confusion because subsequent operations on `model` might still use the CPU, potentially causing device mismatches especially when combining it with tensors on a different device.

### 2. `model = model.to(device)`
- This is generally considered the best practice. It explicitly updates `model` to be the version that is on the new `device`. This ensures that all subsequent operations on `model` are performed on the correct device, and it makes the code clearer and less error-prone.
- This method clearly communicates that the model has been moved and is now being referenced by the original variable name on the new device.

### 3. `m = model.to(device)`
- This approach is also valid and works similarly to the second option, but it assigns the device-transferred model to a new variable `m`. This can be useful if you want to maintain the original model on the CPU while also using a GPU-transferred version for certain operations.
- However, it can introduce complexity and potential bugs if you're not careful with how you manage and distinguish between `model` and `m` throughout your code.

### Recommended Approach and Example
Based on clarity and safety, the best practice is typically:

```python
model = model.to(device)
```

This practice ensures that your model is moved to the appropriate device, and all subsequent operations on the model are performed on that device, reducing the risk of device mismatch errors. Here is how you should structure your example:

```python
# Example usage
# Assume 'vocab_size', 'n_embed', 'device' are defined
model = GPT(vocab_size=65, n_embed=10)
model = model.to(device)  # Move model to specified device correctly
start_idx = torch.tensor([[0]], dtype=torch.long, device=device)  # Ensure start_idx is also on the same device
generated_indices = model.generate(start_idx, max_new_tokens=10)
print("Generated indices:", generated_indices)
```

This code snippet avoids potential errors by ensuring that both the model and the tensors it interacts with are on the same device, providing a clean and error-free setup for deploying models especially in environments where both CPU and GPU are used.

## How Temperature Works

### Temperature in Softmax

The temperature parameter in the softmax function is a crucial tool in controlling the distribution of probabilities produced by the logits. In the context of models like GPT and other transformers, adjusting the temperature allows you to manage the trade-off between randomness and confidence in the model's predictions.

### How Temperature Works

- **Scaling**: The logits are divided by the temperature value before applying the softmax function. A higher temperature results in a softer probability distribution across the output classes, enhancing diversity. Conversely, a lower temperature makes the distribution sharper, with the highest logit value becoming more dominant.
- **Formula**: For a given logit vector \( z \), the softmax function with temperature \( T \) is given by:
  \[
  \text{softmax}(z_i) = \frac{e^{z_i/T}}{\sum_{j} e^{z_j/T}}
  \]

### Practical Implementation

In many neural network frameworks, including PyTorch, this is implemented simply by dividing the logits by the temperature:
```python
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
```

### Answers to Your Questions:

1. **Production Systems with Variable Temperature Settings**:
   - Yes, in most systems that use a temperature parameter, adjusting the temperature is indeed typically implemented as division of the logits by the temperature scalar before applying softmax. This simple operation allows for dynamic control over the behavior of the model, making it versatile in applications that require varying levels of randomness or exploration.

2. **Greedy Decoding and Temperature**:
   - Greedy decoding can be considered a special case where the temperature is implicitly set to a very small value (approaching zero), making the model highly confident, effectively picking the maximum logit value every time.
   - **Temperature = 1.0**: This setting keeps the logits unchanged in the softmax calculation. It represents the "natural" or "default" behavior of the model without any scaling to influence the confidence of the predictions.

3. **Avoiding Division by Zero**:
   - When implementing temperature scaling, care must be taken to avoid division by zero. This is particularly relevant if the temperature could dynamically change to zero due to some runtime condition or error.
   - **Safety Check**: Ensure there’s a check to avoid setting the temperature to zero. If the temperature is part of a user input or adjustable parameter, validate the input to ensure it's within a safe range (e.g., greater than a very small epsilon value close to zero but not zero itself).
   - **Default Fallback**: Implement a fallback to a default temperature (like 1.0) if an invalid (zero or negative) temperature is provided.

### Example with Check to Avoid Division by Zero

Here's an implementation with a safety check for the temperature:

```python
def safe_softmax(logits, dim, temperature=1.0):
    if temperature <= 0:
        raise ValueError("Temperature must be greater than zero.")
    scaled_logits = logits / temperature
    return F.softmax(scaled_logits, dim=dim)
```

Using such a function ensures that your system behaves robustly, avoiding computational errors and providing meaningful error messages for debugging and user feedback.

In summary, the temperature is a powerful parameter for controlling the behavior of softmax in production systems, allowing for everything from exploration (high temperature) to exploitation (low temperature, approaching greedy decoding). Ensuring safety in temperature adjustments is crucial for robust system behavior.

## Understanding Matrix Multiplications in ML Contexts

In general, the orientation and the role of matrices in a multiplication depend heavily on what you're trying to achieve:

1. **nn.Linear**:
   - In neural networks, particularly in layers like `nn.Linear`, the weight matrix is conventionally placed on the right side of the input matrix during the mathematical formulation, but due to implementation details, it appears as if the weights are on the left because the input matrix is often transposed or the weights themselves are transposed during computation.
   - **Formulation**: If `X` is your input matrix of shape `(n, in_features)` and `W` is your weight matrix of shape `(out_features, in_features)`, the matrix multiplication in a linear layer is typically represented as `X @ W^T` (where `W^T` is the transpose of `W`). This gives an output matrix of shape `(n, out_features)`.

2. **Attention Mechanisms**:
   - In attention mechanisms, particularly in transformers, the weight matrices (or the projections of queries, keys, and values) might be arranged differently depending on how the attention function is structured.
   - **Attention Calculation**: The key difference in attention is the interaction between the Query (`Q`), Key (`K`), and Value (`V`) matrices. The attention score is typically computed as `Q @ K^T`, where `Q` and `K` are of shapes `(n, d_k)` and `(n, d_k)` respectively if you're using scaled dot-product attention.
   - The result of `Q @ K^T` is then used to weight the `V` matrix. This kind of multiplication aligns with how you'd multiply matrices where dimensions that face each other (inner dimensions in the multiplication) must match, and you're effectively computing a weighted sum of the values based on the attention scores.

### Example Explanation

From your Python example with matrices `a` and `b`:

- `c = a @ b` results in each row of `c` being the sum of all the elements of `b` because matrix `a` is essentially acting like an accumulator given all its elements are 1. Here, `a` is on the left, but it acts uniformly across the rows of `b`.
- `d = b @ a` also accumulates values, but in this case, since `a` is on the right, it sums up columns of `b` across rows. Each row in `d` is a repeat of these sums because all elements of `a` are 1, thus replicating the column sums across all rows of `d`.

### Key Points to Remember

- The position (left/right) of a weight matrix in a dot product depends on the specific operation and layer architecture. In `nn.Linear`, weights typically are conceptually on the left after considering transposition (`Wx + b` as in matrix notation, it becomes `xW^T + b` in practical code due to the way matrices are handled in computations).
- In attention, the interaction between `Q`, `K`, and `V` dictates the arrangement and thus might feel different as it focuses on aligning matrices to compute similarities (or attention scores) before applying them to `V`.

Understanding these nuances helps in appreciating how different layers and mechanisms manipulate data through matrix operations, adjusting your intuition about how data flows and transforms through complex networks like transformers.

## `torch.allclose`

The issue you're facing with `torch.allclose` returning `False` even though the tensors `xbow` and `xbow2` look very similar visually lies in the numerical precision and the operations you've performed to compute these tensors.

### Issue Analysis

1. **Matrix Multiplication Precision**: When you use matrix multiplication (`@`) with weight matrices computed as `wei` (normalized lower triangular matrix), the resulting computations may introduce very slight numerical inaccuracies due to floating-point precision. These differences are often very small, but they're enough to make `torch.allclose` return `False`.

2. **Computational Differences**:
   - The `xbow` computation uses a direct method that calculates the mean for slices of the tensor `x` in a loop. This computation accumulates results directly from the data.
   - The `xbow2` computation uses matrix multiplication which may involve different internal optimizations and floating-point precision handling, especially when using GPU or optimized CPU routines.

### Closer Look with `torch.allclose`

`torch.allclose` checks if two tensors are element-wise equal within a tolerance. The default settings for `atol` (absolute tolerance) and `rtol` (relative tolerance) are very strict (`1e-08` for `atol` and `1e-05` for `rtol`). These default tolerances might be too tight given the potential small discrepancies introduced by different computational paths in `xbow` and `xbow2`.

### Solution

You can adjust the tolerance parameters in `torch.allclose` to see if the tensors are close within a more reasonable tolerance, acknowledging that minor discrepancies are expected due to the reasons mentioned:

```python
# Check with a more relaxed tolerance
close = torch.allclose(xbow, xbow2, atol=1e-6, rtol=1e-4)
print("Are the tensors close within a relaxed tolerance?", close)
```

### Additional Debugging Step

To better understand the differences, you can calculate and inspect the maximum difference between the two tensors:

```python
# Calculate the maximum difference
max_diff = torch.max(torch.abs(xbow - xbow2))
print("Maximum difference between xbow and xbow2:", max_diff)
```

This will give you a sense of how significant the differences are, which can help in deciding appropriate values for `atol` and `rtol`.

### Summary

In numerical computations, especially those involving iterative or summing operations over floating-point numbers, slight differences are common. Understanding and adjusting for these in functions like `torch.allclose` is essential for accurate debugging and validation of results in scenarios where exact precision isn't always achievable due to underlying hardware or software computation methods.

## `self.tril[:T, :T] == 0`

The line `self.tril[:T, :T] == 0` is used to apply a mask to the attention weights (`wei`). This mask is crucial for enforcing causality within the self-attention mechanism, particularly in scenarios where the model must not be allowed to "see" future tokens. This type of masking is standard in models processing sequential data where each output should only be influenced by previous inputs and not by any future inputs, such as in language modeling or other types of generative tasks.

### Explanation of the Masking Step

- **Purpose of `tril`:** The `torch.tril` function generates a lower triangular matrix where all elements above the diagonal are zero. This matrix acts as a template to specify which positions in the attention matrix `wei` should be considered (past tokens) and which should not (future tokens).

- **Masking Operation**: The operation `self.tril[:T, :T] == 0` dynamically generates a boolean mask where all positions corresponding to future tokens (above the main diagonal) are `True`. The `torch.masked_fill()` function then uses this mask to set the attention scores in `wei` at these positions to `-inf`, effectively preventing the softmax function from assigning any probability mass to future tokens.

### Why `[:T, :T]` is Needed

- **Dynamic Sequence Lengths**: In practice, batches of sequences may not always be filled to the maximum length (`block_size`). The slicing `[:T, :T]` ensures that the mask is correctly sized to the actual sequence length `T` in each batch, providing flexibility for handling sequences of varying lengths without the need to regenerate the triangular mask for each input.

### Alternative Clearer Implementation

While the current implementation with `torch.tril` and slicing is quite efficient, it can be made more intuitive by explicitly creating the mask within the `forward` method or by encapsulating the masking logic into a separate method. Here’s a clearer alternative using a function to generate the mask:

```python
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)

    def generate_mask(self, size):
        """Generates a lower triangular mask to ensure causality in attention."""
        return torch.tril(torch.ones(size, size, device=self.query.weight.device))

    def forward(self, x):
        B, T, C = x.shape
        q = self.query(x)
        k = self.key(x)
        v = self.value(x)
        wei = q @ k.transpose(-2, -1) * C**-0.5  # (B, T, T)
        
        # Apply causal mask to weights
        mask = self.generate_mask(T)
        wei = torch.masked_fill(wei, mask == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        out = wei @ v
        return out
```

### Benefits of the Alternative Approach

- **Clarity**: Encapsulating the mask generation in a separate method makes the code more organized and the functionality of each part clearer.
- **Flexibility**: Generating the mask dynamically based on the actual sequence length each time allows the module to handle variable sequence lengths more naturally without slicing operations.
- **Reusability**: By abstracting the mask generation, it becomes easier to modify or extend the masking logic if needed for different variants of attention mechanisms or models.

This approach enhances the readability and maintainability of the code without sacrificing performance, making the underlying model's behavior more transparent.

##  The slicing notation `[:T, :T]`

The slicing notation `[:T, :T]` is a powerful way to dynamically adjust the size of tensors in PyTorch based on the current requirements, such as the actual sequence length in batch processing. Let's dive into a simple example that illustrates how this type of slicing works to manipulate tensors.

### Scenario

Suppose you have a full square matrix representing some kind of relationship between elements (e.g., distances, similarities) for a set number of elements `block_size`. In each batch, however, you only process `T` elements where `T <= block_size`. You need to extract a `T x T` submatrix from the larger `block_size x block_size` matrix for computations relevant to the current batch.

### Example Setup

Let's create a square matrix of size `block_size` and demonstrate how to dynamically slice it to size `T` when needed.

```python
import torch

# Initialize a fixed block_size
block_size = 5  # Assume our block size is 5

# Create a full matrix of size block_size x block_size
full_matrix = torch.arange(block_size * block_size).reshape(block_size, block_size)
print("Full Matrix:")
print(full_matrix)

# Now, let's say in a particular operation, we only need the first T x T part of this matrix
T = 3  # For this example, let's consider T to be 3

# Extract the T x T submatrix
sub_matrix = full_matrix[:T, :T]
print("\nExtracted Submatrix for T=3:")
print(sub_matrix)
```

### Output Explanation

The code will output the following matrices:

1. **Full Matrix**: A `5x5` matrix filled with numbers from 0 to 24 arranged in a square format.
2. **Extracted Submatrix**: When `T=3`, the submatrix extracted will be the top left `3x3` portion of the full matrix.

### Sample Output

```plaintext
Full Matrix:
tensor([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24]])

Extracted Submatrix for T=3:
tensor([[ 0,  1,  2],
        [ 5,  6,  7],
        [10, 11, 12]])
```

### Why This Is Useful

This slicing technique is particularly useful in applications where:
- The size of data to be processed can vary (e.g., variable sequence lengths in NLP, different batch sizes in image processing).
- Only a portion of a larger dataset or matrix is relevant for a specific computation or analysis.
- You need to ensure that operations are flexible and efficient without the need to recreate or resize tensors unnecessarily.

This method of dynamic slicing helps manage memory more efficiently and makes your code adaptable to varying data sizes, enhancing both performance and scalability of your applications.

## `nn.ModuleList` and `nn.Sequential`

In PyTorch, both `nn.ModuleList` and `nn.Sequential` are used to store and manage multiple submodules (like layers), but they serve different purposes and provide different functionalities based on the needs of your model architecture.

### `nn.ModuleList`

**Description**: `nn.ModuleList` is essentially a Python list that is also a PyTorch module. It holds submodules in a list but doesn't define how they should interact, merely ensuring they're registered as part of the parent module.

**Use Cases**:
- **Custom Combinations of Modules**: When you need to iterate over a list of modules and apply them in some custom way that isn't strictly sequential.
- **Variable Module Application**: Useful when the application of modules can change dynamically during runtime, such as selectively applying certain modules under specific conditions.
- **Module Management**: It helps in managing a list of modules where each module might need to be accessed individually or modified separately.
- **Example from Your Code**:
  - `MultiHeadAttention` class uses `nn.ModuleList` to hold multiple attention heads. This allows the model to process inputs through multiple heads separately and then combine their outputs. This wouldn't be easily manageable with `nn.Sequential` because each head might need to be applied independently or in parallel, and then their outputs concatenated.

```python
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        # Assume initialization details
        pass

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        return torch.cat([h(x) for h in self.heads], dim=-1)
```

### `nn.Sequential`

**Description**: `nn.Sequential` is a container that processes modules in a sequential manner. When you place modules in `nn.Sequential`, the output of one module becomes the input to the next module automatically.

**Use Cases**:
- **Simplified Sequential Processing**: Perfect for cases where the model's structure is a simple feed-forward type, with a clear sequence of operations that do not require branching, skipping, or any other form of non-linear connectivity.
- **Ease of Use**: Helps in quickly setting up networks without explicitly defining the `forward` pass for connecting each layer.
- **Stacked Layers**: Commonly used for creating chains of layers in neural networks, such as a series of convolutional layers followed by activations.
- **Example from Your Code**:
  - The `FeedForward` class uses `nn.Sequential` for a straightforward linear transformation followed by an activation function, which is a typical use case for `nn.Sequential`.

```python
class FeedForward(nn.Module):
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),
            nn.ReLU()
        )

    def forward(self, x):
        return self.net(x)
```

### Conclusion

The choice between `nn.ModuleList` and `nn.Sequential` depends on how you need to apply the submodules:
- Use `nn.ModuleList` when you need more control over how modules are applied, need to handle them individually, or require custom architectures.
- Use `nn.Sequential` for straightforward, linear sequences of modules where the output of one module feeds directly into the next without any modifications or conditions.

These distinctions make each tool suitable for different scenarios, allowing you to optimize the design of your neural network architectures based on specific requirements.

## Skip Connections and Layer Normalizations

Adding skip connections, also known as residual connections, to a transformer block as you've described is indeed quite straightforward and aligns well with the architecture of popular transformer models like those described in the original "Attention is All You Need" paper. Your proposed modification does effectively implement skip connections. Let's go over why this is effective and ensure you're implementing it correctly.

### What Are Skip Connections?

Skip connections are a technique used in neural network architectures to help mitigate the vanishing gradient problem and to enable deeper networks by allowing gradients to flow through the network directly. In transformers, they help the network to preserve information from earlier layers and combine it with new transformations, which can enhance learning and model performance.

### Implementation in Transformers

In the context of transformers, a skip connection typically involves adding the input directly to the output of a sub-layer (like multi-head attention or feedforward neural networks), followed by normalization. Here’s what you need to ensure when adding skip connections:

1. **Addition Before Normalization**: While your code snippet directly adds the outputs from the self-attention and feed-forward layers to their respective inputs, it's common practice in transformers to also apply layer normalization after each addition. This helps in stabilizing the training process.

2. **Scaling (Optional)**: Some implementations also scale the outputs by \(\sqrt{0.5}\) (or similar factors) to balance the variance of the sum.

### Suggested Code Modification

Here’s an improved version of your code, adding layer normalization, which is a critical component in transformer models:

```python
import torch.nn as nn

class Block(nn.Module):
    def __init__(self, n_embed, n_head):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embed)
        self.norm1 = nn.LayerNorm(n_embed)
        self.norm2 = nn.LayerNorm(n_embed)

    def forward(self, x):
        # Apply self-attention and add the input (residual connection), then normalize
        x = self.norm1(x + self.sa(x))
        # Apply feed-forward network, add the result to the input of this step, then normalize
        x = self.norm2(x + self.ffwd(x))
        return x
```

### Key Points:

- **LayerNorm**: Each block now includes two `nn.LayerNorm` layers. Normalization is applied after each skip connection, which is essential for stabilizing the learning process.
- **Norm + Residual**: This order (first add, then normalize) is typical in many successful transformer models.

### Conclusion

Adding skip connections as you've proposed is indeed straightforward but adding normalization is also crucial. These connections make the model more robust by allowing it to learn identity functions where beneficial, thereby preventing the deeper layers from corrupting the information processed in earlier layers. This pattern has been fundamental in enabling the training of very deep networks in practice, significantly improving performance across a wide range of tasks.

## Skip connections are architectural features

Skip connections themselves are architectural features rather than specific parameters or modules that directly appear in printouts like the structure, state_dict, or parameter counts. They are implemented in the code as operations that add the output of a layer (or block) to its input and do not have their own learnable parameters. Therefore, they won't be explicitly listed in the outputs you're planning to generate. Here’s what you can expect from each type of printout regarding skip connections:

### 1. Model's Structure Printout
When you print the model's structure using `print(model)`, it shows the components (submodules) of the model, such as layers defined in the `__init__` method. Skip connections, being merely operational steps within the `forward` method, do not have separate submodules and thus won’t be explicitly visible in this structure printout.

### Example Output Snippet
For a transformer block similar to what you described, the printout might look like this (without skip connections visible):
```plaintext
Block(
  (sa): MultiHeadAttention(
    ... # Details of the attention mechanism
  )
  (ffwd): FeedForward(
    ... # Details of the feedforward network
  )
  (norm1): LayerNorm(...)
  (norm2): LayerNorm(...)
)
```

### 2. Model's `state_dict` Printout
The `state_dict` holds the model's parameters (weights and biases). Since skip connections do not involve learnable parameters—they merely add tensors—they won't appear in the `state_dict`. Only layers like `nn.Linear`, `nn.Conv2d`, `nn.LayerNorm`, etc., which have weights and/or biases, will be visible in the `state_dict`.

### Example Output Snippet
Here, you would see entries for each weight and bias tensor in the model’s layers, but nothing specifically for skip connections:
```plaintext
sa.query.weight    torch.Size([...])
sa.key.weight      torch.Size([...])
ffwd.net.0.weight  torch.Size([...])
norm1.weight       torch.Size([...])
norm2.weight       torch.Size([...])
```

### 3. Number of Parameters Printout
The total and trainable parameters counts will reflect the sum of all parameters in the model. Since skip connections have no parameters, they do not affect these counts. The counts will only include parameters from the defined layers (like those in `MultiHeadAttention` and `FeedForward`) and any normalization layers.

### Implications
Understanding that skip connections are not directly represented in parameter or structure listings is important for interpreting these printouts. Their effects are embodied in how they influence the training dynamics and performance, not in the parameter count or structure listings.

### Conclusion
While skip connections are integral to the functionality and performance improvements in models like transformers, they are essentially part of the model's logic and flow (in how data is processed through the network), rather than discrete components. Their presence needs to be understood and traced through the source code in the `forward` method rather than through inspection tools that list components or parameters.

## Why Add a Projection Layer?

In the context of transformers and particularly within the design of MultiHeadAttention mechanisms, a projection layer after the concatenation of heads is typically required to consolidate the multiple attention outputs into a single tensor that matches the dimensions expected by subsequent layers or operations. This ensures that the MultiHeadAttention's output can be integrated smoothly within the broader architecture of the model.

### Why Add a Projection Layer?

1. **Dimensionality Alignment**: MultiHeadAttention expands the feature dimensionality by concatenating outputs from multiple heads. If each head returns a vector of length `head_size` and there are `num_heads` heads, the concatenated result will have a dimension of `num_heads * head_size`. A projection layer (typically an `nn.Linear` layer) is needed to bring this back to the original embedding dimension `n_embed`, making the output compatible with other parts of the network that expect inputs of this dimension.

2. **Mixing Information Across Heads**: The outputs from different heads might focus on different aspects or parts of the input sequence, given that each head computes attention independently. A projection layer helps to mix or integrate these diverse representations into a unified output, potentially enhancing the representation power by combining various learned aspects.

3. **Maintaining Network Depth**: In the original transformer architecture proposed by Vaswani et al., the depth of the network (number of layers) is crucial for learning complex representations. The projection layer contributes to this depth, adding another linear transformation that the network can learn from.

### How to Implement This in Your `MultiHeadAttention` Class

Your current implementation of `MultiHeadAttention` concatenates outputs from each head but does not apply any projection afterward. Here’s how you might modify your class to include this important projection step:

```python
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size, n_embed):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        # Projection layer to bring concatenated heads back to the dimension n_embed
        self.proj = nn.Linear(num_heads * head_size, n_embed)

    def forward(self, x):
        # Concatenate outputs from all attention heads
        head_outputs = torch.cat([h(x) for h in self.heads], dim=-1)
        # Project concatenated heads back to the original embedding size
        output = self.proj(head_outputs)
        return output
```

### Explanation:

- **`self.proj`**: This is an `nn.Linear` layer that projects the concatenated output of the attention heads back to the original embedding dimension (`n_embed`). It's essential to ensure that the dimensionality of the MultiHeadAttention output matches that of the input, especially in a typical encoder or decoder architecture where each layer expects inputs and outputs to be of consistent dimensions.

Adding this projection layer effectively allows the MultiHeadAttention to integrate the information gathered by different attention heads and refine it into a format usable by subsequent layers, adhering to the design principles of the transformer architecture.

## Dropout

Dropout is a regularization technique used in neural networks to prevent overfitting. The method temporarily drops units (along with their connections) from the network during training. This random omission of units during different training epochs forces the network to learn more robust features that are not reliant on any small set of neurons, promoting better generalization to unseen data.

### Main Benefits of Dropout:

1. **Prevention of Overfitting**: Dropout reduces the model's reliance on any individual neuron by randomly dropping out units during the training process. This helps in preventing the model from fitting too closely to the training data, which can lead to poor performance on new, unseen data (overfitting).

2. **Model Robustness**: By randomly removing neurons during training, dropout forces the network to develop redundant pathways for the same information, increasing its robustness and reducing the likelihood of developing fragile co-adaptations among neurons (where neurons overly depend on the specific presence of other neurons).

3. **Ensemble Effect**: Each training step with dropout can be seen as training a different model that shares weights with other models in the ensemble. At test time, using all neurons can be viewed as averaging the predictions of all these thinned models, akin to an ensemble method, which often results in better performance.

### Implementation in Transformers:

In transformer architectures, dropout is typically applied in several places:
- **After Attention and Before Residual Add**: Applying dropout to the output of the attention mechanism (or any sub-layer outputs) before adding the residual connection can help the model to not rely too heavily on specific paths or weights, promoting more robust learning.
- **On the Attention Weights**: Dropout can also be applied to the attention scores before they are used to create the weighted sum of values, helping to ensure that the model does not overfit to specific aspects of the data.
- **Feedforward Network**: Within the feedforward network of each transformer block, dropout is often applied to the output of the first linear transformation before the activation function.

### Is Applying Dropout Just Before Residual Rejoining Good?

Yes, applying dropout just before the tensors rejoin the residual pathway is indeed beneficial and a common practice in many state-of-the-art transformer models. Here's why:

- **Effective Regularization**: This placement ensures that the information passing through the network cannot rely on any single path for accurate predictions, encouraging the network to find alternative pathways and thus generalize better.
- **Maintains Information Flow**: Since the residual connection bypasses the dropout, essential information is not entirely lost even when dropout is applied. This setup helps in maintaining the flow of information through the network while still benefiting from the regularization effect of dropout.

### Example in Transformer Block:

Here’s how you might typically see dropout applied in a transformer block:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerBlock(nn.Module):
    def __init__(self, n_embed, n_head, dropout_rate=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(n_head, n_embed)
        self.norm1 = nn.LayerNorm(n_embed)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.ffwd = FeedForward(n_embed)
        self.norm2 = nn.LayerNorm(n_embed)
        self.dropout2 = nn.Dropout(dropout_rate)

    def forward(self, x):
        # Multi-head attention part
        attn_output = self.attention(x)
        attn_output = self.dropout1(attn_output)
        x = x + attn_output
        x = self.norm1(x)

        # Feed-forward part
        ffwd_output = self.ffwd(x)
        ffwd_output = self.dropout2(ffwd_output)
        x = x + ffwd_output
        x = self.norm2(x)

        return x
```

In this structure, dropout is applied right after the attention outputs and the feedforward outputs but before they are added back to the main data flow (residual connection). This approach is optimal for ensuring that all parts of the model contribute to the final result without overfitting to the training data.

## Using `nn.Sequential`

Using `nn.Sequential` in MLP (Multi-Layer Perceptron) implementations is quite common in PyTorch, as it provides a convenient way to define a simple sequence of operations where the output of one module is the input to the next. Both the `MLP` class and the `FeedForward` class you've shown serve similar purposes but are structured differently—one using explicit layer calls in its forward method and the other encapsulating the layers in an `nn.Sequential` container. Here are the pros and cons of using `nn.Sequential` in such implementations:

### Using `nn.Sequential`
**Pros:**
1. **Simplicity**: The main advantage is the simplicity of setup and readability. `nn.Sequential` allows you to easily define a clear and concise pipeline of layers and operations where the output of one layer is automatically the input of the next layer. It reduces the boilerplate code required in the `forward` method.
2. **Compactness**: It compacts the model definition, making the model easier to read and maintain, especially for straightforward feed-forward architectures without branching or skip connections.
3. **Ease of Use**: It's straightforward to add or remove layers, which can be beneficial during experimentation and model tuning.

**Cons:**
1. **Lack of Flexibility**: `nn.Sequential` is limited to operations that fit a linear flow—each layer's output is directly passed as input to the next. This makes it unsuitable for more complex models where you might need to implement branches, merges, or residual connections easily.
2. **Direct Control**: You have less direct control over the intermediate outputs, which can be a drawback when you need to perform operations based on intermediate results or when conditions based on these are required.
3. **Customization**: Adding custom behavior (like dynamic layer behavior based on input properties) between layers is not straightforward without breaking the sequential chain.

### Manual Layer Definition in `forward` Method
**Pros:**
1. **Flexibility**: Defining each layer and its activation explicitly in the `forward` method, as in the `MLP` class, offers more control and flexibility. This allows for conditional operations, more complex manipulations of data between layers, and the integration of non-linear connectivity patterns (like skip connections).
2. **Debugging and Inspection**: It's easier to debug and inspect intermediate results, as you can add print statements or logging between layers.
3. **Customization**: Easier to integrate layers with custom behavior or conditional processing of data between layers.

**Cons:**
1. **Verbosity**: This method can be more verbose and complex, particularly for larger models. It requires manually handling each layer and operation, which can lead to more boilerplate code and increase the chance of errors.
2. **Maintainability**: More complex `forward` methods might be harder to maintain, especially when changes involve modifying several interconnected lines of code.

### Conclusion
Choosing between `nn.Sequential` and manual layer definition depends on the specific needs of your application and the complexity of the model:
- Use `nn.Sequential` when your model architecture is straightforward, and layers can be arranged in a simple linear sequence without conditional logic.
- Prefer manual definition in the `forward` method when your model requires more complex data flow control, branching, or when you need to perform specific operations between layers that `nn.Sequential` cannot handle directly.

For example, the `MLP` implementation you provided might be preferred in scenarios where the precise control of dropout application or integration of custom activations or processing steps is necessary. The `FeedForward` using `nn.Sequential` could be more appropriate for simpler, quick-to-deploy models where such granular control is unnecessary.

## Weight-tying

Weight-tying, also known as parameter sharing, is a technique used in machine learning to reduce the memory footprint of models and potentially improve their generalization performance. In the context of neural networks, and particularly transformers, weight-tying involves using the same weight matrix for multiple different parts of the model.

### Applications in Transformers

In transformers, weight-tying is most commonly used between the embedding layers and the final linear layer before the softmax in language modeling tasks. Here are some details on how it's applied:

1. **Embedding and Output Layer Tying**:
   - **Concept**: The same weight matrix that is used in the input embedding layer is transposed and used in the output layer, which predicts the next token based on the context provided by the transformer's decoder.
   - **Reasoning**: The embedding layer converts token indices into vectors. The output layer, on the other hand, converts the decoder's output vectors back into token logits (which are then passed through a softmax to predict probabilities). Since both layers deal with the same vocabulary space and the embedding dimension is the same as the transformer's hidden size, tying these weights can improve performance by reducing overfitting and enforcing a consistent representation of token semantics across the model.

2. **Efficiency and Regularization**:
   - **Reduced Parameters**: By tying weights, the model's total number of parameters decreases significantly, which can lead to less memory usage and often faster inference.
   - **Regularization Effect**: Sharing weights across different layers acts as a form of regularization. It can help the model learn more robust features, as the same weights must work well in both the embedding and the output contexts.

### Example Implementation

In PyTorch, implementing weight tying in a transformer model could look like this (assuming the model structure has an embedding layer named `embeddings` and an output projection layer named `output_projection`):

```python
import torch
import torch.nn as nn

class TransformerWithTiedWeights(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, d_model)
        self.transformer = nn.Transformer(d_model=d_model, nhead=8)
        self.output_projection = nn.Linear(d_model, vocab_size)

        # Tie weights between embedding and output projection
        self.output_projection.weight = self.embeddings.weight

    def forward(self, input, target):
        embedded_input = self.embeddings(input)
        transformer_output = self.transformer(embedded_input)
        logits = self.output_projection(transformer_output)
        return logits

# Usage
model = TransformerWithTiedWeights(vocab_size=10000, d_model=512)
```

### Benefits and Considerations

- **Benefits**:
  - **Efficiency**: Reduces the number of trainable parameters, which can lead to faster training and lower memory consumption.
  - **Performance**: Can improve model performance, particularly in language tasks, by helping the model generalize better from its training data.
- **Considerations**:
  - **Design Constraints**: Tying weights imposes a constraint that the dimensions of the tied layers must match. This can limit some design choices regarding the model architecture.
  - **Application Specificity**: Weight tying is more beneficial in some applications (like NLP) than others. Its effectiveness can vary based on the task and the data.

Weight-tying is a powerful technique in model optimization, particularly for large-scale models like those used in NLP, where reducing the number of parameters without losing model capacity can significantly enhance computational efficiency and model robustness.

---

Weight-tying in the context of transformers, particularly between the token embedding layer and the output linear layer (often referred to as the "lm_head" or language model head), is an interesting technique used to reduce the number of parameters and to theoretically improve the learning efficiency and generalization of the model. Let's clarify how this works and address your concerns.

### How Weight-Tying Works in Transformers

In models like GPT (Generative Pre-trained Transformer), the embedding layer converts input token indices into vectors. These vectors are then processed through various transformer blocks. At the end of the transformer, an output layer (lm_head) converts the transformer's output vectors back into a vocabulary-sized vector for each token, where each element represents a score for that token being the next token in the sequence.

Here's where weight-tying comes into play:
- **Weight-Tying Between Layers**: The weight matrix in the lm_head (used for transforming the output of the last transformer layer back to the token vocabulary space) is set to be the same as the weight matrix in the embedding layer (used for mapping token indices to vectors). Mathematically, this means that `lm_head.weight` is set to `embedding_layer.weight.transpose()` or directly to `embedding_layer.weight` depending on the implementation and shape requirements.

### Implications and Effects

1. **Parameter Efficiency**: This method significantly reduces the model's total number of parameters since we effectively eliminate one large matrix of parameters from the model. For large vocabulary sizes, this can result in substantial savings in memory and computational requirements.

2. **Theoretical Justification**: By tying the weights, the model learns a shared representation for both embedding tokens and decoding them into predicted next tokens. This is thought to help the model learn more robust, generalizable features since the same weights must successfully perform both encoding input tokens and decoding output tokens.

### Addressing Your Concerns

- **Changing Token Embeddings**: You're correct in noting that the weights are shared and any updates to them during training will affect both the embeddings and the output projections. However, this isn't necessarily a downside. Since both tasks (embedding and predicting tokens) are closely related, learning a shared representation can be beneficial. The weight updates are driven by both the embedding role and the projection role, potentially leading to a more integrated understanding of the token space.

- **Different Layers, Same Weights**: It might seem odd that an `nn.Embedding` layer and an `nn.Linear` layer share weights because they are used in very different contexts within the model. However, fundamentally, both are performing linear transformations. An `nn.Embedding` can be thought of as a lookup table that is a special case of a linear transformation without multiplication operations, where indices directly select rows from the weight matrix. When you use these weights in a linear layer (`nn.Linear`), you're applying these weights in a matrix multiplication, which is a more general operation. The fact that these operations are related allows for the possibility of weight-tying without significant conceptual conflict.

### Practical Example

```python
import torch.nn as nn

class TiedTransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, d_model)
        self.transformer_blocks = nn.TransformerEncoderLayer(d_model=d_model, nhead=8)
        self.lm_head = nn.Linear(d_model, vocab_size)
        self.lm_head.weight = self.embeddings.weight  # Tying the weights

    def forward(self, input_ids):
        x = self.embeddings(input_ids)
        x = self.transformer_blocks(x)
        logits = self.lm_head(x)
        return logits
```

In summary, weight-tying in transformers is a beneficial technique that can reduce the model's complexity and improve parameter efficiency without necessarily compromising performance. This strategy leverages the fundamental similarity between embedding inputs and projecting transformer outputs to make learning more efficient.

## `print(model)`

In PyTorch, when you call `print(model)`, it outputs a formatted string representation of your model, showing all its components, including layers, submodules, and other elements defined within the model. This functionality is indeed part of PyTorch and is typically used to give developers a clear, hierarchical view of the model's architecture.

### How `print(model)` Works in PyTorch

1. **Hierarchy Representation**: `print(model)` leverages the nested structure of PyTorch modules. Each module (`nn.Module`) in PyTorch maintains a list of its sub-modules that are automatically registered when you assign an `nn.Module` as an attribute of another `nn.Module`.

2. **Module Registration**: For something to appear in the `print(model)` output, it must be a subclass of `nn.Module` and must be registered as a submodule or a field of another module. This registration happens automatically when you assign an instantiated `nn.Module` to an attribute of another `nn.Module` in its `__init__` method.

3. **`__repr__` Method**: Under the hood, `print(model)` calls the `__repr__` method of the `nn.Module` class, which is designed to recursively fetch and format the string representations of all registered sub-modules.

### Qualifications for Appearing in Model Printout

- **Must Be a Subclass of `nn.Module`**: Any component that should appear in the model structure printout must be an instance of a subclass of `nn.Module`. This is because only `nn.Module` instances can be registered as submodules.
  
- **Must Be Registered as a Submodule**: When constructing a model, any layers or components that are instances of `nn.Module` must be assigned as attributes of the class. For example:
  ```python
  self.layer = nn.Linear(in_features, out_features)
  ```
  This code snippet not only creates a linear layer but also registers it as a submodule of the model, ensuring it appears in the `print(model)` output.

- **Does Not Necessarily Need to Be Part of the `nn` Library**: While most commonly used layers and functions are part of the `torch.nn` library, you can create your own custom modules by subclassing `nn.Module`. As long as these custom modules follow the two rules above, they will appear in the model printout.

### Example with Custom Module

Here’s a brief example demonstrating how custom modules are included in the model's printout:

```python
import torch
import torch.nn as nn

class CustomLayer(nn.Module):
    def __init__(self, features):
        super().__init__()
        self.linear = nn.Linear(features, features)

    def forward(self, x):
        return torch.relu(self.linear(x))

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = CustomLayer(10)
        self.layer2 = nn.Linear(10, 5)

    def forward(self, x):
        x = self.layer1(x)
        return self.layer2(x)

model = MyModel()
print(model)
```

Output:
```
MyModel(
  (layer1): CustomLayer(
    (linear): Linear(in_features=10, out_features=10, bias=True)
  )
  (layer2): Linear(in_features=10, out_features=5, bias=True)
)
```

In this example, `CustomLayer` appears in the printout alongside the standard `nn.Linear` layer because it is a subclass of `nn.Module` and is registered as a submodule of `MyModel`. This explains how custom functionalities are seamlessly integrated into PyTorch’s modular design, allowing for complex and highly customized architectures while maintaining a clear and understandable module hierarchy.

## `nn.Module`

Being a subclass of `nn.Module` in PyTorch provides a structured way to encapsulate network layers and their associated computations. This design not only makes it easier to build and manage complex models but also ensures that these models are extensible, maintainable, and easily integrable within the PyTorch ecosystem. Here are the main benefits and functionalities that inheriting from `nn.Module` provides:

### 1. **Parameter Management**

One of the key benefits of making a class a subclass of `nn.Module` is the automatic management of parameters. `nn.Module` automatically tracks all fields that are instances of `nn.Parameter` or `nn.Module`, allowing:
- **Automatic Parameter Registration**: When you define a model's components (e.g., layers) as attributes of a subclass of `nn.Module`, PyTorch automatically registers these components' parameters.
- **Easy Access to All Parameters**: You can easily access all parameters of a model for purposes like feeding them into an optimizer using `model.parameters()` or `model.named_parameters()`.

### 2. **Device Management**

PyTorch models can seamlessly move computations between different devices (CPUs, GPUs) with simple method calls:
- **`.to(device)`**: Easily move all model parameters to a specified device (CPU, GPU, or other accelerators), helping manage hardware resources efficiently.
- **`.cuda()` or `.cpu()`**: Specific methods to move parameters to GPU or CPU, respectively.

### 3. **Forward Pass Definition**

The `forward()` method:
- **Central Computing Logic**: `nn.Module` requires you to define a `forward()` method in your subclass, which is automatically called when you execute `model(input)`. This method specifies the computations performed by the module on the input data, effectively defining the model's forward pass.
- **Clarity and Intuitiveness**: By encapsulating the forward pass logic in the `forward()` method, `nn.Module` makes the model's operation clear and easy to understand and debug.

### 4. **Integration with PyTorch Ecosystem**

Subclassing from `nn.Module` ensures full compatibility with other PyTorch features:
- **Optimizers**: Direct integration with PyTorch optimizers, which require a list of parameters (`model.parameters()`) to optimize.
- **Serialization**: Support for saving and loading models using PyTorch’s serialization APIs (`torch.save`, `torch.load`).
- **Hooks**: Ability to register forward and backward hooks for debugging and inspecting the flow of tensors through the network.

### 5. **Modularity and Reusability**

- **Encapsulation**: Encapsulating layers and operations within modules makes complex models more organized and modular, promoting reusability and maintainability.
- **Composition**: Models can be composed of other smaller models or layers, all as `nn.Module` subclasses, promoting code reuse and simplifying complex architectures.

### 6. **Customization and Extensibility**

- **Custom Layers and Functions**: Users can define their own custom layers, loss functions, or even complex dynamic behaviors within the `nn.Module` framework, giving extensive flexibility to innovate beyond standard neural network layers.

### Example: Custom Module and Utilization

```python
import torch
import torch.nn as nn

class CustomLayer(nn.Module):
    def __init__(self, input_features, output_features):
        super().__init__()
        self.linear = nn.Linear(input_features, output_features)
        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.linear(x))

# Usage in a model
class MyNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = CustomLayer(10, 50)
        self.layer2 = CustomLayer(50, 10)

    def forward(self, x):
        x = self.layer1(x)
        return self.layer2(x)

model = MyNetwork()
print(model)
# Output the structure, demonstrating encapsulation and ease of understanding
```

In summary, being a subclass of `nn.Module` is fundamental in PyTorch for defining neural networks, as it provides systematic management of components, parameters, and their interactions, alongside integration into the broader ecosystem for training, evaluation, and deployment.

## The autograd system

In PyTorch, the autograd system is a fundamental component that underlies the training process, but it is not specifically part of the loss function or `nn.Module`. Rather, it's a core feature of PyTorch that provides automatic differentiation capabilities across all tensor operations in the framework. This system tracks operations performed on tensors, allowing for the automatic computation of gradients when performing backpropagation.

### Autograd System Overview

The autograd system is built around the concept of the computational graph. Operations performed on tensors create a graph of function nodes that link together the operations that created new tensors. This graph is used to compute derivatives (gradients) in reverse order by tracing the graph from the output back to the inputs.

### Role of Autograd in Different Components

1. **`nn.Module`**:
   - **Definition and Structure**: The `nn.Module` class is a building block for creating neural network layers and models. It primarily manages parameters, sub-modules, and the forward computations.
   - **Gradient Tracking**: While `nn.Module` itself doesn't handle gradients, any operations performed in its `forward` method involving tensors with `requires_grad=True` are tracked by the autograd system. This tracking is what enables gradients to be automatically computed during backpropagation.

2. **Loss Functions**:
   - **Gradient Computation**: Loss functions, whether they are part of `torch.nn` or custom functions, typically return a scalar tensor representing the loss. This tensor is the point where the backward pass is usually initiated. The autograd system uses this scalar value to compute the gradients of parameters with respect to the loss by traversing the computational graph.
   - **Interface with Autograd**: Loss functions are crucial in defining how outputs of the model relate to the expected outcomes (targets), but they rely on the autograd system to handle the differentiation and gradient calculations.

### How Autograd Interfaces with Training Components

Here’s a simple illustration of how autograd interacts with `nn.Module` and loss functions during training:

```python
import torch
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# Create a model instance and define a loss function and optimizer
model = SimpleModel()
loss_function = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Sample data
input_tensor = torch.tensor([[1.0]], requires_grad=True)
target_tensor = torch.tensor([[2.0]])

# Forward pass
output = model(input_tensor)
loss = loss_function(output, target_tensor)

# Backward pass
loss.backward()  # Autograd computes gradients here

# Optimizer uses the gradients to update model parameters
optimizer.step()
```

In this example:
- The `forward` pass of `model` calculates the output using `nn.Module`.
- The `loss_function` computes the loss, which is a scalar tensor.
- Calling `loss.backward()` triggers the autograd system to compute gradients for all tensors with `requires_grad=True` by tracing backwards through the computational graph from the loss tensor.
- The optimizer then updates the model's parameters using the gradients computed by autograd.

### Summary

The autograd system is a backbone of PyTorch's training process, enabling automatic differentiation across tensor operations. It is not a part of `nn.Module` or loss functions but rather operates across all computations involving tensors that require gradients. Its ability to automatically compute gradients by tracing tensor operations makes it indispensable for efficiently training neural network models.

In [47]:
import torch
import torch.nn.functional as F

# Define a 2x3 matrix
A = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])

# Apply softmax with dim=1
softmax_A_dim1 = F.softmax(A, dim=1)
print("Softmax over dim=1:\n", softmax_A_dim1)

# Apply softmax with dim=0
softmax_A_dim0 = F.softmax(A, dim=0)
print("Softmax over dim=0:\n", softmax_A_dim0)


Softmax over dim=1:
 tensor([[0.0900, 0.2447, 0.6652],
        [0.0900, 0.2447, 0.6652]])
Softmax over dim=0:
 tensor([[0.0474, 0.0474, 0.0474],
        [0.9526, 0.9526, 0.9526]])
