# Chapter 1 Exercises - Continued
Cleaning up from where we left off in part 1, we will now implement the multi-head attention layer. We will also implement the positional encoding layer, which is used to add positional information to the input embeddings.


## Input embeddings

### Standard embeddings
The standard input embeddings are defined as follows:
$$
\text{Embed}(x) = xW^E
$$
where $x \in \mathbb{R}^{n \times d_{\text{in}}}$ is the input, $W^E \in \mathbb{R}^{d_{\text{in}} \times d_{\text{model}}}$ is a learned linear projection, and $d_{\text{in}}$ is the input dimension.



### Rotationally invariant embeddings
The additional constraint that the input embeddings are rotationally invariant is that for any orthogonal matrix $R \in \mathbb{R}^{d_{\text{in}} \times d_{\text{in}}}$, we have
$$
\text{Embed}(x) = \text{Embed}(xR)
$$
for any $x$.

This constraint requires that we modify the input embeddings. We will use the following input embeddings instead:
$$
\dots
$$


## Positional encoding

### Standard positional encoding
The standard positional encoding layer is defined as follows:
$$
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\
\text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
$$
where $pos$ is the position and $i$ is the dimension. The positional encoding is added to the input embeddings.

### Rotationally invariant positional encoding
The additional constraint that the positional encoding layer is rotationally invariant is that for any orthogonal matrix $R \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$, we have
$$
\text{PE}(x) = \text{PE}(Rx)
$$
for any $x$.

This constraint requires that we modify the positional encoding layer. We will use the following positional encoding layer instead:
$$
\dots
$$


### Exercise
Prove that the standard positional encoding layer is not rotationally invariant.

### Exercise
Implement the standard positional encoding layer.

### Exercise
Prove that our new positional encoding layer is rotationally invariant.

### Exercise
Implement the rotationally invariant positional encoding layer.



## Rotationally invariant multi-head attention
The multi-head attention layer is defined as follows:
$$
\begin{align}
\text{MultiHead}(Q, K, V) &= \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O \\
\text{head}_i &= \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
\end{align}
$$
where $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$, and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ are learned linear projections, and $h$ is the number of heads. The attention function is defined as:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
where $Q \in \mathbb{R}^{n \times d_k}$, $K \in \mathbb{R}^{m \times d_k}$, and $V \in \mathbb{R}^{m \times d_v}$.

The condition that the attention function is rotationally invariant is that for any orthogonal matrix $R \in \mathbb{R}^{d_k \times d_k}$, we have
$$
\text{Attention}(Q, K, V) = \text{Attention}(QR, KR, VR)
$$
for any $Q$, $K$, and $V$.

### Exercise
Prove that the standard dot-product attention function is not rotationally invariant.


### Exercise
Implement the multi-head attention layer. You may use the `torch.einsum` function to compute the attention function. You may also use the `torch.nn.Linear` module to implement the linear projections.


### Exercise
Prove that our multi-head attention layer is rotationally invariant.

## Transformer
The Rotationally Invariant Transformer (RIT) is defined as follows:
$$
\begin{align}
\text{RIT}(x) &= \text{LayerNorm}(x + \text{MultiHead}(x, x, x)) \\
\text{RIT}(x) &= \text{LayerNorm}(x + \text{FFN}(x))
\end{align}
$$
where the multi-head attention layer is defined above, and the feed-forward network (FFN) is defined as follows:
$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$
where $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$, $b_1 \in \mathbb{R}^{d_{\text{ff}}}$, $W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$, and $b_2 \in \mathbb{R}^{d_{\text{model}}}$ are learned parameters.

### Exercise
Implement the RIT layer. You may use the `torch.nn.LayerNorm` module to implement the layer normalization, and the `torch.nn.Linear` module to implement the linear projections.





### Exercise: Final model
Train a completely custom language model that is built using rotationally invariant principles.


# Exercise: Final model

### Imports

In [1]:

import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

from typing import Optional, Tuple, List, Any, Generator, Union, Dict, NamedTuple

from nanoGPT.model import GPTConfig, GPT, MLP

import torch

import platform
print(f"OS: {platform.system()} {platform.release()}")
print(f"Python: {platform.python_version()}")

print(f"PyTorch: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader


import io
import zipfile
import requests

OS: Windows 10
Python: 3.9.16
PyTorch: 2.0.1
CUDA version: 11.7


In [2]:
# Download the dataset as a zip file
url = 'https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip'
response = requests.get(url)
zip_content = response.content

# Extract the dataset from the zip file
with zipfile.ZipFile(io.BytesIO(zip_content), 'r') as zip_ref:
    zip_ref.extractall('wikitext-2')

# Load data from files
with open('wikitext-2/wikitext-2/wiki.train.tokens', 'r', encoding='utf-8') as f:
    train_data_raw = f.read()
with open('wikitext-2/wikitext-2/wiki.valid.tokens', 'r', encoding='utf-8') as f:
    valid_data_raw = f.read()
with open('wikitext-2/wikitext-2/wiki.test.tokens', 'r', encoding='utf-8') as f:
    test_data_raw = f.read()


In [3]:
# Inspect the data
print(train_data_raw[:100])
print(valid_data_raw[:100])
print(test_data_raw[:100])

# Generate some statistics about the data
print(f"Number of training tokens: {len(train_data_raw.split())}")
print(f"Number of validation tokens: {len(valid_data_raw.split())}")
print(f"Number of test tokens: {len(test_data_raw.split())}")

# Generate some statistics about the vocabulary
x_train = set(train_data_raw.split())
x_val = set(valid_data_raw.split())
x_test = set(test_data_raw.split())
print(f"Number of unique tokens in training set: {len(x_train)}")
print(f"Number of unique tokens in validation set: {len(x_val)}")
print(f"Number of unique tokens in test set: {len(x_test)}")


 
 = Valkyria Chronicles III = 
 
 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 ,
 
 = Homarus gammarus = 
 
 Homarus gammarus , known as the European lobster or common lobster , is 
 
 = Robert <unk> = 
 
 Robert <unk> is an English film , television and theatre actor . He had a gu
Number of training tokens: 2051910
Number of validation tokens: 213886
Number of test tokens: 241211
Number of unique tokens in training set: 33277
Number of unique tokens in validation set: 13776
Number of unique tokens in test set: 14142


In [4]:
# Tokenize the data
tokenizer = get_tokenizer('basic_english')
train_iter = tokenizer(train_data_raw)
valid_iter = tokenizer(valid_data_raw)
test_iter = tokenizer(test_data_raw)

# Create the vocabulary
vocab = build_vocab_from_iterator(train_iter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])
vocab.set_default_index(vocab['<unk>'])

# Preprocess the data
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = torch.device("cpu")
def preprocess_data(data_iter, vocab, device):
    try:
        tensor_data = torch.tensor([vocab[token] for token in data_iter], dtype=torch.long, device=device)
    except RuntimeError as e:
        if 'device-side assert triggered' in str(e):
            # Find the problematic token
            for token in data_iter:
                if token not in vocab:
                    print(f"Adding '{token}' to vocab")
                    vocab[token] = len(vocab)
            raise e
        else:
            raise e

    # Insert <bos> and <eos> tokens to the beginning and end of the sequence
    yield torch.cat([torch.tensor([vocab['<bos>']], device=device), tensor_data, torch.tensor([vocab['<eos>']], device=device)]).long()

train_data = list(preprocess_data(train_iter, vocab, device))
valid_data = list(preprocess_data(valid_iter, vocab, device))
test_data = list(preprocess_data(test_iter, vocab, device))

def batchify(data, batch_size):
    total_seq_len = sum([len(x) for x in data])
    num_batch_elements = total_seq_len // batch_size

    # Concatenate and reshape data into batch_size columns
    batched_data = torch.cat(data).to(device=device)
    batched_data = batched_data.narrow(0, 0, num_batch_elements * batch_size)
    batched_data = batched_data.view(batch_size, -1).t().contiguous()
    return batched_data

batch_size = 64
eval_batch_size = 128

train_loader = DataLoader(batchify(train_data, batch_size), batch_size=batch_size)
valid_loader = DataLoader(batchify(valid_data, eval_batch_size), batch_size=eval_batch_size)
test_loader = DataLoader(batchify(test_data, eval_batch_size), batch_size=eval_batch_size)

In [None]:
# Define the models

# Standard GPT
gpt = GPT(
    GPTConfig(
        vocab_size=len(vocab),
        block_size=128,
        n_layer=8,
        n_head=8,
        n_embd=512
    )
).to(device)


class RotationInvariantMultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads

        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)
        self.out_linear = nn.Linear(d_model, d_model)

    def euclidean_distance(self, x, y):
        return torch.sum((x - y) ** 2, dim=-1)

    def forward(self, x):
        batch_size = x.size(0)

        query = self.query_linear(x)
        key = self.key_linear(x)
        value = self.value_linear(x)

        query = query.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2).unsqueeze(3)
        key = key.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2).unsqueeze(2)
        value = value.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2).unsqueeze(2)

        scores = self.euclidean_distance(query, key)
        scores = - torch.sqrt(scores)

        weights = F.softmax(scores, dim=-1)

        print(f"query shape: {query.shape}")  # Add this line
        print(f"key shape: {key.shape}")  # Add this line
        print(f"value shape: {value.shape}")  # Add this line
        print(f"weights shape: {weights.shape}")  # Add this line

        attention = torch.einsum("bnqd,bnqd->bnqd", weights, value).contiguous()

        print(f"attention shape: {attention.shape}")  # Add this line

        attention = attention.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.out_linear(attention)

class RotationInvariantTransformerLayer(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.multi_head_attention = RotationInvariantMultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(0.1)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model)
        )

    def forward(self, x):
        # Self-Attnetion
        attn_out = self.multi_head_attention(x)
        x = self.norm1(attn_out + x)
        x = self.dropout(x)

        # Position-wise Feedforward
        ff_out = self.feed_forward(x)
        x = self.norm2(ff_out + x)
        x = self.dropout(x)

        return x

class RotationInvariantTransformer(nn.Module):
    def __init__(self, input_dim, d_model, num_heads, num_layers):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, d_model)
        self.layers = nn.ModuleList([RotationInvariantTransformerLayer(d_model, num_heads) for _ in range(num_layers)])

    def forward(self, x):
        x = self.embedding(x.long())
        for layer in self.layers:
            x = layer(x)
        return x

class RotationallyInvariantLanguageModel(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, seq_len):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer = RotationInvariantTransformer(input_dim=d_model, d_model=d_model, num_heads=num_heads, num_layers=num_layers)
        self.fc = nn.Linear(d_model, vocab_size)
        self.seq_len = seq_len

    def forward(self, x, y=None):
        print(f"x dimensions in LanguageModel: {x.shape}")
        print("Min index:", x.min().item(), "Max index:", x.max().item())
        embedded = self.embedding(x.long())  # Add .long() here
        transformer_out = self.transformer(embedded.view(-1, d_model))
        logits = self.fc(transformer_out)

        if y is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), y.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        return logits, loss


vocab_size = len(vocab)
d_model = 512
num_heads = 8
num_layers = 6
seq_len = 35

rigpt = RotationallyInvariantLanguageModel(vocab_size, d_model, num_heads, num_layers, seq_len).to(device)
def train(model, iterator, optimizer, criterion):
    model.train()
    running_loss = 0

    for batch in iterator:
        optimizer.zero_grad()
        input_data = batch[:-1, :]
        target_data = batch[1:, :]
        output, loss = model(input_data, target_data)
        #print("Output type:", type(output))
        #print("Output:", output)
        running_loss += loss.item()
        loss.backward()
        optimizer.step()

    return running_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    running_loss = 0

    with torch.no_grad():
        for batch in iterator:
            input_data = batch[:-1, :]
            target_data = batch[1:, :]
            output = model(input_data)
            loss = criterion(output.view(-1, output.shape[-1]), target_data.view(-1))
            running_loss += loss.item()

    return running_loss / len(iterator)

epochs = 10
lr = 0.001

gpt_optimizer = optim.Adam(gpt.parameters(), lr=lr)
rigpt_optimizer = optim.Adam(rigpt.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

for epoch in range(1, epochs + 1):
    gpt_train_loss = train(gpt, train_loader, gpt_optimizer, criterion)
    ri_train_loss = train(rigpt, train_loader, rigpt_optimizer, criterion)
    gpt_valid_loss = evaluate(gpt, valid_loader, criterion)
    ri_valid_loss = evaluate(rigpt, valid_loader, criterion)
    print(f'Epoch: {epoch}, Train loss: {gpt_train_loss:.3f}, Validation loss: {gpt_valid_loss:.3f}')
    print(f'Epoch: {epoch}, Train loss: {ri_train_loss:.3f}, Validation loss: {ri_valid_loss:.3f}')


def generate(model, start_text, generate_len=30, temperature=0.8):
    model.eval()
    input_data = torch.tensor([vocab[token] for token in tokenizer(start_text)], dtype=torch.long, device=device).unsqueeze(1)

    hidden = None
    generated_text = start_text

    with torch.no_grad():
        for _ in range(generate_len):
            output = model(input_data)
            output = output / temperature
            output = torch.exp(output)
            probs = output[-1, :].squeeze().cpu()

            # Sample from the output distribution
            next_token_idx = torch.multinomial(probs, 1).item()
            next_token = vocab.itos[next_token_idx]

            # Append the generated token to the existing sequence and update input data
            generated_text += " " + next_token
            input_data = torch.cat([input_data, torch.tensor([[next_token_idx]], device=device)], dim=0)

    return generated_text

print(generate(gpt, "The history of", generate_len=30))
print(generate(rigpt_optimizer, "The history of", generate_len=30))

number of parameters: 25.34M
