## Description

In this section we are going to create a pytorch model for training.
The model will be simply with a single headed self-attention


In [5]:
import torch
import torch.nn as nn
import math


class SelfAttention(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model

        # Linear layers for query, key, and value, make input and output have the same dimension
        self.w_query = nn.Linear(d_model, d_model)
        self.w_key = nn.Linear(d_model, d_model)
        self.w_value = nn.Linear(d_model, d_model)
        
        self.attention_scores = nn.Linear(d_model, d_model)
        
    def forward(self, x):
        # x is the input tensor
        # x.shape = (seq_len, d_model) - no batch dimension
        query = self.w_query(x)
        key = self.w_key(x)
        value = self.w_value(x)
        
        # Compute attention scores: (seq_len, seq_len)
        attention_weights = torch.softmax(query @ key.T / math.sqrt(self.d_model), dim=-1)
        
        # Apply attention to values: (seq_len, d_model)
        out = attention_weights @ value
        return out
    


## Explain

Assume we have input like ["I", "Love", "LLM"]

For each word, we do a tokenize process, so it will become size of **embed_dim**, in this case, our seq_len = 3

On following example, we create a training logic.

The lost func we pick is MSE, since the output will be a vector, regression will most likely fit into MSE bucket.

## General practice

In order to train the model, we have couple steps:

1. prepare a **model**, **optmizer**, **lost function**, we feed model to optimizer, 
2. loop through a hyper-parameter **epoch**( mean, the value is hard-coded)
3. for each epoch, we do
    a. clear the grad() <- so previous calculated grad wont impact this epoch
    b. get output from model
    c. get loss from output and target
    d. process a back-propagation by call loss.backward(), this will follow chain rule to reflect grad to first layer.
    e. use optimizer.step() to update the parameters of model

## Dive Deep

1. Why the backward() applied on loss instead of model ?

This is the way how pytorch works.

The **model** saved a lot of layers, activation, nn etc...
While **loss** contains meaningful scalar value. Why? well, normally loss is a simply value like torch.tensor(1), because the loss normally will be a single number after calculation of MSE.

With these two concept, **model** saved entire layers, and it is a graph, while loss, is the end of current graph, this will be the start place for back-propagation.

Following is an example for 3-layer chain

## Example: Simple 3-layer chain

Suppose:

$$
\text{loss} = f(z), \quad z = g(y), \quad y = h(\theta)
$$

To find:

$$
\frac{d\, \text{loss}}{d\theta}
$$

We apply the chain rule:

$$
\frac{d\, \text{loss}}{d\theta} = \frac{d\, \text{loss}}{dz} \times \frac{dz}{dy} \times \frac{dy}{d\theta}
$$


## Qestion

1. In what case we dont need clear grad()

In [8]:
torch.manual_seed(42)
embed_dim = 32
seq_len = 8

# No batching: Create 2D input tensor (seq_len, d_model)
x = torch.randn(seq_len, embed_dim)
target = torch.randn(seq_len, embed_dim)

model = SelfAttention(embed_dim)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Epoch 0, Loss: 0.9530
Epoch 10, Loss: 0.8593
Epoch 20, Loss: 0.7912
Epoch 30, Loss: 0.7181
Epoch 40, Loss: 0.6358
Epoch 50, Loss: 0.5504
Epoch 60, Loss: 0.4689
Epoch 70, Loss: 0.3945
Epoch 80, Loss: 0.3288
Epoch 90, Loss: 0.2758
Epoch 100, Loss: 0.2360
Epoch 110, Loss: 0.2062
Epoch 120, Loss: 0.1824
Epoch 130, Loss: 0.1621
Epoch 140, Loss: 0.1441
Epoch 150, Loss: 0.1277
Epoch 160, Loss: 0.1127
Epoch 170, Loss: 0.0989
Epoch 180, Loss: 0.0863
Epoch 190, Loss: 0.0742
