# Chapter 3

__Figure 3.1 A mental model of the three main stages of coding an LLM, pretraining the LLM on a general text dataset, and finetuning it on a labeled dataset. This chapter focuses on attention mechanisms, which are an integral part of an LLM architecture.__

![LLM mental model](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image001.png)

__Figure 3.2 The figure depicts different attention mechanisms we will code in this chapter, starting with a simplified version of self-attention before adding the trainable weights. The causal attention mechanism adds a mask to self-attention that allows the LLM to generate one word at a time. Finally, multi-head attention organizes the attention mechanism into multiple heads, allowing the model to capture various aspects of the input data in parallel.__

![different attention mechanisms](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image003.png)

## 3.1 The problem with modeling long sequences

__Figure 3.3 When translating text from one language to another, such as German to English, it's not possible to merely translate word by word. Instead, the translation process requires contextual understanding and grammar alignment.__


![problem of word by word translation](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image005.png)

__Figure 3.4 Before the advent of transformer models, encoder-decoder RNNs were a popular choice for machine translation. The encoder takes a sequence of tokens from the source language as input, where a hidden state (an intermediate neural network layer) of the encoder encodes a compressed representation of the entire input sequence. Then, the decoder uses its current hidden state to begin the translation, token by token.__

![RNNs use hidden state to encode the entire sequence of tokens](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image007.png)

## 3.2 Capturing data dependencies with attention mechanisms

__Figure 3.5 Using an attention mechanism, the text-generating decoder part of the network can access all input tokens selectively. This means that some input tokens are more important than others for generating a given output token. The importance is determined by the so-called attention weights, which we will compute later. Note that this figure shows the general idea behind attention and does not depict the exact implementation of the Bahdanau mechanism, which is an RNN method outside this book's scope.__

![attention mechaisn](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image009.png)

__Figure 3.6 Self-attention is a mechanism in transformers that is used to compute more efficient input representations by allowing each position in a sequence to interact with and weigh the importance of all other positions within the same sequence. In this chapter, we will code this self-attention mechanism from the ground up before we code the remaining parts of the GPT-like LLM in the following chapter.__

![self-attention mechaisn](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image011.png)

## 3.3 Attending to different parts of the input with self-attention

### 3.3.1 A simple self-attention mechanism without trainable weights

__Figure 3.7 The goal of self-attention is to compute a context vector, for each input element, that combines information from all other input elements. In the example depicted in this figure, we compute the context vector z(2). The importance or contribution of each input element for computing z(2) is determined by the attention weights α21 to α2T. When computing z(2), the attention weights are calculated with respect to input element x(2) and all other inputs. The exact computation of these attention weights is discussed later in this section.__

![context vector](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image013.png)

dot product not using pytorch

In [2]:
x1 = [0.4, 0.1, 0.8]
x2 = [0.5, 0.8, 0.6]

# calculate the dot product of x1 and x2
dot_product = sum(a * b for a, b in zip(x1, x2))

print("The dot product of x1 and x2 is:", dot_product)

The dot product of x1 and x2 is: 0.76


using pytorch

In [9]:
import torch

torch.dot(torch.tensor(x1), torch.tensor(x2))

tensor(0.7600)

In [1]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

__Figure 3.8 The overall goal of this section is to illustrate the computation of the context vector z(2) using the second input element, x(2) as a query. This figure shows the first intermediate step, computing the attention scores ω between the query x(2) and all other input elements as a dot product. (Note that the numbers in the figure are truncated to one digit after the decimal point to reduce visual clutter.)__

![self-attention scores](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image015.png)

In [7]:
print(inputs.shape)
print(inputs.shape[0])
print(inputs.shape[1])

torch.Size([6, 3])
6
3


In [3]:
query = inputs[1]

attn_scores_2 = torch.empty(inputs.shape[0])

for i, x_i in enumerate(inputs):
  attn_scores_2[i] = torch.dot(x_i, query)

print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


__Figure 3.9 After computing the attention scores ω<sup>21</sup> to ω<sup>2T</sup> with respect to the input query x<sup>(2)</sup>, the next step is to obtain the attention weights α<sup>21</sup> to α<sup>2T</sup> by normalizing the attention scores.__

![attention weights](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image017.png)

In [10]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights:", attn_weights_2_tmp)

print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


In practice, it's more common and advisable to use the __softmax__ function for normalization. This approach is better at managing extreme values and offers more favorable gradient properties during training. Below is a basic implementation of the softmax function for normalizing the attention scores:

In [11]:
def softmax_naive(x):
  return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)

print("Attention weights:", attn_weights_2_naive)

print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


PyTorch softmax implementation

In [12]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)

print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


__Figure 3.10 The final step, after calculating and normalizing the attention scores to obtain the attention weights for query x<sup>(2)</sup>, is to compute the context vector z<sup>(2)</sup>. This context vector is a combination of all input vectors x<sup>(1)</sup> to x<sup>(T)</sup> weighted by the attention weights.__

![query weights](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image019.png)

In [14]:
query = inputs[1] # 2nd input token is the query

print(query.shape)

torch.Size([3])


In [15]:
context_vec_2 = torch.zeros(query.shape)

for i, x_i in enumerate(inputs):
  context_vec_2 += attn_weights_2[i]*x_i

print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


### 3.3.2 Computing attention weights for all input tokens

__Figure 3.11 The highlighted row shows the attention weights for the second input element as a query, as we computed in the previous section. This section generalizes the computation to obtain all other attention weights.__

![attention weights for the 2nd token](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image021.png)

__3.3.1 A simple self-attention mechanism without trainable weights__

![computing all context vectors at once](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image023.png)

In [16]:
attn_scores = torch.empty(6, 6)

for i, x_i in enumerate(inputs):
  for j, x_j in enumerate(inputs):
    attn_scores[i, j] = torch.dot(x_i, x_j)

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [17]:
attn_scores = inputs @ inputs.T

print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [19]:
attn_weights = torch.softmax(attn_scores, dim=1)

print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [20]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

print("Row 2 sum:", row_2_sum)

print("All row sums:", attn_weights.sum(dim=1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


In [21]:
all_context_vecs = attn_weights @ inputs

print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In [22]:
print("Previous 2nd contxt vector:", context_vec_2)

Previous 2nd contxt vector: tensor([0.4419, 0.6515, 0.5683])


## 3.4 Implementing self-attention with trainable weights

__Figure 3.13 A mental model illustrating how the self-attention mechanism we code in this section fits into the broader context of this book and chapter. In the previous section, we coded a simplified attention mechanism to understand the basic mechanism behind attention mechanisms. In this section, we add trainable weights to this attention mechanism. In the upcoming sections, we will then extend this self-attention mechanism by adding a causal mask and multiple heads.__

![self-attention in the LLM mental model](https://drek4537l1klr.cloudfront.net/raschka/v-8/Figures/ch03__image025.png)