In [1]:
import torch
print(torch.__version__)

2.1.2+cu118


- This chapter covers attention mechanisms, the engine of LLMs:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/01.webp" width="700px">,

## 3.3 Attending to different parts of the input with self-attention

### 3.3.1 A simple self-attention mechanism without trainable weights


- This section explains a very simplified variant of self-attention, which does not contain any trainable weights
- This is purely for illustration purposes and NOT the attention mechanism that is used in transformers
- The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/07.webp" width="700px">,

In [2]:
inputs = torch.tensor(
    [
        [0.43, 0.15, 0.89],  # Your     (x^1)
        [0.55, 0.87, 0.66],  # journey  (x^2)
        [0.57, 0.85, 0.64],  # starts   (x^3)
        [0.22, 0.58, 0.33],  # with     (x^4)
        [0.77, 0.25, 0.10],  # one      (x^5)
        [0.05, 0.80, 0.55]   # step     (x^6)
    ]
)

print(inputs.shape)
print(inputs)

torch.Size([6, 3])
tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])


- (In this book, we follow the common machine learning and deep learning convention where training examples are represented as rows and feature values as columns; in the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension)

- The primary objective of this section is to demonstrate how the context vector 
 is calculated using the second input sequence, `x^(2)`, as a `query token`

- The figure depicts the initial step in this process, which involves calculating the attention scores ω between 
 x^(2) (`query`) and all other input elements through a dot product operation

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/08.webp" width="700px">,


In [5]:
query = inputs[1]             # A
attn_scores_2 = torch.empty(inputs.shape[0])

print(f'query shape: {query.shape} and attn_scores_2 shape: {attn_scores_2.shape}')
query, attn_scores_2

#A The second input token serves as query

query shape: torch.Size([3]) and attn_scores_2 shape: torch.Size([6])


(tensor([0.5500, 0.8700, 0.6600]), tensor([0., 0., 0., 0., 0., 0.]))

In [8]:
# computing attention scores
for i, x_i in enumerate(inputs):
    #print(x_i.shape)
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)
print("attn_scores_2 shape: ", attn_scores_2.shape)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
attn_scores_2 shape:  torch.Size([6])


- In the context of self attention mechanisms, the dot product determines the extent to which elements in a
 sequence attend to each other: the higher the dot product, the higher the similarity and attention score between two elements.

In the next step, as shown in Figure 3.9, we normalize each of the attention scores that  we computed previously.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/09.webp" width="700px">,

**Figure 3.9** - After computing the attention scores ω21 to ω2T with respect to the input query x(2) , the next step is to obtain the attention weights α21 to α2T by normalizing the attention scores.

- The main goal behind the normalization shown in Figure 3.9 is to obtain attention weights that sum up to 1. This normalization is a convention that is useful for interpretation and for maintaining training stability in an LLM. Here's a straightforward method for achieving this normalization step:

In [9]:
attn_weigths_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights: ", attn_weigths_2_tmp)
print("Sum attn weights: ", attn_weigths_2_tmp.sum())

Attention weights:  tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum attn weights:  tensor(1.0000)


- In practice, it's more common and advisable to use the softmax function for normalization.
 This approach is better at managing extreme values and offers more favorable gradient
 properties during training. Below is a basic implementation of the softmax function for
 normalizing the attention scores:

In [11]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weigths_2_naive = softmax_naive(attn_scores_2)
print("Attention weights: ", attn_weigths_2_naive)
print("Sum attn weights: ", attn_weigths_2_naive.sum())

Attention weights:  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum attn weights:  tensor(1.)


- In addition, the softmax function ensures that the attention weights are always positive.
 This makes the output interpretable as probabilities or relative importance, where higher
 weights indicate greater importance.

- Note that this naive softmax implementation (softmax_naive) may encounter numerical
 instability problems, such as overflow and underflow, when dealing with large or small input
 values. Therefore, in practice, it's advisable to use the PyTorch implementation of softmax,
 which has been extensively optimized for performance:

In [12]:
attn_weigths_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights: ", attn_weigths_2)
print("Sum attn weights: ", attn_weigths_2.sum())

Attention weights:  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum attn weights:  tensor(1.)


- Now that we computed the normalized attention weights, we are ready for the final step
 illustrated in Figure 3.10: calculating the context vector z(2) by multiplying the embedded
 input tokens, x(i), with the corresponding attention weights and then summing the resulting
 vectors.

 <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/10.webp" width="700px">,

  **Figure 3.10** The final step, after calculating and normalizing the attention scores to obtain the attention
 weights for query x(2) , is to compute the context vector z(2) . This context vector is a combination of all input
 vectors x(1) to x(T) weighted by the attention weights.


In [14]:
query = inputs[1]  # 2nd input token is the query
context_vec_2 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weigths_2[i] * x_i
    print(f'iteartion i = {i} : {context_vec_2}')
print(context_vec_2)

iteartion i = 0 : tensor([0.0596, 0.0208, 0.1233])
iteartion i = 1 : tensor([0.1904, 0.2277, 0.2803])
iteartion i = 2 : tensor([0.3234, 0.4260, 0.4296])
iteartion i = 3 : tensor([0.3507, 0.4979, 0.4705])
iteartion i = 4 : tensor([0.4340, 0.5250, 0.4813])
iteartion i = 5 : tensor([0.4419, 0.6515, 0.5683])
tensor([0.4419, 0.6515, 0.5683])


### 3.3.2 Computing attention weights for all input tokens

 In the previous section, we computed `attention weights` and the `context vector for input 2`,
 as shown in the highlighted row in Figure 3.11. Now, we are extending this computation to
 calculate attention weights and context vectors for all inputs.
 
  <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/11.webp" width="700px">,


 We follow the same three steps as before, as summarized in Figure 3.12, except that we
 make a few modifications in the code to compute all context vectors instead of only the
 second context vector, z(2).

   <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/12.webp" width="700px">,


- **Step 1:** `compute attention scores for all pairs of inputs`

In [16]:
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- When computing the preceding attention score tensor, we used for-loops in Python.
 However, for-loops are generally slow, and we can achieve the same results using matrix
 multiplication:

In [18]:
attn_scores = inputs @ inputs.T  # shape(6, 3) * (3, 6) = output shape(6, 6)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


- **Step 2:** we now normalize each row so that the values in each row sum to 1:

In [19]:
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


- In the context of using PyTorch, the dim parameter in functions like torch.softmax specifies
 the dimension of the input tensor along which the function will be computed. By setting
 dim=-1, we are instructing the softmax function to apply the normalization along the last
 dimension of the attn_scores tensor. If attn_scores is a 2D tensor (for example, with a
 shape of [rows, columns]), dim=-1 will normalize across the columns so that the values in
 each row (summing over the column dimension) sum up to 1.

In [20]:
row_2_sum = sum( [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print(f'row_2_sum: {row_2_sum}')
print(f'All row sums: {attn_weights.sum(dim=-1)}')

row_2_sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


- **Step 3:** Using these attention weights we compute all context vectors via matrix multiplicaiton:

In [21]:
all_context_vectors = attn_weights @ inputs
print(all_context_vectors)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


- We can double-check that the code is correct by comparing the 2nd row with the context
 vector z(2) that we computed previously in section 3.3.1:

In [22]:
print("Previous 2nd context vector: ", context_vec_2)

Previous 2nd context vector:  tensor([0.4419, 0.6515, 0.5683])


-  Based on the result, we can see that the previously calculated context_vec_2 matches the
 second row in the previous tensor exactly:

> Note:
This concludes the code walkthrough of a simple self-attention mechanism. In the next
 section, we will add trainable weights, enabling the LLM to learn from data and improve its
 performance on specific tasks.

## 3.4 Implementing self-attention with trainable weights

- In this section, we are implementing the self-attention mechanism that is used in the
 original transformer architecture, the GPT models, and most other popular LLMs. This self
attention mechanism is also called scaled dot-product attention. Figure 3.13 provides a
 mental model illustrating how this self-attention mechanism fits into the broader context of
 implementing an LLM.

    <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/13.webp" width="700px">,


 - The most notable difference is the introduction of weight matrices that are updated
 during model training. These trainable weight matrices are crucial so that the model
 (specifically, the attention module inside the model) can learn to produce "good" context
 vectors. (Note that we will train the LLM in chapter 5.)
 
 - We will tackle this self-attention mechanism in the two subsections. First, we will code it
 step-by-step as before. Second, we will organize the code into a compact Python class that
 can be imported into an LLM architecture, which we will code in chapter 4.

### 3.4.1 Computing the attention weights step by step

- We will implement the self-attention mechanism step by step by introducing the three
 trainable weight matrices Wq, Wk, and Wv. These three matrices are used to project the
 embedded input tokens, x(i), into query, key, and value vectors as illustrated in Figure 3.14.

    <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/14.webp" width="700px">,

- Figure 3.14 In the first step of the self-attention mechanism with trainable weight matrices, we compute query
 (q), key (k), and value (v) vectors for input elements x. Similar to previous sections, we designate the second
 input, x(2) , as the query input. The query vector q(2) is obtained via matrix multiplication between the input x(2)
 and the weight matrix Wq.  Similarly, we obtain the key and value vectors via matrix multiplication involving the
 weight matrices Wk and Wv

- Earlier in section 3.3.1, we defined the second input element x(2) as the query when we
 computed the simplified attention weights to compute the context vector z(2). Later, in
 section 3.3.2, we generalized this to compute all context vectors z(1) ... z(T) for the six
word input sentence "Your journey starts with one step."

- Similarly, we will start by computing only one context vector, z(2), for illustration
 purposes. In the next section, we will modify this code to calculate all context vectors

In [23]:
x_2 = inputs[1]             #A
d_in = inputs.shape[1]      #B
d_out = 2                   #C

#A The second input element
#B The input embedding size, d = 3
#C The output embedding size, d_out = 2

> Note that in GPT-like models, the input and output dimensions are usually the same, but for
 illustration purposes, to better follow the computation, we choose different input (d_in=3)
 and output (d_out=2) dimensions here.

- Next, we initialize the three weight matrices Wq, Wk, and Wv that are shown in Figure 3.14


In [24]:
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

>Note that we are setting requires_grad=False to reduce clutter in the outputs for
 illustration purposes, but if we were to use the weight matrices for model training, we
 would set requires_grad=True to update these matrices during model training.

- Next, we compute the query, key, and value vectors as shown earlier in Figure 3.14:


In [25]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)

tensor([0.4306, 1.4551])


**WEIGHT PARAMETERS VS ATTENTION WEIGHTS**

- Note that in the weight matrices W, the term "weight" is short for "weight
 parameters," the values of a neural network that are optimized during training. This
 is not to be confused with the attention weights. As we already saw in the previous
 section, attention weights determine the extent to which a context vector depends on
 the different parts of the input, i.e., to what extent the network focuses on different
 parts of the input.

- In summary, weight parameters are the fundamental, learned coefficients that define
 the network's connections, while attention weights are dynamic, context-specific
 values.

- Even though our temporary goal is to only compute the one context vector, z(2), we still
 `require the key and value vectors for all input elements` `as they are involved in computing
 the attention weights with respect to the query q(2)`, as illustrated in Figure 3.14.
    
    We can obtain all keys and values via matrix multiplication:

In [26]:
keys = inputs @ W_key
values = inputs @ W_value
print(f'keys.shape: {keys.shape}')
print(f'values.shape: {values.shape}')

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


- The second step is now to compute the attention scores, as shown in Figure 3.15.

    <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/15.webp" width="700px">,

- Figure 3.15 The attention score computation is a dot-product computation similar to what we have used in the
 simplified self-attention mechanism in section 3.3. The new aspect here is that we are not directly computing
 the dot-product between the input elements but using the query and key obtained by transforming the inputs
 via the respective weight matrices. 

 First, let's compute the attention score ω22:

In [27]:
keys_2 = keys[1]
attn_scores_22 = query_2.dot(keys_2)
print(attn_scores_22)

tensor(1.8524)


- Again, we can generalize this computation to all attention scores via matrix multiplication:


In [29]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query_2 (2) @ (2, 6) => (6)
print(attn_scores_2)

tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


In [30]:
attn_scores_2.shape

torch.Size([6])

- The third step is now going from the attention scores to the attention weights, as illustrated
 in Figure 3.16.

    <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/16.webp" width="700px">,

- Normalize these attention scores using softmax

- The difference to earlier is  that we now scale the attention scores by dividing them by the square root of the
 embedding dimension of the keys, (note that taking the square root is mathematically the
 same as exponentiating by 0.5)

In [31]:
d_k = keys.shape[-1]  # taking embedding dimension
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


 > THE RATIONALE BEHIND SCALED-DOT PRODUCT ATTENTION
 
- The reason for the normalization by the embedding dimension size is to improve the
 training performance by avoiding small gradients. For instance, when scaling up the
 embedding dimension, which is typically greater than thousand for GPT-like LLMs,
 large dot products can result in very small gradients during backpropagation due to
 the softmax function applied to them. As dot products increase, the softmax function
 behaves more like a step function, resulting in gradients nearing zero. These small
 gradients can drastically slow down learning or cause training to stagnate.
 
 - The scaling by the square root of the embedding dimension is the reason why this
 self-attention mechanism is also called scaled-dot product attention.

- Now, the final step is to compute the context vectors, as illustrated in Figure 3.17.

    <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/17.webp" width="700px">,

- **In the final step of the self-attention computation, we compute the context vector by combining all  value vectors via the attention weights**

 - Similar to section 3.3, where we computed the context vector as a weighted sum over the
 input vectors, we now compute the context vector as a weighted sum over the value
 vectors. Here, the attention weights serve as a weighting factor that weighs the respective
 importance of each value vector. Similar to section 3.3, we can use matrix multiplication to
 obtain the output in one step

In [32]:
print(f'Shape of attn_weights_2: {attn_weights_2.shape}')
print(f'Shape of values: {values.shape}')

Shape of attn_weights_2: torch.Size([6])
Shape of values: torch.Size([6, 2])


In [33]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.3061, 0.8210])


- So far, we only computed a `single context vector, z(2)`. In the next section, we will generalize the code to compute `all context vectors` in the input sequence, `z(1) to z(T)`.

> WHY QUERY, KEY, AND VALUE?
 
- The terms "key," "query," and "value" in the context of attention mechanisms are
 borrowed from the domain of information retrieval and databases, where similar
 concepts are used to store, search, and retrieve information.
 
- A "query" is analogous to a search query in a database. It represents the current
 item (e.g., a word or token in a sentence) the model focuses on or tries to
 understand. The query is used to probe the other parts of the input sequence to
 determine how much attention to pay to them.

- The "key" is like a database key used for indexing and searching. In the attention
 mechanism, each item in the input sequence (e.g., each word in a sentence) has an
 associated key. These keys are used to match with the query.

- The "value" in this context is similar to the value in a key-value pair in a database. It
 represents the actual content or representation of the input items. Once the model
 determines which keys (and thus which parts of the input) are most relevant to the
 query (the current focus item), it retrieves the corresponding values.

### 3.4.2 Implementing a compact self-attention Python class

- In the previous sections, we have gone through a lot of steps to compute the self-attention
 outputs. This was mainly done for illustration purposes so we could go through one step at
 a time. In practice, with the LLM implementation in the next chapter in mind, it is helpful to
 organize this code into a Python class as follows:

In [34]:
# A compact self-attention class
import torch.nn as nn
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1)
        context_vec = attn_weights @ values
        return context_vec        

- The __init__ method initializes trainable weight matrices (W_query, W_key, and
 W_value) for queries, keys, and values, each transforming the input dimension d_in to an
 output dimension d_out.

- During the forward pass, using the forward method, we compute the attention scores
 (attn_scores) by multiplying queries and keys, normalizing these scores using softmax.
 Finally, we create a context vector by weighting the values with these normalized attention
 scores

In [35]:
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)  # d_in = 3, d_out = 2
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


In [36]:
print(context_vec_2)

tensor([0.3061, 0.8210])


- Figure 3.18 summarizes the self-attention mechanism we just implemented.

    <img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch03_compressed/18.webp" width="700px">,

- In self-attention, we transform the input vectors in the input matrix X with the three weight
 matrices, Wq, Wk, and Wv. Then, we compute the attention weight matrix based on the resulting queries `(Q)`
 and keys `(K)`. Using the attention weights and values `(V)`, we then compute the context vectors (Z). (For visual
 clarity, we focus on a single input text with n tokens in this figure, not a batch of multiple inputs. Consequently,
 the 3D input tensor is simplified to a 2D matrix in this context. This approach allows for a more straightforward
 visualization and understanding of the processes involved. Also, for consistency with later figures, the values in
 the attention matrix do not depict the real attention weights.)

- As shown in Figure 3.18, self-attention involves the trainable weight matrices Wq, Wk, and
 Wv. These matrices transform input data into queries, keys, and values, which are crucial
 components of the attention mechanism. As the model is exposed to more data during
 training, it adjusts these trainable weights, as we will see in upcoming chapters.

- We can improve the SelfAttention_v1 implementation further by utilizing PyTorch's
 nn.Linear layers, which effectively perform matrix multiplication when the bias units are
 disabled. Additionally, a significant advantage of using nn.Linear instead of manually
 implementing nn.Parameter(torch.rand(...)) is that nn.Linear has an optimized weight
 initialization scheme, contributing to more stable and effective model training

In [37]:
class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
    
    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores/keys.shape[-1]**0.5, dim=-1)
        context_vec = attn_weights @ values
        return context_vec

In [39]:
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


> Note that SelfAttention_v1 and SelfAttention_v2 give different outputs because they use different initial weights for the weight matrices since nn.Linear uses a more  sophisticated weight initialization scheme.