# Self-Attention Mechanism

The self-attention mechanism allows the model to weigh the importance of different words in a sequence. It helps the model focus on relevant words while encoding a particular word. This is a critical part of the Transformer architecture.

In the prior module, We left off with the position encoding embedding. To continue this process, we are pushing the positional embedding into the next compontent of the transformer, the attention mechanism. 

In [20]:
import torch
import torch.nn as nn
import math

# Positional Encoded Embeddings from the previous notebook
pos_encoded_embeddings = torch.tensor([[[ 0.0171,  1.0654,  0.4616,  0.9196,  0.7193, -0.7430,  0.7120,
           0.4198,  2.7427,  0.2844],
         [ 0.9161, -0.4999,  1.8302,  1.2182,  0.0651,  2.0396,  0.3780,
           0.4085,  1.3560, -1.3176],
         [-0.3461,  0.2526,  0.4120,  0.0398,  1.6146,  2.6475, -0.1887,
          -0.6907, -0.4066,  1.2899],
         [-0.8345, -0.4657, -0.5635,  1.1514,  0.1762,  1.8148, -1.4084,
           0.9153, -1.1734,  1.1989],
         [-0.3981, -0.7277, -1.2657,  1.9887, -0.2399,  0.0412,  1.9375,
           0.6083,  0.2095,  1.5739],
         [-1.5240,  1.2359,  0.5596, -0.1529, -0.4064, -0.0906, -1.6746,
           0.2480, -0.2364,  0.8417],
         [ 0.6992, -1.1193,  0.5642,  1.3861, -0.5185,  0.6701, -0.5353,
           1.8013,  0.7900, -0.9430],
         [-0.3550,  2.1032,  1.8690,  0.7174, -1.7692,  0.8200, -0.9620,
           1.8325,  0.3009,  1.0083],
         [ 1.5902, -0.8516,  3.2954, -0.5147,  0.1798,  1.8522, -1.0186,
           0.6484,  0.9932,  0.4030],
         [-0.5990, -0.4916,  0.5419, -0.0293, -0.3465,  1.1548, -1.1130,
          -0.8118,  0.3455,  0.9474]]], dtype=torch.float)

print("Positional Encoded Embeddings Shape:", pos_encoded_embeddings.shape) # torch.Size([1, 10, 10]) because we have 1 batch, 10 tokens, and 10 features
print(pos_encoded_embeddings)

Positional Encoded Embeddings Shape: torch.Size([1, 10, 10])
tensor([[[ 0.0171,  1.0654,  0.4616,  0.9196,  0.7193, -0.7430,  0.7120,
           0.4198,  2.7427,  0.2844],
         [ 0.9161, -0.4999,  1.8302,  1.2182,  0.0651,  2.0396,  0.3780,
           0.4085,  1.3560, -1.3176],
         [-0.3461,  0.2526,  0.4120,  0.0398,  1.6146,  2.6475, -0.1887,
          -0.6907, -0.4066,  1.2899],
         [-0.8345, -0.4657, -0.5635,  1.1514,  0.1762,  1.8148, -1.4084,
           0.9153, -1.1734,  1.1989],
         [-0.3981, -0.7277, -1.2657,  1.9887, -0.2399,  0.0412,  1.9375,
           0.6083,  0.2095,  1.5739],
         [-1.5240,  1.2359,  0.5596, -0.1529, -0.4064, -0.0906, -1.6746,
           0.2480, -0.2364,  0.8417],
         [ 0.6992, -1.1193,  0.5642,  1.3861, -0.5185,  0.6701, -0.5353,
           1.8013,  0.7900, -0.9430],
         [-0.3550,  2.1032,  1.8690,  0.7174, -1.7692,  0.8200, -0.9620,
           1.8325,  0.3009,  1.0083],
         [ 1.5902, -0.8516,  3.2954, -0.5147,  0.17

## Linear Transformation to Generate Queries, Keys, and Values

The positional encoded embeddings are passed through three different linear layers to generate the Query (Q), Key (K), and Value (V) matrices. These matrices are used to compute the attention scores.

In [21]:
# Dimensions
d_model = pos_encoded_embeddings.size(1)
d_k = d_model  # Assuming d_k = d_model for simplicity

# Define weight matrices for linear transformations (for simplicity, we use identity matrices)
W_q = torch.eye(d_model)
W_k = torch.eye(d_model)
W_v = torch.eye(d_model)

# Linear transformations to generate Q, K, V
queries = torch.matmul(pos_encoded_embeddings, W_q)
keys = torch.matmul(pos_encoded_embeddings, W_k)
values = torch.matmul(pos_encoded_embeddings, W_v)

print("Queries Shape:", queries.shape)
print("Queries Matrix:\n", queries)
print("Keys Shape:", keys.shape)
print("Keys Matrix:\n", keys)
print("Values Shape:", values.shape)
print("Values Matrix:\n", values)

Queries Shape: torch.Size([1, 10, 10])
Queries Matrix:
 tensor([[[ 0.0171,  1.0654,  0.4616,  0.9196,  0.7193, -0.7430,  0.7120,
           0.4198,  2.7427,  0.2844],
         [ 0.9161, -0.4999,  1.8302,  1.2182,  0.0651,  2.0396,  0.3780,
           0.4085,  1.3560, -1.3176],
         [-0.3461,  0.2526,  0.4120,  0.0398,  1.6146,  2.6475, -0.1887,
          -0.6907, -0.4066,  1.2899],
         [-0.8345, -0.4657, -0.5635,  1.1514,  0.1762,  1.8148, -1.4084,
           0.9153, -1.1734,  1.1989],
         [-0.3981, -0.7277, -1.2657,  1.9887, -0.2399,  0.0412,  1.9375,
           0.6083,  0.2095,  1.5739],
         [-1.5240,  1.2359,  0.5596, -0.1529, -0.4064, -0.0906, -1.6746,
           0.2480, -0.2364,  0.8417],
         [ 0.6992, -1.1193,  0.5642,  1.3861, -0.5185,  0.6701, -0.5353,
           1.8013,  0.7900, -0.9430],
         [-0.3550,  2.1032,  1.8690,  0.7174, -1.7692,  0.8200, -0.9620,
           1.8325,  0.3009,  1.0083],
         [ 1.5902, -0.8516,  3.2954, -0.5147,  0.1798,  

## Scaled Dot-Product Attention

To compute the attention scores, we take the dot product of the Query and Key matrices. The result is then scaled by the square root of the dimension of the keys (\(d_k\)) to prevent the values from becoming too large. 

In [22]:
# Compute the dot product of queries and keys, then scale
print(f"squre root of the dim of keys: {math.sqrt(d_k)}")

scores = torch.matmul(queries, keys.transpose(-2, -1)) / math.sqrt(d_k)

print("Dot Product of Queries and Keys.Transposed (Scaled):")
print(scores)
print("Scores Shape:", scores.shape)

squre root of the dim of keys: 3.1622776601683795
Dot Product of Queries and Keys.Transposed (Scaled):
tensor([[[ 3.6524,  1.1905, -0.4707, -1.4006,  0.9222, -0.0992,  0.5557,
           0.9713,  0.4132, -0.4219],
         [ 1.1905,  4.4182,  1.0308,  0.0706, -0.2009, -1.0589,  2.5619,
           1.2458,  3.7226,  0.4597],
         [-0.4707,  1.0308,  3.8937,  2.1286,  0.1242,  0.4726, -0.6264,
           0.2723,  1.7791,  1.4722],
         [-1.4006,  0.0706,  2.1286,  3.6419,  1.0043,  1.2146,  0.8493,
           1.3136,  0.4297,  1.2584],
         [ 0.9222, -0.2009,  0.1242,  1.0043,  4.0949, -0.9581,  0.4650,
          -0.3064, -1.8694, -0.3491],
         [-0.0992, -1.0589,  0.4726,  1.2146, -0.9581,  2.5267, -0.5795,
           2.3920,  0.0560,  0.9574],
         [ 0.5557,  2.5619, -0.6264,  0.8493,  0.4650, -0.5795,  3.0812,
           1.2700,  2.0481, -0.0433],
         [ 0.9713,  1.2458,  0.2723,  1.3136, -0.3064,  2.3920,  1.2700,
           5.6132,  2.3743,  0.7503],
         

## Softmax to Get Attention Weights

The scaled dot-product scores are then passed through a softmax function to convert them into probabilities. These probabilities represent the attention weights.

In [23]:
# Apply softmax to the scores to get the attention weights
attention = torch.nn.functional.softmax(scores, dim=-1)

print("Attention Weights Shape:", attention.shape)
print("Attention Weights:\n", attention)

Attention Weights Shape: torch.Size([1, 10, 10])
Attention Weights:
 tensor([[[7.3184e-01, 6.2401e-02, 1.1852e-02, 4.6762e-03, 4.7721e-02,
          1.7182e-02, 3.3077e-02, 5.0123e-02, 2.8682e-02, 1.2444e-02],
         [2.1827e-02, 5.5053e-01, 1.8606e-02, 7.1228e-03, 5.4294e-03,
          2.3020e-03, 8.6020e-02, 2.3070e-02, 2.7458e-01, 1.0511e-02],
         [8.2410e-03, 3.6988e-02, 6.4774e-01, 1.1087e-01, 1.4939e-02,
          2.1167e-02, 7.0524e-03, 1.7324e-02, 7.8172e-02, 5.7510e-02],
         [3.7856e-03, 1.6485e-02, 1.2908e-01, 5.8624e-01, 4.1934e-02,
          5.1750e-02, 3.5914e-02, 5.7137e-02, 2.3607e-02, 5.4067e-02],
         [3.5522e-02, 1.1554e-02, 1.5992e-02, 3.8558e-02, 8.4793e-01,
          5.4186e-03, 2.2487e-02, 1.0397e-02, 2.1783e-03, 9.9621e-03],
         [2.6416e-02, 1.0118e-02, 4.6799e-02, 9.8277e-02, 1.1191e-02,
          3.6500e-01, 1.6342e-02, 3.1902e-01, 3.0851e-02, 7.5989e-02],
         [3.2410e-02, 2.4096e-01, 9.9376e-03, 4.3468e-02, 2.9600e-02,
          1.041

## Weighted Sum of Values to Get the Output

The final step in the attention mechanism is to compute a weighted sum of the Value matrix using the attention weights. This gives us the output of the self-attention mechanism.

In [24]:
# Compute the output as a weighted sum of the values
print("attention weights:")
print(f"{attention}\n")
print("values matrix: ")
print(f"{values}\n")

output = torch.matmul(attention, values)

print("Output Shape:", output.shape)
print("Output:\n", output)

attention weights:
tensor([[[7.3184e-01, 6.2401e-02, 1.1852e-02, 4.6762e-03, 4.7721e-02,
          1.7182e-02, 3.3077e-02, 5.0123e-02, 2.8682e-02, 1.2444e-02],
         [2.1827e-02, 5.5053e-01, 1.8606e-02, 7.1228e-03, 5.4294e-03,
          2.3020e-03, 8.6020e-02, 2.3070e-02, 2.7458e-01, 1.0511e-02],
         [8.2410e-03, 3.6988e-02, 6.4774e-01, 1.1087e-01, 1.4939e-02,
          2.1167e-02, 7.0524e-03, 1.7324e-02, 7.8172e-02, 5.7510e-02],
         [3.7856e-03, 1.6485e-02, 1.2908e-01, 5.8624e-01, 4.1934e-02,
          5.1750e-02, 3.5914e-02, 5.7137e-02, 2.3607e-02, 5.4067e-02],
         [3.5522e-02, 1.1554e-02, 1.5992e-02, 3.8558e-02, 8.4793e-01,
          5.4186e-03, 2.2487e-02, 1.0397e-02, 2.1783e-03, 9.9621e-03],
         [2.6416e-02, 1.0118e-02, 4.6799e-02, 9.8277e-02, 1.1191e-02,
          3.6500e-01, 1.6342e-02, 3.1902e-01, 3.0851e-02, 7.5989e-02],
         [3.2410e-02, 2.4096e-01, 9.9376e-03, 4.3468e-02, 2.9600e-02,
          1.0415e-02, 4.0505e-01, 6.6205e-02, 1.4415e-01, 1.7805e


### 6. **Weighted Sum of Values to Get the Output**

#### Explanation

After computing the attention weights, the next step is to calculate the final output by taking a weighted sum of the value vectors. This is done through matrix multiplication of the attention weights matrix with the value matrix.

Given:
- **Attention Weights** matrix $\mathbf{A}$ of shape $(\text{number of tokens}, \text{number of tokens})$.
- **Values** matrix $\mathbf{V}$ of shape $(\text{number of tokens}, d_{\text{model}})$.

The output is computed as:
$$
\mathbf{O} = \mathbf{A} \times \mathbf{V}
$$
Where:
- $\mathbf{O}$ is the output matrix of shape $(\text{number of tokens}, d_{\text{model}})$.

#### Example Code with Matrix Multiplication

```python
# Let's assume 3 tokens and a model dimension of 4 for simplicity
values = torch.tensor([
    [0.1, 0.2, 0.3, 0.4],
    [0.5, 0.6, 0.7, 0.8],
    [0.9, 1.0, 1.1, 1.2]
], dtype=torch.float)

attention_weights = torch.tensor([
    [0.2, 0.3, 0.5],
    [0.1, 0.8, 0.1],
    [0.4, 0.1, 0.5]
], dtype=torch.float)

# Perform the matrix multiplication
output = torch.matmul(attention_weights, values)

print("Attention Weights Matrix (A):\n", attention_weights)
print("Values Matrix (V):\n", values)
print("Output Matrix (O = A * V):\n", output)
```

#### Expected Output

Given the example matrices:

- **Attention Weights (A):**
  $$
  \mathbf{A} =
  \begin{bmatrix}
  0.2 & 0.3 & 0.5 \\
  0.1 & 0.8 & 0.1 \\
  0.4 & 0.1 & 0.5
  \end{bmatrix}
  $$

- **Values (V):**
  $$
  \mathbf{V} =
  \begin{bmatrix}
  0.1 & 0.2 & 0.3 & 0.4 \\
  0.5 & 0.6 & 0.7 & 0.8 \\
  0.9 & 1.0 & 1.1 & 1.2
  \end{bmatrix}
  $$

The output matrix $ \mathbf{O} $ will be:

- **Output (O):**
  $$
  \mathbf{O} =
  \begin{bmatrix}
  (0.2 \times 0.1 + 0.3 \times 0.5 + 0.5 \times 0.9) & (0.2 \times 0.2 + 0.3 \times 0.6 + 0.5 \times 1.0) & (0.2 \times 0.3 + 0.3 \times 0.7 + 0.5 \times 1.1) & (0.2 \times 0.4 + 0.3 \times 0.8 + 0.5 \times 1.2) \\
  (0.1 \times 0.1 + 0.8 \times 0.5 + 0.1 \times 0.9) & (0.1 \times 0.2 + 0.8 \times 0.6 + 0.1 \times 1.0) & (0.1 \times 0.3 + 0.8 \times 0.7 + 0.1 \times 1.1) & (0.1 \times 0.4 + 0.8 \times 0.8 + 0.1 \times 1.2) \\
  (0.4 \times 0.1 + 0.1 \times 0.5 + 0.5 \times 0.9) & (0.4 \times 0.2 + 0.1 \times 0.6 + 0.5 \times 1.0) & (0.4 \times 0.3 + 0.1 \times 0.7 + 0.5 \times 1.1) & (0.4 \times 0.4 + 0.1 \times 0.8 + 0.5 \times 1.2)
  \end{bmatrix}
  $$

Evaluating the above gives:

$$
\mathbf{O} =
\begin{bmatrix}
0.62 & 0.74 & 0.86 & 0.98 \\
0.42 & 0.54 & 0.66 & 0.78 \\
0.64 & 0.76 & 0.88 & 1.00
\end{bmatrix}
$$

So, the output matrix $ \mathbf{O} $ will be:

```
tensor([[0.62, 0.74, 0.86, 0.98],
        [0.42, 0.54, 0.66, 0.78],
        [0.64, 0.76, 0.88, 1.00]])
```

This demonstrates how each element in the output matrix is a weighted sum of the corresponding value matrix, with weights given by the attention scores.


## Combining Heads and Final Linear Transformation

In a multi-head attention mechanism, multiple sets of Q, K, V matrices are computed and processed in parallel. These are then concatenated and passed through a final linear transformation to project them back to the original embedding size.

In [25]:
# Example with one head (for simplicity)
N, heads, head_dim, embedding_size = 1, 1, d_model, d_model

# Combine heads (for demonstration, we consider one head only)
output = output.view(N, -1, heads * head_dim)

# Final linear transformation to project back to the original embedding size
fc_out = nn.Linear(heads * head_dim, embedding_size)
output = fc_out(output)

print("Final Output Shape:", output.shape)
print("Final Output:\n", output)

Final Output Shape: torch.Size([1, 10, 10])
Final Output:
 tensor([[[-8.4285e-01, -1.7612e-01, -2.7030e-01,  4.4906e-01,  7.3453e-01,
           6.4252e-01, -7.2729e-02, -6.9949e-01,  1.0482e+00,  9.9681e-01],
         [ 8.2112e-02, -4.4043e-01, -3.8180e-01, -8.5722e-01,  7.8969e-02,
           1.3080e+00,  1.9322e-01,  4.9748e-02,  8.4398e-01,  1.1881e+00],
         [-1.6093e-01,  2.5082e-01,  7.8345e-01,  4.3364e-02,  2.3881e-01,
           2.3484e-01, -3.1012e-01,  1.4730e-01,  6.2287e-01,  1.5049e-01],
         [-6.8474e-04,  4.7168e-02,  3.6157e-01, -1.2725e-01,  2.2440e-01,
           2.7673e-01, -1.4018e-01,  3.3133e-01, -7.4699e-02, -9.5472e-02],
         [-4.7937e-01,  1.9969e-01,  3.9888e-01,  8.8724e-01,  9.8833e-01,
           2.3337e-01,  4.5606e-01, -5.3211e-01, -2.1268e-01,  9.2721e-01],
         [ 6.8733e-02,  1.3531e-01, -1.7628e-01, -1.8989e-01, -9.0967e-03,
           3.9240e-01, -1.1892e-01, -2.8764e-01,  5.2636e-01, -3.7702e-01],
         [ 1.6201e-02, -4.6819e-01,