# Scaled-dot product attention: Numpy vs PyTorch implementations

- Considers weight matrices $W_{q,k,v}$ and query, key, values $q, k, v$
- `conda activate pytorch` before this

### TODO

Self-attention 

- [x] attention with trainable weights ($q$, $k$, $v$)
- [x] compare with pytorch class from book

## Init

In [1]:
import numpy as np

In [None]:
import torch.nn as nn
import torch

### Dimensionality

In [2]:
# how many tokens?
ntokens=4

# dimensionality of token embedding space
dim=3

# dimensionality: qvk embeddings
dk=2

### Input token

(I am skipping the step of tokenizing a string)

In [3]:
tokens=np.random.randint(0, 100, size=ntokens, dtype=int)

In [4]:
tokens

array([79,  1, 14, 67])

## Preprocessing

For the sake of going straight into implementing attention, I will skip the preprocessing steps of converting a token into an embedding vector. I will simply initialize vectors of dimension 3 with random numbers as components, according to the number of tokens provided.

Here is a **random array representing the embedding vectors** $X$, where each line corresponds to a different vector $x^{(i)}$:

In [5]:
# Generate a matrix of size n x m with values between 0 and 1
input = np.random.rand(ntokens, dim)

In [6]:
input

array([[0.84711883, 0.29366239, 0.6859246 ],
       [0.01094959, 0.91137307, 0.48771804],
       [0.30762366, 0.60051404, 0.80047688],
       [0.46252616, 0.94153513, 0.05022171]])

## Initializes weight matrices

Does this separately such that I can use them in my class and in the pytorch one, for the sake of comparison.

In [17]:
W_query=np.random.rand(dim, dk)
W_key=np.random.rand(dim, dk)
W_value=np.random.rand(dim, dk)

## Based on numpy

In [20]:
x=input

### Queries, keys and values

In [22]:
keys = x @ W_key
queries = x @ W_query
values = x @ W_value

In [25]:
keys

array([[1.04736787, 0.78204195],
       [0.4219047 , 1.03244979],
       [0.73243049, 0.91343069],
       [0.59448109, 1.12181464]])

### Attention scores

In [23]:
attn_scores = queries @ keys.T 

In [50]:
attn_scores

array([[1.61744518, 1.35153675, 1.48872608, 1.57951193],
       [1.39766512, 1.10238484, 1.25296955, 1.30298844],
       [1.67081042, 1.39618009, 1.53787067, 1.63167416],
       [1.22167049, 0.87404728, 1.04945537, 1.05431901]])

### Softmax implementation

This softmax implementation is dumb and slow

In [31]:
def softmax(scores):
    result=np.empty_like(scores)
        
    for i in range(scores.shape[0]):
        exps=np.exp(scores[i,:])
        result[i,:]=exps/exps.sum()
            
    return result

In [32]:
softmax(attn_scores)

array([[0.27712289, 0.21241728, 0.24365224, 0.2668076 ],
       [0.28414938, 0.21149891, 0.24587039, 0.25848132],
       [0.27801019, 0.21124687, 0.24340288, 0.26734006],
       [0.29463193, 0.20811768, 0.24802059, 0.24922981]])

### Attention weights

In [33]:
attn_weights = softmax(attn_scores / keys.shape[-1]**0.5)

In [34]:
attn_weights

array([[0.26916971, 0.22303226, 0.24575225, 0.26204578],
       [0.27400623, 0.222373  , 0.24735774, 0.25626302],
       [0.26979738, 0.22217786, 0.24559126, 0.2624335 ],
       [0.28122913, 0.21994181, 0.24898565, 0.24984341]])

### Context vectors

In [35]:
context_vec = attn_weights @ values

In [36]:
context_vec

array([[1.04199908, 1.1026673 ],
       [1.04514966, 1.10237919],
       [1.04210659, 1.10270256],
       [1.04925196, 1.10208835]])

## Compares with PyTorch result

### PyTorch class from the book

- `d_in`: embedding dimension of tokens
- `d_out`: embedding dimensio of q,v,k vectors

This is the original class from the book:

```python
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec
```

### Modified class

However, I modified it such that it accepts as input the weight matrices.

In [45]:
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out, W_query, W_key, W_value):
        super().__init__()
        self.W_query = nn.Parameter(torch.from_numpy(W_query).float())
        self.W_key = nn.Parameter(torch.from_numpy(W_key).float())
        self.W_value = nn.Parameter(torch.from_numpy(W_value).float())

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec

Convert input from np to torch

In [46]:
inputTorch=torch.from_numpy(input)

Use the class and get context vectors

In [47]:
#torch.manual_seed(123)
sa_v1 = SelfAttention_v1(dim, dk, W_query, W_key, W_value)
sa_v1(inputTorch.float())

tensor([[1.0420, 1.1027],
        [1.0451, 1.1024],
        [1.0421, 1.1027],
        [1.0493, 1.1021]], grad_fn=<MmBackward0>)

### Compare with mine

In [48]:
context_vec

array([[1.04199908, 1.1026673 ],
       [1.04514966, 1.10237919],
       [1.04210659, 1.10270256],
       [1.04925196, 1.10208835]])

The difference is accounted by truncation error.

# Sandbox