# Numpy implementation of scaled-dot product attention

- Uses only numpy.
- Considers weight matrices $W_{q,k,v}$ and query, key, values $q, k, v$

## Init

In [3]:
import numpy as np

### Dimensionality

In [4]:
# how many tokens?
ntokens=4

# dimensionality of token embedding space
dim=3

# dimensionality: q, v, k embeddings
dk=2

### Input token

(I am skipping the step of tokenizing a string)

In [5]:
tokens=np.random.randint(0, 100, size=ntokens, dtype=int)

In [6]:
tokens

array([12,  4, 27, 70])

## Generates mock embedding vectors

For the sake of going straight into implementing attention, I will skip the preprocessing steps of converting a token into an embedding vector. I will simply initialize vectors of dimension 3 with random numbers as components, according to the number of tokens provided.

Here is a **random array representing the embedding vectors** $X$, where each line corresponds to a different vector $x^{(i)}$:

In [7]:
# Generate a matrix of size n x m with values between 0 and 1
input = np.random.rand(ntokens, dim)

In [8]:
input

array([[0.64035478, 0.34173638, 0.15410133],
       [0.80115281, 0.54747531, 0.7199265 ],
       [0.10293235, 0.03616533, 0.51787903],
       [0.7233325 , 0.06916225, 0.85134328]])

In [9]:
X=input

## Initializes weight matrices

In [10]:
W_query=np.random.rand(dim, dk)
W_key=np.random.rand(dim, dk)
W_value=np.random.rand(dim, dk)

## Queries, keys and values

$$ Q = X W_q$$
$$ K = X W_k$$
$$ V = X W_v$$

In [11]:
Q = X @ W_query
K = X @ W_key
V = X @ W_value

## Attention

The attention is given by $${\rm Attention}(Q,K,V) = {\rm softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$$
where $QK^T$ are the attention scores, softmax gives the attention weights and Attention is the context.

### Attention scores

Given by $Q K^T$.

In [12]:
attn_scores = Q @ K.T 

### Softmax

This softmax implementation is dumb and slow

In [13]:
def softmax(X):
    result=np.empty_like(X)
        
    for i,row in enumerate(X):
        exps=np.exp(row)
        result[i,:]=exps/exps.sum()
            
    return result

### Attention weights

In [14]:
attn_weights = softmax(attn_scores / K.shape[-1]**0.5)

### Context

In [15]:
attn = attn_weights @ V

In [16]:
attn

array([[0.95090118, 0.45092381],
       [1.00706272, 0.48109718],
       [0.97225873, 0.46184897],
       [1.01025764, 0.48263529]])

# Sandbox

In [33]:
attn_weights

array([[0.24152842, 0.29440261, 0.24147565, 0.22259332],
       [0.21538969, 0.37558925, 0.22729347, 0.18172759],
       [0.21413151, 0.36706921, 0.23176097, 0.18703831],
       [0.22252215, 0.33964576, 0.23726476, 0.20056732]])

In [26]:
def softmax2(x, axis=1):
    exp_x = np.exp(x)  # Exponentiate the shifted input
    sum_exp_x = np.sum(exp_x, axis=axis, keepdims=True)  # Sum along the specified axis
    softmax_output = exp_x / sum_exp_x  # Normalize by the sum
    return softmax_output

In [27]:
softmax(attn_scores)

array([[0.24814573, 0.28976314, 0.20758454, 0.25450659],
       [0.23132908, 0.37280341, 0.13723455, 0.25863297],
       [0.24106925, 0.31857965, 0.17960542, 0.26074569],
       [0.22875064, 0.37683716, 0.13390889, 0.2605033 ]])

In [28]:
softmax2(attn_scores)

array([[0.24814573, 0.28976314, 0.20758454, 0.25450659],
       [0.23132908, 0.37280341, 0.13723455, 0.25863297],
       [0.24106925, 0.31857965, 0.17960542, 0.26074569],
       [0.22875064, 0.37683716, 0.13390889, 0.2605033 ]])