# A simple attention mechanism

Uses only numpy.

```mermaid
graph TD;
    A[Input text] --> B[Preprocessing steps];
    B --> C[Self-attention module];
    C -->D[Postprocessing steps <br> not implemented here];

    style B fill:orange
    style C fill:yellow, stroke-dasharray: 2 2
    style D fill:gray;    
```

## TODO

Preprocessing

- [x] sample input
- [x] convert tokens to naive embedding vectors


Self-attention 

- [x] simple attention for a fixed query
- [x] same for all queries w/ matrix multiplication
- [ ] attention with trainable weights ($q$, $k$, $v$)

## Init

In [35]:
%matplotlib inline
import numpy as np
#import matplotlib.pyplot as plt

In [9]:
# how many tokens?
ntokens=4

# dimensionality of embedding space
dim=3

**Input token**:
(I am skipping the step of tokenizing a string)

In [6]:
tokens=np.random.randint(0, 100, size=ntokens, dtype=int)

In [7]:
tokens

array([72,  3, 64, 69])

## Preprocessing

For the sake of going straight into implementing attention, I will skip the preprocessing steps of converting a token into an embedding vector. I will simply initialize vectors of dimension 3 with random numbers as components, according to the number of tokens provided.

Here is a **random array representing the embedding vectors** $\{x^{(i)}\}$, where each line corresponds to a different vector $x^{(i)}$:

In [10]:
# Generate a matrix of size n x m with values between 0 and 1
input = np.random.rand(ntokens, dim)

In [11]:
input

array([[0.86379499, 0.93042748, 0.82149658],
       [0.11288872, 0.23407153, 0.26185969],
       [0.79451453, 0.63626262, 0.39415654],
       [0.59248056, 0.70135167, 0.51830873]])

## Self-attention for one query 

Specify which vector will be the query vector using the array index below

In [14]:
qIndex=1

Query vector

In [15]:
q=input[qIndex,:]

### Attention scores

Let's do it first in the naive (computationally inefficient) way. 

Empty array that will hold the scores

In [17]:
score=np.empty(ntokens)

Computes the dot product between the query vector and the entire set.

In [18]:
for i in range(ntokens):
    # `x` is a vector
    x=input[i,:]

    score[i]=x @ q

In [19]:
score

array([0.53041613, 0.13610384, 0.3418364 , 0.36677499])

### Attention weights

The attention weights are the normalized version of the score. We use the traditional `softmax` approach.

First we define a naive softmax function.

In [23]:
def softmax(y):
    # ⚠️this is prone to overflow, needs to be improved
    exps=np.exp(y)

    return exps/exps.sum()    

Then we get the actual weights

In [24]:
weight=softmax(score)

In [25]:
weight

array([0.29838948, 0.20115732, 0.24710661, 0.25334659])

### Context

The context $z$ is the weighted sum $$z^{(i)} = \sum_i \alpha^{ij} x^{(j)}$$ where $\alpha^{ij}$ are the attention weights for query index $i$ and $x^{(j)}$ are the embedding vectors.

In [33]:
context=0
for i in range(ntokens):
    context+=weight[i]*input[i,:]

In [34]:
context

array([0.62688845, 0.65962472, 0.52651137])

## Attention for all queries

Using matrix multiplication and thus more computationally-efficient.

```mermaid
graph TD;
    A[1. Compute attention scores] --> B[2. Compute attention weights];
    B --> C[3. Compute context vectors];
    
    style B fill:orange
    style C fill:yellow;
```

### Attention scores

Each line contains the scores for the query vector corresponding to that line.

In [38]:
scores=input@input.T

In [40]:
scores

array([[2.28669371, 0.53041613, 1.60209215, 1.59012746],
       [0.53041613, 0.13610384, 0.3418364 , 0.36677499],
       [1.60209215, 0.3418364 , 1.19144284, 1.12127305],
       [1.59012746, 0.36677499, 1.12127305, 1.11157133]])

### Attention weights

Yes, I know, using PyTorch would be faster and easier. But what would we be learning?

In [47]:
weights=np.empty_like(scores)

In [48]:
for i in range(ntokens):
    weights[i,:]=softmax(scores[i,:])

In [49]:
weights

array([[0.45971284, 0.07938619, 0.2318291 , 0.22907187],
       [0.29838948, 0.20115732, 0.24710661, 0.25334659],
       [0.38985174, 0.11055474, 0.25855726, 0.24103626],
       [0.39375687, 0.11585984, 0.24638103, 0.24400226]])

### Context

In [52]:
contexts=weights@input

In [53]:
contexts

array([[0.72597167, 0.75447563, 0.60854748],
       [0.62688845, 0.65962472, 0.52651137],
       [0.69746917, 0.72216799, 0.57605494],
       [0.69352439, 0.72137613, 0.57739014]])

# Sandbox

In [22]:
np.exp(score[0])/(np.sum(np.exp(score)))

0.29838947615945444