## Attention Mechanism

1. In the past few years, the Attention Mechanism has become the most important mechanism in Deep Learning. It is powers the various langugage models that we hear so often about. It also has various applications such as in Vision tasks.
1. The whole idea of Attention can actually be quite confusing. Thankfully, the implementation of Attention (in Numpy) is not. It requires very basic knowledge of Linear Algebra and its forward feed function. 
1. We will thus discuss the mechanics, and the whole idea of Attention will be left to you to discover

In [None]:
from __future__ import annotations
import numpy as np
from scipy.special import softmax

This is what the attention mechanism looks like: 

$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

Q, K, V are matrices. To get them, you have an input, X, and then you apply the linear weights on it: 

$Q = XW_q$  
$K = XW_K$  
$V = XW_V$  

$X$ can be your original input or it could be the output from a previous layer. $W_q$, $W_k$, $W_v$ are weights.

Task: Create a function that will be used to create $X$, $W_q$, $W_k$, $W_v$

1. Use the numpy's random.randn to generate the data and weights. Recall that it takes in the dimension that you want eg randn(2,3) is a 2x3 dimension array. 
1. We will stick to only 2 dimensions for each of the weights. Example, X would be $\mathbb{R}^{2\times32}$
1. Ensure that the X and Weights are conformable. Hence, if $X \in \mathbb{R}^{2\times32}$, then $W_q \in \mathbb{R}^{32\times D}$.

<details>
<summary>Click here to reveal answer</summary>

```
# initialise X, q_weights, k_weights, v_weights
def generate_data_and_weights(num_data: int, num_features: int, num_out_dims: int) -> tuple(np.ndarray):
    """_summary_

    Args:
        num_data (int): number of rows in X
        num_features (int): number of columns/features in X
        num_out_dims (int): output dimension of q, k and v weights

    Returns:
        tuple(np.ndarray): X, q_weights, k_weights and v_weights
    """
    X = np.random.randn(num_data,num_features)
    q_weights = np.random.randn(num_features, num_out_dims)
    k_weights = np.random.randn(num_features, num_out_dims)
    v_weights = np.random.randn(num_features, num_out_dims)
    return X, q_weights, k_weights, v_weights
```

In [None]:
# initialise X, q_weights, k_weights, v_weights
def generate_data_and_weights(num_data: int, num_features: int, num_out_dims: int) -> tuple(np.ndarray):
    """_summary_

    Args:
        num_data (int): number of rows in X
        num_features (int): number of columns/features in X
        num_out_dims (int): output dimension of q, k and v weights

    Returns:
        tuple(np.ndarray): X, q_weights, k_weights and v_weights
    """

    X = None
    q_weights = None
    k_weights = None
    v_weights = None
    return X, q_weights, k_weights, v_weights

In [None]:
X, q_weights, k_weights, v_weights = generate_data_and_weights(196, 768, 8)

Task: 

Create a function that takes in X, q_weights, k_weights and v_weights, and returns Q, K, V, where the Q, K, V are the linear encoding of X and the respective weights

<details>
<summary>Click here to reveal answer</summary>

```
def get_QKV(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray) -> tuple(np.ndarray):
    Q = X@q_weights
    K = X@k_weights
    V = X@v_weights
    return Q, K, V
```

In [None]:
def get_QKV(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray) -> tuple(np.ndarray):
    Q = None
    K = None
    V = None
    return Q, K, V

Q, K, V = get_QKV(X, q_weights, k_weights, v_weights)

Now that you have obtained the Q, K and V, you want work on the interaction between Q and K, and normalise it by the dimension of K. 

$A = \frac{Q K^T}{\sqrt{d_k}}$

Task:

Create a function that takes Q, K and D, and return A, which is the normalised interaction between Q and K

The output should be an NxN matrix, where is the number of rows that X has. 

<details>
<summary>Click here to reveal answer</summary>

```
def get_attention(Q: np.ndarray, K: np.ndarray, D: int) -> np.ndarray:
    A = Q@K.T / np.sqrt(D)
    return A
```

In [None]:

def get_attention(Q: np.ndarray, K: np.ndarray, D: int) -> np.ndarray:
    A = None
    return A

A = get_attention(Q, K, 8)

The next part is to softmax the attention output along the columns. Due to complexities revolving handling exponents and logarithms, we will instead use the scipy's softmax function `softmax`. 

In [None]:
A = softmax(A, axis = -2) 

Now that you have the actual attention, we want calculate the self-attention. Practically, it simply means doing a matmul of A and V. 

Task: complete the function `get_self_attention` below which takes in the A and V, performs matmul return SA.  

<details>
<summary>Click here for the answer</summary>
```
def get_self_attention(A, V):
    SA = A@V
    return SA
```
</details>

In [None]:
def get_self_attention(A, V):
    SA = None
    return SA
SA = get_self_attention(A, V)

Task: Put together the functions as defined above as a single function: `single_head_self_attention`. It should take in `X, q_weights, k_weights, v_weights, D` as the parameters, and return the self-attention scores, `SA`

<details>
<summary>Click here for the answer</summary>

```
def single_head_self_attention(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray, D: int) -> np.ndarray:
    Q, K, V = get_QKV(X, q_weights, k_weights, v_weights)
    A = get_attention(Q, K, D)
    A = softmax(A, axis = -2)
    SA = get_self_attention(A, V)
    return SA
```

In [None]:
def single_head_self_attention(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray, D: int) -> np.ndarray:
    SA = None
    return SA

## Multi-Head Attention

Another concept that was introduced in the paper `Attention is all you need` is the  Multi-Head Attention. 

![](./msa-paper.png)

The original diagram above raises more questions than answers:

1. Is the data first split, then run through the heads, before concatenating, or is the data is multiplied to on W -> split to the different heads -> calculate self-attention -> concatenating. 
1. Another minor question is whether each head is the size of the data itself. Eg above example if data has 32 features, do the individual heads hence have to accomodate 32 features each?  

We will use the implementation as guided by PyTorch's own documentation - https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention

```
num_heads: Number of parallel attention heads. Note that ``embed_dim`` will be split
            across ``num_heads`` (i.e. each head will have dimension ``embed_dim // num_heads``).
```

Interpretation:
1. We decide the embedding dim. In our context, the dim should be 8. 
1. The number of head is 2. Hence, each head will only have dimension of 4

How we might intuitively do this is:

1. You have X and the Weights. Let's say X is n x 32, weights are 32x8. 
1. We want to split by heads. Heads here is 2
1. Split X by the number of heads: n x 2 x 16
1. Split the weights as well: 2 x 16 x 4 
1. Do matmul to get n x 2 x 4

That might work. However, it requires you to loop over each head, which is inefficient. Instead, let's flip it another way:

1. Do mat mul on the X and weights -> n x 32 . 32 x 8 -> n x 8
1. reshape the output to 2 heads -> n x 2 x 4
1. Perform the matmul along the head axis (n remains untouched) Here, use `einsum` to control the dimensions
1. Use the same concept to get the softmax and the self-attention. 
1. reshape back 

Task: Create the `multi_head_attention`, with the `X, q_weights, k_weights, v_weights, D, num_heads` parameters

<details>
<summary>Click here to reveal the answers</summary>

```
def multi_head_attention(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray, D: int, h: int) -> np.ndarray:
    dims = (X.shape[0], h, D//h)
    Q, K, V = get_QKV(X, q_weights, k_weights, v_weights)

    Q = Q.reshape(dims)
    K = K.reshape(dims)
    V = V.reshape(dims)

    def get_multi_head_attention(Q, K, D):
        # return Q@np.transpose(K, (0, 2, 1)) / np.sqrt(D)
        return np.einsum('ijk,ikl->ijl', Q, K.transpose((0,2,1))) / np.sqrt(D)
     
    A = get_multi_head_attention(Q, K, D)
    A = softmax(A, axis = -2)

    SA = A@V

    # reshape back to original shape
    SA = SA.reshape(X.shape[0], D)
    return SA
```


In [None]:
def multi_head_attention(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray, D: int, h: int) -> np.ndarray:
    
    SA = None
    
    return SA

MSA = multi_head_attention(X, q_weights, k_weights, v_weights, 8, 2)

In [None]:
np.random.seed(42)
X, q_weights, k_weights, v_weights = generate_data_and_weights(196, 768, 8)
SA = single_head_self_attention(X, q_weights, k_weights, v_weights, 8)
MSA = multi_head_attention(X, q_weights, k_weights, v_weights, 8, 2)

print (SA.shape)
print (MSA.shape)

<details>

<summary>Click Here for the full answers</summary>

```
def generate_data_and_weights(num_data: int, num_features: int, num_out_dims: int) -> tuple(np.ndarray):
    """_summary_

    Args:
        num_data (int): number of rows in X
        num_features (int): number of columns/features in X
        num_out_dims (int): output dimension of q, k and v weights

    Returns:
        tuple(np.ndarray): X, q_weights, k_weights and v_weights
    """
    X = np.random.randn(num_data, num_features)
    q_weights = np.random.randn(num_features, num_out_dims)
    k_weights = np.random.randn(num_features, num_out_dims)
    v_weights = np.random.randn(num_features, num_out_dims)
    return X, q_weights, k_weights, v_weights

def get_QKV(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray) -> tuple(np.ndarray):
    Q = X@q_weights
    K = X@k_weights
    V = X@v_weights
    return Q, K, V

def get_attention(Q: np.ndarray, K: np.ndarray, D: int) -> np.ndarray:
    A = Q@K.T / np.sqrt(D)
    return A

def get_self_attention(A, V):
    SA = A@V
    return SA

def single_head_self_attention(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray, D: int) -> np.ndarray:
    Q, K, V = get_QKV(X, q_weights, k_weights, v_weights)
    A = get_attention(Q, K, D)
    A = softmax(A, axis = -2)
    SA = get_self_attention(A, V)
    return SA

def multi_head_attention(X: np.ndarray, q_weights: np.ndarray, k_weights: np.ndarray, v_weights: np.ndarray, D: int, h: int) -> np.ndarray:
    dims = (X.shape[0], h, D//h)
    Q, K, V = get_QKV(X, q_weights, k_weights, v_weights)

    Q = Q.reshape(dims)
    K = K.reshape(dims)
    V = V.reshape(dims)

    def get_multi_head_attention(Q, K, D):
        # return Q@np.transpose(K, (0, 2, 1)) / np.sqrt(D)
        return np.einsum('ijk,ikl->ijl', Q, K.transpose((0,2,1))) / np.sqrt(D)
     
    A = get_multi_head_attention(Q, K, D)
    A = softmax(A, axis = -2)

    SA = A@V

    # reshape back to original shape
    SA = SA.reshape(X.shape[0], D)
    return SA
```