In [61]:
import torch

So at some point I'll want to train my model. The data will be tokens where each token has a dimension $d_e$ (e for embedding).

Given a particular window having a number of tokens as ($d_w$), 

for example a sentence: "Hello my name is Donny" -> is a tensor of shape $(d_w, d_e)$ and we probably want to sample many such sentences during training to have a single batch as $(d_b, d_w, d_e)$

Batches complicate things and so does the dimensionality of the $d_e$, so I will ignore that for now.

My goal is given a tensor of $(d_w, d_e)$ where $d_e=1$ to weigh each token (self-attention). So given $(d_w, d_e)$ I should get $(d_w, d_e)$ back.

In [132]:
torch.manual_seed(0)
d_w, d_e = 5,1
x = torch.randn((d_e, d_w))
x

tensor([[ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845]])

One strategy to model the dependencies between the embedding is to simply sum them up, then hope that information can be used to predict the next token

In [133]:
x.sum(dim=0, keepdim=True)

tensor([[ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845]])

Its not a bad first strategy! So let's keep this one. What I then want to do is all at once predict the next token given the sum of previous token.

So I need a way to grab those values, I think I can do that with a lower triangular ones matrix.

In [135]:
mask = torch.tril(torch.ones((d_seq_len, d_seq_len)))
mask*x

tensor([[ 1.5410, -0.0000, -0.0000,  0.0000, -0.0000],
        [ 1.5410, -0.2934, -0.0000,  0.0000, -0.0000],
        [ 1.5410, -0.2934, -2.1788,  0.0000, -0.0000],
        [ 1.5410, -0.2934, -2.1788,  0.5684, -0.0000],
        [ 1.5410, -0.2934, -2.1788,  0.5684, -1.0845]])

And summing across the rows I could have just matmuled

In [136]:
print((mask*x).sum(-1, keepdim=True))
print(mask@x.T)
print(x@torch.triu(torch.ones((d_w, d_w)))) # no need for transpose!

tensor([[ 1.5410],
        [ 1.2476],
        [-0.9312],
        [-0.3628],
        [-1.4473]])
tensor([[ 1.5410],
        [ 1.2476],
        [-0.9312],
        [-0.3628],
        [-1.4473]])
tensor([[ 1.5410,  1.2476, -0.9312, -0.3628, -1.4473]])


What if I want to do an average (equal weighting to every value to predict the next).

This is just a sum divided by the number of elements per. So for the first one is 1. The second should have 0.5, 0.5, ...

In [137]:
w_mask = torch.triu(torch.ones((d_w, d_w)))
w_mask  /= w_mask.sum(0, keepdim=True)
w_mask

tensor([[1.0000, 0.5000, 0.3333, 0.2500, 0.2000],
        [0.0000, 0.5000, 0.3333, 0.2500, 0.2000],
        [0.0000, 0.0000, 0.3333, 0.2500, 0.2000],
        [0.0000, 0.0000, 0.0000, 0.2500, 0.2000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.2000]])

In [138]:
print(mask@x.T / mask.sum(-1, keepdim=True))
print(x@w_mask)

tensor([[ 1.5410],
        [ 0.6238],
        [-0.3104],
        [-0.0907],
        [-0.2895]])
tensor([[ 1.5410,  0.6238, -0.3104, -0.0907, -0.2895]])


Now by changing the the weights, I can do other things than average. I can weigh the first token higher (or whatever).

So let's compute that weight matrix on the fly. 

The Attention is all you need paper does 

$$\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

So the softmax portion is what computes the weighted sum basically and the V is essentially the x (although projected).

In [139]:
mask = torch.triu(torch.ones((d_w, d_w)))
mask.masked_fill_(mask == 0, float("-inf"))
mask

tensor([[1., 1., 1., 1., 1.],
        [-inf, 1., 1., 1., 1.],
        [-inf, -inf, 1., 1., 1.],
        [-inf, -inf, -inf, 1., 1.],
        [-inf, -inf, -inf, -inf, 1.]])

In [140]:
torch.softmax(mask, dim=0)

tensor([[1.0000, 0.5000, 0.3333, 0.2500, 0.2000],
        [0.0000, 0.5000, 0.3333, 0.2500, 0.2000],
        [0.0000, 0.0000, 0.3333, 0.2500, 0.2000],
        [0.0000, 0.0000, 0.0000, 0.2500, 0.2000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.2000]])

Now I need a method to dynamically generate form learned weights that matrix!

Here comes in Q, K in the attention compution. These compute attention values.

In [141]:
d_k = 3
Q = torch.randn((d_w, d_k))
K = torch.randn((d_w, d_k))

In [142]:
# First find similarities within the input
# (QK^T)/sqrt(d_k)
inside = Q@K.T*(d_k**(-.5)) # (d_w, d_w)
inside

tensor([[ 0.4948,  0.0305,  0.8477,  0.0659, -1.1022],
        [-0.8167, -0.8018,  0.7273,  0.6297,  0.5582],
        [-0.0843,  0.7896,  0.2424, -1.1661, -1.6582],
        [-0.4332, -0.8788,  0.7489,  0.9102,  0.6452],
        [-1.4891, -1.0330,  0.1476,  0.7217,  1.6950]])

But really I only want to model similarities between the masked values, so here I used the masking.

In [143]:
mask = torch.triu(inside)
mask.masked_fill_(mask == 0, float("-inf"))
qk = torch.softmax(mask, dim=0)
qk

tensor([[1.0000, 0.6968, 0.4111, 0.1860, 0.0345],
        [0.0000, 0.3032, 0.3645, 0.3269, 0.1816],
        [0.0000, 0.0000, 0.2244, 0.0543, 0.0198],
        [0.0000, 0.0000, 0.0000, 0.4328, 0.1981],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.5660]])

In [144]:
V = torch.randn((d_w, d_k))
qkv = qk@V