We will figure out how to properly code out a encoder block along with embeddings, since the transformer variants are made out of stacking these blocks together, part of the reason for the scalability of GPT family is the simplicity of the model design.

We use the diagram found in the paper:

The first thing we look at is the word embedding. After we use a dictionary to map each word to a unique integer, we would like to learn a more useful embedding. Here, the embedding is just a look up table, with the additional feature that the table can be adjusted by gradient descent. We can dive into algorithms like word2vec in the future if we have time, but for now let's demonstrate why it's really just a look up table.

In [3]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

In [1]:
embed = nn.Embedding(3, 5)# vocab size 3, vec dim 5
embed.weight.data

tensor([[ 2.0771,  0.8933, -0.4322,  0.8430,  0.4313],
        [-0.9201,  0.5402, -1.1606,  0.2522, -0.9390],
        [-1.0256, -0.3078, -2.8334,  2.2777, -0.7678]])

In [4]:
# we see that the embedding have a vocabulary of 3 and each vector has length 5
# More specifically, we are simply accessing this 2d array with indices when we 
# call embedding in practice:
x = torch.tensor([0,2])
embed(x)

tensor([[ 2.0771,  0.8933, -0.4322,  0.8430,  0.4313],
        [-1.0256, -0.3078, -2.8334,  2.2777, -0.7678]],
       grad_fn=<EmbeddingBackward>)

In [3]:
#torch allows us to access a tensor indefinitely many times with arbitrarily lengthed array
x = torch.tensor([0,1,1,1,1,0])
embed(x)

tensor([[ 2.0771,  0.8933, -0.4322,  0.8430,  0.4313],
        [-0.9201,  0.5402, -1.1606,  0.2522, -0.9390],
        [-0.9201,  0.5402, -1.1606,  0.2522, -0.9390],
        [-0.9201,  0.5402, -1.1606,  0.2522, -0.9390],
        [-0.9201,  0.5402, -1.1606,  0.2522, -0.9390],
        [ 2.0771,  0.8933, -0.4322,  0.8430,  0.4313]],
       grad_fn=<EmbeddingBackward>)

In [5]:
# but as soon as we try to embed a bigger integer, it breaks
# because it is only retriving content using the integer as index, nothing fancy at all
x = torch.tensor([3])
embed(x)

IndexError: index out of range in self

Now we've seen that embedding is simply a look up table, the next step is positional embedding. Since the word embedding just encodes an integer representing the word in dictionary, it contains no positional information for the incoming words. As we know, in sentences, if we swap orders of words, the meaning can completely change, and we would like the model to learn this semantics, thus we need to device an extra mechanism to bake the positional info into the embeddings which will be fed to the model. 

The clever thing the authors use is sinusoidal functions.

Good explanations can be found at:
https://datascience.stackexchange.com/questions/51065/what-is-the-positional-encoding-in-the-transformer-model
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ 

In essense, let the embedding dimension be D, we slice D sinusoidal curves using postion as the input/independent variable, and use the D outputs as the positional embedding vector. The D sinusoidal functions are at increasing frequencies, which means that even though sinusoidals are periodic functions, the positional embedding vectors are unlikely to overlap (in fact should be garanteed within a range once we view sinusoidals as continuations of binary numbers, but that's for another time). 

Another really neat feature about this encoding is that it's easy for the model to learn about relative positions, which is really what matters, instead of, say, the absolute position of a word from the start of the article. Also part of the significance is due to the input sequence will be a sliding window through a given text, and the second word in the current input sequence will be the first word in the next, so there's really no point in learning any semantic information based on absolute position. Mathematically, the postional embedding vector as we shift the position by some integer step is a linear transformed (i.e. matrix multiplication) version of the original vector. Intuitively, we can think that the model is really learning these transformation matrices between positions and use that to construct semantics. To wit, we see relative positions, the machine sees transformatioin matrices between the relative positions. The full proof can be found in:

https://timodenk.com/blog/linear-relationships-in-the-transformers-positional-encoding/ 



In [5]:
# we recycle parts from Pytorch's official positional encoding implementation:
# https://pytorch.org/tutorials/beginner/transformer_tutorial.html 
# which is pretty much
# straight up a translation of the formula itself into code so nothing is very new here

def positional_encoding(max_len, d_model):
    '''
    Computes positional embedding vectors deterministically with sin and cos
    max_len: number of positions, i.e. input seq. length
    d_model: embedding dimensiion
    
    CAVEAT/WARNING: the embedding dimension must be even, as dictated by the formula
    '''
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
#     pe = pe.unsqueeze(0).transpose(0, 1)
    return pe

In [24]:
# Here is the positional vectors for input sequence length of 10 and embedding dim 5
# As a quick sanity check, notice the first row should be
# [sin(0), cos(0), sin(0), cos(0)] = [0,1,0,1] checks out!
positional_encoding(10, 4)

tensor([[ 0.0000,  1.0000,  0.0000,  1.0000],
        [ 0.8415,  0.5403,  0.0100,  0.9999],
        [ 0.9093, -0.4161,  0.0200,  0.9998],
        [ 0.1411, -0.9900,  0.0300,  0.9996],
        [-0.7568, -0.6536,  0.0400,  0.9992],
        [-0.9589,  0.2837,  0.0500,  0.9988],
        [-0.2794,  0.9602,  0.0600,  0.9982],
        [ 0.6570,  0.7539,  0.0699,  0.9976],
        [ 0.9894, -0.1455,  0.0799,  0.9968],
        [ 0.4121, -0.9111,  0.0899,  0.9960]])

In [47]:
# Next, all we need to do is to add to the word embedding
embed = nn.Embedding(20, 4) # vocab 20, vector dim 4
x = torch.tensor([0,1,7,3,8,5,4,0, 17, 12])
x = embed(x)
pe = positional_encoding(10, 4)
print(x)
print(x+pe)
x = x+pe
# We can visually check it is simply the elementwise addition

tensor([[ 5.2250e-01,  7.4871e-01, -6.9554e-01, -4.6671e-01],
        [ 5.9929e-01, -1.7954e+00,  8.9031e-01,  9.3008e-01],
        [-4.3043e-01,  7.2504e-01,  6.1766e-01,  8.5517e-01],
        [ 1.1575e+00, -5.3880e-01,  8.1583e-01, -5.1187e-01],
        [-1.6941e+00, -4.1849e-01, -7.3580e-01,  9.9257e-02],
        [-4.4127e-01,  8.5713e-02,  1.3710e+00,  2.8298e-01],
        [ 1.3788e+00, -2.5741e-04, -7.7389e-01,  3.4132e+00],
        [ 5.2250e-01,  7.4871e-01, -6.9554e-01, -4.6671e-01],
        [ 2.7098e+00,  1.3977e+00,  9.7662e-02,  8.0236e-01],
        [-5.9547e-01,  6.5157e-01,  5.8158e-01,  8.8459e-01]],
       grad_fn=<EmbeddingBackward>)
tensor([[ 0.5225,  1.7487, -0.6955,  0.5333],
        [ 1.4408, -1.2551,  0.9003,  1.9300],
        [ 0.4789,  0.3089,  0.6377,  1.8550],
        [ 1.2987, -1.5288,  0.8458,  0.4877],
        [-2.4509, -1.0721, -0.6958,  1.0985],
        [-1.4002,  0.3694,  1.4210,  1.2817],
        [ 1.0994,  0.9599, -0.7139,  4.4114],
        [ 1.1795,  1.

Now we are ready to move on to the encoder block, which as shown in the diagram, consists of self-attention, layer norm and feed forward network. 

We have covered attention, feed forward network is simply linear transformation (matrix multiplication and bias) combined with a non-linear activation function (e.g. ReLU). 

Layer norm (https://arxiv.org/pdf/1607.06450.pdf) is essentially re-centering the embedding distribution to 0 with standard deviation of 1, except it has some additional learned parameters to massage the distribution further to according to the data, but this does not change the gist.

Let's do a step-through for the encoder:

In [25]:
# We've previously hacked out attention as:
def _self_attention(x, emb_dim, latent_dim):
    M_K, M_Q, M_V = [torch.rand(emb_dim, latent_dim) for _ in range(3)]
    K, Q, V = x@M_K, x@M_Q, x@M_V 
    W_raw = Q@(K.transpose(1,2))
    W = F.softmax(W_raw, dim=1)
    Y = W@V
    return Y

# while it illustrates the key concepts, this is not most efficient/standard way to implement 
# In practice, we could use the Pytorch linear layer to do the matrix work for us
# And we actually want the latent dim to be the same as embedding dim just to make things
# simpler and easier to contruct residual connections
# We also negative indices to avoid having to deal with the batch dimension which is usually the first
# in this case we are not batching yet and this code will not be affected whereas positive indices will be

def self_attention(x, emb_dim):
    M_K, M_Q, M_V = [nn.Linear(emb_dim, emb_dim, bias=False) for _ in range(3)]
    K, Q, V = [M(x) for M in [M_K, M_Q, M_V ]]
    W_raw = Q@(K.transpose(-1,-2))
    W = F.softmax(W_raw, dim=-1)
    Y = W@V
    return Y

In [61]:
x_attn = self_attention(x, 4)
print(x_attn)
x = x+x_attn # residual/addition connection'


tensor([[-0.0327,  0.5757,  0.1935,  0.1563],
        [-0.0577,  0.2456,  0.0520,  0.0041],
        [ 0.0533,  0.4262,  0.1914,  0.2409],
        [-0.2851,  0.3682, -0.0097, -0.2671],
        [-0.1351,  0.8174,  0.2423,  0.1249],
        [ 0.0053,  1.4199,  0.5294,  0.4709],
        [ 0.6515,  0.2615,  0.4021,  0.9457],
        [-0.0649,  0.4225,  0.1173,  0.0577],
        [ 0.0294,  0.1864,  0.0610,  0.0835],
        [ 0.0262,  0.4751,  0.2000,  0.2282]], grad_fn=<MmBackward>)


In [65]:
# construct LayerNorm layers
ln1 = nn.LayerNorm(emb_dim)
ln2 = nn.LayerNorm(emb_dim)

x_ln1 = ln1(x)
print(x_ln1)

# then we let each word/vector flow through the same feed forward network/multi-layer perceptron(MLP)
# which in pytorch is simply

mlp = nn.Sequential(
            nn.Linear(emb_dim, 2*emb_dim),
            nn.ReLU(),
            nn.Linear(2*emb_dim, emb_dim) ) # can be anything mlp, this is one simple example

x_mlp = mlp(x)
print(x_mlp)

#residual connection again
x = x_ln1 + x_mlp
print(x)

#final layer norm
x = ln2(x)

print(x)


tensor([[-0.3288,  1.5076, -1.2692,  0.0904],
        [ 0.4799, -1.4269, -0.3281,  1.2751],
        [-0.7354, -0.3693, -0.6123,  1.7169],
        [ 0.9439, -1.6826,  0.3181,  0.4205],
        [-1.2941,  0.4488, -0.5171,  1.3625],
        [-1.7049,  0.7377,  0.2817,  0.6854],
        [ 0.2013, -0.4740, -1.2234,  1.4961],
        [ 0.3179,  1.2512, -1.5270, -0.0421],
        [ 1.3514, -0.4557, -1.3324,  0.4367],
        [-1.0831, -0.3047, -0.2470,  1.6349]],
       grad_fn=<NativeLayerNormBackward>)
tensor([[-0.3271, -0.1151,  0.3035, -0.1041],
        [-0.1208, -0.1754,  0.1748, -0.2144],
        [-0.0746, -0.2061,  0.1998, -0.1201],
        [-0.1316, -0.2053,  0.1774, -0.2689],
        [-0.0535, -0.2428,  0.1933, -0.0846],
        [-0.0279, -0.3096,  0.1677, -0.0493],
        [-0.1461, -0.1492,  0.1999, -0.1966],
        [-0.4272, -0.0251,  0.3488, -0.1124],
        [-0.3756, -0.0784,  0.2905, -0.1758],
        [-0.0549, -0.2382,  0.1955, -0.0863]], grad_fn=<AddmmBackward>)
tensor([[-0

In [None]:
# Putting all the steps together, we can organize the encoder block into a class like this:

class Block(nn.Module):
    def __init__(self, emb_dim):
        self.ln1 = nn.LayerNorm(emb_dim)
        self.ln2 = nn.LayerNorm(emb_dim)
        self.mlp = nn.Sequential(
            nn.Linear(emb_dim, 2*emb_dim),
            nn.ReLU(),
            nn.Linear(2*emb_dim, emb_dim) ) # can be any mlp, this is one simple example
        self.emb_dim = emb_dim


    def self_attention(self, x, emb_dim,):
        M_K, M_Q, M_V = [nn.Linear(emb_dim, emb_dim, bias=False) for _ in range(3)]
        K, Q, V = [M(x) for M in [M_K, M_Q, M_V ]]
        W_raw = Q@(K.transpose(-1,-2))
        # == masking begins ==
        ones = torch.ones((seq_len, seq_len), dtype=torch.uint8)
        mask = torch.triu(ones, diagonal=1)
        W_raw[mask] = float('-inf')
        # == masking ends ==
        W = F.softmax(W_raw, dim=-1)
        Y = W@V
        return Y



    def forward(x):
        x = x + self.self_attention(x, self.emb_dim)
        x_ln1 = self.ln1(x)
        x_mlp = self.mlp(x)
        x = x_ln1 + x_mlp
        x = ln2(x)
        return x 

    


That's it! We've run through an encoder block. In reality, we would use a bigger embedding dimension, more complex MLP, batching, etc. But the essence of the transformer has been captured here and will not change in any major way. 

We will visit the last key idea, masking, in the next section.

