
## Simple GPT implementation in Torch

The GPT uses the decoder only part of the Transformer.

The input to the decoder varies based on whether you are training or predicting. If you are training, the input to the decoder is the sentence itself. When training, a mask is needed here to prevent the model from seeing all the words it is trying to predict. This is called a look ahead mask.  

If you are testing, the input is just the previous words before the word you are trying to predict. You start with a start of sentence token (e.g. <sos>) and predict. The predicted word is then added to the previous tokens and the process is repeated. 

The decoder consists of N (e.g. 6) decoder layers, followed by a linear layer. 

Each decoder layer has a decoder multi-head attention layer, followed by a fully connected layer. 

The attention layers consist of N (e.g. 8) parallel attention sub layers that are later concatenated. 

The numbers 6 and 8 are a choice the architect makes.





Transformers are the latest step in the evolution of deep learning. As is necessary with progress, newer deep learning algorithms are also much more complicated, with deeper and more resource intensive networks. The Transformer is the best example of this. Transformers are, for me, the first algorithm I was not able to run on a laptop. They truly require a machine learning “war machine”. Lots of GPU power and memory, etc. The algorithms are much more complicated and, as you well see, the networks are very deep.
Transformers are one of the new innovations in NLP since 2017. They were first made popular by the paper “Attention Is All You Need” by Vaswani et al. (2017). They are very interesting and seem to be very powerful. Many researchers suggests that they are better than RNNs for NLP because they parallelize better and because of the Attention mechanism.
So far, Transformers have been used to develop very impressive implementations such as BERT (Devlin et al. 2018), and GPT-2 (Radford et al. 2019), as of this writing, which seem to be very good at language understanding. Transformers have been applied to language translation, question answering, document summarization, automatic code generation, text generation, etc. Okay, let’s get started.




## Encoder Decoder with Multi-Head Attention

In this section, I will present the first version of the Transformer first made popular in the paper “Attention Is All You Need” by Vaswani et al. As of this writing, there are newer versions of Transformers called BERT, GPT-2, GPT-3, etc. I will simply call this first implementation the Encoder Decoder with Multi Head Attention

Transformer (that’s a mouth full). The Encoder Decoder with Multi Head Attention Transformer is a very deep network. The architecture has an encoder followed by a decoder. The encoder has 6 sublayers called encoder layers.
Each encoding layer has a Multi-Head Attention layer followed by a standard fully connected feed forward layer. The input to the encoder goes through all this layers in the encoder an is converted into an encoder output. The input to the encoder and the output of the encoder have the same dimensions. For instance, here, the input to the encoder would be the English sentence (given a translation problem).
The decoder layer has 2 inputs. One input is the encoder output. The second input to the decoder varies based on whether you are training or predicting. If you are training, the input to the decoder is the sentence in the other language. For instance, the Spanish sentence. In the decoder, when training the Transformer, a mask is needed to prevent the model from seeing all the words it is trying to predict. This is called a look ahead mask.
If you are testing, the input to the decoder is just the previous words before the word you are trying to predict. You start with a start of sentence token (e.g. <sos>) and predict iteratively. The predicted word is then added to the previous tokens and the process is repeated.



![alternative text](full_transformer.png)



Now that we have looked at the big picture, we can proceed to discuss the main ideas of the Transformer model.


## The Main Ideas of the Transformer

So, where does one start with Transformers? The Transformer is complex and it involves several ideas to make them work correctly. In this section I will present the main ideas first with some relevant code. Understanding these concepts or steps really well before venturing to write the code for the whole Transformer is really important. It will save you time in the long run. So now, let us proceed to discuss these topics. In the next section I will start discussing the code for the the full transformer.
Numpy arrays, tensors, and linear algebra
Linear algebra, numpy arrays, and tensor operations are at the heart of understanding the Transformer architecture. Before you continue, I strongly recommend that you read and practice the topics in chapter 1, and in particular, the section on linear algebra, numpy arrays, and tensor operations.
Inputs and outputs
When dealing with deep neural networks I like to think of inputs and outputs first and treat the network as a black box. So, let us start there. Let's quickly remember our classic example of MNIST supervised classification. In MNIST standard feed forward classification, you have an input image which is 28x28 and a predicted vector of size 10 for the classes. So, what do the inputs and outputs look like for transformers? For language translation, they are lists of ids. Each id can represent a word in a sentence. This is best visualized with an example.

First, let us look at the classic use case for Transformers. As I said earlier, Transformers have been used extensibly in NLP. And the simplest example is language translation where we have sentence pairs. Such as the following for English-Spanish translation:
"the cat is sleeping" --> which translates to -- > "el gato esta durmiendo"
Therefore, first we need to understand how to encode this for the neural network and then to understand how exactly it is that the network will train and learn. So, again, before you look into the network's very deep and complex layers, I believe that one needs to focus on:
*Taking text sentences and converting them into sequences of ids
*Padding these sequences of ids
Consider that after encoding and padding, your sentences will look like this: 

English


[12110 203 43947 29 2 168 2 4 27 684333 836222943 1012 112111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Spanish

[12110 13 43947 29 2 5 32 36 161145 458 347905 58 25 28 354 2482 3 17 27 28 4395 9 2886 7 12111 0 0 0 0 0 0 0 0 0 0 0]


## Masks
Masks serve several purposes. One is to help ignore the padded values during training. The other goal is to block the given word you want to predict (or future words). This brings up the important aspect of training with Transformers. Transformers predict the last word in a sequence. For example:
Given an input in english: "the cat is sleeping"
a Transformer is also given part of the output sentence. In this case: "el gato esta ?". The Transformer will predict the next word in the sequence which in this case would be "durmiendo" to complete the translation as “el gato esta durmiendo”. All of this is achieved through the masks to ignore padded values and to only show the partial sentence. The type of training that will be used for Transformer training is called “Teacher Forcing”. So definitely understand this concept.
Teacher forcing
You may have already read somewhere (on-line) that the Transformer network predicts one word at a time and that that word is read back as an input in the next iteration. Also, the network predicts the last word in the sequence of words. But you may think, aren't those last words just padding? Eh? So, what is going on here? As it turns out, the mechanism of predicting one word at a time and feeding it back as an input in the next iteration is only done during the testing phase and it is not done during training. Instead, during training we use “Teacher Forcing”.
Teacher forcing is a technique in auto regressive models where you do not use the predicted outputs of the decoder to feed back as input but instead you use the real data. This helps the model to learn the correct information instead of its own erroneous predictions (especially at the beginning of training).


## Attention
The Attention mechanism in Transformers is the heart of the whole algorithm. The attention matrix is nothing more than a dot product matrix multiplication between all the words in a sentence (e.g. the input English sentence). The idea is that, given the input and output, the model learns to correlate the words in the sentence to determine their importance. This is done multiple times and that is why it is called a multi head attention mechanism.

## Embeddings
Embedding converts the sequence of ids into a sequence of embeddings. You will go from a 2d tensor to a 3d tensor of size:
N * seq_length_max * embedding_dimension
where N is the batch size, seq_length_max is 40, and embedding_dimension is 512.



In [44]:

import torch
import numpy as np
import requests
## import tiktoken
import torch.nn as nn

from torch.nn import functional as F


In [45]:

## !pip install requests
## !pip install tiktoken    ## requires python   >    3.9


In [46]:

torch.manual_seed(1337)

block_size = 256      ## max content length for predictions
batch_size = 64 
max_iters  = 5000
eval_interval = 500
learning_rate = 3e-4             ## 0.001
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
vocab_size = 65
n_embd  = 384                  ## every id gets embedded to vector of this size
n_head  = 6
n_layer = 6
dropout = 0.2



In [47]:

input_file_path = 'input.txt'

## data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

with open(input_file_path, 'r', encoding='utf-8') as f:
    text = f.read()


In [48]:

print("length of data in characters")
len(text)


length of data in characters


1115394

In [49]:

 list(set(text))


['u',
 'v',
 'W',
 "'",
 '$',
 'I',
 'Q',
 'L',
 ',',
 'Y',
 'w',
 'D',
 'e',
 'P',
 'h',
 'z',
 'F',
 'n',
 'l',
 'T',
 '-',
 'q',
 '&',
 'p',
 '3',
 'r',
 'j',
 'X',
 '!',
 's',
 'A',
 'H',
 '\n',
 'O',
 '.',
 ':',
 'S',
 'K',
 'C',
 'N',
 'E',
 'Z',
 ' ',
 'd',
 'y',
 'x',
 'c',
 'f',
 ';',
 '?',
 'B',
 'g',
 'o',
 'G',
 'V',
 'R',
 't',
 'i',
 'm',
 'M',
 'k',
 'b',
 'a',
 'U',
 'J']

In [50]:

chars = sorted(     list(set(text))   )

vocab_size = len(chars)

print(  ''.join(chars)  )



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [51]:

len(chars)


65


## Tokenizer


In [52]:

## tokenizer

stoi = { ch:i for i, ch in enumerate(chars) }
itos = { i:ch for i, ch in enumerate(chars) }




In [53]:

stoi


{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}

In [54]:

itos


{0: '\n',
 1: ' ',
 2: '!',
 3: '$',
 4: '&',
 5: "'",
 6: ',',
 7: '-',
 8: '.',
 9: '3',
 10: ':',
 11: ';',
 12: '?',
 13: 'A',
 14: 'B',
 15: 'C',
 16: 'D',
 17: 'E',
 18: 'F',
 19: 'G',
 20: 'H',
 21: 'I',
 22: 'J',
 23: 'K',
 24: 'L',
 25: 'M',
 26: 'N',
 27: 'O',
 28: 'P',
 29: 'Q',
 30: 'R',
 31: 'S',
 32: 'T',
 33: 'U',
 34: 'V',
 35: 'W',
 36: 'X',
 37: 'Y',
 38: 'Z',
 39: 'a',
 40: 'b',
 41: 'c',
 42: 'd',
 43: 'e',
 44: 'f',
 45: 'g',
 46: 'h',
 47: 'i',
 48: 'j',
 49: 'k',
 50: 'l',
 51: 'm',
 52: 'n',
 53: 'o',
 54: 'p',
 55: 'q',
 56: 'r',
 57: 's',
 58: 't',
 59: 'u',
 60: 'v',
 61: 'w',
 62: 'x',
 63: 'y',
 64: 'z'}

In [55]:

encode = lambda s: [ stoi[c]          for c in s   ]    ## encoder: string to integer


In [56]:

encode("bahh")


[40, 39, 46, 46]

In [57]:

decode = lambda l: ''.join(   itos[i] for i in l   )    ## decoder: interger to string


In [58]:

decode([40, 39, 46, 46])


'bahh'


## Encode the data


In [59]:

data = torch.tensor(   encode(text), dtype=torch.long   )



In [60]:

data


tensor([18, 47, 56,  ..., 45,  8,  0])

In [61]:

n    = int(   0.9*len(data)   )


In [62]:

train_data = data[:n]
val_data   = data[n:]



## Function to create batches

* sentences are selected for x and y where they are the same but y is shifted by one from x 


In [63]:

temp_batch_size = 4
temp_block_size = 16

ix = torch.randint(   len(data) - block_size, (temp_batch_size,)   )
ix


tensor([213173, 989153, 193174, 874116])

In [64]:

for index_temp in ix:
    print(  data[index_temp]  )


tensor(59)
tensor(43)
tensor(58)
tensor(17)


In [65]:

x  = torch.stack(    [  data[   i : i+   temp_block_size ]   for i in ix ]    ) 
y  = torch.stack(    [  data[ i+1 : i+1+ temp_block_size ]   for i in ix ]    )

print(x)
print(y)


tensor([[59, 58,  1, 15, 50, 39, 56, 43, 52, 41, 43, 12,  1, 39, 52, 42],
        [43, 56,  1, 24, 59, 41, 43, 52, 58, 47, 53,  8,  0,  0, 24, 33],
        [58, 46, 53, 59, 45, 46, 58, 57,  6,  1, 39,  1, 50, 43, 45, 47],
        [17, 37, 10,  0, 32, 56, 59, 50, 63,  6,  1, 57, 47, 56,  6,  1]])
tensor([[58,  1, 15, 50, 39, 56, 43, 52, 41, 43, 12,  1, 39, 52, 42,  1],
        [56,  1, 24, 59, 41, 43, 52, 58, 47, 53,  8,  0,  0, 24, 33, 15],
        [46, 53, 59, 45, 46, 58, 57,  6,  1, 39,  1, 50, 43, 45, 47, 53],
        [37, 10,  0, 32, 56, 59, 50, 63,  6,  1, 57, 47, 56,  6,  1, 47]])


In [66]:

def get_batch(split):
    if split == "train":
        data = train_data
    else:
        data = val_data
        
    ix = torch.randint(   len(data) - block_size, (batch_size,)   )
    
    x  = torch.stack(    [  data[   i : i+block_size ]     for i in ix ]    ) 
    y  = torch.stack(    [  data[ i+1 : i+1+block_size ]   for i in ix ]    )
    
    x, y = x.to(device), y.to(device)

    return x, y



## Estimate loss function


In [67]:

@torch.no_grad()    ## for efficiency
def estimate_loss():
    out = {}
    model.eval()   ## no training
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()  ## back to training
    return out



## One Head of Self attention

The english sentence and corresponding padding mask are the only inputs to this attention layer.

The output of this attention mechanism. The output of this  Attention mechanism is passed to a fully connected layer.





In [68]:

'''

## x [N, 40, 512]
## look_ahead_mask [N, 40, 40]

def Dec_MultiHeadAttention(x, look_ahead_mask, dropout):

    Wq = tf.Variable( xavier_init( [batch_size, 512, 64] )  )
    bq = tf.Variable( tf.random_normal( [batch_size, 40, 64] )  )
    Q = tf.matmul(x, Wq) + bq    # Nx40x64
    
    Wk = tf.Variable( xavier_init( [batch_size, 512, 64] )  )
    bk = tf.Variable( tf.random_normal( [batch_size, 40, 64] )  )
    K = tf.matmul(x, Wk) + bk    # Nx40x64
    
    Wv = tf.Variable( xavier_init( [batch_size, 512, 64] )  )
    bv = tf.Variable( tf.random_normal( [batch_size, 40, 64] )  )
    V = tf.matmul(x, Wv) + bv    # Nx40x64
    

    ## calc a score of word_i importance to all other words
    scores_matrix = tf.matmul( Q, K, transpose_b=True)        ### [N, 40, 40]
    scores_matrix = scores_matrix/( tf.sqrt(64.0) )           ### [N, 40, 40]
    
    ################################# ## look_ahead_mask [N, 40, 40]
    ## [N, 40, 40] + [N, 40, 40]
    scores_matrix = scores_matrix + (look_ahead_mask * -1e9)
    ## [N, 40, 40]
    
    #################################
    # softmax is normalized on the last axis (seq_len_k) so that the scores # add up to 1. 
    ## axis -1 is for last dimension in this tensor
    
    a1 = tf.nn.softmax(scores_matrix, axis=-1) # (N, seq_len_q, seq_len_k) a1 = tf.nn.dropout(a1, dropout)
    a2 = tf.matmul(a1, V) ## [N, 40, 40] * [N, 40, 64]
    
    return a2 ## [N, 40, 64]
    

'''


'\n\n## x [N, 40, 512]\n## look_ahead_mask [N, 40, 40]\n\ndef Dec_MultiHeadAttention(x, look_ahead_mask, dropout):\n\n    Wq = tf.Variable( xavier_init( [batch_size, 512, 64] )  )\n    bq = tf.Variable( tf.random_normal( [batch_size, 40, 64] )  )\n    Q = tf.matmul(x, Wq) + bq    # Nx40x64\n    \n    Wk = tf.Variable( xavier_init( [batch_size, 512, 64] )  )\n    bk = tf.Variable( tf.random_normal( [batch_size, 40, 64] )  )\n    K = tf.matmul(x, Wk) + bk    # Nx40x64\n    \n    Wv = tf.Variable( xavier_init( [batch_size, 512, 64] )  )\n    bv = tf.Variable( tf.random_normal( [batch_size, 40, 64] )  )\n    V = tf.matmul(x, Wv) + bv    # Nx40x64\n    \n\n    ## calc a score of word_i importance to all other words\n    scores_matrix = tf.matmul( Q, K, transpose_b=True)        ### [N, 40, 40]\n    scores_matrix = scores_matrix/( tf.sqrt(64.0) )           ### [N, 40, 40]\n    \n    ################################# ## look_ahead_mask [N, 40, 40]\n    ## [N, 40, 40] + [N, 40, 40]\n    scores_

In [69]:

class Head(nn.Module):
    """ one head of self-attention """
    
    def __init__(self, head_size):
        
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        
        ## the mask tril is not part of the graph since only for masking
        ## so register buffer makes it a thing out of the graph
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        
        B, T, C = x.shape
        k = self.key(x)              ## (B, T, C)
        q = self.query(x)            ## (B, T, C)
        
        wei = q @ k.transpose(-2, -1) * C**-0.5       ## (B, T, C) @ (B, C, T)  -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))     ## (B, T, T)
        wei = F.softmax(wei, dim= -1)           ## (B, T, T)
        wei = self.dropout(   wei   )
        
        ## perform the weighted aggregation of the values
        v   = self.value(  x  )   ## (B, T, C)
        out = wei @ v             ## (B, T, T) @ (B, T, C) -> (B, T, C)
        
        return out
        



## Multi-Head Attention


The Masked multi-head attention layer is done N (e.g. 8) times in parallel  and the results are concatenated. 

This concatenated result is added to the original after mapping it through one more layer to calculate the residual. 





![alternative text](encoder_layer.png)



In [70]:


'''


## input_dec_layer = [N, 40, 512]

## dec_look_ahead_comb_mask [N, 40, 40]


def decoder_layer(input_dec_layer,dec_look_ahead_comb_mask, dropout):

    with tf.variable_scope("Dec_MultiHead_Attention_1"):
        z1 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_2"):
        z2 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_3"):
        z3 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_4"):
        z4 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_5"):
        z5 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_6"):
        z6 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_7"):
        z7 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Dec_MultiHead_Attention_8"):
        z8 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)
        
    z_concat = tf.concat([z1, z2 ,z3, z4, z5, z6, z7, z8], -1) ## [N, 40, 512]
    
    W0 = tf.Variable( xavier_init( [batch_size, 8*64, 512] ) ) 
    b0 = tf.Variable( tf.random_normal( [batch_size, 40, 512] ) ) 
    z1 = tf.matmul(z_concat, W0) + b0
    
    residual1 = layer_norm(input_dec_layer + z1)
    

'''


'\n\n\n## input_dec_layer = [N, 40, 512]\n\n## dec_look_ahead_comb_mask [N, 40, 40]\n\n\ndef decoder_layer(input_dec_layer,dec_look_ahead_comb_mask, dropout):\n\n    with tf.variable_scope("Dec_MultiHead_Attention_1"):\n        z1 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Dec_MultiHead_Attention_2"):\n        z2 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Dec_MultiHead_Attention_3"):\n        z3 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Dec_MultiHead_Attention_4"):\n        z4 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Dec_MultiHead_Attention_5"):\n        z5 = Dec_MultiHeadAttention(input_dec_layer, dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Dec_MultiHead_Attention_6"):\n        z6 = Dec_MultiHeadAttention(input_dec


In this part of the code in the Dec_Multihead_Attention function


In [71]:

'''


Wq = tf.Variable( xavier_init( [batch_size, 512, 64] ) ) 
bq = tf.Variable( tf.random_normal( [batch_size, 40, 64] ) ) 
Q = tf.matmul(x, Wq) + bq # Nx40x64

Wk = tf.Variable( xavier_init( [batch_size, 512, 64] ) ) 
bk = tf.Variable( tf.random_normal( [batch_size, 40, 64] ) ) 
K = tf.matmul(x, Wk) + bk # Nx40x64

Wv = tf.Variable( xavier_init( [batch_size, 512, 64] ) ) 
bv = tf.Variable( tf.random_normal( [batch_size, 40, 64] ) ) 
V = tf.matmul(x, Wv) + bv # Nx40x64


'''


'\n\n\nWq = tf.Variable( xavier_init( [batch_size, 512, 64] ) ) \nbq = tf.Variable( tf.random_normal( [batch_size, 40, 64] ) ) \nQ = tf.matmul(x, Wq) + bq # Nx40x64\n\nWk = tf.Variable( xavier_init( [batch_size, 512, 64] ) ) \nbk = tf.Variable( tf.random_normal( [batch_size, 40, 64] ) ) \nK = tf.matmul(x, Wk) + bk # Nx40x64\n\nWv = tf.Variable( xavier_init( [batch_size, 512, 64] ) ) \nbv = tf.Variable( tf.random_normal( [batch_size, 40, 64] ) ) \nV = tf.matmul(x, Wv) + bv # Nx40x64\n\n\n'



 You calculate the keys, queries, and values which are tensors that map the input x of size [N, 40, 512] to size [N, 40, 64]. We then calculate the scores matrix which is the Attention mechanism. This is a dot product. We matrix multiply Q with the transpose of K. This results in a matrix that is size [N, 40, 40].



In [72]:

'''

## calc a score of word_i importance to all other words

scores_matrix = tf.matmul( Q, K, transpose_b=True)     ### [N, 40, 40]
scores_matrix = scores_matrix/(tf.sqrt(64.0))          ### [N, 40, 40]

'''


'\n\n## calc a score of word_i importance to all other words\n\nscores_matrix = tf.matmul( Q, K, transpose_b=True)     ### [N, 40, 40]\nscores_matrix = scores_matrix/(tf.sqrt(64.0))          ### [N, 40, 40]\n\n'


After calculating the score matrix, we need to mask the values so that we don’t cheat by looking ahead. We apply the look ahead and padding masks. The mask for look ahead attention happens before the softmax calculation. Notice that the masking is done to the dot_product scores matrix only. The mask is multiplied with -1e9 (close to negative infinity).


In [73]:

'''

## look_ahead_mask [N, 40, 40]
## [N, 40, 40] + [N, 40, 40]

scores_matrix = scores_matrix + (look_ahead_mask * -1e9) ## [N, 40, 40]


'''


'\n\n## look_ahead_mask [N, 40, 40]\n## [N, 40, 40] + [N, 40, 40]\n\nscores_matrix = scores_matrix + (look_ahead_mask * -1e9) ## [N, 40, 40]\n\n\n'


This is done because the mask is summed with the scaled matrix multiplication of Q and K and is applied immediately before a softmax. The goal is to zero out padded cells, and large negative inputs to softmax are near zero in the output.



In [74]:


'''

For example, softmax for “a”

a = tf.constant([0.6, 0.2, 0.3, 0.4, 0, 0, 0, 0, 0, 0]) 

tf.nn.softmax(a)

gives the following


<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([0.15330984, 0.10276665, 0.11357471, 0.12551947, 0.08413821,
0.08413821, 0.08413821, 0.08413821, 0.08413821, 0.08413821], dtype=float32)>

now, if some of the values are negative infinities

b = tf.constant([0.6, 0.2, 0.3, 0.4, -1e9, -1e9, -1e9, -1e9, -1e9, -1e9])

tf.nn.softmax(b)

then softmax gives us

<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([ 0.3096101 , 0.20753784, 0.22936477, 0.25348732, 0. ,0. , 0. , 0. , 0. , 0. ], dtype=float32)>

'''



'\n\nFor example, softmax for “a”\n\na = tf.constant([0.6, 0.2, 0.3, 0.4, 0, 0, 0, 0, 0, 0]) \n\ntf.nn.softmax(a)\n\ngives the following\n\n\n<tf.Tensor: shape=(10,), dtype=float32, numpy=\narray([0.15330984, 0.10276665, 0.11357471, 0.12551947, 0.08413821,\n0.08413821, 0.08413821, 0.08413821, 0.08413821, 0.08413821], dtype=float32)>\n\nnow, if some of the values are negative infinities\n\nb = tf.constant([0.6, 0.2, 0.3, 0.4, -1e9, -1e9, -1e9, -1e9, -1e9, -1e9])\n\ntf.nn.softmax(b)\n\nthen softmax gives us\n\n<tf.Tensor: shape=(10,), dtype=float32, numpy=\narray([ 0.3096101 , 0.20753784, 0.22936477, 0.25348732, 0. ,0. , 0. , 0. , 0. , 0. ], dtype=float32)>\n\n'



Notice the infinities are now zeros! At this point, just like with the encoder_multihead_attention function, the decoder_multihead_attention function takes the scores_matrix after adding the mask and applies the softmax. The softmax is normalized on the last axis (seq_len_k) so that the scores add up to 1. The value of axis = -1 is for the last dimension in this tensor.



In [75]:

'''

a1 = tf.nn.softmax(scores_matrix, axis=-1) # (N, seq_len_q, seq_len_k) a1 = tf.nn.dropout(a1, dropout)
a2 = tf.matmul(a1, V) ## [N, 40, 40] * [N, 40, 64]
return a2 ## [N, 40, 64]



'''


'\n\na1 = tf.nn.softmax(scores_matrix, axis=-1) # (N, seq_len_q, seq_len_k) a1 = tf.nn.dropout(a1, dropout)\na2 = tf.matmul(a1, V) ## [N, 40, 40] * [N, 40, 64]\nreturn a2 ## [N, 40, 64]\n\n\n\n'


Finally, just like before, dropout is applied and the result is multiplied with the matrix V. The final tensor is of size [N, 40, 64].
Remember that the previous function


In [76]:

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """
    
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList(  [ Head(head_size) for _ in range(num_heads) ] )
        self.proj  = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        out = torch.cat(   [ h(x) for h in self.heads ], dim = -1   )
        out = self.proj(  out   )
        out = self.dropout(   out   )
        return out


In [77]:

class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """
    
    def __init__(self, n_embd):
        
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )
        
    def forward(self, x):
        return self.net(x)



## The N decoding blocks


In [78]:


class Block(nn.Module):
    """ Transformer block: comuunication followed by computation """
    
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa   = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward( n_embd)
        self.ln1  = nn.LayerNorm(n_embd)
        self.ln2  = nn.LayerNorm(n_embd)
        
    def forward(self, x):
        ## these normalizations (ln1, ln2) are about the only thing different from
        ## the original Vaswani paper. In the paper, they are done at the end of forward
        ## but now they are usually done at the beginning of forward
        x = x + self.sa(     self.ln1(x)      )
        x = x + self.ffwd(   self.ln2(x)      )
        return x
    



## A GPT language model

The decoder has one last layer as can be seen here

self.lm_head = nn.Linear(n_embd, vocab_size)

where h6 = [N,40,512]

dec_out_one_hot = self.lm_head(h6)

the returned dec_out_one_hot is of size [N, 40, vocabulary_size]

this final layer maps a tensor of size [N, 40, 512] to a tensor of size [N, 40, vocab_size] where vocab_size is the size of the vocabulary


In [79]:

'''

## embed_pt_pos_dec_in = [N, 40, 512]

def decoder(embed_en_pos_dec_in,  dec_look_ahead_comb_mask, dropout):

    with tf.variable_scope("Decoder_layer_1"):
        h1 = decoder_layer(embed_en_pos_dec_in,  dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Decoder_layer_2"): 
        h2 = decoder_layer(h1,  dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Decoder_layer_3"): 
        h3 = decoder_layer(h2,  dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Decoder_layer_4"): 
        h4 = decoder_layer(h3,  dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Decoder_layer_5"): 
        h5 = decoder_layer(h4,  dec_look_ahead_comb_mask, dropout)
    with tf.variable_scope("Decoder_layer_6"): 
        h6 = decoder_layer(h5,  dec_look_ahead_comb_mask, dropout)


    ## h6 = [N,40,512]

    dec_out_one_hot = dec_final_linear_layer(h6)

    return dec_out_one_hot          ## [N, 40, vocabulary_size]


'''


'\n\n## embed_pt_pos_dec_in = [N, 40, 512]\n\ndef decoder(embed_en_pos_dec_in,  dec_look_ahead_comb_mask, dropout):\n\n    with tf.variable_scope("Decoder_layer_1"):\n        h1 = decoder_layer(embed_en_pos_dec_in,  dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Decoder_layer_2"): \n        h2 = decoder_layer(h1,  dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Decoder_layer_3"): \n        h3 = decoder_layer(h2,  dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Decoder_layer_4"): \n        h4 = decoder_layer(h3,  dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Decoder_layer_5"): \n        h5 = decoder_layer(h4,  dec_look_ahead_comb_mask, dropout)\n    with tf.variable_scope("Decoder_layer_6"): \n        h6 = decoder_layer(h5,  dec_look_ahead_comb_mask, dropout)\n\n\n    ## h6 = [N,40,512]\n\n    dec_out_one_hot = dec_final_linear_layer(h6)\n\n    return dec_out_one_hot          ## [N, 40, vocabulary_size]\n\n\n'

In [80]:


class BigramLanguageModel(nn.Module):
    
    def __init__(self):
        
        super().__init__()
        
        self.token_embedding_table    = nn.Embedding(vocab_size, n_embd)     ## [vocab_size, embed_size]
        self.position_embedding_table = nn.Embedding(block_size, n_embd)     ## 
        
        self.blocks = nn.Sequential(
                *[   Block(n_embd, n_head=n_head) for _ in range(n_layer)    ]
        )
        self.ln_f    = nn.LayerNorm(  n_embd    )        ## final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
    
    def forward(self, idx, targets=None):
        
        B, T = idx.shape
        
        ## ids and targets are both (B, T) tensor of integers
        tok_emb = self.token_embedding_table(idx)      ## batch, time, embed (4, 8, 32) 
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))      ## (T, C)
        
        x = tok_emb + pos_emb    ## (B, T, C)

        ## This is the architecture
        
        x = self.blocks(  x  )   ## (B, T, C)        
        x = self.ln_f(    x  )         ## (B, T, C)
        
        logits = self.lm_head(x)                 ## (B, T, vocab_sice)   ## logits are what is predicted
        
        if targets is None:
            loss = None
        else:
            B, T, C  = logits.shape
            logits   = logits.view(B*T, C)
            targets  = targets.view(B*T)
            loss     = F.cross_entropy(logits, targets)
        
        return logits, loss
        
    
    def generate(self, idx, max_new_tokens):
        
        ## idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            
            ## crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            ## get the predictions
            logits, loss = self(idx_cond)
            ## focus only on last time stamp
            logits = logits[:, -1, :]           ## becomes (B, C)
            ## apply softmax to get probs
            probs = F.softmax(logits, dim= -1)    ## (B, C)
            ## sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)     ## (B, 1)
            ## append sample to the running sequence
            idx = torch.cat(  (idx, idx_next), dim=1  )            ## (B, T+1)
        return idx
            
            
            



The decoder has a final linear layer after the 6 decoder_layer functions. We proceed to discuss it in the next section.
Decoder Final Linear Layer
The final layer in the decoder is the decoder_final_layer. This is a linear layer with no non-linearities and a softmax that maps the tensor [N, 40, 512] to a tensor of size [N, 40, en_vocab_size] as can be seen in the next code segment.



In [81]:

'''


## input = [N, 40, 512]

def dec_final_linear_layer(input):
    w_h1 = tf.Variable( xavier_init( [batch_size, 512, VOCAB_SIZE_EN]    ))
    b_h1 = tf.Variable( tf.random_normal(  [batch_size, 40, VOCAB_SIZE_PT] ))
    h1_mul = tf.matmul( input , w_h1 ) 
    h1 = tf.add( h1_mul, b_h1 )
    

    softmax_h1 = tf.nn.softmax( h1 , axis=-1 ) ## [N, 40, vocabulary_size] 
    dec_out_one_hot = softmax_h1  ## [N, 40, vocabulary_size]
    
    #################
    ## if you wanted the ids, you could do this
    dec_out_ids = tf.argmax( softmax_h1 , axis=-1) 
    dec_out_ids = tf.cast(dec_out_ids, tf.int32)
    
    #################
    ## you could return
    ## dec_out_ids or dec_out_one_hot
    ## [N, 40] [N, 40, vocabulary_size]
    ## because of the loss function used (sparse_cross_entropy) 
    ## dec_out_one_hot seems to be the correct one
    
    return dec_out_one_hot ## [N, 40, vocabulary_size]

'''


'\n\n\n## input = [N, 40, 512]\n\ndef dec_final_linear_layer(input):\n    w_h1 = tf.Variable( xavier_init( [batch_size, 512, VOCAB_SIZE_EN]    ))\n    b_h1 = tf.Variable( tf.random_normal(  [batch_size, 40, VOCAB_SIZE_PT] ))\n    h1_mul = tf.matmul( input , w_h1 ) \n    h1 = tf.add( h1_mul, b_h1 )\n    \n\n    softmax_h1 = tf.nn.softmax( h1 , axis=-1 ) ## [N, 40, vocabulary_size] \n    dec_out_one_hot = softmax_h1  ## [N, 40, vocabulary_size]\n    \n    #################\n    ## if you wanted the ids, you could do this\n    dec_out_ids = tf.argmax( softmax_h1 , axis=-1) \n    dec_out_ids = tf.cast(dec_out_ids, tf.int32)\n    \n    #################\n    ## you could return\n    ## dec_out_ids or dec_out_one_hot\n    ## [N, 40] [N, 40, vocabulary_size]\n    ## because of the loss function used (sparse_cross_entropy) \n    ## dec_out_one_hot seems to be the correct one\n    \n    return dec_out_one_hot ## [N, 40, vocabulary_size]\n\n'


## Instantiate GPT


In [82]:

model   = BigramLanguageModel()

m = model.to(device)



In [None]:

optimizer = torch.optim.Adam(  m.parameters(), lr=learning_rate   )



## Training


In [None]:

for iter in range(max_iters):
    
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    xb, yb = get_batch('train')
    
    ## evaluate the loss
    logits, loss = m(xb, yb)
    
    optimizer.zero_grad(set_to_none=True)   ## zero out
    loss.backward()
    optimizer.step()




## Now, regenerate after some training


In [None]:

## Kick off generation with some starting token. In this case id 0

context = torch.zeros(  (1, 1),  dtype=torch.long, device=device   )   ## scalar with value 0

gen_text = m.generate(context, max_new_tokens=500)[0].tolist()

print(  decode(gen_text)   )
