# Let's build GTP from Scratch, in code , spelled out

- ChatGPT is a system that allows us to interact with an AI and give it text-based tasks.
- it generates words sequentially, and it is a probabilistic system (for any same prompt, it can generate multiple answers)
    - mine: so just like what we did when we used the . token as a start to generate different names in the previous notebooks
- it is what we call a `Language Model` -what we have been building so far-
    - it models the sequence of tokens and knows how they follow each other in a language
        - in bigram we modeled the counts of each pair of tokens in the dataset
        - in the MLP we fed it multiple token sequences (3 or 4 up to 8 tokens) -> result token
    - from its perspective, what it is doing is that it is completing the sequence (the prompt we give it is the start of the sequence, and it completes it)

- so what is the neural network under the hood that models the sequence of these words?
    - it comes from a paper called [Attention is All You Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. (2017), it proposed the `Transformer` architecture
        - so GPT is an abbreviation for `Generatively Pre-trained Transformer`
    - if we read the paper, it reads like a random machine translation paper, that is because they didn't fully anticipate the impact of the model they proposed
    - the architecture they produced in the context of machine translation, ended up taking over the rest of AI in the 5 years after
        - this architecture with minor changes was copy-pasted into a huge amount of applications in AI (such as ChatGPT), it is trained on a good chunck of the internet, and there is a lot of pre-training and fine-tuning stages to it

- mine: in Course 5 in deep learning specialization, we studied the language model (which is one-to-many architecture), and studied its behaviour
    - during training, we feed it the actual sequence of tokens, and make it predict the next token in the sequence (we take the predicted probability of the actual next token and we maximize it -or more precisely minimize the negative log of it-)
        - this is exactly what we have been doing so far
    - during inference, we can 
        - feed it a sequence of tokens and check the probability of that sequence, or 
        - use it to generate tokens (give it none or the start token and make it generate the next token, then sample from the output distribution, and feed it back to the model to generate the next token, and so on)
        - this is what we have been doing in the previous notebooks
    - but then we studied the encoder-decoder architecture (which is many-to-many architecture), and its applications in image captioning and machine translation
        - in image captioning, we feed the image to the encoder and get a fixed-size representation -that represents the image- and feed it to the decoder along with the start token to generate the caption
            - also the behavior is different during training (we feed it the actual sequence of tokens along with the image representation and make it predict the next token in the sequence) and during inference (we feed it the image and the start token and make it generate the next token)
        - in machine translation, we feed the source sentence to the encoder and get a fixed-size representation -that represents the source sentence- and feed it to the decoder along with the start token to generate the target sentence
            - also the behavior is different during training and during inference
        - the decoder in encoder-decoder architecture is similar to the language model (in one-to-many architecture), but there are 2 differences:
            1. the decoder in encoder-decoder architecture takes an vector representation of some input (image in image captioning, source sentence in machine translation) and the start token, and generates the next token
                - so we can think of the decoder in the encoder-decoder architecture as a conditional language model (it generates the output sequence conditioned on some input)
                - unlike the language model, we gave it the start token or none and it generated the next tokens
            2. in the encoder-decoder architecture, we want to generate the most likely output sequence given the input (image in image captioning, source sentence in machine translation), we don't just sample from the output distribution like we did with the language model in one-to-many architecture
        - then we studied using the encoder-decoder architecture with the attention mechanism, (instead of using the same encoded vector representation of the input in every step of the decoder) we feed the decoder with a context vector that is a weighted sum of the encoded input 
            - like in image captioning, we feed the encoded image with different weights in every step of the decoder to tell it which part of the image to focus on when generating the next token
            - in machine translation, we feed the encoded source sentence with different weights in every step of the decoder to tell it which part of the source sentence to focus on when generating the next token (so the decoder takes the context vector and the previous activation to generate the next token)
        - then we studied the transformer architecture, which is much more complex than the previous architectures
            - it processes the entire sequence at once (unlike the RNNs that process the sequence one token at a time)
            - we know that the same word can have different meanings depending on the context it is used in -the surrounding words-, so the transformer uses something called `self-attention` to tune the representation of each word in the sequence based on the other words in the sequence
                - and we do that multiple times (called multi-head attention) to have multiple perspectives on each word in the sequence
                - read my notes for more details

        
- in this notebook, we will like to build out something like chatGPT, we will just focus on training a transformer-based language model (will be a character level language model), and we will train on a smaller dataset (Tiny Shakespeare dataset, which is a concatenation of all of Shakespeare's works)
    - and the transformer will model how these characters follow each other 
    - so it can take a chunk of characters (context of characters) the transformer will look at the characters and predict the next character
    

In [1]:
with open("shakespear.txt","r",encoding='utf-8') as file:
        text = file.read()

In [2]:
len(text)

1115394

we are working with 1M characters roughly

In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



## Vocabulary, encoding (numericalization), and decoding

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
chars

65


['\n',
 ' ',
 '!',
 '$',
 '&',
 "'",
 ',',
 '-',
 '.',
 '3',
 ':',
 ';',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [5]:
stoi = {char:i for i,char in enumerate(chars)}
itos = {i:char for i,char in enumerate(chars)}

In [6]:
stoi

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}

In [7]:
encode = lambda s: [stoi[c] for c in s] # takes a string, output list of integers
decode = lambda l: ''.join(itos[i] for i in l) # takes the list of integers and return the string

In [8]:
encode('hii there')

[46, 47, 47, 1, 58, 46, 43, 56, 43]

In [9]:
decode([46, 47, 47, 1, 58, 46, 43, 56, 43])

'hii there'

- this is only one of many possible tokenizers
- but there are many other schemas that people came up with in practice
    - for example, `google` uses `SentencePiece`, which is a byte pair encoding tokenizer
        - it is a subword tokenizer, which is usually what adopted in practice (something in between character level and word level)
    - `openAI` uses a library called `tiktoken`, that uses a byte pair encoding tokenizer as well

In [10]:
import tiktoken
enc = tiktoken.get_encoding('gpt2') # we are getting the encoding used in gpt2

In [11]:
enc.n_vocab

50257

- they have 50K tokens in their vocabulary

In [13]:
enc.encode("hii there")

[71, 4178, 612]

- we encoded the exact same string, we got a list of 3 integers (meaning the above text is 3 tokens)
- so basically, you can trade-off the vocabulary size with the sequence length
    - so you can have a very long sequences of integers with very small vocabularies
    - or you can have a short sequences of integers with very large vocabularies
- people typically use subword tokenization in practice, but we will keep our tokenization simple and use character level tokenization (very small vocabulary and very long sequences)

- now we can tokenize the entire training set of shakespear

## Tokenize the dataset

In [14]:
import torch
data = torch.tensor(encode(text),dtype=torch.long)
data.shape, data.dtype

(torch.Size([1115394]), torch.int64)

In [15]:
data[:1000]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 

## Split the dataset into training and validation sets
- we will take the first 90% of the dataset as the training set and the last 10% as the validation set
- this will help us understand to what extent the model is overfitting (whether the improvement in the loss is actually due to the model learning the data or just memorizing the data)
- that is because we don't want a perfect memorization of this exact shakespear text, we want a neural network that creates shakespear-like text (generalizes to shakespear-like text)

In [16]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [17]:
len(train_data), len(val_data)

(1003854, 111540)

## Data loader: batches of chunks of data

### Chunking the data 
- we would like to start plugging this text sequence (more precisely integer sequence) into the transformer, so that it can train on them and learn those patterns

- it would be computationally expensive to feed the entire dataset sequence at once to the transformer
    - so when we actually train the transformer on a lot of these datasets, we only work with chunks of the dataset with maximum length called `block_size` or `context_length`
    - we basically will sample random little chunks out of the training set and train the model on them

In [18]:
block_size = 8
# let's look at the first block size
train_data[:block_size+1] # we take block_size+1 

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

- as we said in the first notebook, the above chunk has multiple examples packed into it (block_size examples)
    - 47 likely comes after 18
    - 56 likely comes after 18, 47
    - 57 is likely to come after 18, 47, 56
    - and so on

In [31]:
i = 0 # later will be random starting indices
x = train_data[i:i+block_size]     # x is 0,1,2,3,4,5,6,7
y = train_data[i+1:i+block_size+1] # y is 1,2,3,4,5,6,7,8 (basically x shifted by 1)
for t in range(block_size):
    context = x[:t+1] # print all sequences up to (including) the current timestep
    target = y[t] # print the label of the current timestep sequence
    print(f"when input is {context}, The target is {target}")

when input is tensor([18]), The target is 47
when input is tensor([18, 47]), The target is 56
when input is tensor([18, 47, 56]), The target is 57
when input is tensor([18, 47, 56, 57]), The target is 58
when input is tensor([18, 47, 56, 57, 58]), The target is 1
when input is tensor([18, 47, 56, 57, 58,  1]), The target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]), The target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), The target is 58


- the above are the 8 examples in the chunk we sampled from the training set
- mine: the labels are the same as inputs but shifted by one token 
    - that is why we extracted block_size+1 tokens 
        - from 0 to block_size are the x's
        - from 1 to block_size+1 are the y's (x but shifted by 1)
- so notice that for each chunk, we train on examples of multiple contexts (with context between 1 up to block_size)
    - we do that not just for computational reasons, but to make the transformer network used to seeing different contexts (all the way from as little as 1, up to block_size)
    - this will be useful later in inference, because while we are sampling from the model, we can start the sampling generation with as little as 1 character of context, and it knows how to predict the next character, then use the 2 characters to predict the next, and so on up using block_size characters to predict the next one, `after block_size we will have to start truncating the context, because the transformer will never receive more than block_size characters of context when predicting the next character` 

- the above was the time dimension (sequence length), now the batch dimension 
### multiple chunks in a batch

- as we are sampling these chunks of text, we are going to have a batch of these chunks (we will have many batches of multiple chunks of text)
    - and that is just for efficiency reasons, since the GPUs are very good at parallel processing of data, so we make it process multiple chunks all at the same time to keep them busy (each chunk is processed independently)

In [20]:
torch.randint(3, (4,)) # sample 4 random integers from 0 to 2 (3 is exclusive)

tensor([2, 0, 1, 0])

In [32]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences we will process in parallel?
block_size = 8 # what is the maximum context length for prediction

def get_batch(split):
    data = train_data if split == 'train' else val_data
    idx = torch.randint(len(data) - block_size, (batch_size,)) # the maximum starting index we can sample is len(data) - block_size - 1 (that is the end) but since the end is exclusive, we will add 1 to include it 
    x = torch.stack([data[i:i+block_size] for i in idx])
    y = torch.stack([data[i+1:i+block_size+1] for i in idx])
    return x,y

x_batch, y_batch = get_batch('train')
print('inputs: ')
print(x_batch.shape)
print(x_batch)
print('targets')
print(y_batch.shape)
print(y_batch)

print('---------------------------')

for ex in range(batch_size):
    for t in range(block_size):
        context = x_batch[ex][:t+1] # print all sequences up to (including) the current timestep
        target = y_batch[ex][t] # print the label of the current timestep sequence
        print(f"when input is {context}, The target is {target}")

inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
---------------------------
when input is tensor([24]), The target is 43
when input is tensor([24, 43]), The target is 58
when input is tensor([24, 43, 58]), The target is 5
when input is tensor([24, 43, 58,  5]), The target is 57
when input is tensor([24, 43, 58,  5, 57]), The target is 1
when input is tensor([24, 43, 58,  5, 57,  1]), The target is 46
when input is tensor([24, 43, 58,  5, 57,  1, 46]), The target is 43
when input is tensor([24, 43, 58,  5, 57,  1, 46, 43]), The target is 39
when input is tensor([44]), The target is 53
when input is tensor([44, 53]), The target is 56
when input is t

- so for each batch_size, we have batch_size * block_size examples (4*8=32) and they are completely independent as far as the transformer is concerned

## Recap: Bigram language model

- covered it in depth in the previous notebooks, so we will go faster 
- mine: connecting with what we have done so far
    - the preprocessing we done above is exactly as we did in previous notebooks
        - we processed each individual name, and that name packed multiple examples in it
            - for example from the word "`isabella` that
                - it says that the character "i" is a very likely character to come first in the sequence of a name (mine: after the start token we will create)
                - the character "s" is a very likely character to come after "i"
                - the character "a" is a very likely character to come after "is"
                - the character "b" is a very likely character to come after "isa"
                - and so on all the way to "a" following "isabell" 
                - finally, after there is isabella, the word is very likely to end (mine: we will make it predict the end token we will create)
        - now each individual chunk packs exactly the same information as each individual name in the above example
    - then we took the bigram language model, which was a weak model because it looked at context of 2 characters only 
        - for example it will learn from the word "isabella" that
            - "i" is likely to start a word
            - "s" is likely to come after "i"
            - "a" is likely to come after "s"
            - "b" is likely to come after "a"
            - and so on till "a" is likely to come after "l", and "a" is likely to finish the word
        - and we said it is a weak model because when we look at a window of 2 characters, our capacity to learn the patterns in the data is limited (because the a might come after "l" in some word, and come after "b" in another word and so on, the context is not enough to understand the patterns in the data)
    - then we formalize that in a neural network framework which allowed us to scale up the context (we used context 2 then context 3 up to context 8 with the mlps)
        - but the mlp takes fixed context size, so if it was 3 for example, then the word isabella was expressed by the following examples
            - ... -> i
            - ..i -> s
            - .is -> a
            - isa -> b
            - sab -> e
            - abe -> l
            - bel -> l
            - ell -> a
            - lla -> . (end token)
        - so it is not yet able to make use of all the examples or information packed in the word 
        - we started the mlp simple then used batch norm then used progressive fusion (processing the characters not all at once but in a hierarchical manner) and the loss decreased
    - then the RNNs were introduced, and they have the ability due to their sequential nature of processing to capture all the examples packed in the sequence theorectically
        - but in practice, they have a lot of limitations, like it fails to capture long-range dependencies because of the vanishing gradient problem (the early tokens in the sequence have very little impact on the later tokens in the sequence, therfore the gradients that are backpropagated through the network are very small and the network doesn't learn anything)
        - people then tried to solve that by using LSTMs and GRUs, as they have the ability thanks to the gates to carry on activations from the early tokens in the sequence to the later tokens in the sequence, therefore they are better at capturing long-range dependencies
            - but still they have limitations, like they are slow to train because they are sequential in nature (they process the sequence one token at a time), and the performance of the model is limited by the capacity of the memory cell (the hidden state of the LSTM or GRU)
            - so the performance decreases as the sequence length increases, for example in the encoder-decoder architectures, the GRUs and the LSTMs have to process and memorize the entire source sentence or the entire image representation in the encoder, and then generate the target sentence or the caption in the decoder
                - so the performance of the model decreases as the length of the source sentence or the image representation increases
            - then people started using them with the attention mechanism, which allows the decoder to focus on different parts of the source sentence or the image representation in every step of the decoder
                - the encoder now doesn't have to summarize the entire source sentence or the image representation in a fixed-size vector, but it can give the decoder a context vector that is a weighted sum of the source sentence or the image representation
    - then the transformers were introduced, and they are much better than RNNs, GRUs, and LSTMs for several reasons
        - the key innovation in the transformer is the self-attention mechanism
            - in the transformer, each element of the input sequence can attend to (or focus on) every other element of the sequence
            - this means that the model can directly consider the relationships between distant elements in the sequence without relying on sequential steps of processing (like the RNNs, GRUs, and LSTMs)
                - this allows the transformer to handle the entire sequence at once (we apply the self-attention mechanism to make them communicate with each other and update their representations based on the other elements in the sequence -all at once-, then we process them all at once)
                - so it is like making the sequence elements talk to each other and capture the relationships without me -the model- having to do so
            - this attention mechanism helps the transformer to also capture different contexts of data more effectively (as each token in the sequence can attend to different number of other tokens -with different importance- in the sequence)
                - for example the $7^{th}$ token in the sequence will look at 6 tokens before and itself then gather information from them -context of 7 tokens-, the $3^{rd}$ token will look at 2 tokens before and itself then gather information from them, and so on
                - so the self-attention mechanism which allows each token to look at the tokens before them already captures different contexts of data (so it can actually capture all the information packed in the sequence as we specified earlier)
            - because transformers process the entire sequence simultaneously, tey don't inherently understand the order of the sequence, which the transformers solve by adding positional encodings to the input embeddings
                - so the transformer can understand the order of the sequence by adding the positional encodings to the input embeddings


In [33]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [34]:
dummy_emb = nn.Embedding(5,5)
dummy_emb

Embedding(5, 5)

In [36]:
dummy_indices = torch.tensor([[0,2,3,4],[1,2,2,1]])
dummy_indices.shape

torch.Size([2, 4])

In [37]:
dummy_emb(dummy_indices),dummy_emb(dummy_indices).shape

(tensor([[[ 0.6258,  0.0255,  0.9545,  0.0643, -0.5024],
          [-0.4713,  0.0084, -0.6631, -0.2513,  1.0101],
          [ 0.1215,  0.1584, -0.6300, -0.2221,  0.6924],
          [-0.5075, -0.9239,  0.5467, -1.4948, -1.2057]],
 
         [[-0.2026, -1.5671, -1.0980,  0.2360, -1.8002],
          [-0.4713,  0.0084, -0.6631, -0.2513,  1.0101],
          [-0.4713,  0.0084, -0.6631, -0.2513,  1.0101],
          [-0.2026, -1.5671, -1.0980,  0.2360, -1.8002]]],
        grad_fn=<EmbeddingBackward0>),
 torch.Size([2, 4, 5]))

- every single integer in (batch_size,sequence_length) will go to the corresponding row in the embedding matrix and bluck it out
    - so instead of having integers (scalars) of shape (2,4), we got blucked out vectors of shape 5 and that is for each element in (2,4) so the shape is (2,4,5)

In [38]:
torch.manual_seed(1337)

class Bigram(nn.Module):
    def __init__(self,vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,vocab_size)

    def forward(self,idx,targets):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        logits = self.token_embedding_table(idx) # the logits of shape (batch_size,sequence_length,emb_size)
        
        return logits


- we manually had a lookup table of shape vocab_size,vocab_size and we manually filled it with the counts of each pair of tokens, then we used these counts as logits to get the probabilities and sample the next character from them
    - here, we used nn.Embedding as the table and treated its output as the logits then backprograte to update the table

In [39]:
model = Bigram(vocab_size)
out = model(x_batch,y_batch)
out.shape

torch.Size([4, 8, 65])

In [40]:
vocab_size

65

- we got the logits of next characters for every one in the 4x8 positions
- we will include the loss as well

In [41]:
torch.manual_seed(1337)

class Bigram(nn.Module):
    def __init__(self,vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,vocab_size)

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        logits = self.token_embedding_table(idx) # the logits of shape (batch_size,sequence_length,emb_size)
        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (batch_size,sequence_length), sequence_length can be anything (past context)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # inference the idx
            logits, _ = self(idx) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx

In [42]:
model = Bigram(vocab_size)
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([4, 8, 65]), tensor(4.8786, grad_fn=<NllLossBackward0>))

- we can guess what is the initial loss
    - initially, all of them should have the same probability that is $\frac{1}{vocab_size}$, so the loss should be $-log(1/vocab_size)$
    - in our case, -log(1/65) = 4.17
    - since our loss is higher than that, it means that the initial predictions are not super diffuse, and got a little bit of entropy
- let's generate some text from the model

In [43]:
decode([0])

'\n'

- we will use the newline character as the starting index (to kick off the generation)
    - reasonable thing to start with

In [44]:
start_idx = torch.zeros([1,1], dtype=torch.long)
print(decode(model.generate(start_idx,100)[0].tolist()))


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


- it is rubbish at first

### Train the bigram
- in the makemore series we only used the SGD manually
- but now we will use `AdamW` optimizer, which is a much more advanced and popular optimizer
    - a typical good setting for the learning rate is roughly 3e-4, but for very small networks, we can get away with higher learning rates (like 1e-3)

In [45]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [50]:
@torch.no_grad()
def estimate_loss(eval_iter = 200):
    model.eval()
    out = {}
    for split in ['train','val']:
        losses = torch.zeros(eval_iter)
        # we will loop over random eval_iter batches and average the losses to get a more robust estimate
        for k in range(eval_iter):
            x,y = get_batch(split)
            logits, loss = model(x,y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [51]:
batch_size = 32
eval_interval = 2000
for i in range(10000):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

print(loss.item())

Train Loss: 4.7281, validation Loss: 4.7212
Train Loss: 3.1239, validation Loss: 3.1254
Train Loss: 2.6440, validation Loss: 2.6419
Train Loss: 2.5278, validation Loss: 2.5189
Train Loss: 2.4921, validation Loss: 2.4963
2.459792137145996


In [52]:
start_idx = torch.zeros([1,1], dtype=torch.long)
print(decode(model.generate(start_idx,300)[0].tolist()))


Be l wloms ip, is aince oowoupis pr gedere, anenoweld bt; poup ecored be af FomeatscakevexENRI'des t nd cadeld cer:
Weale? ighalveangry d mak


JA linod t my Whansad
ENELI thund?
Whe hen ay han mandin heret, t foono kis beorcad Sicode ale
AD:
Alth gno ty iotorthind yootho-the; tonthexigacrodshe, CEL


- that is the best bigram can do
    - whatever context we give it, it only looks at the last character to make the predictions about what comes next

## The mathematical Trick in self-attention
- we want to get used to a mathematical trick that is used in the self-attention inside a transformer
    - it is really at the heart of an efficient implementation of self-attention

In [53]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch_size, timestep, channels = features of each timestep in each example
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

- in the above exmaple, we have 8 tokens (a sequence of 8), and they are currently not talking to each other (totally independent)
    - we would like them to talk to each other, we would like to couple them in a very specific way
        - the token at the $5^{th}$ location shouldn't communicate with the tokens at the $6^{th}$ , $7^{th}$, and $8^{th}$ locations, because those are future tokens in the sequence 
        - in other words, each token should look at the tokens that came before it in the sequence
            - `information only flows from previous contexts to the current time step`, we can't get any information from the future because we are trying to predict the future (at least in the context of language modeling)
    - so what is the easiest way for tokens to communicate? to be coupled together?
        1. the simplist (and weakest) way is to take the average of the preceeding elements
            - so the $5^{th}$ token will look at the average of features of the $1^{st}$, $2^{nd}$, $3^{rd}$, and $4^{th}$ tokens
                - that average would be a feature vector that summarizes the context of the previous tokens
                - the sum or average is the weakest form of coupling, as we have lost a lot of information about the placement and order of those tokens

In [54]:
xbow = torch.zeros((B,T,C)) # stores the averages -of features- of the previous timesteps -tokens- for each example and timestep (it will be of dimension C, same as features)
for ex in range(B):
    for t in range(T):
        x_prev = x[ex,:t+1] # get all the previous tokens up to -and indluding- our current one, shape (t,c)
        xbow[ex,t] = torch.mean(x_prev,dim=0)

- bag of words is a term that people use when they just averaging up things

In [55]:
x[0], xbow[0]

(tensor([[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]]),
 tensor([[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]]))

- for the first example, the first row in xbow is the average of the first row in x (which is the same)
    - the second row in xbow is the average of the first and second rows in x (the average of the first and second token features)
    - and so on

- the mathematical trick is that we can do this averaging in a very efficient way using matrix multiplication
    

In [56]:
torch.manual_seed(1337)
a = torch.ones(3,3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b # (3,3) X (3,2) = (3,2)
print('a =')
print(a)
print('---------------')
print("b = ")
print(b)
print('----------------')
print('c = ')
print(c)

a =
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
---------------
b = 
tensor([[5., 7.],
        [2., 0.],
        [5., 3.]])
----------------
c = 
tensor([[12., 10.],
        [12., 10.],
        [12., 10.]])


- we multiplied torch.ones with random numbers
    - we see that each element in the first column in the matrix c is the sum of the first column in matrix b (since we multiply 3 ones with the first column in b and sum it)

In [57]:
torch.manual_seed(1337)
a = torch.tril(torch.ones(3,3))
b = torch.randint(0,10,(3,2)).float()
c = a @ b # (3,3) X (3,2) = (3,2)
print('a =')
print(a)
print('---------------')
print("b = ")
print(b)
print('----------------')
print('c = ')
print(c)

a =
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
---------------
b = 
tensor([[5., 7.],
        [2., 0.],
        [5., 3.]])
----------------
c = 
tensor([[ 5.,  7.],
        [ 7.,  7.],
        [12., 10.]])


- now when we wrapped the matrix of ones with torch.tril in order to make it lower triangular
    - the first row has 1 one
    - the second row has 2 ones
    - the third row has 3 ones

- now for the result matrix $c$
    - the first row in $c$ is the first row in $a$
    - the second row in $c$ is the sum of the first and second rows in $a$
    - the third row in c is the sum of the first, second, and third rows in $a$
    - and so on if the matrices are larger
- that was possible because we used the lower triangular matrix of ones (and from the matrix multiplication operation, where we take each row in $a$ and multiply it with each row in $b$ to get the row in c)

In [58]:
torch.tril(torch.ones(3,3)) / torch.tril(torch.ones(3,3)).sum(dim=1,keepdim=True)

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])

In [59]:
torch.manual_seed(1337)
a = torch.tril(torch.ones(3,3))
a /= a.sum(dim=1,keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b # (3,3) X (3,2) = (3,2)
print('a =')
print(a)
print('---------------')
print("b = ")
print(b)
print('----------------')
print('c = ')
print(c)

a =
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
---------------
b = 
tensor([[5., 7.],
        [2., 0.],
        [5., 3.]])
----------------
c = 
tensor([[5.0000, 7.0000],
        [3.5000, 3.5000],
        [4.0000, 3.3333]])


- now we get the average
- so we can see how to use matrix multiplication and a weighting matrix a, to get the sum or average of b in an incremental fashion

In [60]:
wei = torch.tril(torch.ones(T,T))
wei /= wei.sum(dim=1,keepdim=True)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [61]:
x.shape, wei.shape

(torch.Size([4, 8, 2]), torch.Size([8, 8]))

In [62]:
xbow_vectorized = torch.zeros((B,T,C)) # stores the averages -of features- of the previous timesteps -tokens- for each example and timestep (it will be of dimension C, same as features)
for ex in range(B):
    xbow_vectorized[ex] = wei @ x[ex]

In [63]:
torch.allclose(xbow,xbow_vectorized)

True

- let's vectorize it even further

In [64]:
xbow_vectorized = torch.zeros((B,T,C)) # stores the averages -of features- of the previous timesteps -tokens- for each example and timestep (it will be of dimension C, same as features)
xbow_vectorized = wei @ x # (8,8) @ (4,8,2) -> (4,8,8) @ (4,8,2) 

- so we have a weighting matrix of shape (T,T) `between each token in the T tokens and all others` and our matrix x of shape (B,T,C)
    - what will happen when we multiply wei @ a?
    - (T,T) @ (B,T,C) => (B,T,T) @ (B,T,C) => (B,T,C)
        - so the weighting matrix is broadcasted to the batch dimension, and then we multiply it with the matrix x 
        - so we have (T,T) @ (T,C) for each example in the batch

- this is called `batched matrix multiplication` basically applying matrix multiplication to each example in the batch in parallel and independently
    - so in general for any matrix (a,b,c) @ (a,c,d) => (a,b,d), that is doing the matrix multiplication (b,c) @ (c,d) for each example in the batch

In [65]:
torch.allclose(xbow,xbow_vectorized)

True

- so the trick is that we were able to use batch matrix multiplication to get the average of tokens in an incremental fashion
    - basically we are doing a weighted aggregation, but the weights are designed in a way that makes it an aggregation (normalized equal weights to all past tokens for each token)

- there is another identical vectorized implementation using the softmax
    - the only difference is getting the weighting matrix
    - we set the desired elements of tril -past elements- to be 0 and the rest to be -inf (to make the softmax ignore the future tokens), then we convert to probabilities using the softmax

In [66]:
wei = torch.zeros((T,T))
tril = torch.tril(torch.ones(T,T))
wei.masked_fill(tril == 0, float('-inf'))

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

- when we exponentiate the above matrix in the softmax, we get 1 for the 0s and 0 for the -infs
    - then when we divide by the sum, we get equal probabilities for the 0s, therefore taking the average of the previous tokens, and 0 probabilities for the -infs, therefore ignoring the future tokens
        - so the softmax exponentiation will convert it to ones -for past tokens- and zeros -for future tokens- then we divide by the sum to get the probabilities -exactly the same thing we did above-

In [67]:
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=1)
xbow_vectorized_2 = wei @ x
torch.allclose(xbow_vectorized_2,xbow)

True

- we will use the latter implementation in the self-attention, because it is more interpretable and scalable
    - `torch.zeros((T,T))` means we basically begin the weights with 0s (we can think of it as the interaction strength), or like an affinity matrix -tell us how much each token from the past will contribute in the aggregation-
        - so it is initially 0s and therefore giving all previous tokens same importance or relevance -ending up taking the average-, but they will be data dependent later and not constant at 0s
        - they will start looking at each other, and some tokens will find other tokens more or less interesting to different amounts
    - the line `wei.masked_fill(tril == 0, float('-inf'))` is basically discarding the future tokens (by setting them to -inf before the softmax normalization)
        - we basically saying we are not aggregating any information from the future tokens
    - wei @ x is the actual aggregation through matrix multiplication

- `recap`
    - so we can do weighted aggregation of the past tokens by using matrix multiplication of a lower triangular fashion, and the elements in the lower triangular part will tell us how much each past token will contribute to the aggregation

## Scaling up the network
- we will not get the logits from the lookup table directly, but rather get the embeddings then add a linear layer (usually called `language model head`) that maps the embeddings to the logits
- vocab_size is already defined for the whole notebook so no need to pass it to the constructor

In [68]:
torch.manual_seed(1337)
n_embed = 32

class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.lm_head = nn.Linear(n_embed,vocab_size)

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        logits = self.lm_head(emb) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # inference the idx
            logits, _ = self(idx) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.3632, grad_fn=<NllLossBackward0>))

## Positional encoding

- the indices we feed are encodings of the identities of the tokens (each index refers to a unique token)
- we would also like to encode the position of the tokens in the sequence
    - we will do that with an embedding layer that maps the position of the token (from 0 to block_size) to a vector of the same size as the token embeddings


- we will also modify the generate method to truncate the context to block_size (we said that the transformer will never receive more than block_size tokens of context when predicting the next token)
    - but now it is even more important because we are using the positional encoding, so if we have a position that is larger than block_size, we will get an error (out of range of the embedding table)

In [69]:
torch.manual_seed(1337)
n_embed = 32

class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding = nn.Embedding(block_size,n_embed)
        self.lm_head = nn.Linear(n_embed,vocab_size)

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        # get the token embeddings
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        # get the positional embeddings
        pos_emb = self.positional_embedding(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        # add them together to get the final embeddings
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # truncate the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.4819, grad_fn=<NllLossBackward0>))

- now x not only holds the token identity, but also the positions in which these tokens occur
    - they are not useful in the bigram because we only look at one token to predict the next token after it

- mine: the architecture now is x (token embeddings + positional encodings) -> linear layer -> logits

## The crux of the video: Self-attention

- we will implement a small self-attention for a single individual head (as people call it)

In [70]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch_size, timestep, channels = features of each timestep in each example
x = torch.randn(B,T,C) 

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
# use the tril to mask out the upper triangular part of the weight matrix
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=1)
out = wei @ x
out.shape

torch.Size([4, 8, 32])

- we said that when we initialize the wei to be 0s (we initialize the affinities between the different tokens to be 0), that way we give them equal importance and then the softmax will turn them to equal probabilities and end up taking the average of all tokens
    - but we said we don't actually want all this to be uniform, because differnt tokens will find different other tokens to be more or less interesting
        - for example, if i am a vowel, then maybe i am looking for constants in my past and want to know what they are, so i want their information to flow to me
- in other words, i want to gather information from the past, but i want to do it in a data-dependent way (meaning that the importance of each token in the past will be different for each token in the sequence)
    - that is the problem that self-attention solves 

- the way self-attention solves this is the following
    - every single token (at each timestep), will emit 2 vectors
        - a query vector (q)
            - roughly speaking, what i am looking for 
        - a key vector (k)
            - roughly speaking, what do i contain
    - the way we get affinities between these tokens now in the sequence, is basically doing a dot product between the keys and the queries 
        - so for a certain token, we multiply its query with all the keys of the other tokens, and that will tell us how much each token is interesting to the current token, and that dot product now becomes wei (that we initialized equally)
        - then we discard future tokens for each token using the tril matrix
        
- let's implement a single head of self-attention
    - we have another hyperparameter `head_size` which is the dimension of the query and key vectors
    - we will have 2 weight matrices $W_q$ and $W_k$ that will map the token embeddings to the query and key vectors
        - `nn.Linear(embedding_size,head_size,bias=False)` will do that for us
        - (batch_size,sequence_length,embedding_size) @ (embedding_size,head_size) => (batch_size,sequence_length,head_size)
        
    - we will multiply the input embeddings (batch_size,sequence_length,embedding_size) with these matrices (embedding_size,head_size) to get the query and key vectors of shape (batch_size,sequence_length,head_size) using batched matrix multiplication
        - we simply forward the embeddings to the linear layers we created to perform that multiplication
        - so now all the tokens in all the positions of batch_size,sequence_length will produce a query and key vectors (in parallel and independently), so no communication has happened yet
    - then we dot product each token's query with all the keys of the other tokens to get the affinities
        - Q @ K.transpose(-1,-2), where k.transpose(-1,-2) will transpose the last 2 dimensions and keep the batch dimension as is (because we want to do batched matrix multiplication)
            - so k will be from (batch_size,sequence_length,head_size) to (batch_size,head_size,sequence_length) 
            - then Q @ k.transpose(-1,-2) will be (batch_size,sequence_length,head_size) @ (batch_size,head_size,sequence_length) => (batch_size,sequence_length,sequence_length), which is the affinities or the wei matrix (how much each token is interesting to each other token) for each example in the batch
    - then we cancel the future tokens for each token using the tril matrix 
    - then we apply the softmax to get the probabilities (more like the percentage of how much each token is interested in the previous tokens -the percentage of aggregation- and therefore aggregating its information to it)

            
        

In [73]:
k_dummy = torch.randn(4,8,16) # batch_size, sequence_length, head_size
q_dummy = torch.randn(4,8,16) # batch_size, sequence_length, head_size

k_dummy.shape, q_dummy.shape

(torch.Size([4, 8, 16]), torch.Size([4, 8, 16]))

In [74]:
k_dummy.T.shape, k_dummy.transpose(-2,-1).shape

  k_dummy.T.shape, k_dummy.transpose(-2,-1).shape


(torch.Size([16, 8, 4]), torch.Size([4, 16, 8]))

In [75]:
# mine: another way to transpose batch of matrices
k_dummy.mT.shape

torch.Size([4, 16, 8])

In [77]:
q_dummy @ k_dummy.T # (4,8,16) @ (16,8,4) won't work, it is not even a thing

RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0

In [78]:
(q_dummy @ k_dummy.mT).shape # (4,8,16) @ (4,16,8) = (4,8,8), basically (8,16) @ (16,8) for each example in the batch

torch.Size([4, 8, 8])

- notice that the batch size must match in order to apply matrix multiplication independently for each example

In [79]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch_size, timestep, channels = features of each timestep in each example
x = torch.randn(B,T,C)

# let's see a single head 
head_size = 16
key = nn.Linear(C,head_size)
query = nn.Linear(C,head_size)

# get the keys and queries for all tokens in batch_size,sequence_length
k = key(x) # (B,T,C) @ (B,C, head_size) -> (B,T,head_size)
q = query(x) # same thing (B,T,C) -> (B,T,head_size)

# get the affinities between each pair of tokens
wei = q @ k.transpose(-2,-1) # (B,T,head_size) @ (B,head_size,T) -> (B,T,T)

tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
# use the tril to mask out the upper triangular part of the weight matrix
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x # (B,T,T) @ (B,T,C) -> (B,T,C) which is x but with the attention mechanism applied
out.shape

torch.Size([4, 8, 32])

- the wei are not zeros anymore, therefore not applying basic average aggregation
    - they are now coming from the dot product between the queries and the keys
- the weighted aggregation now is data dependent, meaning that the importance of each token in the past will be different for each token in the sequence

In [80]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.8368, 0.1632, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0585, 0.2252, 0.7164, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4540, 0.0818, 0.3104, 0.1538, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0442, 0.1218, 0.1840, 0.3724, 0.2776, 0.0000, 0.0000, 0.0000],
        [0.0214, 0.1360, 0.1688, 0.0460, 0.5743, 0.0536, 0.0000, 0.0000],
        [0.0592, 0.0265, 0.4947, 0.0623, 0.2254, 0.0120, 0.1199, 0.0000],
        [0.0351, 0.0286, 0.3691, 0.1940, 0.1005, 0.0118, 0.0331, 0.2277]],
       grad_fn=<SelectBackward0>)

- as you can see, before, wei was uniform giving each past token an equal importance, but now it is different for each token
- for wei[0] 'first example' we see the affinities of the past tokens for each token of the 8 tokens in the sequence
    - in the last row we see the past token affinities for the 8th token, 
        - they are not uniform, meaning the 8th token is interested in more tokens than the others, according to the keys and queries
        - so the 8th token's query may say i am a vowel and i am looking for constants at positions up to 4
        - then all the other tokens get to emit keys, maybe one of them says i am a constant and i am in a position up to 4
        - so their multiplication will be high -high affinity between the $8{th}$ token and that token in the wei matrix, and when we have high affinities, then the softmax end up aggregating alot of its information into the 8th token (we get to learn a lot about it)

- one final modification, we don't aggregate the embeddings themselves, but rather something called the value vectors
    - so in addition to the key and query vectors, each token will also emit a value vector (v) -what i am contributing with to the aggregation- or -what i am communicating to the other tokens-
    - so we will have a value vector (v) for each token, the same way we had a query and key vector
    - then we will use the wei matrix to aggregate the value vectors of the previous tokens into the current token's value vector 

In [81]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch_size, timestep, channels = features of each timestep in each example
x = torch.randn(B,T,C)

# let's see a single head 
head_size = 16
key = nn.Linear(C,head_size)
query = nn.Linear(C,head_size)
value = nn.Linear(C,head_size)

# get the keys and queries for all tokens in batch_size,sequence_length
k = key(x) # (B,T,C) -> (B,T,head_size)
q = query(x) # (B,T,C) -> (B,T,head_size)
v = value(x) # (B,T,C) -> (B,T,head_size)

# get the affinities between each pair of tokens
wei = q @ k.transpose(-2,-1) # (B,T,head_size) @ (B,head_size,T) -> (B,T,T)

tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
# use the tril to mask out the upper triangular part of the weight matrix
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

# aggregate the values instead of the embeddings themselves
#out = wei @ x
out = wei @ v # (B,T,T) @ (B,T, head_size) -> (B,T,head_size) same shape as v (the values) but weighted by the attention mechanism
out.shape

torch.Size([4, 8, 16])

- the output of the single head will be of shape (batch_size,sequence_length,head_size)
- why do we aggregate the values instead of the embeddings themselves? why do we have value vectors in the first place?
    - we can think of x as kind of like a private information to the token
    - so it is like each token saying, my information is kept in vector x, and now for the purpose of the single head, here is what i am interested in (vector q), here is what i have (vector k), and if you find me interesting, here is what i will communicate to you (vector v)

- that is basically the self-attention mechanism

![self attention](assets/attentionmechanism.jfif)

## notes

1. attention is a communication mechanism, we can think about it as a communication mechanism where we have a number of nodes in a directed graph

![directed graph](assets/directed_graph.png)

- what happens is every node has some vector of information, and it gets to aggregate information via a weighted sum from all of the nodes that point to it, depending on what data is stored in them
- so our tokens are the nodes, and each token points at itself and the tokens after it (in other words, each token get pointed at by the tokens before it)

![directed graph](assets/our_directed_graph.png)

2. notice that there is no notion of space or position, so attention simply acts over a set of vectors, and so these nodes have no idea where they are positioned
    - so we need to encode them positionally, and give them some information that is unique for each position in the sequence (so they know where they are)
    - mine: so the matrix x (which derives q,k,v) will be the embeddings of the tokens (the identity of the tokens), added to it the positional encoding (the position of the tokens) 

2. notice that the calculation is independent for each example in the bacth (they are always processed independently)
    - mine: the queries, keys, and values are calculated independently for each example in the batch, and therefore the wei matrix is calculated independently for each example in the batch
    - so it is like having 4 separate graphs, and each graph's nodes are communicating with each other in a different way in independent and parallel manner
3. in the case of language modeling, we don't want the tokens to communicate with future tokens, but this is not general
    - in many cases, we may want all the tokens to communicate with each other, and we may want to have a full matrix of affinities 
        - like sentiment analysis with transformers, we might have a number of tokens, and we may want to have them all talk to each other fully, because later we will predict the sentiment of the whole sentence
    - in those cases, we will use what is called an `encoder block` of self-attention, all it means that it is an encoder block, is that we will delete the line that discards future tokens -the masking line- `wei = wei.masked_fill(tril == 0, float('-inf'))` allowing all the nodes to communicate with each other
        - what we are implementing here is something called a `decoder block`, because it is sort of like decoding language and it has got this auto-regressive format where we have to mask the future tokens for each token (otherwise they will give away the answer, and that is not mimicing the real world scenario)
    - so `notice how we change the connectivity of the nodes depending on the use case, despite that attention doesn't care, attention supports arbitrary connectivity between nodes`
5. there is also something called `cross-attention` so what is the difference between them?
    - the reason this attention we implementing here is self-attention, is because the keys, queries, and values are all coming from the same source (x)
        - in other words, the same source x produces the keys, queries, and values (these nodes -tokens- are self-attending)
    - but attention can be much more general than that, for example in encoder-decoder transformers, we can have a case where the queries are produced from x, but the keys and values come from a whole separate external source (sometimes from encoder blocks that encodes some input that we would like to condition on)
        - so the encoder produces the keys and values, those are nodes on the side (mine: the source sentence for example in machine translation), and the decoder produces queries -from the target sentence in machine translation- that read-off information from the side nodes
        - in other words, cross-attention is used when there is a separate source of nodes we would like to pull information from into our nodes
            - like in machine translation, we would like to pull information from the source sentence into the target sentence, so we use the encoder to produce the keys and values, and the decoder to produce the queries
        - and it is self-attention if we have nodes that are the source of the keys, queries, and values (like in language modeling here, we want the token to communicate with itself and the tokens before it in order to predict the next token)

6. if we come to the `attention is all you need paper` we will see that the equation of attention is $Attention(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
    - notice that we divide by $\sqrt{d_k}$, where $d_k$ is the head size, why this is so important?
        - this is called scaled attention, and it is important normalization to have 
        - if we have unit gaussian inputs (inputs with mean 0 and variance 1), then the dot product of the keys and queries -the wei matrix- will have a mean of 0 and variance of $d_k$ (the head size)
            - but if we divide by the square root of $d_k$, then the dot product will have a mean of 0 and variance of 1, and the variance will be preserved
            - mine: this is similar to what we did in the weight initialization in the MLPs, we divided by the square root of the input dimension to preserve the distribution
        - why is this important? as we said, we would like before the softmax to have diffused values (meaning that the values are not too high or too low), because if we have too high values then the softmax will converge towards one-hot vectors (meaning it will sharpen the probabilities towards the maximum value), and this will basically aggregate the information from one token only
            - so, the sclaing is used just to control the variance at initialization -making the variance small therefore the values are comparable therefore the softmax will assign comparable probabilities- therefore aggregating the information from all the tokens in the sequence at the beginning
            - mine: then it will learn the importance of each token in the sequence and the variance will be data dependent
    


In [82]:
k = torch.randn(4,8,16)
q = torch.randn(4,8,16)
wei = q @ k.transpose(-2,-1)
diffused_wei = q @ k.transpose(-2,-1) / (16**0.5)

In [83]:
k.var(), q.var(), wei.var(), diffused_wei.var()

(tensor(1.0750), tensor(1.0539), tensor(18.7561), tensor(1.1723))

In [84]:
tril = torch.tril(torch.ones(8,8))

# discard the future information
wei = wei.masked_fill(tril == 0, float('-inf'))
diffused_wei = diffused_wei.masked_fill(tril == 0, float('-inf'))

wei = F.softmax(wei, dim=-1)
diffused_wei = F.softmax(diffused_wei, dim=-1)

In [85]:
torch.set_printoptions(sci_mode=False)
wei[0][5], diffused_wei[0][5]

(tensor([    0.3779,     0.5906,     0.0006,     0.0042,     0.0003,     0.0263,
             0.0000,     0.0000]),
 tensor([0.3003, 0.3358, 0.0607, 0.0976, 0.0513, 0.1543, 0.0000, 0.0000]))

- notice how the diffused weights are comparable at the beginning, leading to the aggregation of information from most if not all tokens, unlike the weights before we are mainly aggregating the information from 1 or 2 tokens only

## Adding a single head of self-attention to the network
- we will put the previous implementation in a class called `Head` and then add it to the network

In [86]:
class Head(nn.Module):
    """ One head of Self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed,head_size,bias=False)
        self.query = nn.Linear(n_embed,head_size,bias=False)
        self.value = nn.Linear(n_embed,head_size,bias=False)
        # the tril is used to mask out -discard- the upper triangular part of the weight matrix -the future tokens-
        self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size))) # trill is not a parameter, so it is called a buffer in pytorch naming conventions, so we have to assign it the module using the register_buffer method of the nn.Module class
    
    def forward(self,x):
        B,T,C = x.shape
        k = self.key(x) # B,T,head_size
        q = self.query(x) # B,T,head_size

        # compute the attention scores "Affinities"
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B,T,head_size) @ (B,head_size,T) => (B,T,T)
        # discard the future tokens for each token
        wei = wei.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
        # apply softmax to get the attention weights
        wei = F.softmax(wei,dim=-1) # (B,T,T)

        # get the values
        v = self.value(x) # B,T,head_size
        out = wei @ v # (B,T,T) @ (B,T,head_size) => (B,T,head_size)
        return out

- we see something for the first time, we used the `register_buffer` method, what is that?
    - it is a way to register a tensor as a buffer -a tensor that is not a parameter- and it will be saved and loaded with the model
    - it takes the following arguments
        - the name of the buffer
        - the tensor itself
        - whether it is a persistent buffer or not (if it is not persistent, it will not be saved and loaded with the model), and it is by default True (persistent)
    

- we will plug the self-attention head to the bigram model (that is the simplist way to plug-in the self-attention component)

In [87]:
torch.manual_seed(1337)
n_embed = 32

class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding = nn.Embedding(block_size,n_embed)
        self.lm_head = nn.Linear(n_embed,vocab_size)
        self.head = Head(n_embed) # we will keep the head size the same as the embedding dimension, just for now

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the head
        x = self.head(x) # (batch_size,sequence_length,head_size = emb_size)
        
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.2793, grad_fn=<NllLossBackward0>))

## Train Bigram with self-attention

- mine: x (token embeddings + positional encodings) -> self-attention head -> linear layer -> logits

In [88]:
lr = 1e-3 # we reduce the learning, because the self-attention can't tolerate large learning rates
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
max_iters = 5000
eval_interval = 500

for i in range(max_iters):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

Train Loss: 4.2293, validation Loss: 4.2282
Train Loss: 2.7106, validation Loss: 2.7294
Train Loss: 2.5368, validation Loss: 2.5460
Train Loss: 2.4894, validation Loss: 2.4981
Train Loss: 2.4554, validation Loss: 2.4653
Train Loss: 2.4385, validation Loss: 2.4560
Train Loss: 2.4234, validation Loss: 2.4446
Train Loss: 2.4029, validation Loss: 2.4295
Train Loss: 2.4094, validation Loss: 2.4068
Train Loss: 2.4011, validation Loss: 2.4121


- we got a little bit of improvement

In [91]:
idx = torch.zeros([1,1], dtype=torch.long)
print(decode(model.generate(idx,300)[0].tolist()))


Ant I nd.

t hralvo me.

SAs fr rm ockioded I odearde,
Prind at flor, otho wrtothe shaisour de yot me pleve onveme
CA: d sar mieven woupto wousaderusuerde-uwea ow veen.
Wheit thinggomt t.

AMasevee icks merof ther ipen yontee loud ese no y, st, Goround, I ter.

G ies CAy hay, sor.
MNG ISinoul,

K: r


- the text is still not amazing

## Multi-head attention

- it is just applying multiple attention heads in parallel, and concatenating their results
    - each head will have a shape of (batch_size,sequence_length,head_size), and we will concatenate them along the last dimension to get a tensor of shape (batch_size,sequence_length,head_size*num_heads)

![multi-head attention](assets/multi_head_attention.png)

In [92]:
class MultiHeadAttention(nn.Module):
    def __init__(self,num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self,x):
        # pass the x to each head, result will be a list of tensors of shape (B,T,head_size), we concatenate them on the last dimension
        return torch.cat([h(x) for h in self.heads], dim=-1)

- now add it to the Bigram model
    - we will have 4 heads each of head_size n_embed//4 (so that when we concatenate them we get the same size as the embeddings)

- now we will use the multi-head attention in the model

In [93]:
torch.manual_seed(1337)
n_embed = 32

class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding = nn.Embedding(block_size,n_embed)
        self.multi_head = MultiHeadAttention(4,n_embed//4)  # 4 heads each will produce shape (B,T,n_embed//4) -> concatenated will be (B,T,n_embed)
        self.lm_head = nn.Linear(n_embed,vocab_size)
        
    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the head
        x = self.multi_head(x) # (batch_size,sequence_length,emb_size)
        
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx

  
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.1828, grad_fn=<NllLossBackward0>))

In [94]:
lr = 1e-3 # we reduce the learning, because the self-attention can't tolerate large learning rates
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
max_iters = 8000
eval_interval = 500

for i in range(max_iters):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

Train Loss: 4.2103, validation Loss: 4.2105
Train Loss: 2.6623, validation Loss: 2.6749
Train Loss: 2.4980, validation Loss: 2.5083
Train Loss: 2.4299, validation Loss: 2.4370
Train Loss: 2.3762, validation Loss: 2.3869
Train Loss: 2.3449, validation Loss: 2.3578
Train Loss: 2.3183, validation Loss: 2.3386
Train Loss: 2.2944, validation Loss: 2.3214
Train Loss: 2.2916, validation Loss: 2.2893
Train Loss: 2.2772, validation Loss: 2.2895
Train Loss: 2.2684, validation Loss: 2.2897
Train Loss: 2.2393, validation Loss: 2.2870
Train Loss: 2.2539, validation Loss: 2.2672
Train Loss: 2.2406, validation Loss: 2.2508
Train Loss: 2.2338, validation Loss: 2.2531
Train Loss: 2.2085, validation Loss: 2.2611


- the loss got better, which means it helped to use multiple communication channels (multiple heads) 
    - obviously, these tokens have a lot to talk about :D (a lot of different information to aggregate)
- so, it helps to create multiple independent channels of communication, gather lots of different types of information, and decode the output

In [96]:
idx = torch.zeros([1,1], dtype=torch.long)
print(decode(model.generate(idx,300)[0].tolist()))


LUS:
Nomis st we thy daeqou,-hir ty mar.
NUS:
Anot ghis this Yome! ha hiem tordloundig pry usttulselt the per, youve badapesthel,-ld lou and my
What a and 'is shous gras as is you kne del the cu tand to dice.
Sor elay vitiels;
Me, To gre the shout? fillaing, mer we, what hicpr.
I proy ou fooor I so 


## Attention is all you need Architecture

![transformer](assets/attention_architecture.png)

we are starting see some components of what we already implemented 
    - the token encoding (the embeddings)
    - the positional encoding
    - the `masked` multi-head attention (the self-attention with the masking of the future tokens)

- notes
    - we notice that the decoder also has a multi-head attention -not masked- which is a cross attention to the encoder -which we will not implement here-
    - then there is that `feed-forward` block, `add & norm` (residual connection followed by layer normalization)
    - all of that is grouped into a block `decoder block` that is repeated multiple times

- we want to start in a similar fashion, by adding computation to the network
    - what we did so far is getting the logits from the multi-head attention
        - so the multi-head attention did the communication, but we went too fast and calculated the logits on them
        - so, the tokens looked at each other but didn't really have a lot of time to `think on` what they found from the other tokens

    - that is what the  `feed forward` block tries to solve 
        - it will take the output of the multi-head attention of shape (batch_size,sequence_length,n_embed) and apply the linear layer on the last dimension (map it from n_embed to n_embed) then apply the relu
        - so you can see that it is per-token level (we map the n_embed for each token to another vector of n_embed), basically allowing each token to think on the data it gathered individually
        - we will implement it (just a linear layer followed by a ReLU)
            - it will be a nn.Linear(n_embed,n_embed) followed by a relu
            

In [99]:
class FeedForward(nn.Module):
    """ a simple layer followed by a non-linearity """
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed,n_embed),
            nn.ReLU(),
        )

    def forward(self,x):
        return self.net(x)
    

class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding = nn.Embedding(block_size,n_embed)
        self.lm_head = nn.Linear(n_embed,vocab_size) # language model head
        self.multi_head = MultiHeadAttention(4,n_embed//4)  # multi-head attention
        self.ffwd = FeedForward(n_embed) # feed forward layer

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the head
        x = self.multi_head(x) # (batch_size,sequence_length,emb_size)

        # feed the output of the head to the feed forward layer
        x = self.ffwd(x)
        
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.2134, grad_fn=<NllLossBackward0>))

In [100]:
torch.manual_seed(1337)
lr = 1e-3 # we reduce the learning, because the self-attention can't tolerate large learning rates
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
max_iters = 8000
eval_interval = 500

for i in range(max_iters):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

Train Loss: 4.1972, validation Loss: 4.1970
Train Loss: 2.6546, validation Loss: 2.6640
Train Loss: 2.5058, validation Loss: 2.5062
Train Loss: 2.4258, validation Loss: 2.4302
Train Loss: 2.3723, validation Loss: 2.3812
Train Loss: 2.3387, validation Loss: 2.3397
Train Loss: 2.3038, validation Loss: 2.3221
Train Loss: 2.2796, validation Loss: 2.2864
Train Loss: 2.2760, validation Loss: 2.2809
Train Loss: 2.2577, validation Loss: 2.2702
Train Loss: 2.2393, validation Loss: 2.2608
Train Loss: 2.2243, validation Loss: 2.2618
Train Loss: 2.2074, validation Loss: 2.2484
Train Loss: 2.1927, validation Loss: 2.2479
Train Loss: 2.1876, validation Loss: 2.2436
Train Loss: 2.1873, validation Loss: 2.2181


- so, as we can see, the transfoermer architecture intersperses the communication (multi-head attention) with the computation (feed-forward block)
    - it has blocks that communicate then compute, and it groups them and repeats them multiple times

## Transformer block (communication + computation)

- this is basically the decoder (except for the cross attention)

In [102]:
class block(nn.Module):
    """ Transformer Block: Communiaction followed by computation """

    def __init__(self, n_embed, num_heads):
        super().__init__()
        # we will make the head size so that the output of the multi-head attention has dimension n_embed
        head_size = n_embed // num_heads
        # the communication is done using multi-head attention
        self.self_attn = MultiHeadAttention(num_heads, head_size) # communication, output is of shape (B,T,head_size*num_heads = n_embed)
        # the computation is done using a feed forward layer
        self.ffwd = FeedForward(n_embed) # computation, takes the (B,T,n_embed) and outputs (B,T,n_embed), allowing each token to think on the information it has

    def forward(self,x):
        # communication
        x = self.self_attn(x)
        # computation
        x = self.ffwd(x)
        return x

- this is how the transformer structures the sizes typically

In [103]:
class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding_table = nn.Embedding(block_size,n_embed)
        self.blocks = nn.Sequential(
            block(n_embed,num_heads=4),
            block(n_embed,num_heads=4),
            block(n_embed,num_heads=4),
        )
        self.lm_head = nn.Linear(n_embed,vocab_size)
        

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding_table(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the blocks
        x = self.blocks(x) # (batch_size,sequence_length,emb_size)
        
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.1831, grad_fn=<NllLossBackward0>))

- in the above model
    - we encode the tokens and the positions
    - we apply multiple transformer blocks (each block has a multi-head attention and a feed-forward block)
        - communication then computation many times
    - then we apply a linear layer to get the logits (decode)

- the above architecture does not give us good results
    - that is because it gets deeper and deeper, and deep neural nets suffer from optimization issues
    - so we need the residual connections and the layer normalization to help with that

In [104]:
torch.manual_seed(1337)
lr = 1e-3 # we reduce the learning, because the self-attention can't tolerate large learning rates
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
max_iters = 8000
eval_interval = 500

for i in range(max_iters):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

Train Loss: 4.1717, validation Loss: 4.1746
Train Loss: 3.1150, validation Loss: 3.1260
Train Loss: 2.7538, validation Loss: 2.7414
Train Loss: 2.5552, validation Loss: 2.5523
Train Loss: 2.4857, validation Loss: 2.4854
Train Loss: 2.4515, validation Loss: 2.4482
Train Loss: 2.3938, validation Loss: 2.4082
Train Loss: 2.3658, validation Loss: 2.3590
Train Loss: 2.3557, validation Loss: 2.3475
Train Loss: 2.3246, validation Loss: 2.3210
Train Loss: 2.3016, validation Loss: 2.3108
Train Loss: 2.2821, validation Loss: 2.3042
Train Loss: 2.2665, validation Loss: 2.2889
Train Loss: 2.2412, validation Loss: 2.2793
Train Loss: 2.2311, validation Loss: 2.2641
Train Loss: 2.2368, validation Loss: 2.2441


- we got worse results as expected

## Residual connections and layer normalization (Add & Norm)

- these 2 dramatically help with the depth of these networks, and make sure they remain optimizable

- mine: why do we need residual connections and normalization? specially for deep neural networks?
    - during training, as the parameters are updated, the distribution of the activations in the network changes, and the activations can become very large or very small, and this can lead to the vanishing or exploding gradient problem
        - from another perspective, as the distribution of the activations changes as we go deeper in the network, it becomes harder for the network to learn effectively, as each layer needs to continuously adapt to new distributions
    - we studied then the importance of good initialization of the network parameters, a well-calibrated initialization (like dividing by the square root of the input dimension for linear layers, and multiplying by some gain before the activation function squashes the activations) all these things help to preserve the distribution of the activations therefore preserving the distribution of the gradients as well (avoiding the vanishing or exploding gradient problem)

    -  good initialization was extremely essential to do and the neural networks (specifically the deep neural networks) were not very forgiving if we didn't make perfect and exact calibration of the weights
        - but as the neural networks got deeper and more complex, it became very difficult to perfectly calibrate the weights to make the logits have the same distribution (it was like balancing a pencil on its tip :D)

    
    - so people have came up with modern innovations that mitigated this, like `residual connections` and also what is called `normalization layers` 
        - think about the residual connections having a residual pathway, and we forked off from the residual pathway to do some computation (mine: additional block) and then project back to the residual pathway via addition, this ensures that the activations in the residual pathway are not changed much (unlike earlier when we had no residual pathway and the only path we have is to pass it through multiple subsequent layers, this increases the chances of changing the distribution alot)
            - so when we do x = block result + x, this mitigates the chances of changing the distribution (unlike x = block result only)
            - from another perspective, when we offer the residual pathway for the activations to flow through, this will offer the gradients a direct path to flow as well, and this will help with the vanishing gradient problem
                - preserving the activations and the gradients distribution & vanishing and exploding gradients are 2 faces of the same coin
        - the normalization layers are layers that normalize the activations to have a mean of 0 and variance of 1, so the activations are always in a good range, and the network can learn to adjust the activations as needed

### Residual connections
- we talked about the residual connections in the deep learning specialization
    - we discussed how they can easily learn the identity function, and each additional block can add to the identity function
    - so, previously, each additional block will change the output of the previous block whether to the best or worst, but now it can only add to the output of the previous block (even if it didn't learn anything useful, the previous output will not be lost)

![residual connections](assets/residual_connection.png)

- another way to think about it, we have a residual pathway, and we forked off from the residual pathway to do some computation (mine: additional block) and then project back to the residual pathway via addition
    - and we studied that during backpropagation, addition routes the gradient equally to both branches, so the gradients from the loss basically hop through every addition node all the way to the input, and also fork off to each block and update them
    - so, basically we have that gradient super highway tha goes directly from the supervision all the way to the input (and also to each block)
    - and the blocks are usually initialized in the beginning so they contribute very little (if anything) to the residual pathway
        - so in the beginning, they are sort of almost like not there , and during the optimization, they will learn to contribute with something useful

- mine: so, we will fork-off the residual pathway to do the mulit-head attention then add it again to the residual pathway, then fork-off again to do the feed-forward block then add it again to the residual pathway
    - in order to be able to add back the blocks results to the residual pathway, they must have the same shape, and if they are not the same shape, we can use a linear layer to project them to the same shape as the residual pathway
    - so, we will modify the ff_block and the multi-head attention to project their outputs to the residual pathway
        - note that their shapes are already ready for that, but we will make it so that it is general in case we want to change their shapes to output something other than n_embed
        - that is indeed what we will do, we will change the number of output features of the feed-forward block to be n_embed*4 (instead of n_embed) to be consistent with the paper, then we will project it back to n_embed using a linear layer

In [107]:
class MultiHeadAttention(nn.Module):
    def __init__(self,num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed,n_embed)

    def forward(self,x):
        # pass the x to each head, result will be a list of tensors of shape (B,T,head_size), we concatenate them on the last dimension
        out =  torch.cat([h(x) for h in self.heads], dim=-1)
        return self.proj(out)
    

class FeedForward(nn.Module):
    """ a simple layer followed by a non-linearity """
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 *n_embed),
            nn.ReLU(),
            # let's put the projection here
            nn.Linear(4 * n_embed,n_embed),
        )

    def forward(self,x):
        return self.net(x)

- now let's modify the block to include the residual connections

In [108]:
class block(nn.Module):
    """ Transformer Block: Communiaction followed by computation """

    def __init__(self, n_embed, num_heads):
        super().__init__()
        # we will make the head size so that the output of the multi-head attention has dimension n_embed
        head_size = n_embed // num_heads
        # the communication is done using multi-head attention
        self.self_attn = MultiHeadAttention(num_heads, head_size)
        # the computation is done using a feed forward layer
        self.ffwd = FeedForward(n_embed)

    def forward(self,x):
        # communication
        x = x + self.self_attn(x) # x += self.self_attn(x) -additional block-
        # computation
        x = x + self.ffwd(x) # x += self.ffwd(x) - additional block-
        return x

- notice how we fork-off to do the self-attention blocks and the feed-forward blocks then add the results again to the residual pathway
    - we fork-off to do some block, then come back to the residual pathway (mine: we project the output of the blocks if they are not of the same shape as the residual pathway)

In [109]:
class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding_table = nn.Embedding(block_size,n_embed)
        self.blocks = nn.Sequential(
            block(n_embed,num_heads=4),
            block(n_embed,num_heads=4),
            block(n_embed,num_heads=4),
        )
        self.ml_head = nn.Linear(n_embed,vocab_size)
        

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding_table(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the blocks
        x = self.blocks(x) # (batch_size,sequence_length,emb_size)
        
        logits = self.ml_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.6007, grad_fn=<NllLossBackward0>))

In [110]:
lr = 1e-3 # we reduce the learning, because the self-attention can't tolerate large learning rates
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
max_iters = 8000
eval_interval = 500

In [111]:
torch.manual_seed(1337)
for i in range(max_iters):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

Train Loss: 4.3802, validation Loss: 4.3655
Train Loss: 2.3987, validation Loss: 2.4098
Train Loss: 2.2801, validation Loss: 2.3041
Train Loss: 2.1948, validation Loss: 2.2187
Train Loss: 2.1440, validation Loss: 2.1861
Train Loss: 2.1058, validation Loss: 2.1539
Train Loss: 2.0781, validation Loss: 2.1402
Train Loss: 2.0579, validation Loss: 2.1106
Train Loss: 2.0382, validation Loss: 2.0987
Train Loss: 2.0196, validation Loss: 2.0932
Train Loss: 2.0038, validation Loss: 2.0832
Train Loss: 1.9874, validation Loss: 2.0901
Train Loss: 1.9722, validation Loss: 2.0728
Train Loss: 1.9486, validation Loss: 2.0539
Train Loss: 1.9317, validation Loss: 2.0627
Train Loss: 1.9346, validation Loss: 2.0377


- we got down to a lower loss thanks to the residual connections (allowed us to successfuly deepen the network with multiple blocks), and notice that the network is starting to get big enough that our training loss is getting ahead of validation loss (starting to see a little bit of overfitting)

In [112]:
idx = torch.zeros([1,1], dtype=torch.long)
print(decode(model.generate(idx,300)[0].tolist()))


PENCE:
It not bame;
And God your ver:
All I your passon but king frumakner, aull reamence wemblew-on, I grabence cadervers light:
Go him this mishourgent sGARENCENTER:
I wook, my swen there, did. Tants, blow?

SWARWY:
You wold stare honoubleeperemed:
I man Claremaknow this efent:
There my reverbid:



- the generation is still bad, but it starts to almost look like english

### Layer normalization

- Layer Norm is a paper that came out in 2016, [layer normalization paper](https://arxiv.org/abs/1607.06450)


- mine: theoretical recap (Read notebook 4)
- we studied batch norm in previous notebooks
    - in batch norm, we normalize the across the examples for each feature, assume we have $x_{batch}$ of shape (batch_size,features)
        - we calculte the mean and std across the batch dimension for each feature, shapes (1,features) 
        - then we standardize the features by subtracting the mean and dividing by the std
            - x - mean / std, (batch_size,features) - (1,features) / (1,features) => (batch_size,features), the mean and std are broadcasted to the batch dimension
            - now each feature has a mean of 0 and std of 1
        - then we move the features to a different mean and scale them up or down to a different std using $\gamma$ and $\beta$
            - $\gamma$ and $\beta$ are learnable parameters, and they are of shape (1,features)
            - $\gamma x + \beta$, (1,features) * (batch_size,features) + (1,features) => (batch_size,features)
            - so we can learn a different mean and std for each feature
        - the initial values of $\gamma$ and $\beta$ are 1 and 0 respectively, so that the distribution is not changed (still gaussian after the standardization), at least in the beginning

    - we said there are some disadvantages to batch norm
        - because we normalize through the batch, we are mathematically coupling these examples together (the mean and std are calculated across the batch)
            - now the logits are no longer independent for each example, as they will change slightly depending on the other examples in the randomly sampled batch (because the mean and std are calculated across the batch) and the logits will sort of jitter around depending on the other examples in the batch
            - and during inference, we don't have a batch, so we have to calculate the mean and std across the whole training set, then use them during inference -or use the running mean and std-
        - so, for that reason, no one likes this layer, it causes a huge amount of bugs, and therefore people tried to avoid it and proposed other alternatives like `layer normalization` and `group normalization`


In [115]:
class BatchNorm1d:
    def __init__(self, dim, eps = 1e-5, momentum = 0.1):
        self.eps = eps
        self.momentum = momentum
        self.training = True # because we will need to know if we are in training or evaluation mode
        # initialize the learnable parameters
        self.gamma = torch.ones(1,dim)
        self.beta = torch.zeros(1,dim)
        # initialize the running mean and var (called Buffers in PyTorch), we used the variance to follow the paper
        self.running_mean = torch.zeros(1,dim)
        self.running_var = torch.ones(1,dim)

    def __call__(self, x):  # shape of x is (m, dim)
        if self.training:
            # use the mean and std of the batch
            mean = x.mean(dim=0, keepdim=True) # mean of the logits over the batch for each neuron, shape (1, dim)
            var = x.var(dim=0, keepdim=True) # std of the logits over the batch for each neuron, shape (1, dim)
        else:
            # use the running mean and std
            mean = self.running_mean 
            var = self.running_var
        
        # standardize
        x_standardized = (x - mean) / torch.sqrt(var + self.eps) # shape (m, dim)
        # rescale
        self.out = x_standardized * self.gamma + self.beta # shape (m, dim)

        # update the running mean and std if we are in training mode
        if self.training:
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var

        return self.out # we don't have to save the ouput in .out attribute, but we do so for later visualization

    def parameters(self):
        return [self.gamma, self.beta]
    
dummy = torch.randn(100,8)
bn = BatchNorm1d(8)
out = bn(dummy)
out.shape

torch.Size([100, 8])

In [117]:
out.mean(dim=0), out.var(dim=0)

(tensor([     0.0000,      0.0000,      0.0000,     -0.0000,     -0.0000,
             -0.0000,      0.0000,     -0.0000]),
 tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]))

- the mean and variance for each column (neuron) are 0 and 1 respectively

- mine
- `layer normalization`
    - While batch normalization normalizes each feature across the batch (i.e., it computes the mean and standard deviation for each feature across the batch dimension), layer normalization normalizes the features of each individual sample
    - after standardizing the features, $\gamma$ and $\beta$ scale and shift the normalized values of the features within each sample. they allow the model to learn the optimal scale and offset for each feaure within the context of the sample itself
        - since after normalization, the mean and std of all the features for each sample are 0 and 1 respectively, so $\gamma$ and $\beta$ can learn the optimal scale and offset for the features within the example

- motivation behind layer normalization
    - now the logits are independent for each example (the mean and std calculated for each individual example), so the logits are not coupled together
        - therefore, we can easily apply layer normalization during inference (we don't need a batch to calculate the mean and std across it), so there is no need of a second stage to calculate the mean and std or even use the running mean and std 
    - Commonly used in RNNs, Transformers, and other models where batch normalization would be difficult to apply effectively due to varying sequence lengths or small batch sizes.

- Similarities between batch norm and layer norm: Both Batch Norm and Layer Norm help in preserving the distribution of activations and stabilizing gradients, making the network less sensitive to weight initialization and allowing for higher learning rates.


| Aspect                      | Batch Normalization (Batch Norm)                        | Layer Normalization (Layer Norm)                        |
|-----------------------------|---------------------------------------------------------|---------------------------------------------------------|
| Normalization Dimension     | Across the batch for each feature                       | Across the features for each individual sample          |
| Dependency on Batch Size    | Sensitive to batch size; performance may degrade with small batches | Independent of batch size; works well even with batch size of one |
| Ideal Use Cases             | CNNs, deep feedforward networks, architectures where batch size is large | RNNs, Transformers, models dealing with sequential or variable-sized data |
| Computation During Inference| Uses running mean and variance computed during training | Uses per-sample statistics; no need for running estimates |


- implementation differences
    - we will delete all the variables related to running statistics
        - the momentum
        - running mean
        - running variance
    - we will delete the training flag (we needed it to know wether to use the batch statistics during training or the running statistics during inference)
        - the behavior is now the same during training and inference
    - we will change the mean and std from dim = 0 (accross the batch or each feature) to dim = -1 (across the features for each individual sample) 

- shape analysis (on sequential data) 
    - we have x of shape (batch_size,sequence_length,features)
    - we will calculate the mean and std across the features for each individual sample, so we will use dim = -1, the mean and std will be of shape (batch_size,sequence_length,1)
    - then we will standardize the features by subtracting the mean and dividing by the std, so the mean and std will be broadcasted for all the features in each sample
    - then we will scale and shift the normalized values of the features within each sample using $\gamma$ and $\beta$, so $\gamma$ and $\beta$ will be of shape (1,1,features) and they will be broadcasted across the batch dimension and the sequence length dimension
    
TODO: do more research about layer normalization

In [119]:
class LayerNorm1d:
    def __init__(self, dim, eps = 1e-5, momentum = 0.1):
        self.eps = eps
        # initialize the learnable parameters
        self.gamma = torch.ones(1,dim)
        self.beta = torch.zeros(1,dim)
        

    def __call__(self, x):  # shape of x is (m, dim)
        # use the mean and std of the batch
        mean = x.mean(dim=1, keepdim=True) # mean of the logits over the batch for each neuron, shape (m, 1)
        var = x.var(dim=1, keepdim=True) # std of the logits over the batch for each neuron, shape (m, 1)
        print(mean.shape)
        print(var.shape)
    
        # standardize
        x_standardized = (x - mean) / torch.sqrt(var + self.eps) # shape (m, dim)
        # rescale
        self.out = x_standardized * self.gamma + self.beta # shape (m, dim)

        return self.out # we don't have to save the ouput in .out attribute, but we do so for later visualization

    def parameters(self):
        return [self.gamma, self.beta]
    
dummy = torch.randn(100,8)
bn = LayerNorm1d(8)
out = bn(dummy)
out.shape

torch.Size([100, 1])
torch.Size([100, 1])


torch.Size([100, 8])

In [120]:
# notice that the columns are not normalized
out.mean(dim=0), out.var(dim=0)

(tensor([ 0.0425, -0.1134, -0.0339,  0.0080,  0.2189, -0.0068, -0.0921, -0.0232]),
 tensor([1.0071, 0.7030, 0.8447, 1.0312, 0.7555, 1.0201, 0.9427, 0.6926]))

In [121]:
# but the rows are normalized (for every individual example, the features have mean 0 and std 1 at the beginning)
out.mean(dim=1), out.var(dim=1)

(tensor([     0.0000,      0.0000,      0.0000,      0.0000,      0.0000,
              0.0000,      0.0000,      0.0000,      0.0000,      0.0000,
              0.0000,      0.0000,     -0.0000,     -0.0000,     -0.0000,
             -0.0000,      0.0000,      0.0000,     -0.0000,     -0.0000,
              0.0000,     -0.0000,      0.0000,     -0.0000,     -0.0000,
              0.0000,     -0.0000,      0.0000,      0.0000,     -0.0000,
             -0.0000,      0.0000,     -0.0000,      0.0000,      0.0000,
              0.0000,      0.0000,      0.0000,      0.0000,     -0.0000,
             -0.0000,      0.0000,      0.0000,     -0.0000,      0.0000,
              0.0000,     -0.0000,     -0.0000,      0.0000,      0.0000,
             -0.0000,     -0.0000,     -0.0000,      0.0000,      0.0000,
             -0.0000,      0.0000,      0.0000,      0.0000,      0.0000,
              0.0000,      0.0000,      0.0000,      0.0000,     -0.0000,
              0.0000,      0.0000,    

- in PyTorch, we have `nn.LayerNorm` that does that for us

### Incorporate Add & Norm to the model

![attention architecture](assets/attention_architecture.png)

- we said that very few details about the transformer architecture have changed in the last 5 years
    - in the image above, add & norm is applied after the transformation (after the block)
    - but it became more common to apply the layer norm before the block (before the multi-head attention and the feed-forward block)
        - mine: so we do layer norm -> block -> add to the residual pathway

In [126]:
class block(nn.Module):
    """ Transformer Block: Communiaction followed by computation """

    def __init__(self, n_embed, num_heads):
        super().__init__()
        # we will make the head size so that the output of the multi-head attention has dimension n_embed
        head_size = n_embed // num_heads
        # the communication is done using multi-head attention
        self.self_attn = MultiHeadAttention(num_heads, head_size)
        # the computation is done using a feed forward layer
        self.ffwd = FeedForward(n_embed)
        # layer normalization (mine: we will use 2 different layer norms because we want to normalize the features of each token in the multi-head attention and the feed forward layer separately)
        self.ln1 = nn.LayerNorm(n_embed) # layer normalization before the multi-head attention 
        self.ln2 = nn.LayerNorm(n_embed) # layer normalization before the feed forward layer

    def forward(self,x):
        # communication (norm -> multi-head attention -> residual connection)
        x = x + self.self_attn(self.ln1(x))
        # computation (norm -> feed forward -> residual connection)
        x = x + self.ffwd(self.ln2(x))
        return x

- we added layer normalization layers right before the multi-head attention and the feed-forward block
- since our inputs are of shape (batch_size,sequence_length,n_embed), we will apply the layer normalization on the last dimension (n_embed)
    - so the layer normalization will normalize the features, and the batch dimension as well as the sequence length will be untouched
        - so think of it as a per-token transformation, that just normalizes the features and make them unit gaussian at initialization (then the network will learn the mean and variance)

- there should be a layer norm at the end of the transformer (right before the final linear layer that decodes into vocab_size logits)
    

In [127]:
class Bigram(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table (the lookup table that maps each character to the logits of all possible next characters)
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding_table = nn.Embedding(block_size,n_embed)
        self.blocks = nn.Sequential(
            block(n_embed,num_heads=4),
            block(n_embed,num_heads=4),
            block(n_embed,num_heads=4),
            # layer normalization before the final linear layer
            nn.LayerNorm(n_embed),
        )
        self.lm_head = nn.Linear(n_embed,vocab_size)
        

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding_table(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the blocks then layer norm right before the final linear layer
        x = self.blocks(x) # (batch_size,sequence_length,emb_size)
        
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = Bigram()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.3122, grad_fn=<NllLossBackward0>))

In [128]:
lr = 1e-3 # we reduce the learning, because the self-attention can't tolerate large learning rates
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
max_iters = 8000
eval_interval = 500

In [129]:
torch.manual_seed(1337)
for i in range(max_iters):
    # get the batch
    x_batch,y_batch = get_batch('train')

    # Forward prop & loss 
    logits, loss = model(x_batch,y_batch)

    # backward prop
    # reset the gradients from the previous step before the backprop (we used to do it manually)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    ## update parameters
    optimizer.step()

    # validation phase each eval interval
    if i % eval_interval == 0:
        losses = estimate_loss()
        print(f"Train Loss: {losses['train']:.4f}, validation Loss: {losses['val']:.4f}")

Train Loss: 4.2364, validation Loss: 4.2348
Train Loss: 2.4111, validation Loss: 2.4204
Train Loss: 2.2663, validation Loss: 2.2850
Train Loss: 2.1814, validation Loss: 2.2050
Train Loss: 2.1318, validation Loss: 2.1701
Train Loss: 2.0874, validation Loss: 2.1365
Train Loss: 2.0646, validation Loss: 2.1204
Train Loss: 2.0420, validation Loss: 2.0910
Train Loss: 2.0253, validation Loss: 2.0833
Train Loss: 2.0115, validation Loss: 2.0742
Train Loss: 1.9975, validation Loss: 2.0605
Train Loss: 1.9779, validation Loss: 2.0706
Train Loss: 1.9659, validation Loss: 2.0586
Train Loss: 1.9459, validation Loss: 2.0581
Train Loss: 1.9275, validation Loss: 2.0566
Train Loss: 1.9408, validation Loss: 2.0384


- so now we have pretty complete transformer that is similar to the decoder block in the transformer architecture
    - it is a decoder only transformer

In [131]:
idx = torch.zeros([1,1], dtype=torch.long)
print(decode(model.generate(idx,300)[0].tolist()))


HENRY: I dess pake,
The faslese astot him, good his, and as lienght your condlanle 'tis suve
A king the steapiraft! Tomere;

DUKE VINNEN:
How lord Entsel? 'featcend?

WICK:
Oglach not Nor is An, verarm.
He; whim the
For dest.
Whats,
Abry GRIO:
Abpyon his I from a leasure a preavity.

DUCUCIPRUMPRIET


## Dropout


- what does dropout do?
    - it randomly every forward & backward, it shuts off some subset of neurons -randomly drops them to zero- therefore training without them
    - what this does exactly is that because the mask of what being dropped out is changed every forward and backward pass (meaning every single forward and backward pass, different neurons are dropped out), it ends up kind of training different neural networks (kind of training an ensemble of sub-networks)
        - then at test time, everything is fully enabled, and kind of all these sub-networks are merged into a single ensemble

    - it has a regularization effect and we will add it because we will scale-up the model and we don't want to worry about overfitting
    
- drop out is something that we can add right before the block output is back into the residual pathway (mine: after the projection of the block output to the residual pathway)
    - so we will do that for the multi-head attention and the feed-forward blocks

- we can also dropout some of the affinities (in the wei matrix), right before using it to aggregate the values
    - so this will randomly prevent some of the nodes from communicating

In [132]:
dropout = 0.2

class MultiHeadAttention(nn.Module):
    def __init__(self,num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embed,n_embed)
        self.dropout = nn.Dropout(dropout)

    def forward(self,x):
        # pass the x to each head, result will be a list of tensors of shape (B,T,head_size), we concatenate them on the last dimension
        out =  torch.cat([h(x) for h in self.heads], dim=-1)
        return self.dropout(self.proj(out))
    

class FeedForward(nn.Module):
    """ a simple layer followed by a non-linearity """
    def __init__(self, n_embed):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 *n_embed),
            nn.ReLU(),
            # let's put the projection here
            nn.Linear(4 * n_embed,n_embed),
            # dropout
            nn.Dropout(dropout),
        )

    def forward(self,x):
        return self.net(x)
    
class Head(nn.Module):
    """ One head of Self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed,head_size,bias=False)
        self.query = nn.Linear(n_embed,head_size,bias=False)
        self.value = nn.Linear(n_embed,head_size,bias=False)
        # the tril is used to mask out -discard- the upper triangular part of the weight matrix -the future tokens-
        self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size))) # trill is not a parameter, so it is called a buffer in pytorch naming conventions, so we have to assign it the module using the register_buffer method of the nn.Module class

        self.dropout = nn.Dropout(dropout)

    def forward(self,x):
        B,T,C = x.shape
        k = self.key(x) # B,T,head_size
        q = self.query(x) # B,T,head_size

        # compute the attention scores "Affinities"
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B,T,head_size) @ (B,head_size,T) => (B,T,T)
        # discard the future tokens for each token
        wei = wei.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
        # apply softmax to get the attention weights
        wei = F.softmax(wei,dim=-1) # (B,T,T)

        # dropout some of the affinities randomly
        wei = self.dropout(wei)

        # get the values
        v = self.value(x) # B,T,head_size
        out = wei @ v # (B,T,T) @ (B,T,head_size) => (B,T,head_size)
        return out

## Scale-up the model

- we will organize the implementation a little bit 
    - first of all naming Bigram to GPTLanguageModel (it's no longer a bigram model :D )
    - make the number of heads and the number of layers as parameters instead of fixed numbers (4 heads and 3 blocks)
- then I will move everything to a script `gpt.py` 

In [133]:
n_layers = 4
n_heads = 4


class GPTLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size,n_embed)
        self.positional_embedding_table = nn.Embedding(block_size,n_embed)
        self.blocks = nn.Sequential(*[block(n_embed,num_heads=n_heads) for _ in range(n_layers)])
        self.ln_f = nn.LayerNorm(n_embed) # final layer norm
        self.lm_head = nn.Linear(n_embed,vocab_size)
        

    def forward(self,idx,targets=None):
        """
        idx: the token indices, shape (batch_size,sequence_length)
        """
        B,T = idx.shape
        emb = self.token_embedding_table(idx) # embeddings of shape (batch_size,sequence_length,emb_size)
        pos_emb = self.positional_embedding_table(torch.arange(T,device=idx.device)) # (sequence_length,emb_size)
        x = emb + pos_emb # adding shapes (batch_size,sequence_length,emb_size) + (sequence_length,emb_size) -> broadcasting for the batch_size dimension
        
        # feed the input to the blocks
        x = self.blocks(x) # (batch_size,sequence_length,emb_size)

        # layer normalization before the language model head
        x = self.ln_f(x)
        # feed the output of the blocks to the language model head
        logits = self.lm_head(x) # the logits of shape (batch_size,sequence_length,vocab_size)

        # if the targets are not provided (during generation)
        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits.reshape(-1,vocab_size),targets.reshape(-1)) # we need to collapse the batch_size and the sequence length dimensions together (flatten out the timesteps as individual examples), that is what the loss expects
        
        return logits, loss

    def generate(self,idx,max_new_tokens):
        """
        idx: token indices of some batch (the same one used in training) (batch_size,sequence_length)
        we will basically take the indices and expand the sequence length using generation (sampling) up to max_new_tokens
        """
        for _ in range(max_new_tokens):
            # crop the sequence length to the block size
            idx_cropped = idx[:,-block_size:]
            # inference the idx
            logits, _ = self(idx_cropped) # batch_size,sequence_length,vocab_size
            # focus only on the last timestep (the bigram only needs the last character to predict the next), but later we can feed all the previous characters
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=1) # still (batch_size, vocab_size), but each example in the batch now is betwee 0 and 1 and sums to 1
            # sample from the dsitribution
            idx_next = torch.multinomial(probs,num_samples=1) # batch_size,1, sampled next indices for each example in the batch
            # concatenate the sampled indices to the current indices (along the sequence length dimension)
            idx = torch.cat((idx,idx_next),dim=1) # batch_size, sequence_length + 1 = new sequence length
        return idx
    
model = GPTLanguageModel()
logits, loss = model(x_batch,y_batch)
logits.shape, loss

(torch.Size([32, 8, 65]), tensor(4.3854, grad_fn=<NllLossBackward0>))

- after running the script `gpt.py`, we got around 1.48 loss on the validation set!
    - the generation is a lot more recognizable and always like someone speaking (just like the original text)
        - except of course the text is nonsensical when you actually read it :D
        - but this is just a transformer trained on the character level for 1 Million characters that come from shakespeare
            - so it the generation wouldn't make sense at this scale
            - but still pretty good demonstration of the power of the transformer

## Mine: Recape: Architecture evolution

- we started with the very basic and naive bigram model
    - x -> token embeddings -> logits
- then we scaled it up a little
    - x -> token embeddings + positional encodings -> linear layer -> logits
- then we studied the self-attention mechanism and added it (mine: the features now are the aggregated information from the previous tokens for each token)
    - x -> token embeddings + positional encodings -> self-attention -> linear layer -> logits
- then we added multi-head attention (multiple communication channels, allowing the tokens to communicate different types of information at each head in parallel)
    - x -> token embeddings + positional encodings -> multi-head attention -> linear layer -> logits
- then we studied the feed-forward block (computation block) 
    - so the multi-head attention did the communication, but we went too fast and calculated the logits on them
        - in other words, the tokens looked at each other but didn't really have a lot of time to `think on` what they found from the other tokens 
        - that is why we add the feed-forward block after the multi-head attention
            - our feed-forward block is a linear layer followed by a ReLU and maps the n_embed to n_embed (to allow each token to think on the data it gathered individually)
    - x -> token embeddings + positional encodings -> multi-head attention -> feed-forward block -> linear layer -> logits

- then we grouped the multi-head attention and the feed-forward into a single block -that does communication then computation- 
    - this is basically the decoder block in the transformer architecture, but without the cross-attention to the encoder
    - that is because we want to repeat this block multiple times
    - shape analysis of this block
        - it takes the embeddings added to it the positional encodings, shape (batch_size,sequence_length,n_embed)
        - applies multi-head attention with head_size = n_embed // number of heads (so that their concatenation will be of the same size as the embeddings)
            - so the output of the multi-head attention will be of shape (batch_size,sequence_length,n_embed)
        - then applies the feed-forward block that maps the n_embed to n_embed, output of shape (batch_size,sequence_length,n_embed)
            - to allow each token to think on the data it gathered individually
    - so we tried this architecture x -> token embeddings + positional encodings -> transformer block x 3 -> linear layer -> logits
    - we didn't get good results, because the network is getting deeper and deeper, and deep neural nets suffer from optimization issues

- then we studied add & norm blocks (residual connections and layer normalization)
    - residual connections: allows us to deepen the network without optimization issues, and allows each block to add to the output of the previous block
        - we added the residual connections inside the transformer block
            - inside the block we have x, then we fork-off to do the multi-head attention then add the result back to the residual pathway, then fork-off to do the feed-forward block then add the result back to the residual pathway
            - so we projected the output of the multi-head attention and the feed-forward block to the same shape as the residual pathway (n_embed) using a linear layer (they were already of the same shape, but this is general in case we want to change their shapes)
                - and indeed we changed the feed-forward block to output n_embed*4 instead of n_embed to be consistent with the paper, then we projected it back to n_embed using a linear layer
            
        - now using the same architecture above, x -> token embeddings + positional encodings -> transformer block (with residual connections) x 3 -> linear layer -> logits and got pretty good results
    - layer normalization
        - we added layer normalization inside the transformer block (a layer normalization before the multi-head attention and another one before the feed-forward block)
            - this became more common in the last 5 years, to apply layer normalization before the block
                - so we have block input -> layer norm -> block -> project it to the residual pathway (instead of add & norm after the block)
        - we also added layer normalization at the end right before the final linear layer (language model head)
        - then we trained again x -> token embeddings + positional encodings -> transformer block (with residual connections and layer normalization) x 3 -> layer norm -> linear layer -> logits and got better results

- we then added dropout
    - we added it inside the transformer block (right before the blocks output is added to the residual pathway)
    - we also added it inside the multi-head attention block (on the affinities -the wei matrix- right before using it to aggregate the values)
        - this randomly prevents some of the nodes from communicating
    - we did so to scale-up the model without worrying about overfitting

- then we made the number of heads and the number of blocks as parameters instead of fixed numbers (4 heads and 3 blocks previously)
    - and now we have our gpt :D

- so, we started with the very basic bigram model, and ended up with a pretty complete transformer that is similar to the decoder block in the transformer architecture (except for the cross-attention to the encoder)

- our implemented architecture
![transformer](assets/transformer.png)

## Encoder Vs decoder vs both?

- decoder
    - what we implemented here is a decoder only transformer
        - so there is no encoder or cross attention
    - our block only has a self-attention and the feed-forward block
    - the reason we have a decoder only, is because we are just generating text and it is unconditioned on anything (just blabbering on according to a given dataset)
        - mine: that is what i said earlier on the difference between a language model (decoder only) and an encoder-decoder architecture (a conditional language model)
    - and what makes it a decoder is that we are using a masked multi-head attention (masking the future tokens)
        - so it has this auto-regressive property where we can just go and sample from it
    - check our implementation above
    - so we have nothing to encode, there is no conditioning, we just have a text file and we want to imitate it (that is why we are using a decoder only transformer as done in GTP)


- encoder-decoder
    - the reason that the original paper had an encoder-decoder architecture, is because it is a machine translation paper
        - it is expected to encode a source sentence then decoder its translation in the target language
        - so it reads in the source sentence and conditions on it, and then start the generation with a special token "start of sentence" token
    - example
        - Encode: Les réseaux de neurones sont géniaux! 
        - Decode: `<SOS>` Neural Networks are awesome! `<EOS>`
        - so the decoder will do exactly what we did here, but this generation will be conditioned on some additional information "the french sentence"
        - so the encoder reads the source sentence, we are going to take the french sentence and we are going to create tokens from it and put a transformer block on it -the encoder block- 
            - and this time there is no masking, so all the tokens are allowed to communicate with each other 
        - once they have encoded it (mine: applying the transformer block on it several times), they basically come out of the last block and then the decoder (which does language modeling) will take that output to the cross-attention block (a normal multi-head attention, not masked, but with the queries coming from the target sentence, and the keys and values coming from the encoder output)
            - and those keys and values (from the encoder's last block) will be fed to the cross attention to every single block of the decoder -check the image below-
            - so it is conditioning the decoding on having seen the fully encoded french prompt -the source sentence-

![attention architecture spanned](assets/attention_architecture_spanned.png)

## Nano-GTP

can be found in [here](https://github.com/karpathy/nanoGPT)

- it is basically 2 files of interest
    - train.py
        - all the boiler plate code for training the model, just like what we have but a lit more complicated, as we also do
            - saving and loading the checkpoints
            - decaying the learning rate
            - using distributed training across multiple GPUs
    - model.py
        - looks very similar to what we done here (almost identical), except for few differences
            - it has a causal self attention block (parallel multi-head attention)
                - we have a single head then we create multiple heads of them in here and concatenate them
                - but there, all of it is implemented in a batched manner (so k,q,v for example is (batch_size,num_heads,sequence_length,head_size) instead of (batch_size,sequence_length,head_size))
                - `all the heads are treated as a batch dimension as well`
            - using the `gelu` non-linearity instead of the `relu`
                - just because OpenAI used it and he wanted to laod their checkpoints
            - separating our the parameters into that should be weight decayed and those that shouldn't 


## Training ChatGPT

- to train a chatGPT there are roughly 2 stages 
    - `pretraining stage` (what we did here)
        - we train on a large chunk of the internet and just trying to get a first decoder only transformer to babble text 
            - we did so but only 10M parameters on 1M tokens (if we will use OpenAI vocabulary -subword tokenizer- then it is probably around 300K tokens on the subword level)
                - so 10M parameters on 300K tokens
            - as for GPT-3, they trained several models with different sizes 
                - the biggest one has 175B parameters! on 300B tokens!
                - so it is a massive infrastructure challenge (we are talking about thousands of GPUs having to talk to each other to train a model of that size)
        - at the end of the first stage,  we don't get something that responds to the questions with answers
            - we just get a document completer, something that babbles internet (arbitrary news articles, wikipedia and so on)
            - so if we gave it a question after the first stage, it would potentially give us more questions or ignore it or follow with whatever it looks like some close document would do in the training data
            - so we have undefined behavior
    - the `fine-tuning stage`
        - we take the pre-trained model and we fine-tune it on a specific task
        - in here we actually allign it to be an assistant
        - here is how Chatgpt fine-tuned their models [here](https://openai.com/index/chatgpt/), they basically fine-tuned it on 3 steps
            1. collect training data that looks specifically like the task we want to do
                - so they have documents that have to format where the question is on top and the answer is below
                - they have a large number of these but not on the scale of the internet (mayble thousands of examples)
                - so we are trying to slowly allign it so that it expects a question on the top and complete the answer 
            2. they then let the model respond with different answers and let different readers look at them and rank them for their preference
                - they use that to train a reward model, so they can predict how much a human would like the response and let the model later choose the best response
            3. then they run a PPO (Proximal Policy Optimization) reinforcement learning algorithm to train the model to maximize the reward
        - this takes the model from a document completer to a question answerer
        - a lot of this data isn't available publicly, and it is much harder to replicate this stage
        - so basically if we wanted to make it alligned to a specific way to do some task (question answerer, or a chatbot, or detect sentiment), basically anytime we want something more than a document completer, we have to complete further stages of fine-tuning which we did not cover
        - the fine-tuning can be simple supervised learning, or something more fancy like what OpenAI did

- more on that on the next notebook
            