GPT = Generative Pre-Trained Transformer

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [3]:
# Read the data into a string
with open("tinyshakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [4]:
chars = sorted(list(set(text)))
data_size, vocab_size = len(text), len(chars)
print('data has %d characters, %d unique.' % (data_size, vocab_size))

data has 1115394 characters, 65 unique.


In [5]:
# Tokenize the text into words
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Convert the text into a sequence of integers
encoder = lambda s: [stoi[c] for c in s]
decoder = lambda s: ''.join([itos[c] for c in s])

In [6]:
data = torch.tensor(encoder(text), dtype=torch.int64)
print(data.shape, data.dtype)
print(data[:100])


torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [9]:
# Split the data into train and test
n_train = int(0.9 * data_size)
train_data = data[:n_train]
test_data = data[n_train:]


We're not going to feed the entire sequence into the model at once. \
Instead, we'll feed it a small chunk at a time, and then let it continue predicting the next character.

In [7]:

# Sometimes called the chunk size or batch size.
block_size = 8

# We will actually separate block_size + 1 characters at a time. 
# Look at it this way:

# For block[0],    we will predict block[1], 
# For block[0, 1], we will predict block[2], 
# and so on. 

# Which means we need to have block[8] as target for block[0, ... , 7]

# This is not just done for efficiency, but also to make sure that the 
# Transformer gets used to seeing contexts as little as 1 character long 
# and as long as block_size characters long.

In [10]:
x = train_data[:block_size]
y = train_data[1 : block_size + 1]

for t in range(block_size):
    context = x[: t + 1]
    target = y[t]
    print(f"When input is {context.tolist()}, the target is {target}.")


When input is [18], the target is 47.
When input is [18, 47], the target is 56.
When input is [18, 47, 56], the target is 57.
When input is [18, 47, 56, 57], the target is 58.
When input is [18, 47, 56, 57, 58], the target is 1.
When input is [18, 47, 56, 57, 58, 1], the target is 15.
When input is [18, 47, 56, 57, 58, 1, 15], the target is 47.
When input is [18, 47, 56, 57, 58, 1, 15, 47], the target is 58.


In [None]:
# There's one more thing to care about and that's the batch dimension.
# We want to be able to feed in a batch of sequences at a time (as a tensor) to the GPU
# and have it process all of them in parallel for a speedup.

In [12]:
# Setting a seed to make sure the same pseudo-random sequence is generated every time we run this cell
torch.manual_seed(1337)
batch_size = 4      # How many independent sequences to process in parallel
block_size = 8      # Maximum context length for predictions


def get_batch(split: str):
    # generate a small batch of data of inputs `x` and targets `y`
    data = train_data if split == "train" else test_data
    # stochastic sampling of the data, ix = indices
    ix = torch.randint(len(data) - block_size, size=(batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])

    return x, y

In [13]:
torch.manual_seed(1337)

# Print an example batch
xb, yb = get_batch("train")

print("inputs:")
print(xb.shape)
print(xb)

print("targets:")
print(yb.shape)
print(yb)

print("---")

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, : t + 1]
        target = yb[b, t]
        print(f"When input is {context.tolist()}, the target is {target}.")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
---
When input is [24], the target is 43.
When input is [24, 43], the target is 58.
When input is [24, 43, 58], the target is 5.
When input is [24, 43, 58, 5], the target is 57.
When input is [24, 43, 58, 5, 57], the target is 1.
When input is [24, 43, 58, 5, 57, 1], the target is 46.
When input is [24, 43, 58, 5, 57, 1, 46], the target is 43.
When input is [24, 43, 58, 5, 57, 1, 46, 43], the target is 39.
When input is [44], the target is 53.
When input is [44, 53], the target is 56.
When input is [44, 53, 56], the target is 1.
When input is [44, 53, 56, 1], the target is 58.
When input is [44, 53, 

A Logit function, also known as the log-odds function, is a function that represents probability values from 0 to 1, and negative infinity to infinity. The function is an inverse to the sigmoid function that limits values between 0 and 1 across the Y-axis, rather than the X-axis.

$$
logit(p) = \log(\frac{p}{1-p})
$$

In [18]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


In [3]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int) -> None:
        super().__init__()

        # Each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

        # The vocab_size * vocab_size matrix is the embedding table
        # which means that we every word in the vocabulary is going to be represented by a vector of vocab_size length
        # More on word embeddings: https://www.tensorflow.org/text/guide/word_embeddings
    
    def forward(self, idx: torch.tensor, targets: torch.tensor = None) -> torch.tensor:

        # idx and targets are both (Batch, Time) tensors of integers
        logits = self.token_embedding_table(idx) # (Batch, Time, Channel)

        # When we pass our input through the embedding table, 
        # every single integer in our input is going to refer to this embedding table
        # is going to pluck out a row from this embedding table corresponding to that integer (as an index)
        # In this case, Batch = 4, Time = 8, Channel = 65 (vocab_size)

        # Logits are the scores for the next character in the sequence
        # As you see, this is a bi-gram, which means that the next character is predicted
        # Only based the current character, and not the entire sequence

        # Aka negative log likelihood loss
        # loss = F.cross_entropy(logits, targets)

        # The above code won't work, because we have a tensor in the shape of (B, T, C)
        # But Pytorch's cross entropy expects a tensor of shape (B * T, C)
        # So basically we need to flatten the input tensor and the target tensor into a 1D tensor

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)
            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)
            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)
            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

In [28]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.5262, grad_fn=<NllLossBackward0>)


In [None]:
# Now, because initally our model should equally predict any character as the next character
# Our log likelihood loss should be around -ln(1/65) = 4.174387269895637

In [29]:
# We set our initial token to be 0, which is the newline character in our vocabulary

idx = torch.zeros((1, 1), dtype=torch.long)

# We have one batch and the [0] is picking the first batch
print(decoder(m.generate(idx, max_new_tokens=100)[0].tolist()))

# The model is obviously not trained so it's just spitting out random characters


'JgC.JZWqUkpdtkSpmzjM-,RqzgaN?vC:hgjnAnBZDga-APqGUH!WdCbIb;$DefOYbEvcaKGMmnO'q$KdS-'ZH
.YSqr'X!Q! d;


In [30]:
# create a PyTorch optimizer

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [34]:
batch_size = 32
max_steps = 10000

for steps in range(max_steps):
    # get a batch of data
    xb, yb = get_batch("train")

    # get the model predictions and loss
    logits, loss = m(xb, yb)

    # zero out the gradients
    optimizer.zero_grad(set_to_none=True)
    
    # compute the gradients
    loss.backward()

    # update the model parameters
    optimizer.step()

    if steps % 1000 == 0:
        print(f"Step {steps}, loss = {loss.item()}")

Step 0, loss = 3.6967320442199707
Step 1000, loss = 3.0273618698120117
Step 2000, loss = 2.821552276611328
Step 3000, loss = 2.547393798828125
Step 4000, loss = 2.519805908203125
Step 5000, loss = 2.529714584350586
Step 6000, loss = 2.542541980743408
Step 7000, loss = 2.5140950679779053
Step 8000, loss = 2.5017194747924805
Step 9000, loss = 2.5214755535125732


In [36]:
# We set our initial token to be 0, which is the newline character in our vocabulary

idx = torch.zeros((1, 1), dtype=torch.long)

# We have one batch and the [0] is picking the first batch
print(decoder(m.generate(idx, max_new_tokens=300)[0].tolist()))


CKELOresm, bur stthakls,
Ther layo-then ha nleincede jahe
DZW:
Gothe s kendwepive.
FAnorereroldghmig ppu:
Co nlllinger hus;
aver his, t towis t s ng,
ANE: foratreaisplblthriat, otimust hiny ille, yomeON p, IN I ckist vemo th.
Dieathy al hi?
Fo'd ha s?
ARS:
Semnd thinghy.
IORDitwint sth! mine actwis 


In [None]:
# This is obviously not shakespeare, but it's a lot better than random characters!

### The mathematical trick in self-attention

In [4]:
# consider the following toy example:

torch.manual_seed(1337)
B, T, C = 4, 8, 2
# randn = normal distribution with mean 0 and variance 1
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [6]:
# Alright, so now we want to bring out some information from the tokens
# that come before the current token at each step and use this information
# in predicting the next token
# The easiest way to do this is to average out all the embedding vectors
# of the tokens that come before the current token (with the current token included)


# bow = bag of words 
# bag here means that it's an average of the word vectors
xbow = torch.zeros((B, T, C))

# We want x[b, t] = mean_{i <= t} x[b, i] 
for b in range(B):
    for t in range(T):
        #                 [0, t]
        xbow[b, t] = x[b, :t+1].mean(dim=0)

# This thing uses `for` loops, which is not good for performance
# and is used for illustration purposes only
# We can do this much more efficiently using matrix multiplication

In [9]:
# Example of doing averaging using matrix multiplication

torch.manual_seed(42)

# Torch.tril gives us a lower triangular matrix of its input
# (the upper triangular part is zeroed out)
a = torch.tril(torch.ones(3, 3))

# dim = 1 means that we are summing over the rows
# This is for doing the averaging part 
a = a / torch.sum(a, dim=1, keepdim=True)

b = torch.randint(low=0, high=10, size=(3, 2)).float()

c = a @ b

print(f"a = \n {a} \n")
print(f"b = \n {b} \n")
print(f"c = \n {c} \n")


a = 
 tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]]) 

b = 
 tensor([[2., 7.],
        [6., 4.],
        [6., 5.]]) 

c = 
 tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]]) 



In [13]:
# You can see that this gives us the averages in incremental fashion:
# The first row of c is the average of the first row of b
# The second row of c is the average of the first and second row of b
# The third row of c is the average of the first, second and third row of b
# and so on

# So now let's go back and do this for our toy example

# wei is short for weights
wei = torch.tril(torch.ones(T, T))
wei = wei / torch.sum(wei, dim=1, keepdim=True)

xbow2 = wei @ x # (T, T) @ (B, T, C) -> (B, T, C)
# Pytorch will see that these shapes are incompatiable and will automatically
# broadcast the first matrix to be (B, T, T) and then do the matrix multiplication
# So it's a (B, T, T) @ (B, T, C) -> (B, T, C)
# First dim is the same, so the second and third dim are multiplied
# and the result is broadcasted to the first dim.
# Something like this:
# (B, (T, T)) @ (B, (T, C)) = (B, (T, T) @ (T, C)) = (B, (T, C))

In [15]:
# Version 3: Use softmax!
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
# In wei, replace the elements where the corresponding element in tril is 0 with -inf
# By corresponding, meaning the same index because they have the same shape
wei = wei.masked_fill(tril == 0, float("-inf"))
# Softmax gets the sum of e^{x} for x in each row and divides each element by that sum
# So the sum of each row is 1. (e^{0} = 1, e^{-inf} = 0)
wei = F.softmax(wei, dim=-1) 
xbow3 = wei @ x

torch.allclose(xbow3, xbow2)

True

Funny thing to mention: \
You could see Andrej's lighting get dimmer and dimmer as time went by.
He finally went to sleep at 1:01:57 in the video.

In [17]:
# Now, let's try to update the Bigram model some more

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int, block_size: int, n_embed: int) -> None:
        super().__init__()
        # We make an embedding table for the __value__ of the tokens
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)

        # We also make an embedding table for the __position__ of the tokens
        # meaning that both the token and the position of the token will be paid attention to
        self.position_embedding_table = nn.Embedding(block_size, n_embed)

        # We have changed our channel size to be n_embed, but we wanted to get logits
        # in such a way that the logits are in the same shape as the vocabulary
        # because they are the probabilities of each token in the vocabular for being the next token
        # So we add a linear layer to transform the channel size to be the same as the vocab size
        self.lm_head = nn.Linear(n_embed, vocab_size)
        # lm = language model
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensors of integers

        tok_emb = self.token_embedding_table(idx) # (B, T, n_embed)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, n_embed)

        # The batch dimension will get automatically broadcasted
        # for the position embedding, as it should be the same for all the batches
        x = tok_emb + pos_emb # (B, T, n_embed)

        # So now, the x will not only hold the token identity, but also the position of the token
        # It is currently not that useful because we're only using the bigram model,
        # so it's all translation invariant at this stage. (Doesn't matter _where_ the token is)
        # But this will be useful when we use the transformer model (self-attention)

        logits = self.lm_head(x) # (B, T, vocab_size)

    ### From here on out, it's the same as the previous version

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)
            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)
            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)
            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

In [20]:
# Now we get to the most important part of the tutorial
# The crux of 'self-attention'.

# Version 4: Self-attention!

torch.manual_seed(1337)

# Initializing the input vector
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

tril = torch.tril(torch.ones(T, T))
# The weights are the initial "affinities" between tokens,
# When we initialize them to be 0, we will get the uniform numbers we got before
# Now, we don't want this to be this way, because some tokens will find different tokens
# more or less interesting, and we want that to be data-dependent. 
# e.g. a wovel might look for consonants in its past and would want that information to flow towards it
# This is the problem that self-attention solves! (more details below)
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float("-inf"))
wei = F.softmax(wei, dim=-1)

out = wei @ x

# Now, the way that self-attention solves this is the following:
# Ever single node/token at each position will "emit" two vectors:
# 1. A query vector
# 2. A key vector
# The query vector is, roughly speaking, what am I looking for?
# The key vector is, roughly speaking, what do I contain?
# And the way we get affinities between tokens is by taking the dot product of the query and key vectors
# So my query, dot producted will keys of all the other tokens now becomes the weights `wei`
# So if the key and the query are "aligned", then the weight will be high which means that
# the token will be more likely to be attended to by the other token
# We will also have a
# 3. Value vector
# Which is a representation of the token that we want to be propagated in case it's selected

# So let's try to implement this and change the above code to use self-attention

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)

# So far, keys and queries have been independetly computed for each token
# Now let's do the dot product between them

# We're transposing last two dimensions of the key matrix
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) -> (B, T, T)

# Now that we've found the initial affinities, we need to do the same as before
tril = torch.tril(torch.ones(T, T))
# Remember, no communication from the future to the past, because we're doing next token prediction
# That's why we mask the upper triangle
# But if we wanted to do something like sentiment analysis with a transformer,
# WE wouldn't need to mask the upper triangle, and we could just use the entire matrix
# (Which would correspond to a fully connected graph where the nodes are the tokens)
# In the case where we do mask, it's called a decoder block (decodes the future)
# And in the case where we don't mask, it's called an encoder block (encodes everything)
wei = wei.masked_fill(tril == 0, float("-inf"))
wei = F.softmax(wei, dim=-1)

v = value(x)

out = wei @ v

# Previously, every batch had the same weights, but now
# every batch has different weights, because every batch has differetn tokens
wei 


tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
         [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
         [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
         [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1687, 0.8313, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2477, 0.0514, 0.7008, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4410, 0.0957, 0.3747, 0.0887, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0069, 0.0456, 0.0300, 0.7748, 0.1427, 0.0000, 0.0000, 0.0000],
         [0.0660, 0.089

In [21]:
# Now, why do we keep calling this method self-attention? 
# (there's also something called the cross-attention!)
# It's because all 3 of key, query and value come from the same source `x`.

# For example, in encoder-decoder transformers, you can have a case where
# the queries are produced by `x`, but the keys and values are produced by a whole different source
# And sometimes from encoder blocks which encode some context that we'd like to condition on.
# So basically, cross-attention is where we have another source of information (nodes)
# And we'd like to pull that information into our current graph and use it there.

You can see that in the "Attention is all you need" paper, they have a formula like this:

$$
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
$$

But here, we haven't used the denominator of the thing inside softmax, which is the square root of the head size. \
This normalization is called "scaled attention".

The problem is that if you only have Gaussian inputs, there's going to be trouble with the variance, as shown below:

In [22]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)

wei_1 = q @ k.transpose(-2, -1)
wei_2 = wei_1 * head_size**(-0.5)

In [23]:
k.var()

tensor(0.9487)

In [24]:
q.var()

tensor(1.0449)

In [25]:
wei_1.var(), wei_2.var()

(tensor(14.3682), tensor(0.8980))

You will see that the unnormalized `wei`'s variance scales with the `head_size`, \
but the normalized `wei`'s variance stays around the same as $1$.

Now why is this important?

You have seen that the `wei` feeds into the softmax, so it's very important (especially at initialization) that the `wei` be diffuse.
If the values in `wei` take on some very positive or very negative numbers, then the softmax will be very peaked, and the weights will converge to one-hot vectors, which would mean that we're aggregating information from a single node, and that would be quite unfortunate!

Let's see a quick example below:

In [26]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [28]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 8, dim=-1)

# This is peaked at output[-1]

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [29]:
class Head(nn.Module):
    '''One head of self-attention'''

    def __init__(self, n_embed: int, head_size: int):
        super().__init__()

        self.head_size = head_size
        
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)

        # Here we're creating this tril variable, but `tril` is not a parameter of the module
        # So in the pythonic convention, it's called a "buffer" and not a "parameter".
        # Basically this is assigning `tril` to `self` but telling Pytorch
        # that it's not a real parameter of the module, so don't optimize it or anything.
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    
    def forward(self, x):
        B, T, C = x.shape

        k = self.key(x)   # B, T, C
        q = self.query(x) # B, T, C

        # compute scores (affinities)
        wei = q @ k.transpose(-2, -1) * self.head_size**(-0.5) # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)

        # perform the weighted aggergation of the values
        v = self.value(x) # (B, T, C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)

        return out

In [None]:
# Now, let's try to update the Bigram model even more with self-attention!

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int, block_size: int, n_embed: int, head_size: int) -> None:
        super().__init__()
        # We make an embedding table for the __value__ of the tokens
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)

        # We also make an embedding table for the __position__ of the tokens
        # meaning that both the token and the position of the token will be paid attention to
        self.position_embedding_table = nn.Embedding(block_size, n_embed)

        # sa = self-attention
        self.sa_head = Head(n_embed, head_size)

        # We have changed our channel size to be n_embed, but we wanted to get logits
        # in such a way that the logits are in the same shape as the vocabulary
        # because they are the probabilities of each token in the vocabular for being the next token
        # So we add a linear layer to transform the channel size to be the same as the vocab size
        self.lm_head = nn.Linear(n_embed, vocab_size)
        # lm = language model
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensors of integers

        tok_emb = self.token_embedding_table(idx) # (B, T, n_embed)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, n_embed)

        # The batch dimension will get automatically broadcasted
        # for the position embedding, as it should be the same for all the batches
        x = tok_emb + pos_emb # (B, T, n_embed)
        x = self.sa_head(x) # apply one head of self-attention (B, T, C)

        # So now, the x will not only hold the token identity, but also the position of the token
        # It is currently not that useful because we're only using the bigram model,
        # so it's all translation invariant at this stage. (Doesn't matter _where_ the token is)
        # But this will be useful when we use the transformer model (self-attention)

        logits = self.lm_head(x) # (B, T, vocab_size)

    ### From here on out, it's the same as the previous version

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # Crop the idx to last_block_size tokens 
            # Because now we're using positional embeddings, we can never have
            # more than `block_size` coming in.  
            # The positional embedding has embeddings only up to `block_size`. (# of rows)
            idx_cond = idx[:, -block_size:]

            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx_cond)

            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)

            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)

            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)

            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

Now we want to do something called the multi-head attention. \
This is basically doing a stack of parallel attention layers, and then concatenating the results.

In [11]:
# a stack of parallel attention layers
class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads: int, head_size: int, n_embed:int) -> None:
        super().__init__()

        self.heads = nn.ModuleList([
            Head(n_embed, head_size) for _ in range(num_heads)
        ])

    
    def foward(self, x: torch.tensor):
        # Concatenate the results of all the heads
        return torch.cat([head(x) for head in self.heads], dim=-1)

In [8]:
# Now, let's try to update the Bigram model even more with self-attention!
# We will use the MultiHeadAttention class that we just made

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int, block_size: int, n_embed: int, n_heads: int) -> None:
        super().__init__()
        # We make an embedding table for the __value__ of the tokens
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)

        # We also make an embedding table for the __position__ of the tokens
        # meaning that both the token and the position of the token will be paid attention to
        self.position_embedding_table = nn.Embedding(block_size, n_embed)

        # sa = self-attention
        # We make each head smaller corresponding to the number of heads
        # So that the total number of parameters is the same as the original
        # This is like doing a group convolution instead of a large convolution
        self.sa_heads = MultiHeadAttention(n_heads, n_embed // n_heads, n_embed)

        # We have changed our channel size to be n_embed, but we wanted to get logits
        # in such a way that the logits are in the same shape as the vocabulary
        # because they are the probabilities of each token in the vocabular for being the next token
        # So we add a linear layer to transform the channel size to be the same as the vocab size
        self.lm_head = nn.Linear(n_embed, vocab_size)
        # lm = language model
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensors of integers

        tok_emb = self.token_embedding_table(idx) # (B, T, n_embed)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, n_embed)

        # The batch dimension will get automatically broadcasted
        # for the position embedding, as it should be the same for all the batches
        x = tok_emb + pos_emb # (B, T, n_embed)
        x = self.sa_heads(x) # apply one head of self-attention (B, T, C)

        # So now, the x will not only hold the token identity, but also the position of the token
        # It is currently not that useful because we're only using the bigram model,
        # so it's all translation invariant at this stage. (Doesn't matter _where_ the token is)
        # But this will be useful when we use the transformer model (self-attention)

        logits = self.lm_head(x) # (B, T, vocab_size)

    ### From here on out, it's the same as the previous version

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # Crop the idx to last_block_size tokens 
            # Because now we're using positional embeddings, we can never have
            # more than `block_size` coming in.  
            # The positional embedding has embeddings only up to `block_size`. (# of rows)
            idx_cond = idx[:, -block_size:]

            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx_cond)

            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)

            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)

            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)

            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

If you look at the "attention is all you need" paper, you'll see that they also have a feed-forward layer,
after two levels of "masked multi-head attention", which basically gives tokens more chance to think about each other.

![](.graphics/2023-04-21-20-26-17.png)

Basically this, but ignoring the left part (we'll get back to it later he said).

In [9]:
class FeedForward(nn.Module):
    def __init__(self, n_embed: int) -> None:
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(n_embed, n_embed),
            nn.ReLU(),
        )

    def forward(self, x: torch.tensor):
        return self.net(x)


In [None]:
# Now, let's try to update the Bigram model even more with self-attention!
# We will use the MultiHeadAttention class that we just made
# And now we add the FeedForward network as well!

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int, block_size: int, n_embed: int, n_heads: int) -> None:
        super().__init__()
        # We make an embedding table for the __value__ of the tokens
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)

        # We also make an embedding table for the __position__ of the tokens
        # meaning that both the token and the position of the token will be paid attention to
        self.position_embedding_table = nn.Embedding(block_size, n_embed)

        # sa = self-attention
        # We make each head smaller corresponding to the number of heads
        # So that the total number of parameters is the same as the original
        # This is like doing a group convolution instead of a large convolution
        head_size = n_embed // n_heads

        self.sa_heads = MultiHeadAttention(n_heads, head_size, n_embed)

        # This is done on a token-level basis.
        # What it means that each token has gathered all the information
        # and now each token needs to think on that data individually

        self.ffwd = FeedForward(n_embed)

        # We have changed our channel size to be n_embed, but we wanted to get logits
        # in such a way that the logits are in the same shape as the vocabulary
        # because they are the probabilities of each token in the vocabular for being the next token
        # So we add a linear layer to transform the channel size to be the same as the vocab size
        self.lm_head = nn.Linear(n_embed, vocab_size)
        # lm = language model
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensors of integers

        tok_emb = self.token_embedding_table(idx) # (B, T, n_embed)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, n_embed)

        # The batch dimension will get automatically broadcasted
        # for the position embedding, as it should be the same for all the batches
        x = tok_emb + pos_emb # (B, T, n_embed)
        x = self.sa_heads(x) # apply one head of self-attention (B, T, C)

        x = self.ffwd(x) # (B, T, C)

        # So now, the x will not only hold the token identity, but also the position of the token
        # It is currently not that useful because we're only using the bigram model,
        # so it's all translation invariant at this stage. (Doesn't matter _where_ the token is)
        # But this will be useful when we use the transformer model (self-attention)

        logits = self.lm_head(x) # (B, T, vocab_size)

    ### From here on out, it's the same as the previous version

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # Crop the idx to last_block_size tokens 
            # Because now we're using positional embeddings, we can never have
            # more than `block_size` coming in.  
            # The positional embedding has embeddings only up to `block_size`. (# of rows)
            idx_cond = idx[:, -block_size:]

            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx_cond)

            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)

            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)

            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)

            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

We're going to now intersperse the communication with the computation!

Now we implement something called the block, which is the "gray" box in the picture above. (except for the cross-attention)

In [9]:
class Block(nn.Module):

    def __init__(self, n_embed: int, n_heads: int) -> None:
        super().__init__()

        head_size = n_embed // n_heads

        # Communication is the MultiHeadAttention
        self.sa = MultiHeadAttention(n_heads, head_size, n_embed)
        
        # Computation is the FeedForward
        self.ffwd = FeedForward(n_embed)
    
    def forward(self, x: torch.tensor):
        x = self.sa(x)
        x = self.ffwd(x)

        return x

Let's upgrade the Bigram model ONCE AGAIN.

(As you can see, there is a $\times N$ thing next to the graph, which means we want to use $N$ layers of the block.)

In [10]:
# Now, let's try to update the Bigram model even more with Blocks!

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int, block_size: int, n_embed: int, n_heads: int) -> None:
        super().__init__()
        # We make an embedding table for the __value__ of the tokens
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)

        # We also make an embedding table for the __position__ of the tokens
        # meaning that both the token and the position of the token will be paid attention to
        self.position_embedding_table = nn.Embedding(block_size, n_embed)

        # A couple of blocks in sequence (Communication and Computation, many times)

        self.blocks = nn.Sequential(
            Block(n_embed, n_heads=4),
            Block(n_embed, n_heads=4),
            Block(n_embed, n_heads=4),
        )
        
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensors of integers

        tok_emb = self.token_embedding_table(idx) # (B, T, n_embed)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, n_embed)

        # The batch dimension will get automatically broadcasted
        # for the position embedding, as it should be the same for all the batches
        x = tok_emb + pos_emb # (B, T, n_embed)
        
        x = self.blocks(x) # (B, T, n_embed)

        logits = self.lm_head(x) # (B, T, vocab_size)

    ### From here on out, it's the same as the previous version

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # Crop the idx to last_block_size tokens 
            # Because now we're using positional embeddings, we can never have
            # more than `block_size` coming in.  
            # The positional embedding has embeddings only up to `block_size`. (# of rows)
            idx_cond = idx[:, -block_size:]

            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx_cond)

            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)

            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)

            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)

            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

The above model actually doesn't give out good results, because it's actually a _deep_ neural net. And deep NNs suffer from optimization problems.

So we need **one** more idea that we can burrow from the Transformer paper to resolve this issue.

Now, there are two optimizations that dramatically help with the depth of these networks and make sure that the networks remain optimizable.

1. Residual connections:

Look at the three arrows highlighted in the picture below:

![](.graphics/2023-04-24-18-05-16.png)

Those are called the "skip" or the "residual" connections.

They come from this paper: [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) \
(Holy shit co-pilot figured out the paper name and url on its own!)

Using the figures of this [Towards Datascience post](https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec),

![](.graphics/2023-04-24-18-11-29.png)

It's like we have a high-way from our inputs data directly to our output data using addition.

Basically, there is going to be a fork in the computation path that doesn't go through all of the "extra" computation done by a specific part of the NN, so it will be much better at preserving the gradient it's going to be carrying through.

With that knowledge, let's update our Block class!



In [17]:
class FeedForward(nn.Module):
    def __init__(self, n_embed: int) -> None:
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            # This is the projection layer "going back into the residual pathway"
            # (in quotes because I don't totally understand it)
            nn.Linear(4 * n_embed, n_embed),
        )

    def forward(self, x: torch.tensor):
        return self.net(x)

# Why the 4 * n_embed?
# Because in the paper, they use a 512 dimensionality for the input
# and 2048 for the hidden layer, so 4 * 512 = 204
# So we use the same ratio here

In [16]:
# We also need to update this to add the "accumulative" NN
# at the end of it, which in a way gets a "summary" of the parallel 
# computations we do inside of it.

# a stack of parallel attention layers
class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads: int, head_size: int, n_embed:int) -> None:
        super().__init__()

        self.heads = nn.ModuleList([
            Head(n_embed, head_size) for _ in range(num_heads)
        ])

        self.proj = nn.Linear(n_embed, n_embed)

    
    def foward(self, x: torch.tensor):
        # Concatenate the results of all the heads
        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.proj(out)

        return out

In [19]:
class Block(nn.Module):

    def __init__(self, n_embed: int, n_heads: int) -> None:
        super().__init__()

        head_size = n_embed // n_heads

        # Communication is the MultiHeadAttention
        self.sa = MultiHeadAttention(n_heads, head_size, n_embed)
        
        # Computation is the FeedForward
        self.ffwd = FeedForward(n_embed)
    
    def forward(self, x: torch.tensor):
        # Adding the residual connection
        x = x + self.sa(x)
        # To both communication and computation
        x = x + self.ffwd(x)

        return x

At this point in the video (around here: https://youtu.be/kCc8FmEb1nY?t=5646).

---

He started by talking about batch normalization, so I watched Andrew Ng's videos on that.
Then I watched Yannic Kilcher's video on [Group Normalization](https://www.youtube.com/watch?v=l_3zj6HeWUE).

I have written a bunch of notes on batch normalization in the `Ng - Improving DNNs` folder, but I'll just write down the forumla for **Layer Normalization** here: \
(We compute the layer normalization statistics over all the hidden units in the same layer)

$$
\mu^{[l]} = \frac{1}{H} \sum_{i=1}^{H} z^{[l](i)} \\
\, \\
\sigma^{[l]} = \sqrt{\frac{1}{H} \sum_{i=1}^{H} (z^{[l]}_{i} - \mu^{[l]})^2} \\
$$

Where $H$ is the number of hidden units in the layer, and $z^{[l]}_{i}$ is the $i$ th hidden unit in the $l$ th layer.

(There is also instance normalization and group normalization, but we don't need to worry about those for now.)

---



And now, for the second method of improving our optimization:

2. Layer Normalization:

We see in our graph above that there is an `Add & Norm` block, and we want to use Layer Normalization for it, but the slight change that has happened in Transformer implementations is that nowadays people do the `LayerNorm` before the `MaskedMultiHeadAttention` and the `FeedForward` layers, and then do the `Add` after them. This is called the "pre-norm" formulation.

So let's update our Block once again!

In [15]:
class Block(nn.Module):

    def __init__(self, n_embed: int, n_heads: int) -> None:
        super().__init__()

        head_size = n_embed // n_heads

        # Communication is the MultiHeadAttention
        self.sa = MultiHeadAttention(n_heads, head_size, n_embed)
        
        # Computation is the FeedForward
        self.ffwd = FeedForward(n_embed)

        # So both batch and time dimensions act as batch dimensions
        # (Because we're passing n_embed as the number of features here)
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
    
    def forward(self, x: torch.tensor):
        # Adding the residual connection
        x = x + self.sa( self.ln1(x) )
        # To both communication and computation
        x = x + self.ffwd( self.ln2(x) )

        return x

And add the LayerNorm to our Bigram model as well:

In [19]:
# Now, let's try to update the Bigram model even more with Blocks!

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size: int, block_size: int, n_embed: int, n_heads: int) -> None:
        super().__init__()
        # We make an embedding table for the __value__ of the tokens
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)

        # We also make an embedding table for the __position__ of the tokens
        # meaning that both the token and the position of the token will be paid attention to
        self.position_embedding_table = nn.Embedding(block_size, n_embed)

        # A couple of blocks in sequence (Communication and Computation, many times)

        self.blocks = nn.Sequential(
            Block(n_embed, n_heads=4),
            Block(n_embed, n_heads=4),
            Block(n_embed, n_heads=4),
            # We should also have a LayerNorm at the end of the stack
            # Right before linear layer that decodes into the vocabulary
            nn.LayerNorm(n_embed)
        )
        
        self.lm_head = nn.Linear(n_embed, vocab_size)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B, T) tensors of integers

        tok_emb = self.token_embedding_table(idx) # (B, T, n_embed)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device)) # (T, n_embed)

        # The batch dimension will get automatically broadcasted
        # for the position embedding, as it should be the same for all the batches
        x = tok_emb + pos_emb # (B, T, n_embed)
        
        x = self.blocks(x) # (B, T, n_embed)

        logits = self.lm_head(x) # (B, T, vocab_size)

    ### From here on out, it's the same as the previous version

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We will reshape the logits instead of the input here.
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
        
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx: torch.tensor, max_new_tokens: int):
        # idx is a (Batch, Time) array of indices in the current context

        # The point of this is to start with a sequence and 
        # keep adding new tokens to it, and then keep feeding it back into the model
        # to get new predictions until we have the desired number of new tokens

        for _ in range(max_new_tokens):
            
            # This function, even though general, is a bit ridiculous
            # Because this is a bi-gram model, we only need to pass the last token
            # to our table, but we are passing the entire sequence to the table
            # and then, at the next step (`logits = ...`) we pick the last token
            # Out of all the tokens in the sequence

            # Later, we will use the entire sequence to predict the next token
             
            # Crop the idx to last_block_size tokens 
            # Because now we're using positional embeddings, we can never have
            # more than `block_size` coming in.  
            # The positional embedding has embeddings only up to `block_size`. (# of rows)
            idx_cond = idx[:, -block_size:]

            # get the predictions, we don't care about the loss because we're not training
            logits, loss = self(idx_cond)

            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (Batch, Channel)

            # apply softmax to convert to probabilities
            probs = F.softmax(logits, dim=-1) # becomes (Batch, Channel)

            # sample from the distribution or take the most likely
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (Batch, 1)

            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=-1) # becomes (Batch, Time + 1)
        
        return idx

At this point, he adds a bunch of dropout layers, changes the number of layers in each block, and changes the number of heads in the multi-head attention and batch size, and he runs it on his `a100` GPU for 15 mins, and gets a pretty good structure-wise (but not-semantically) correct sentences.

## Encoder-Decoder Architecture

What we implemented above (the left hand side of the graphic) is called the Decoder part of the Transformer, we didn;t implement the right-hand side because this is just a text-prediction model, and we have that upper-triangular mask that prevents the model from looking at the future tokens, which makes it a decoder.

The task considered in the original paper is a translation task, and that's why they needed the Encoder part of the transformer. (Encodes French, decodes English)

What it means is that the generation will be conditioned on the French sentence, and now all the tokens will be allowed to talk to each other (No mask)

So the queries are still generated from (left hand side) `x`, but the keys and the values are coming from the right hand side, so it conditions the generation on the entirety of the French sentence.

## Nano-GPT

It's basically what we have done here, and it includes two files `train.py` and `model.py`. 

`model.py` is basically the same code as here, but it's more complicated because it includes:

- Saving and loading checkpoints
- Pre-trained weights
- Decaying the learning rate
- Compiling the model
- Using distributed learning across multiple nodes (GPUs)

## Chat-GPT

Chat-GPT was pre-trained like this on a large corpus of the internet (You can see model parameters and data-size in their paper), but then it was fine-tuned to be a chatbot.

Then they got some human evaluators to evaluate the chatbot, and used that human-feedback data to train a reward model.

Once they have the reward model, they run PPO, which is a form of policy gradient reinforcement learning optimizer to fine-tune their sampling policy (The answers that chatGPT generates are expected to score a high reward from the reward model)