# GPT
- GPT from scratch! What is the neural network under the hood that sequences the words? 
- GPT stands for generatively pretrained transformer. The transformer is the neural net.
- chatGPT is trained on a good chunk of internet and goes thru pre and post training. 
- We will do a character level languague model. We will use a smaller dataset, tiny shakespeare, which is a concatenation of all of Shakespeare's work in one file (~1MB).

## Attention is all you need
- Check out this 2017 landmark AI paper: [Attention is all you need](https://arxiv.org/pdf/1706.03762). 
- This proposed the Transformer architecture which revolutioned natural language processing (NLP) by relying on self-attention mechanisms.
- The Transformer ues scaled dot-product attention to weigh the importance of different words in a sequence. This enables parallelization and handling of long-range dependencies.
- This self-attention mechanism is the core innovation in the Transformer model .
- Multiple attention heads allow the model to focus on different parts of a sequence simultaneously. This captures diverse linguistic patterns.
- Positional encodings are added to input embeddings in order to preserve word order information.
- Transformer block comtains feed-forward layers and layer normalization to stabilize training.

## Self-attention steps
Given an input sequence represented as word embeddings $X$.
1. Compute Query $Q$, Key $K$, and Value $V$ metrics. 
    - Query represents what this word is looking for in other words. $Q = X W_{Q}$
    - Key represent what this word contains that other words might find useful. $K = X W_{K}$
    - Value represents the actual information in the word. $V = X W_{V}$
2. Computer attention scores (scaled dot-product attention)
    - $Scores = Q K^T$
    - Each row of scores represents how much attention each word should pay to other words.
    - This is scaled to prevent extremely high gradients.
    - $Scaled Scores = \frac{Q K^T}{ \sqrt{d _{k} } }$ where $d_{k}$ is the dimension of the key vectors.
3. Apply Softmax to get attention weights.
    - This is done to convert the scores into probabilities. The sum of attention weights for each word will be 1.
    - $Attention Weights = softmax( \frac{Q K^T}{ \sqrt{d _{k} } } )$
4. Computer the Weighted Sum of values
    - This is each word's final representation.
    - $Output = Attention Weights * V$

In [71]:
# get data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print("length of dataset in chracters:", len(text), "\n-----\n")
print("first 100 chracters:\n", text[:100], "\n-----\n")

# get unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)  # this is the numebr of possible options for next characer in sequence
print("chars:")
print(''.join(chars))
print(f'{vocab_size=}')
print("\n-----\n")

# create mapping from chracters to integers and back
# aka this is a simple "tokenizer"
# other tokenizers include:
# - Google [sentencepiece](https://github.com/google/sentencepiece). This is a subword unit level, aka not chracters, not words.
# - OpenAI's [tiktoken](https://github.com/openai/tiktoken) you can import this directly in python code
# subword encodings are popular in practice
# below is a character level encoder, small code book (65), long sequences
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]  # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # decoder: takes a list of integers, output a string
print(f'{encode("hii there cutie!")=}')
print(f'{decode(encode("hii there cutie!"))=}')
print("\n-----\n")

# encode the entire txt dataset and store it into a torch.Tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(f'{data.shape=}, {data.dtype=}')
print(f'{data[:100]=}')  # direct translation of first 100 characters
# this is a massive sequnce of integers representing the data characters
# aka the entire dataset of text is now represented as a series of integers


length of dataset in chracters: 1115394 
-----

first 100 chracters:
 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 
-----

chars:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab_size=65

-----

encode("hii there cutie!")=[46, 47, 47, 1, 58, 46, 43, 56, 43, 1, 41, 59, 58, 47, 43, 2]
decode(encode("hii there cutie!"))='hii there cutie!'

-----

data.shape=torch.Size([1115394]), data.dtype=torch.int64
data[:100]=tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [72]:
# TRAIN ADN VAL DATA
# split up the data into train and validation sets
# this will allow us to understand when the model is overfitting
n = int(0.9*len(data))  # first 90% will be train data, rest will be validation data
train_data = data[:n]
val_data = data[n:]

In [73]:
# BLOCK SIZE
# we will never feed whole dataset into transformer to train it
# we train the transformer on chunks of the dataset
block_size = 8
print("first 9 characters in sequence")
print(train_data[:block_size+1])
# this has multiple examples packed into it and the transformer will train at every position
x = train_data[:block_size]
y = train_data[1:block_size+1]  # targets for each posiiton
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'When input is {context}, the target is: {target}')

first 9 characters in sequence
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])
When input is tensor([18]), the target is: 47
When input is tensor([18, 47]), the target is: 56
When input is tensor([18, 47, 56]), the target is: 57
When input is tensor([18, 47, 56, 57]), the target is: 58
When input is tensor([18, 47, 56, 57, 58]), the target is: 1
When input is tensor([18, 47, 56, 57, 58,  1]), the target is: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is: 58


In [74]:
# BLOCK SIZE AND BATCH SIZE
torch.manual_seed(1337)
batch_size = 4  # how many independent sequences will we process in parallel?
block_size = 8  # the maximum context length for predictions
print(f'{batch_size=}, {block_size=}')

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))  # 4 numbers generated betw 0 and len(data) - block_size, aka offsets in data
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('\ninputs to transformer:')
print("each row is a chunk of the data training set")
print(f'{xb.shape=}')
print(f'{xb=}')
print('\ntargets')
print(f'{yb.shape=}')
print(f'{yb=}')

print('-----')

for b in range(batch_size):  # batch_dimension
    for t in range(block_size):  # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f'When input is {context.tolist()} the target: {target}')

# the 4x8 array contains 32 independent examples in a batch of 4

batch_size=4, block_size=8

inputs to transformer:
each row is a chunk of the data training set
xb.shape=torch.Size([4, 8])
xb=tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

targets
yb.shape=torch.Size([4, 8])
yb=tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
-----
When input is [24] the target: 43
When input is [24, 43] the target: 58
When input is [24, 43, 58] the target: 5
When input is [24, 43, 58, 5] the target: 57
When input is [24, 43, 58, 5, 57] the target: 1
When input is [24, 43, 58, 5, 57, 1] the target: 46
When input is [24, 43, 58, 5, 57, 1, 46] the target: 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
When input is [44] the target: 53
When input is [44, 53] the target: 56
When input is [44, 53, 56] the target: 1
Wh

In [75]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensors of integers
        # logits are scores for next character in the sequence
        logits = self.token_embedding_table(idx)  # (B, T, C) aka (Batch, time, channel) like (batch=4, time=8 aka block size, channels=65 aka vocab size)
        
        if targets is None:
            loss = None
        else:
            # the loss is defined by negative log likelihood, also called cross entrypy 
            # loss defines the quality of the logits wrt the targets, aka how well are we predicitng the next character
            # https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html 
            # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
            # for multi dimensional input to cross entropy, channels is needed as the 2nd dimension, aka wants a (B, C, T)
            # so reshape the logits
            B, T, C = logits.shape
            logits = logits.view(B*T, C)  # stretches out the array to 2D
            # targets is of shape (B, T) and we want one dimension B*T, could also just use -1 in view for targets
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    # this function is written to be general for now
    # we fed entire sequence into model, but then looked at the last piece to predict the next, in this case
    # later - we'll have the history be used to predict the next character
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictons
            logits, loss = self(idx)  # self(idx) calls forward() function
            # focus only on the last time step, aka get last position in the time dimension, because that is all that we are using to predict the next character
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution, aka get a simple prediction for each batch example
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1), dim=1 is the time dimension
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(f'{logits.shape=}')  # torch.Size([4, 8, 65]) indicates logitis for every 4,8 position
print(f'{loss=}')  # comes out to 4.8786
idx = torch.zeros((1,1), dtype=torch.long)  # create 1by1 tensor, B 1, T 1, torch.long is an integer, zero also corresponds to a new line character
print(f'idx of {0} corresponds to character {repr(itos[0])}')
# generate works on level of batches, so index into 0th row
print("generate:", decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))  # spits out garbage since model is random now

# we have a vocab size of 65, so expected loss is -ln(1/65) wich is about 4.17
# so this tells us initial predicitons are diffuse and we are guessing wrong

logits.shape=torch.Size([32, 65])
loss=tensor(4.8786, grad_fn=<NllLossBackward0>)
idx of 0 corresponds to character '\n'
generate: 
SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


In [76]:
# train the model
# create a PyTrch optimizer object - in makemore, we only used the simple stochastic gradient descent instead
# adam is a popular optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
# the optimizer object will optimize the parameters using the gradients

In [77]:
batch_size = 32
for steps in range(10000):
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    # zero out all the gradients from the previous step
    optimizer.zero_grad(set_to_none=True)
    # getting the gradients for all the parameters
    loss.backward()
    # use gradients to optimize parameters
    optimizer.step()

print(loss.item())

2.382369041442871


In [78]:
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=300)[0].tolist()))  # spits out garbage since model is random now
# this is a simple bigram model since tokens are not talking to each other. we are only looking at the very last character to predict next.
# the bigram model, training, and sample generation is summarized in bypgram.py
# now, let's get these tokens to start talking to each other --> Transformer time



lso br. ave aviasurf my, yxMPZI ivee iuedrd whar ksth y h bora s be hese, woweee; the! KI 'de, ulseecherd d o blllando;LUCEO, oraingofof win!
RIfans picspeserer hee tha,
TOFonk? me ain ckntoty ded. bo'llll st ta d:
ELIS me hurf lal y, ma dus pe athouo
BEY:! Indy; by s afreanoo adicererupa anse tecor


In [79]:
# THE MATHEMATICAL TRICK IN SELF-ATTENTION
# consider the following toy example
# we want to couple tokens in a specific way, 5th token should talk to 4th token and bach, no info from future
# simplest way to do to this is use an average of all the precedding tokens
# this is extremely weak/lossy though, aka we lose spatial information about tokens
torch.manual_seed(1337)
B, T, C = 4, 8, 2  # batch, time (tokens), channels
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [80]:
# VERSION 1: double for loop
# We want x[b,t] = mean_{i<=t} x[b, i]
# bow means bag of words, aka theres a word stored in each of 8 locations and we'll average them
xbow = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]  # (t,C), row at b, columns everything up to and including the t token
        xbow[b,t] = torch.mean(xprev, 0)  # avergaing out on time, get bag C dimensional vector to store in xbow

print(f'{x[0]=}')
print(f'{xbow[0]=}')
print("notice how x[0] and xbow[0] are the same in the first row, since taking average of one token")
print("each row of xbow is an average of all precedding elements")
# this is ineffcient though, we want to use matrix multiplication instead

x[0]=tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
xbow[0]=tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])
notice how x[0] and xbow[0] are the same in the first row, since taking average of one token
each row of xbow is an average of all precedding elements


In [81]:
# get lower triangular portion of tensor given
# zeros out upper triangular portion
torch.tril(torch.ones(3,3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [82]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)  # in 1th dimension, set keepdim to True, rows will now sum to 1
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b  # (3,3) @ (3,2) -> (3, 2)
print(f'{a=}\n---\n{b=}\n---\n{c=}')
# ah, genius
# so now getting averages in incremental fashion in c

a=tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
---
b=tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
---
c=tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [83]:
# VERSION 2: batch matrix multiplication
# use batch multilpy to make a weighted aggregation and the weights are specified in the TxT array
# we do weighted sums tha take on the triangular form, so that tokens only take info from preceeding tokens
torch.manual_seed(1337)
B, T, C = 4, 8, 2  # batch, time (tokens), channels
x = torch.randn(B, T, C)
print(f'{x.shape}')

wei = torch.tril(torch.ones(T, T))  # wei is short for weights
wei = wei / torch.sum(wei, 1, keepdim=True)
xbow2 = wei @ x  # (T, T) @ (B, T, C) --> (B, T, T) @ (B, T, C) --> (B, T, C)

print("wei shows how much of every row we want to avergae up; note rows sum to one")
print(f'{wei.shape=}, {x.shape=}')
print(f'{wei=}')
print(f'{xbow2=}')
print(f'{torch.allclose(xbow, xbow2)=}')

torch.Size([4, 8, 2])
wei shows how much of every row we want to avergae up; note rows sum to one
wei.shape=torch.Size([8, 8]), x.shape=torch.Size([4, 8, 2])
wei=tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
xbow2=tensor([[[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]],

     

In [84]:
# VERSION 3: USE SOFTMAX
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))  # think of wei as affinities, set as zero by us here, but this will be data dependent, some tokens will find other tokens more or less interesting to different amounts, aka affinities
print(f'wei as all zeros:\n{wei=}\n')

wei = wei.masked_fill(tril == 0, float('-inf'))  # this line is saying tokens from the past cannot communicate
print(f'mask wei with inf where tril is 0:\n{wei=}\n')

wei = F.softmax(wei, dim=1)
print("softmax exponentiates every element and divides by the sum, aka normalize")
print(f'softmax wei along 1st dimension:\n{wei=}\n')

xbow3 = wei @ x  # this is the aggregation of the values through matrix multiplication
print("xbow3 = wei @ x")
print(f'{torch.allclose(xbow, xbow3)=}')

wei as all zeros:
wei=tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

mask wei with inf where tril is 0:
wei=tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

softmax exponentiates every element and divides by the sum, aka normalize
softmax wei along 1st dimension:
wei=tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.500

# TLDR: the preview to self-attention
- you can do weighted aggregations of past elements by using matrix multiplcation of a lower traingular fashion
- elements in lower traingular part tells you how much each element fuses
- we actually don't want wei to be init to all zeros though, we want it to be data dependent
- we want to gather info from the past in data dependent way --> self-attention
- **every single token will emit two vectors, a query vector and a key vector, aka what you are looking for and what do you contain**
- affinities will be obtained with a dot product of query and key; that dot product becomes wei

# Attention
- Attention is a communication mechanism between nodes. It can be seen as nodes in a directed graph. Every node has a vector of information and it gets to aggregate information via a weighted sum from all the nodes that point to it. This is done in a data dependent manner. Attention can be applied to any arbitrary directed graph.
- Our graph has 8 nodes since block size is 8. The first points to itself. The second node points to itself and the first node. etc.
- Notice there is no notion of space. That's why we encode nodes to a specific position. This is different from convolution which has a specific layout in space. Attention is just a set of vector out there in space, and you have to add positional encodings to know the space.
- In batch matrix multiplication, each example across batch dimensions is processed completely independently and never "talk" to each other. like separate pools. like 4 (batch size) separate pools of 8 (block size) nodes.
- In an "encoder" attention block, just delete the single line that does the masking with trill, allowing all tokens to communicate. For different use case, maybe all tokens/nodes will all talk to each other. Maybe sentiment analysis. 
- In a "decoder" attention block, keep the triangular masking. This is usually used in autoregressive settings, like modeling.
- "self-attention" means that the keys and values are produced from the same source x as queries. Node are looking at each other. In "cross-attention", key and value come from a separate external source, aka a separate source of nodes that we want to pool informaiton from.
- "Scaled" dot-product attention includes additional division of wei by 1/sqrt(head_size). This makes it so when input Q, K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.

In [85]:
# VERSION 4: self-attention! for a single individual head
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention 
head_size = 16  # this is a hyperparameter
key = nn.Linear(C, head_size, bias=False)  # this will apply matrix multiply with some fixed weights, bias is False
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# forward the model on x, no communication occurs here
k = key(x)  # (B, T, 16)
q = query(x)  # (B, T, 16)
# now all the queries will dot product with all the keys, aka communication
# for every row of B, we'll have T^2 affinities
# transpose the last 2 dimensions of k
wei = q @ k.transpose(-2, -1)  # (B, T, 16) @ (B, 16, T) --> (B, T, T)

tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=1)  # exponentiate and normalize
v = value(x)  # v is the vector we aggregate, instead of the raw x
# think of x as private information to the head
# for the purpose of a single head, v is the think that gets aggregated for this single head between the different nodes
out = wei @ v

print(f'{out.shape}')
print(f'{wei=}')  # these are no longer uniform

torch.Size([4, 8, 16])
wei=tensor([[[0.0248, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0052, 0.0091, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0521, 0.0135, 0.2482, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.3171, 0.0214, 0.1642, 0.1188, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0412, 0.0487, 0.1046, 0.0742, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.1060, 0.5347, 0.2059, 0.1030, 0.7402, 0.0192, 0.0000, 0.0000],
         [0.4298, 0.3409, 0.1769, 0.2027, 0.0480, 0.8472, 0.2329, 0.0000],
         [0.0238, 0.0316, 0.1002, 0.5013, 0.0117, 0.1336, 0.7671, 1.0000]],

        [[0.0443, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0042, 0.0375, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0560, 0.0210, 0.2496, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.3679, 0.1441, 0.4929, 0.0438, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0088, 0.1052, 0.0604, 0.5847, 0.2046, 0.0000, 0.0000, 0.000

In [86]:
# SCALED DOT PRODUCT ATTENTION

# if k and q are unit gaussian, and do wei naiively, just q @ k.transpose(-2,-1), then wei will be on the order of headsize.
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1)
print("naiive wei computation:")
print(f'{k.var()=}')
print(f'{q.var()=}')
print(f'{wei.var()=}') 
print("notice that variance of wei is 16 or 17")
print(f'{torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)=}')

# now scale the attention by 1/sqrt of head size
# then the vaiance of wei will be 1, it will be preserved
print("\nnow let' do scaled attention!!!:")
wei = q @ k.transpose(-2, -1)  * head_size**-0.5
print(f'{k.var()=}')
print(f'{q.var()=}')
print(f'{wei.var()=}') 
print("notice that the variance of wei is about 1")
print("this is important because wei feeds into softmax, so we need wei to be fairly diffuse")
print("if wei takes on very negative or positive numbers, then softmax converges to very one hot vectors ")
print("softmax sharpens towards the max, and softmax will be way too peaky, and then you aggregate from a single node")
print(f'{torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)=}')
print(f'{torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)=}')


naiive wei computation:
k.var()=tensor(1.0449)
q.var()=tensor(1.0700)
wei.var()=tensor(17.4690)
notice that variance of wei is 16 or 17
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)=tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

now let' do scaled attention!!!:
k.var()=tensor(1.0449)
q.var()=tensor(1.0700)
wei.var()=tensor(1.0918)
notice that the variance of wei is about 1
this is important because wei feeds into softmax, so we need wei to be fairly diffuse
if wei takes on very negative or positive numbers, then softmax converges to very one hot vectors 
softmax sharpens towards the max, and softmax will be way too peaky, and then you aggregate from a single node
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)=tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)=tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])


# Multi-head attention
- apply multiple attentions in parallel and concatenating the results

# Feed forward layer
- follow self attention heads by a feed forward layer, aka let nodes talk to each other than think
- feed forward layer is designed as a linear layer followed by a non-linearity, aka a simple multi-layer perceptron

# Blocks
- block contains communication (self attention heads) and computation (feed forward layer)

# Deep neural networks
- deep neural networks require some optimizations
- optimization 1: skip connections
- optimization 2: layer norm
- optimization 3: dropout

# Optimization 1: Addition of Residual Connection
- Skip connections, also called residual connections
- see paper "Deep Residual Learning for Image Recognition" which introduced the concept
- you transform the data and then have a skip connection with addition to previous features
- in other words, you have a residual pathway, you fork off from it, perfrom some computation, and then come back to the residual pathway and add to it, so you go from inputs to targets via plus and plus and plus
- this is useful because during back propagation, addition distributes gradients equally to both of its branches that fed as the input
- so gradients from loss hop throgh every addition node all the way to the input, and also fork off into the residual blocks
- its a gradient super highway, unimpeded
- residual pathways contirbute very little intially, and start to contribute more with time into training

# Optimization 2: LayerNorm
- this is implemennted in PyTorch
- see paper "Layer Normalization"
- this is similar to BatchNorm, which made sure that across batch dimensions, any indivdual neron had unit gaussian distribution, 0 mean and 1 standard deviation output
- LayerNorm normalizes columns (while BatchNorm normalizes rows)
- there is preNorm and postNorm formualations 

# Optimization 3: Dropout
- dropput is something you can add right before the residual connection back into the residual pathway
- this randomly prevents some of the nodes from communicating
- see 2014 paper: "Dropout: A Simple Way to Prevent Neural Networks from Overfitting"
- it takes a neural network, eveyr forward backward path, shuts off some neurons and trains without then, because this change every time, it trains on an ensemble of sub networks
- think of it as a regularization technique

# Overfitting
- when train loss start getting ahead of validation loss, aka train loss smaller

In [87]:
class LayerNorm:

    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        # parameters (trained with backprop)
        self.gamma = torch.ones(dim)  # like batch norm gain, set to ones by default
        self.beta = torch.zeros(dim)  # like batch norm bias, set to zeros by default

    def __call__(self, x):
        # forward pass
        xmean = x.mean(1, keepdim=True)  # batch mean
        xvar = x.var(1, keepdim=True, unbiased=True)  # batch variance
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # normalize to unit variance
        self.out = self.gamma * xhat + self.beta
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]
    
torch.manual_seed(1337)
module = LayerNorm(100)
x = torch.randn(32, 100)  # batch size 32 of 100-dimensional vectors
x = module(x)

print(f'{x.shape=}\n')
print("mean, std of one feature across all batch inputs")
print("look at each column")
print(f'{x[:,0].mean()=}, {x[:,0].std()=}')
print("mean, std of a single input from the batch of its features")
print("look at each row")
print(f'{x[0,:].mean()=}, {x[0,:].std()=}')

# the difference between BatcHNorm and LayerNorm
# Batch Norm normalizes on the rows
# xmean = x.mean(0, keepdim=True),  xvar = x.var(0, keepdim=True, unbiased=True)
# LayerNorm normalizes on the columns
# xmean = x.mean(1, keepdim=True),  xvar = x.var(1, keepdim=True, unbiased=True)
# we also don't need to update buffers.
# there is no distinction between training and test time
# we don't need momentum


x.shape=torch.Size([32, 100])

mean, std of one feature across all batch inputs
look at each column
x[:,0].mean()=tensor(0.1469), x[:,0].std()=tensor(0.8803)
mean, std of a single input from the batch of its features
look at each row
x[0,:].mean()=tensor(2.3842e-09), x[0,:].std()=tensor(1.0000)


# Performance

## Bigram model
output from running *bigram.py* on my laptop
```
step 0: train loss 4.73047, vall loss 4.7241
step 500: train loss 4.18133, vall loss 4.1848
step 1000: train loss 3.73513, vall loss 3.7420
step 1500: train loss 3.38599, vall loss 3.3942
step 2000: train loss 3.12636, vall loss 3.1296
step 2500: train loss 2.94339, vall loss 2.9434
step 3000: train loss 2.80013, vall loss 2.8070
step 3500: train loss 2.71062, vall loss 2.7112
step 4000: train loss 2.64477, vall loss 2.6354
step 4500: train loss 2.59551, vall loss 2.5975

GofiO:
Xro sick's q-etcichors lNSKIUWLLJ$ deposicea!
SGINCAbor mealintimatede ser movis non. h;g oR a s, wind ngulffove ou!
SSe na iravak:

Buths d wans thasth bellout eshin,Ink'de ander yGo,
JUEOWenqpor t th. lo CK$Ff eawhinr tonfur prerdy higse lom;Ay fr ILESiek'XEGHory bagure med anon:
in oThis y'ld ben'd pond,N3Rocos$zFlim mereril,
CENToPOLve!
Fit CO:

YMI&qupow ced -
jouse, toe ic parl gum hangr t w fune hen seringououpeat? UveCOUDEP&y!qmmed t myeaverentcoV&ule whivf d.
D n;
'?sp
Thad!zPHe!
```
hyperparameters used:
```
batch_size = 32  # number of independent sequences to process in parallel
block_size = 8  #  the maximum context length for predictions
max_iters = 5000
learning_rate = 1e-3 
```
## Decoder Transforer model
output from **transformer.py** on my laptop
```
step 0: train loss 4.29086, vall loss 4.2774
step 500: train loss 2.43082, vall loss 2.4306
step 1000: train loss 2.30982, vall loss 2.3111
step 1500: train loss 2.21267, vall loss 2.2409
step 2000: train loss 2.18911, vall loss 2.1927
step 2500: train loss 2.14461, vall loss 2.1689
step 3000: train loss 2.11362, vall loss 2.1596
step 3500: train loss 2.08868, vall loss 2.1466
step 4000: train loss 2.06669, vall loss 2.1134
step 4500: train loss 2.05117, vall loss 2.0971


On thell ow; if hear of you whouse wiche hereight this qud, proot b ind you he ble to tim, and fard seesd to stat Yother make to furth minlies ospited:
Frourn ip are, now Pod ffordor fores?

ORUTUCE I woy so, for nish:
To do word, so mift swet hum, sawell you promiorso thel spor!
You HENR:
The is souls seabd encais hof here.
KING RA:
Whal
I we the ane,ught coning that wudan Howe eive
Ale'ny is keard of lome a swilld must lust pearne't, and fome, whould I having to me
'go go tiey thain ligese:
U
```
hyperparameters used:
```
batch_size = 32  # number of independent sequences to process in parallel
block_size = 8  #  the maximum context length for predictions
max_iters = 5000
learning_rate = 1e-3  # self-attention doesn't tolerate high learning rates
n_embd = 32
n_head = 6
n_layer = 6
dropout = 0.2
```

## Bigger Decoder Transformer Model - training on GPU
EC2 instance was spun up (see README) with the **gd4n.xlarge** instance type. 

### Trail 1 with profiler, 1000 iterations
hyperparameters used:
```
batch_size = 48  # number of independent sequences to process in parallel
block_size = 50  #  the maximum context length for predictions
max_iters = 1000
eval_interval = 200
learning_rate = 3e-4  # self-attention doesn't tolerate high learning rates
```
output:
```
step 0: train loss 4.23446, vall loss 4.2355
step 200: train loss 3.01987, vall loss 3.0424
step 400: train loss 2.70946, vall loss 2.7206
step 600: train loss 2.59565, vall loss 2.5969
step 800: train loss 2.52830, vall loss 2.5304

WRLave, thethes iudt er bdehere oto Ralstobe be ara hot aiser tureseme ths wito bte t wis balery me oparireareu
F e. he hady!
Grt malither a, theseyher be rouchor te hariy Rm m pthase fe mrarhe,,
Te withad beut wionte:
Mar wengh t cho boveathan l inchathe w thiee; aig d,dildit Ge? omovedrshe be y mud nin
Blelr. batshe t tothorede.
TI bere met nouvONAHhe IOTENedeathan,
Vid, t$ fos o it hisaffob 

E!P thareroal h whamitit th f ace thathevis m thept canethither obsde ale sto lere d tho pad,
-ererin
```
output from profiler:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          MODEL_FORWARD         0.00%       0.000us         0.00%       0.000us       0.000us     780.734ms       210.70%     780.734ms      26.024ms            30  
                                               aten::mm         5.96%     215.166ms         8.50%     306.660ms      28.238us     197.642ms        53.34%     197.648ms      18.200us         10860  
       autograd::engine::evaluate_function: MmBackward0         0.71%      25.686ms         6.79%     244.958ms      75.604us       0.000us         0.00%     166.774ms      51.474us          3240  
                                            MmBackward0         0.76%      27.467ms         6.08%     219.272ms      67.677us       0.000us         0.00%     166.774ms      51.474us          3240  
                         volta_sgemm_32x32_sliced1x4_nt         0.00%       0.000us         0.00%       0.000us       0.000us     156.101ms        42.13%     156.101ms      45.644us          3420  
                               Optimizer.step#Adam.step         0.00%       0.000us         0.00%       0.000us       0.000us      99.523ms        26.86%      99.523ms       3.317ms            30  
                                          ProfilerStep*         1.17%      42.315ms        69.42%        2.505s      80.804ms       0.000us         0.00%      84.931ms       2.740ms            31  
                                          MODEL_FORWARD         5.15%     185.763ms        22.59%     814.999ms      27.167ms       0.000us         0.00%      82.771ms       2.759ms            30  
                                              aten::bmm         3.54%     127.817ms         4.82%     173.906ms      26.837us      74.220ms        20.03%      74.234ms      11.456us          6480  
      autograd::engine::evaluate_function: BmmBackward0         0.84%      30.300ms         4.68%     168.747ms      78.124us       0.000us         0.00%      52.554ms      24.330us          2160  
                                           BmmBackward0         0.42%      14.987ms         3.84%     138.447ms      64.096us       0.000us         0.00%      52.554ms      24.330us          2160  
                                           aten::matmul         0.86%      31.076ms         7.81%     281.917ms      52.207us       0.000us         0.00%      40.990ms       7.591us          5400  
                                   volta_sgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us      30.500ms         8.23%      30.500ms      14.120us          2160  
                                   volta_sgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us      30.238ms         8.16%      30.238ms      13.999us          2160  
                                           aten::linear         0.28%       9.944ms         6.16%     222.242ms      58.331us       0.000us         0.00%      28.742ms       7.544us          3810  
    autograd::engine::evaluate_function: AddmmBackward0         0.27%       9.575ms         2.08%      75.114ms     131.779us       0.000us         0.00%      19.510ms      34.228us           570  
                         volta_sgemm_32x32_sliced1x4_tn         0.00%       0.000us         0.00%       0.000us       0.000us      19.305ms         5.21%      19.305ms       5.958us          3240  
                                  volta_sgemm_32x128_nn         0.00%       0.000us         0.00%       0.000us       0.000us      14.952ms         4.04%      14.952ms       4.119us          3630  
                                   volta_sgemm_64x64_tn         0.00%       0.000us         0.00%       0.000us       0.000us      13.482ms         3.64%      13.482ms       6.242us          2160  
autograd::engine::evaluate_function: NativeLayerNorm...         0.22%       7.864ms         0.89%      32.136ms      82.401us       0.000us         0.00%      13.171ms      33.771us           390  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.608s
Self CUDA time total: 370.551ms
```
### Trail 2 with profiler, bigger network, with profiler
hyperparameters used:
```
batch_size = 48  # number of independent sequences to process in parallel
block_size = 50  #  the maximum context length for predictions
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4  # self-attention doesn't tolerate high learning rates
device = 'cuda' if torch.cuda.is_available() else 'cpu'  # ability to run on a gpu if present, and it'll be faster
eval_iters = 200
n_embd = 120
n_head = 6
n_layer = 6
dropout = 0.2  # 20% disabled and drop to zero
```
output:
```
step 0: train loss 4.27598, vall loss 4.2730
step 500: train loss 2.25223, vall loss 2.2637
step 1000: train loss 2.00600, vall loss 2.0572
step 1500: train loss 1.87084, vall loss 1.9646
step 2000: train loss 1.77201, vall loss 1.8899
step 2500: train loss 1.71431, vall loss 1.8538
step 3000: train loss 1.65738, vall loss 1.8094
step 3500: train loss 1.62083, vall loss 1.7834
step 4000: train loss 1.59883, vall loss 1.7647
step 4500: train loss 1.56548, vall loss 1.7389

This in ganess are after! what me wran so olded:
And sweet and fallows, What: you you turse,
As my leess-world.

RANT:

Alacheld Retmed pace to whick chan sabull and out
Thy day the bloody what gaques, and come, my hour lord?

HENRY KING RIVENCENTIO:
If that fee of hoppie insurion Compperane.

SICINGS:
wither, I dow, Herern'd he boing insale undeso hearts
That had but fright the gramen hang? Which sive
Thee make Mercous upon that I am thing
this hild Romp.

SANDINES:
Then look thus, passiful, 't
```
output from profiler:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          MODEL_FORWARD         0.00%       0.000us         0.00%       0.000us       0.000us     259.484ms       189.43%     259.484ms      25.948ms            10  
                                               aten::mm         6.02%      73.228ms         8.87%     107.852ms      29.793us      49.878ms        36.41%      49.898ms      13.784us          3620  
                                          ProfilerStep*         1.14%      13.827ms        69.38%     843.911ms      76.719ms       0.000us         0.00%      48.788ms       4.435ms            11  
                                          MODEL_FORWARD         5.14%      62.552ms        22.27%     270.815ms      27.082ms       0.000us         0.00%      43.885ms       4.388ms            10  
                               Optimizer.step#Adam.step         0.00%       0.000us         0.00%       0.000us       0.000us      35.001ms        25.55%      35.001ms       3.500ms            10  
                                              aten::bmm         3.51%      42.728ms         4.75%      57.775ms      26.748us      27.452ms        20.04%      27.452ms      12.709us          2160  
       autograd::engine::evaluate_function: MmBackward0         0.70%       8.563ms         7.31%      88.938ms      82.350us       0.000us         0.00%      23.337ms      21.608us          1080  
                                            MmBackward0         0.76%       9.268ms         6.61%      80.374ms      74.421us       0.000us         0.00%      23.337ms      21.608us          1080  
                                           aten::linear         0.26%       3.159ms         5.98%      72.744ms      57.279us       0.000us         0.00%      20.630ms      16.244us          1270  
                                           aten::matmul         0.84%      10.227ms         7.67%      93.253ms      51.807us       0.000us         0.00%      19.345ms      10.747us          1800  
      autograd::engine::evaluate_function: BmmBackward0         0.79%       9.655ms         4.61%      56.089ms      77.902us       0.000us         0.00%      19.186ms      26.647us           720  
                                           BmmBackward0         0.41%       5.032ms         3.82%      46.434ms      64.492us       0.000us         0.00%      19.186ms      26.647us           720  
    autograd::engine::evaluate_function: AddmmBackward0         0.25%       3.028ms         2.07%      25.215ms     132.710us       0.000us         0.00%      19.097ms     100.509us           190  
                                  volta_sgemm_128x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us      16.167ms        11.80%      16.167ms      12.730us          1270  
                                         AddmmBackward0         0.15%       1.863ms         1.26%      15.286ms      80.455us       0.000us         0.00%      15.483ms      81.487us           190  
                                   volta_sgemm_64x64_nt         0.00%       0.000us         0.00%       0.000us       0.000us      13.419ms         9.80%      13.419ms      17.204us           780  
                         volta_sgemm_64x32_sliced1x4_nt         0.00%       0.000us         0.00%       0.000us       0.000us      12.038ms         8.79%      12.038ms      11.146us          1080  
                         volta_sgemm_32x32_sliced1x4_tn         0.00%       0.000us         0.00%       0.000us       0.000us      11.059ms         8.07%      11.059ms      10.240us          1080  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      10.571ms         7.72%      10.571ms       8.390us          1260  
                                   volta_sgemm_64x64_nn         0.00%       0.000us         0.00%       0.000us       0.000us      10.262ms         7.49%      10.262ms      14.252us           720  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.216s
Self CUDA time total: 136.978ms
```

## Summary
- The bigram model on cpu achieved a validation loss of about 2.6
- The decoder transformer model on cpu achieved a significantly better loss of about 2.1
- The decoder transformer model's hyperparameters was updated to create an even bigger network. This resulted in a loss of about 1.7.