<a href="https://colab.research.google.com/github/rileyburns707/Shakespeare_GPT/blob/main/building_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# this loads in the tiny shakespeare dataset. We will train the model off of this
# input.txt is the name of the data

--2024-06-06 16:34:00--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-06-06 16:34:00 (17.0 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# read it in to inspect it and read in the data as a string. Opens the file
# reading it in is the "input.txt". That is the data set
# 'r' means you open the input.txt file in read mode
# Encoding means it specifies that the file should be read using the UTF-8 encoding
# It reads the entire content of the file into a variable named text

with open('input.txt', 'r', encoding='utf-8') as f:
  text = f.read()

# The open function returns a file object, which is assigned to the variable f. The with statement
# ensures that the file is closed automatically when the block inside it is exited
# After this code executes, the text variable will contain all the text from input.txt,
# and the file will be closed automatically, ensuring there are no resource leaks.

In [None]:
print("length of dataset in characters: ", len(text))
# prints the amount of characters in the dataset. roughly 1 million characters

length of dataset in characters:  1115394


In [None]:
# lets look at the first 1000 charactera
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
# here are all the unique characters that occur in this text

chars = sorted(list(set(text))) # Gets all the characters that occur in the data set sorted
# text is a sequence of characters in python. The set constructor gets a set of all the characters
# that occur in this text. The list function orders it arbitrarily. Sorted sorts them

vocab_size = len(chars) # the number of them. The possible elements of the sequences

print(''.join(chars)) # prints all the characters
print(vocab_size) # says the number of characters. In this case it is 65

# Each charcter printed below are all the possibe characters that occured in the dataset



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [None]:
# create a mapping from characters to integers

# stoi() means 'string to integer' - is a function that converts a string to an integer value and returns that value
stoi = { ch:i for i, ch in enumerate(chars) } # create look up table from character to integer
# creates a dictionary (stoi) that maps each character in the chars string to a unique integer index
# The enumerate function is used to get both the index and the character
# The dictionary comprehension { ch:i for i, ch in enumerate(chars) } iterates over each character in chars,
# assigning the character (ch) as the key and its index (i) as the value.

itos = { i:ch for i, ch in enumerate(chars) } # create look up table from integer to the character
# creates a dictionary (itos) that maps each unique integer index back to the corresponding character.
# the dictionary comprehension { i:ch for i, ch in enumerate(chars) } assigns the index (i) as the key
# and the character (ch) as the value.

encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
# defines an anonymous function (a lambda function) called encode that takes a string (s) as input and returns a list of integers
# [stoi[c] for c in s] iterates over each character (c) in the input string (s),
# looks up the integer index for that character in the stoi dictionary, and
# constructs a list of these integer indices.

decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# [itos[i] for i in l] iterates over each integer index (i) in the input list (l), looks up the corresponding
# character in the itos dictionary, and constructs a list of these characters.
# ''.join(...) concatenates these characters into a single string.

print(encode("hii there"))
print(decode(encode("hii there")))
# This ensures that the original string is correctly transformed to a list of integers and back.
# We now effectivly have a tokenizer!!

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [None]:
# Now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org

data = torch.tensor(encode(text), dtype=torch.long)
# gets all the text in tinyshakespeare, encodes it, then wrap it into a torch.tensor, this gets the data tensor
# A tensor is a multi-dimensional array (similar to NumPy arrays) and is the fundamental data structure in PyTorch
# encode(text) produces a list of integers
# torch.tensor(encode(text), dtype=torch.long) converts this list into a 1-dimensional tensor of integers
# The argument dtype=torch.long specifies that the data type of the tensor elements should be long integers (64-bit integers).

print(data.shape, data.dtype)
# data.shape returns the dimensions of the tensor
# Since encode(text) produces a 1-dimensional list, data.shape will be a tuple with one element representing the length of the encoded text.
# data.dtype returns the data type of the tensor elements, which will be torch.int64 (another name for torch.long).

print(data[:1000]) # the 1000 characters we looked at earlier will to the GPT look like this

# this is an identical translation of the first 1000 characters we printed above

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [None]:
# now split up the data into train and validation sets. I know it as train test split
n = int(0.9*len(data)) # first 90% will be train, rest val (test)
train_data = data[:n]
val_data = data[n:] # test data we will use to see how accurate the model is. Don't train of this data

In [None]:
block_size = 8 # trains the model on chunks of the data since training on all the data is compuationally expensive
train_data[:block_size+1] # first 9 characters in training set. 8 chunks
# has multiple examples packed into it because they all connect.
# for example in the conext of '18', '47' likely comes next
# in the conext of '18' and '47', '56' likely comes next. And so on


tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [None]:
# spells out in code what I just explained
# prints the 8 examples hidden in the chunk of 9 characters
x = train_data[:block_size] # inputs to the transformer. 1st block_size characters
y = train_data[1:block_size+1] # targets for each position. Next block_size characters. Offset by 1
for t in range(block_size): # iterating over the block_size of 8
  context = x[:t+1] # all the characters in x up to 't' including 't'
  target = y[t] # always the t'th character
  print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


Batch Dimension: The batch dimension allows processing multiple sequences in parallel. Instead of processing one sequence at a time, you process a set of sequences simultaneously. This is more efficient and makes better use of modern hardware, which is optimized for parallel computations. In this example, batch_size = 4 means you are working with 4 sequences at once, each of length block_size = 8.


In [None]:
# generalize what is above and introduce a batch dimension

torch.manual_seed(1337)
# sets the random seed to 1337 for PyTorch's random number generator
# ensures that the random numbers generated in the code are the same each time you run it


batch_size = 4 # determines how many sequences of data will be processed in parallel.
# A batch size of 4 means we'll be working with 4 sequences simultaneously.

block_size = 8 # defines the length of each sequence.
# Each sequence will have a context length of up to 8 characters for prediction purposes.

# Defines a function to generate batches of data
def get_batch(split):
  # generate a small batch of data of inputs x and targets y
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size, (batch_size,)) # gets 4 numbers that are randomly generated between 0 and len(data) - block_size
  x = torch.stack([data[i:i+block_size] for i in ix]) # first block size characters starting at i
  y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # offset by 1 of x
  return x, y

# split: Indicates whether to use training data (train_data) or validation data (val_data).
# data: Selects the appropriate dataset based on the split parameter.
# ix: Generates a tensor of random starting indices.
      # generates batch_size random integers between 0 and len(data) - block_size.
      # These integers represent the starting points of the sequences.
# x: Creates a tensor of input sequences by stacking slices of the data from the indices in ix to ix + block_size.
# y: Creates a tensor of target sequences by stacking slices of the data from ix + 1 to ix + block_size + 1.
      # These are the next characters that the model should predict.

xb, yb = get_batch('train') # xb (inputs) and yb (targets) are tensors with shapes (batch_size, block_size)
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

# get an input and output matrix. Using Linear Algebra concpets here to make data easy to read

print('----')

# spell out in code what I just explained
for b in range(batch_size): # batch dimension
  for t in range(block_size): # time dimension
    context = xb[b, :t+1]
    target = yb[b,t]
    print(f"When input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
When input is [24] the target: 43
When input is [24, 43] the target: 58
When input is [24, 43, 58] the target: 5
When input is [24, 43, 58, 5] the target: 57
When input is [24, 43, 58, 5, 57] the target: 1
When input is [24, 43, 58, 5, 57, 1] the target: 46
When input is [24, 43, 58, 5, 57, 1, 46] the target: 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
When input is [44] the target: 53
When input is [44, 53] the target: 56
When input is [44, 53, 56] the target: 1
When input is [44, 53, 56, 1] the target: 58
When input is [44, 53, 56, 1, 58] the target: 46
When input is [44, 53

In [None]:
print(xb) # our input to the transformer

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


In [None]:
# Bigram language model

# import PyTorch model
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337) # for reproducability

class BigramLanguageModel(nn.Module): # constructing a Bigram Language Model



  def __init__(self, vocab_size):
    super().__init__()
    # each token directly reads off the logits for the next token from a loopup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)


  # evauluates the quality of the model
  def forward(self, idx, targets = None): # input is rennamed to idx. Targets is optional so it equals none

    # idx and targets are both (B,T) tensor of integers
    logits = self.token_embedding_table(idx) # (B,T,C) (Batch by Time by Chanel tensor)...batch = 4, time = 8, chanel = vocab_size = 65
            # going to pluck all the rows (24, 43,...), arrange them in a (B,T,C)
            # and interpret this as the logits, which are basically the scores for the next charater in the sequence
            # we are predicting what comes next based on the individual identity of a sigle token

    if targets is None: # if we have targets we provide them and get a loss. If we have no targets it will just get the logits
      loss = None
    else:
        B,T,C = logits.shape # unpacks those numbers
        logits = logits.view(B*T, C) # 2D array. Streches orignal array to better conform to what PyTorch expects in its dimensions
        targets = targets.view(B*T) # 1D array

        loss = F.cross_entropy(logits, targets) # measures the quality of the logits w/ respect to the targets
              # the correct dimension of logits should have a very high number, and all the other dimensions should be a low number

    return logits, loss

    # Generates from the model. Generate is [(B, T+1), (B, T+2), ...].
    # Continues the generation of all the batch dimensions in the time dimension
    # will do that for max_new_tokens

  def generate(self, idx, max_new_tokens):
      # idx is (B,T) array of indices in the current context
      for _ in range(max_new_tokens):
          # get the predictions
          logits, loss = self(idx) # loss is ignored here

          # focus only the last time step by getting last element in time dimension
          logits = logits[:, -1, :] # becomes (B, C)

          # apply softmax to get probabilties
          probs = F.softmax(logits, dim = -1) # (B, C)

          # sample from the distribution
          idx_next = torch.multinomial(probs, num_samples = 1) # (B, 1)
              # samples from probabilities and ask PyTorch to give us 1 sample
              # this will give us a single prediction for what comes next

          # append sampled index to the running sequence
          idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)
              # whatever is predicted is concatenated on top of the previous idx along the
              # first dimension along the time dimension which creates B by T+1
      return idx
    # right now the generate function is a overkill since we only need the character right before the
    # prediction but we fed in all the previous characters. Later we will use the history so it will
    # make sense to use this set up, for now it is just creating a good draft



m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb) # calling the Bigram Language Model and passing the inputs and targets
print(logits.shape)

print(loss) # loss is 4.87. Should be -ln(1/65) = 4.17.
          # tells us that the inital conditions are not super disfuse so we have some entropy
          # This evaluates the quality of the

idx = torch.zeros((1,1), dtype=torch.long)
      # batch=1, time=1 therefore creating 1x1 tensor and it is holding a 0
      # dtype is the data type which is integer
      # 0 is how we start the generation. Which in this case a 0 is the new line character which is a reasonable thing to feed in at the start

print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
      # after idx we will ask for 100 tokens
      # tolist converts to a simple python list that can feed into the decode function above
      # the output is garabage because it is a totally random model. So we want to train this model

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [None]:
# training the model so it is less random

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3) # lr = learning rate

In [None]:
# will take the gradients and update the parameters using the gradients

# typical training loop
batch_size = 32
for steps in range(100): # for some number of steps

  xb, yb, = get_batch('train') # we are sampling a new batch of data

  logits, loss = m(xb,yb) # evaluating the loss
  optimizer.zero_grad(set_to_none=True) # zeroing out all the gradients from the previous step
  loss.backward() # getting the gradients for all the parameters
  optimizer.step() # using those gradients to update our parameters

  print(loss.item())

4.704006195068359
4.721118927001953
4.653193473815918
4.706261157989502
4.780904293060303
4.751267910003662
4.8395490646362305
4.667973041534424
4.743716716766357
4.774043083190918
4.6908278465271
4.789142608642578
4.61777925491333
4.650947093963623
4.886447429656982
4.703796863555908
4.757591724395752
4.655108451843262
4.709283828735352
4.6745147705078125
4.760501384735107
4.7892632484436035
4.653748512268066
4.6619181632995605
4.673007488250732
4.66577672958374
4.7301106452941895
4.755304336547852
4.712186813354492
4.745501518249512
4.726755619049072
4.735108375549316
4.777461051940918
4.643350601196289
4.6651835441589355
4.79764461517334
4.717412948608398
4.683647155761719
4.81886100769043
4.613771915435791
4.573785781860352
4.560741901397705
4.81563138961792
4.6061553955078125
4.619696140289307
4.725419521331787
4.650487899780273
4.5941481590271
4.7202863693237305
4.699342250823975
4.6724138259887695
4.727972984313965
4.66152286529541
4.616766929626465
4.599857807159424
4.653340339

In [None]:
# same thing as above but this time increasing the steps and only print at the end
batch_size = 32
for steps in range(1000):

  xb, yb, = get_batch('train')

  logits, loss = m(xb,yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

print(loss.item())

# Run 1 outputted 3.6
# Run 2 outputted 3.06
# Run 3 outputted 2.68
# so it is optimzing

3.6380467414855957


In [None]:
# same thing as above but this time increasing the steps since rerunning 10 times was silly
batch_size = 32
for steps in range(10000):

  xb, yb, = get_batch('train')

  logits, loss = m(xb,yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

print(loss.item())

# Output was 2.5

2.4199717044830322


In [None]:
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))



weangond
OMave wap

I RO:
Banleenoalit-blt
INRon

UM: nd kngonesll;
O: pa heore 'ga llis?-sur inidi


In [None]:
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=400)[0].tolist()))


ANGO:
He rthay n thavee
Sw s serer Fofow.
Houspathe t:
Mind fit.
DUKINoceamy hun.
CKIUShorst onre t ache bar, simed?
And me theluse BHENurind-g'sto f w m CK:
YCESI fatass mbre lious ave
Wer'dor' wod y:

Henkns ges wise we me y to elil'doug p in t her spalisusin t wndalu?Y!

CKINENGLOFrkeang-lumod n odas ine a! thayayor hannd t; frat.
OLArZAUSum,
s I f pin hondecharvyouke helldid t we keicetlot lll


We can see we are starting to get something reasonableish. It is a dramatic improvement from the gibberish that wa above, but still not full sentences. This is a very very simple model since the tokens are not talking to eachother, it only looks at the last character to make the prediction. So now we want the tokens to talk to eachother which will kick off the transformer!! :) You are doing a great job you have made substantial progress







 **


---


---


Took a detour to understand the mathematical trick in self-attention. I have created a separte google colab notebook for that. It is located in the GitHub repository.

Breif Summary:
You can do weighted aggregations of your past elements by using matrix multiplication of a lower triangular fashion. The elements in the lower triangular part tells you how much each element uses


---


---


**

In [None]:
# version 4: self attention
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
X = torch.randn(B,T,C)

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ X

out.shape

torch.Size([4, 8, 32])

Self attention is explained above, this is notes explaining how the code will work moving foward. As we know attention averages all the past and current tokens to predict the next token. However, each token will have a varying level of attraction to different past tokens. For example, if the current token is a vowel, it may have higher attraction to a consonant compared to another vowel. How do you know which consonant of the past to choose? We want to gather information of the past, but in a data department way. This is where self attention comes in to play

Every token emits 2 vectors.

 1. A query ~ what am I looking for?
 2. A key ~ what do I contain?

 The way we get affinities (level of attraction) between 2 tokens is by doing a dot product between the keys and the queries. For example, a query from token 20 would dot product with the keys from tokens 0-19.

 That dot product gets you the level affinity, in the code above we store it as wei. If the key and query allign they will interact to a high amount and you will learn more about that specifc token. For example, if token 20 interacts with token 10 the most, token 20 will learn more about token 10 compared to any other token in that sequence

In [None]:
# implementing what we just explained
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x) # size would become (B, T, 16)
q = query(x) # (B, T, 16)
# when you foward this Linear ontop of this x, all the tokens in all the positions in the
# B by T arrangement, in parallel and independently, produce a key and a query. No communication has happened yet

# Now for communication! All the queries will dot product with all the keys
wei = q @ k.transpose(-2,-1) # (B, T, 16) @ (B, 16, T) = (B, T, T)... @ means multiply
# so for every row of B, we are going to have a T^2 matrix giving us the affinities, which are the wei


tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ X

out.shape

torch.Size([4, 8, 32])

In [None]:
wei
# every batch element has different wei (is not the same anymore) since they contain different tokens at diferent positions

tensor([[[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
         [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
         [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
         [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],

        [[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.1687, 0.8313, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2477, 0.0514, 0.7008, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.4410, 0.0957, 0.3747, 0.0887, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.0069, 0.0456, 0.0300, 0.7748, 0.1427, 0.0000, 0.0000, 0.0000],
         [0.0660, 0.089

In [None]:
wei[0]

# for example. In the last row, the 8th token knows what content it has and what position
# it is in. So the 8th token creates a query saying what it is looking for. "I'm a vowel
# in the 8th position looking for any consonants in positions up to 4." Now all the tokens
# emit keys says and maybe one of the channels replies, "I am consonant and I am in a
# position up to 4." That key would have a high number in that specifc channel.
# So the query and the key when they dot product have a high affinity.

# In the last row the 4th token was interesting to the 8th token. So through the softmax
# the 8th token will end up aggegating a lot of its information into its position, so
# it learns a lot about it.

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

This is wei after masking and softmax has already happend, so to look under the hood we will comment out a few lines to see what is happening

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,32
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)

wei = q @ k.transpose(-2,-1)

tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
#wei = wei.masked_fill(tril == 0, float('-inf'))
#wei = F.softmax(wei, dim=-1)
out = wei @ X

out.shape

torch.Size([4, 8, 32])

In [None]:
wei[0]
# these are the raw outputs of the dot products. The raw affinities between all the tokens.

# we don't want the 5th token to be interacting to the 6,7,8th nodes so we mask those.
# look below for results

tensor([[-1.7629, -1.3011,  0.5652,  2.1616, -1.0674,  1.9632,  1.0765, -0.4530],
        [-3.3334, -1.6556,  0.1040,  3.3782, -2.1825,  1.0415, -0.0557,  0.2927],
        [-1.0226, -1.2606,  0.0762, -0.3813, -0.9843, -1.4303,  0.0749, -0.9547],
        [ 0.7836, -0.8014, -0.3368, -0.8496, -0.5602, -1.1701, -1.2927, -1.0260],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,  0.8638,  0.3719,  0.9258],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,  1.4187,  1.2196],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,  0.8048],
        [-1.8044, -0.4126, -0.8306,  0.5899, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,32
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)

wei = q @ k.transpose(-2,-1)

tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
#wei = F.softmax(wei, dim=-1)
out = wei @ X

out.shape

torch.Size([4, 8, 32])

In [None]:
wei[0]
# Using masking we make sure the past cannot interact with the future
# to make sure we have a nice distribution we apply softmax

tensor([[-1.7629,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-3.3334, -1.6556,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-1.0226, -1.2606,  0.0762,    -inf,    -inf,    -inf,    -inf,    -inf],
        [ 0.7836, -0.8014, -0.3368, -0.8496,    -inf,    -inf,    -inf,    -inf],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,    -inf,    -inf,    -inf],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,    -inf,    -inf],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,    -inf],
        [-1.8044, -0.4126, -0.8306,  0.5899, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,32
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)

wei = q @ k.transpose(-2,-1)

tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ X

out.shape

torch.Size([4, 8, 32])

In [None]:
wei[0]
# after exponentiating and normalizing using softmax we get a nice distribution that
# sums to 1.

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

This final product tells us in a data dependent matter how much information to aggregate from any of the tokens in the past


There is one more part to a single self-attention Head. When you do the aggregation you don't actually aggregate the tokens exacly. We produce one more value, we will call in value lol

In [None]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # size would become (B, T, 16)
q = query(x) # (B, T, 16)
# when you foward this Linear ontop of this x, all the tokens in all the positions in the
# B by T arrangement, in parallel and independently, produce a key and a query. No communication has happened yet

# Now for communication! All the queries will dot product with all the keys
wei = q @ k.transpose(-2,-1) # (B, T, 16) @ (B, 16, T) = (B, T, T)... @ means multiply
# so for every row of B, we are going to have a T^2 matrix giving us the affinities, which are the wei


tril = torch.tril(torch.ones(T,T))
#wei = torch.zeros((T,T)) # not zeroes an more so we comment it out
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x) # don't aggregate x, we calculate a value 'v'
out = wei @ v # we output wei multiplied by v
# v is the vectors that we aggregate instead of the raw x
      # x is like the private information to a token

# out = wei @ X

out.shape

# note that the output of a single Head is 16 not 32 now because it is 16 dimensional (head_size)

torch.Size([4, 8, 16])

To explain through an example

If you are a 5th token with some identity and my information is kept in vector x. For the purposes of this single Head here is what I am interested in, here is what I have, and if you find me interesting here is what I will communicate to you.

What you will communicate is stored in v. So v is what gets aggregated for the purpose of this single Head between the different tokens

That is the self-attention mechanism, it is probably the hardest part so good job for getting through it!


Here are the final notes on attention that Andrej Karpathy wrote in his code:



*   Attention is a **communication mechanism**. Can be seen as nodes in a direct graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
*   There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
* Each example across batch dimensions is of course processed completely independently and never "talk" to each other. (if there are 4 batches and 32 channels, picture it as 4 separate batches of 8 that talk within themselves but not too eachother)
* In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
* "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
* "Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below






In [None]:
# scaled attention
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2,-1)  * head_size**-0.5 # 0.5 is square roo and - makes it 1 over

In [None]:
k.var()

tensor(1.0449)

In [None]:
q.var()

tensor(1.0700)

In [None]:
wei.var()

tensor(1.0918)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)
# input low numbers toward 0 and get low numbers back

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)
# increased the low numbers and it sharpened the max to be more precise
# scaling is used to control the variance at initialization

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

Most of the work after this was switched to Visual Studio Code (code in GitHub file) because it is easier to work with. We added a multiple head attention module, FeedForward module, and Block module.

 I came back here to go over LayerNorm. It is very similiar to Batch Norm.

In [None]:
class BatchNorm1d:
  def __init__(self, dim, eps=1e-5, momentim=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

    def parameters(self):
      return [self.gamma, self.beta]

torch.manual_seed(1337)
module = BatchNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100 dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [None]:
x[:,0].mean(), x[:,0].std() # mean, std of one feature acros all batch inputs

(tensor(0.1469), tensor(0.8803))

In [None]:
x[0,:].mean(), x[0,:].std() # mean, std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

To summarize what I have done.

I trained a decoder only model following the paper "Attention is all you need".
I trained the model on the tiny shakespeare data set and got sensable results (would have been better with a GPU). All the training code is about 200 lines of code. Architecturally speaking this code is almost identical to large GPT models like GPT3, with the biggest difference being those large models are anywhwere from ten thousand to 1 million times bigger to what we have here.

The next step would be fine tuning. Which could look like getting a GPT in a question/answer format, getting it to preform tasks, detecting sentiment, etc. That is the harder step which could be supervised fine tuning or something much more complex like creating a reward model to train how Open AI does.

So I finsihed the video!!! But I am calling it quits so next time what we have to do is this.

1.   Upload what you have done to github
2.   redo it without the comments in a separate notebook in colab and VS code
3. read the article and right about it on linkedIn
4. make a post about project on linkedIn about
5. Then you are done!


Get started on mirror after that and see if that is possible! You are killing it first summer project is almost done :)