<a href="https://colab.research.google.com/github/rrl7012005/PyTorch-Language-Modelling-Notes/blob/main/GPT_detailed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Can use the following to download files from the internet, if you want to use the shakespeare file
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
chars = ""
with open("wizard_of_oz.txt", "r", encoding='utf-8') as f:
  text = f.read() #read the file as a string

print(len(text))
print(text[:200])

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(vocab_size)

232309
﻿DOROTHY AND THE WIZARD IN OZ

  BY

  L. FRANK BAUM

  AUTHOR OF THE WIZARD OF OZ, THE LAND OF OZ, OZMA OF OZ, ETC.

  ILLUSTRATED BY JOHN R. NEILL

  BOOKS OF WONDER WILLIAM MORROW & CO., INC. NEW Y

 !"&'()*,-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz﻿
81


#Tokenizing

A tokenizer consists of an encoder and a decoder, where each character can be encoded into an id (different from embedding). The following is a character level tokenizer, we could have a word level tokenizer. If we use a word level tokenizer we have a very large vocabulary, but there are less elements in the dataset to encode and decode. For our character level tokenizer, there are more elements in dataset to encode and decode but the vocab size is smaller. Typically use sub word units so you're something in between.

For tokenization, can use tiktoken open ai library (for words) or sentence piece library for (sub word units)

In [None]:
string_to_int = {ch:i for i, ch in enumerate(chars)} #Create mapping
int_to_string = {i:ch for i, ch in enumerate(chars)}
encode = lambda x: [string_to_int[c] for c in x]
decode = lambda x: "".join([int_to_string[i] for i in x])

print(encode("hello"))
print(decode(encode("hello")))

[61, 58, 65, 65, 68]
hello


Its best to efficient with our data and strings aren't. Also we should use a ML framework so use pytorch

In [None]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
data = torch.tensor(encode(text), dtype=torch.long)
print(data[:200])

cuda
tensor([80, 28, 39, 42, 39, 44, 32, 49,  1, 25, 38, 28,  1, 44, 32, 29,  1, 47,
        33, 50, 25, 42, 28,  1, 33, 38,  1, 39, 50,  0,  0,  1,  1, 26, 49,  0,
         0,  1,  1, 36, 11,  1, 30, 42, 25, 38, 35,  1, 26, 25, 45, 37,  0,  0,
         1,  1, 25, 45, 44, 32, 39, 42,  1, 39, 30,  1, 44, 32, 29,  1, 47, 33,
        50, 25, 42, 28,  1, 39, 30,  1, 39, 50,  9,  1, 44, 32, 29,  1, 36, 25,
        38, 28,  1, 39, 30,  1, 39, 50,  9,  1, 39, 50, 37, 25,  1, 39, 30,  1,
        39, 50,  9,  1, 29, 44, 27, 11,  0,  0,  1,  1, 33, 36, 36, 45, 43, 44,
        42, 25, 44, 29, 28,  1, 26, 49,  1, 34, 39, 32, 38,  1, 42, 11,  1, 38,
        29, 33, 36, 36,  0,  0,  1,  1, 26, 39, 39, 35, 43,  1, 39, 30,  1, 47,
        39, 38, 28, 29, 42,  1, 47, 33, 36, 36, 33, 25, 37,  1, 37, 39, 42, 42,
        39, 47,  1,  4,  1, 27, 39, 11,  9,  1, 33, 38, 27, 11,  1, 38, 29, 47,
         1, 49])


In [None]:
n = int(0.8*len(data))
train_data = data[:n]
val_data = data[n:]

block_size = 8
batch_size = 4

In [None]:
torch.manual_seed(1337)

def get_batch(split):
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size, (batch_size,))
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  x, y = x.to(device), y.to(device) #load data on the gpu
  return x, y

x, y = get_batch('train')
print(x.shape, y.shape)
print(x)
print(y)


for b in range(batch_size):
  for t in range(block_size):
    context = x[b, :t+1]
    target = y[b, t]
    print(f"when input is {context.tolist()} the target: {target}")

torch.Size([4, 8]) torch.Size([4, 8])
tensor([[58, 66,  1, 62, 67, 73, 68,  1],
        [67, 57,  1, 73, 61, 58,  1, 44],
        [57,  1, 73, 61, 58,  1, 73, 76],
        [61, 68, 71, 72, 58,  1, 54, 67]], device='cuda:0')
tensor([[66,  1, 62, 67, 73, 68,  1, 73],
        [57,  1, 73, 61, 58,  1, 44, 62],
        [ 1, 73, 61, 58,  1, 73, 76, 68],
        [68, 71, 72, 58,  1, 54, 67, 57]], device='cuda:0')
when input is [58] the target: 66
when input is [58, 66] the target: 1
when input is [58, 66, 1] the target: 62
when input is [58, 66, 1, 62] the target: 67
when input is [58, 66, 1, 62, 67] the target: 73
when input is [58, 66, 1, 62, 67, 73] the target: 68
when input is [58, 66, 1, 62, 67, 73, 68] the target: 1
when input is [58, 66, 1, 62, 67, 73, 68, 1] the target: 73
when input is [67] the target: 57
when input is [67, 57] the target: 1
when input is [67, 57, 1] the target: 73
when input is [67, 57, 1, 73] the target: 61
when input is [67, 57, 1, 73, 61] the target: 58
when inpu

We only train the transformer from a chunk of the dataset which we shuffle and choose randomly. We train the transformer on all 8 samples here (if block size is 8) so for one sequence of text we get 8 different data.

In [None]:


x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f"when input is {context} the target is {target}")

when input is tensor([80]) the target is 28
when input is tensor([80, 28]) the target is 39
when input is tensor([80, 28, 39]) the target is 42
when input is tensor([80, 28, 39, 42]) the target is 39
when input is tensor([80, 28, 39, 42, 39]) the target is 44
when input is tensor([80, 28, 39, 42, 39, 44]) the target is 32
when input is tensor([80, 28, 39, 42, 39, 44, 32]) the target is 49
when input is tensor([80, 28, 39, 42, 39, 44, 32, 49]) the target is 1


In our GPU, we stack a bunch of blocks and the number of blocks is our batch_size. Its how many things we're doing in parallel in our GPU. Btw numpy works on the cpu and torch, cuda works on gpu. CPU is better for large complex operations but GPU is better for many more but smaller operations.

#PyTorch stuff

In [None]:
randint = torch.randint(-100, 100, (6,))
randint

tensor([ -5,  53, -85, -90, -66, -10])

In [None]:
tensor = torch.tensor([[0.1, 1.2], [2.2, 3.1]])
tensor

tensor([[0.1000, 1.2000],
        [2.2000, 3.1000]])

In [None]:
zeros = torch.zeros(2, 3) #shape is argument, dtype is float
ones = torch.ones(3, 4)

zeros, ones

(tensor([[0., 0., 0.],
         [0., 0., 0.]]),
 tensor([[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]))

In [None]:
input = torch.empty(2, 3)
input #creates a tensor of shape specificied with uninitialized values
#it can be faster to use this since it doesn't initialize if you want to create a
#a random tensor instead of using zeros or ones
input.fill_(1.0) #fill all elements with 1

tensor([[1., 1., 1.],
        [1., 1., 1.]])

In [None]:
arange = torch.arange(5) #similar to range in python
linspace = torch.linspace(3, 10, steps=5) #similar to linspace in python
arange, linspace

(tensor([0, 1, 2, 3, 4]),
 tensor([ 3.0000,  4.7500,  6.5000,  8.2500, 10.0000]))

In [None]:
#logarithmically creates something check it out
logspace = torch.logspace(start=-10, end=10, steps=5)
logspace

tensor([1.0000e-10, 1.0000e-05, 1.0000e+00, 1.0000e+05, 1.0000e+10])

In [None]:
eye = torch.eye(5) #dimension of identity
eye

tensor([[1., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]])

In [None]:
#Can also do like function (ones_like, zeros_like, empty_like)
#Can also specify datatype while decarling tensor
a = torch.empty((2, 3), dtype=torch.int64)
empty_like = torch.empty_like(a)
empty_like

tensor([[              0,  98036351716544,  98036310022928],
        [135021587812640,               0,               0]])

Firsy argument is range, second is shape. rand function generates float32, randint gives integer

In [None]:
int_64 = torch.randint(1, (3,2), dtype=torch.int64)
float_32 = torch.rand(2, 3)
int_64, float_32

#cast via the following
casted = int_64.float()
casted

tensor([[0., 0.],
        [0., 0.],
        [0., 0.]])

In [None]:
#Switch to GPU and compare the following

import time #or do %%time for the whole cell
start_time = time.time()

#operations here

torch_rand1 = torch.rand(10000, 10000).to(device)
torch_rand2 = torch.rand(10000, 10000).to(device)
torch_rand = (torch_rand1 @ torch_rand2)

end_time = time.time()

elapsed_time = end_time - start_time
print(f"{elapsed_time:.8f}")

4.24842238


More TORCH functions

Draw from a discrete probability distribution using torch.multinomial (place the probabilities in a tensor) and the samples returned will be the indices of the probability tensor.

In [None]:
probabilities = torch.tensor([0.2, 0.2, 0.2, 0.2, 0.2])
samples = torch.multinomial(probabilities, num_samples=10, replacement=True)
print(samples)

tensor([1, 2, 2, 0, 4, 1, 3, 2, 4, 3])


Cat function concatenates tensors. Brackets around the 2 tensors you want to concatenate and along which dimension

In [None]:
tensor = torch.tensor([1, 2, 3, 4])
out = torch.cat((tensor, torch.tensor([5])), dim=0)
out

tensor([1, 2, 3, 4, 5])

Torch.tril returns the lower triangular part of the argument and torch.triu gives the upper triangular part of the argument.

In [None]:
lower = torch.tril(torch.ones(5, 5))
upper = torch.triu(torch.ones(5, 5))

lower, upper

(tensor([[1., 0., 0., 0., 0.],
         [1., 1., 0., 0., 0.],
         [1., 1., 1., 0., 0.],
         [1., 1., 1., 1., 0.],
         [1., 1., 1., 1., 1.]]),
 tensor([[1., 1., 1., 1., 1.],
         [0., 1., 1., 1., 1.],
         [0., 0., 1., 1., 1.],
         [0., 0., 0., 1., 1.],
         [0., 0., 0., 0., 1.]]))

any pytorch tensor has masked fill method which is like numpy, used for whenever something is 0 send it to float -ing

In [None]:
x = torch.zeros(5, 5)
y = x.masked_fill(torch.tril(torch.ones(5, 5)) == 0, float('-inf'))

In [None]:
torch.exp(y) #returns e^x

tensor([[1., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1.]])

Transpose and stack

In [None]:
torch.zeros(2, 3, 4).transpose(0, 2) #the 2 arguments inside transpose are the dimensions to swap

tensor([[[0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.]]])

In [None]:
tensor1 = torch.tensor([1, 2, 3])
tensor2 = torch.tensor([4, 5, 6])
tensor3 = torch.tensor([7, 8, 9])

stacked_tensor = torch.stack([tensor1, tensor2, tensor3]) #Stack all 3 tensors along an extra dimension
print(stacked_tensor)

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])


Matrix multiplication

In [None]:
a = torch.tensor([[1,2], [3,4], [5,6]])
b = torch.tensor([[7,8,9], [10,11,12]])

c = a @ b
print(c)
d = torch.matmul(a, b)
print(d)

tensor([[ 27,  30,  33],
        [ 61,  68,  75],
        [ 95, 106, 117]])
tensor([[ 27,  30,  33],
        [ 61,  68,  75],
        [ 95, 106, 117]])


Now for the neural network methods and functions. Bias = False introduces no bias term. The first 2 arguments are the dimensions of the weight matrix. This is how you create regular custom layers in pytorch. There are many possible layers (see torch.nn website)

In [None]:
import torch.nn as nn
sample = torch.tensor([10., 10., 10.])
linear = nn.Linear(3, 3, bias=False) #Create a 3x3 matrix
print(linear(sample)) #Whenever applied to a tensor it return the output if it passes through the layer

tensor([1.7518, 2.2448, 2.0193], grad_fn=<SqueezeBackward4>)


Now you could use the nn.Sequential API and create models exactly how you did in tensorflow (see the website).

Below see the softmax function

In [None]:
import torch.nn.functional as F

tensor1 = torch.tensor([1.0, 2.0, 3.0])
softmax_output = F.softmax(tensor1, dim=0) #softmax along the 0th dimension
print(softmax_output)

tensor([0.0900, 0.2447, 0.6652])


#Embedding

You know what embedding is by now. Watch 3b1b if you somehow do not. The embedding dimension is a hyperparameter, basically how complex do you want to make your semantic space. BTW neural network and multilayer perceptron is the same thing.

In [None]:
vocab_size = vocab_size
embedding_dim = 100
embedding = nn.Embedding(vocab_size, embedding_dim) #Create our mapping function

#example
random_indices = torch.LongTensor([1, 5, 3, 2])
embedded_output = embedding(random_indices)

embedded_output

tensor([[-1.6061e-01,  2.4685e-01, -1.4746e+00,  1.7112e+00,  4.5142e-01,
          6.7146e-01,  6.3328e-01,  2.2874e+00,  8.2223e-01,  5.4649e-02,
         -6.9024e-01, -6.8275e-02, -5.9100e-02, -1.3982e-01, -1.4625e-01,
          3.2315e-01, -8.8086e-01,  4.7022e-01, -4.4766e-01, -7.6995e-01,
         -7.0201e-01,  1.9830e+00, -1.0658e+00,  5.5441e-01,  7.9698e-02,
          1.4904e+00,  2.9600e-01,  6.9858e-01, -1.2551e-01,  5.3266e-01,
          1.6305e+00, -7.7768e-02,  1.1295e+00, -1.3841e+00,  1.0300e+00,
         -1.5525e+00,  1.3757e+00, -1.2609e-01,  1.8745e-01,  1.6855e+00,
         -1.6363e-01,  9.8537e-02,  8.0232e-01, -2.2649e+00, -1.0715e-01,
          8.9931e-01, -3.3558e-02, -3.3444e-01,  7.0777e-01, -3.8958e-01,
          6.8493e-01,  3.0040e-01, -9.5306e-03,  4.2025e-01,  1.0545e+00,
         -1.9577e-01, -6.3084e-01,  1.4094e+00,  1.0459e+00, -8.5410e-01,
          2.5702e-01, -1.2885e+00,  2.3138e-01, -7.8347e-01,  3.5477e-02,
         -5.2666e-01,  5.5806e-01, -9.

#Model Training and Development (Bigram)

Use nll as our loss function and the adam W optimizer (look at torch optimizers page). Adam W leads to better regularization.

Also we write the forward pass function (forward pass through the neural network) from scratch ourselves because you can understand how the data gets transformed step by step, and you have freedom to create your own layers.

We need to reshape the logits. C is the channel size (vocabulary size). The .view method in pytorch reshapes a tensor without changing its data. The purpose of the reshaping is so that the shapes are compatible for the cross entropy function (see what cross entropy expects).

It expects a C in the 2nd shape argument.

In [None]:
a = torch.rand(2, 3, 5)
a.view(30)

tensor([0.5327, 0.8851, 0.2264, 0.3411, 0.0833, 0.5025, 0.8800, 0.0619, 0.2684,
        0.1377, 0.2342, 0.7635, 0.0441, 0.2077, 0.7670, 0.4485, 0.2564, 0.8747,
        0.0882, 0.2180, 0.1531, 0.8768, 0.3880, 0.8192, 0.3223, 0.5152, 0.7320,
        0.9656, 0.6906, 0.8787])

In [None]:
@torch.no_grad() #a decorator telling python no need gradients at all here, saves time and memory
#useful for inference

def estimate_loss():
  out = {}
  model.eval() #set the model to evaluation mode
  for split in ['train', 'val']:
    losses = torch.zeros(eval_iters)
    for k in range(eval_iters):
      X, Y = get_batch(split)
      logits, loss = model.forward(X, Y)
      losses[k] = loss.item()
    out[split] = losses.mean()
  model.train() #set back to training mode
  return out

In [None]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module): #inherit as a subclass so nn.Linear is learnable
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) #Create the embedding layer

  def forward(self, index, targets=None):
    logits = self.token_embedding_table(index) #map the word index to a vector called logits, shape is BxTxC
    #each entry is a sequences, each sequence contains word ids, each word id is mapped to the embedding
    #vector supposed to be saying the probability distribution over all words for the next word

    if targets is None:
      loss = None
    else:
      B, T, C = logits.shape #batch size, time, channel
      logits = logits.view(B*T, C) #blend the batch and time dimensions
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits, targets) #computes cross entropy between the probabilities and the actual targets

    return logits, loss

  def generate(self, index, max_new_tokens): #generate new tokens (max_new is the number of new tokens)
    #index is (B, T) a bunch of sequences
    for _ in range(max_new_tokens):
      logits, loss = self.forward(index) #Call a forward pass, output is now BxTxC for logits
      logits = logits[:, -1, :] #Take the last timestep (word in each sequence) as its a bigram model, becomes a B x C shape
      probs = F.softmax(logits, dim=-1) #softmax over the last dimension ('embeddedment')
      index_next = torch.multinomial(probs, num_samples=1) #sample from the distribution
      index = torch.cat((index, index_next), dim=1) #append to the sequence, autoregressive format
    return index

model = BigramLanguageModel(vocab_size)
m = model.to(device) #move model to GPU

context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


abv]iP4YsN'BCQ'0yrNnYCsmH(iPb;
X-RDKMYP2Pb1I7-LOGgbC),sGA660M?zOfrd-?cexea﻿TBfhbQD(Vg*erT6 RLZ!4NOmSp6hQz8qrrxErT45NO﻿:ik5t[zL
MP,&"wA?_TtWRwb9!y51!CK.]RMZK(F
S,-WTtS)0FLUQWSCZFls8675N5s745dGB]2wbK[ :*5-4t*rmKXPsfaSZqZ?]vnxmE**5)WoxNqDRvwgJB-0Gfp.AswHda,IhYj*5:'q2DI89Ys2T&a'MS,AgR5:J_-4tj-hMRv[zLGBY:;BFK)3RY6EzXJ3[9Wy,sis?mn7mnA"﻿qm5HDB](VE&"6dnZ)?a,-,S)iyr8
﻿S,YLR[]n]cqT!y10'_..,AE*?!Hm-T55(3RY*TZDI5z(CMTtT5D540sD;?p0[*ycN;)Dop
sw;T*nP&l&O
GxRIaPK(Vwt5U4yxMmgRWVJqBea0wOcOEpj3Fc)isy)Q&Qv&2TQgBw:


So whats going on? Even though typically the embedding layer acts as a means to convert a word into a vector in such a way so as to encode semantic meanings in directions. Here the embedding layer acts as a cocurrent matrix because of the way we designed the model architecture. Neural nets will just adapt the weights necessary so as to achieve the task (optimize the cost function) so this embedding layer will end up adjusting itself to be a probability distribution of a word given another word because of our architecture. So embedding is not exactly the right word but technically it is because the word is converted to a continuous vector, with each entry correlating with the probability distribution of other words given the current word. Hence why its vocab size by vocab size. For a trigram it would be a 3d tensor or something.

So its called logits because its before the softmax. So while training the batches are gotten, passed into the forward pass, and the loss computed between each element's probability distribution and the target word index. This is why we didn't embed the target word (its not a direct embedding).

Then we tack on the new index as it's a bigram

Now define the optimizer, AdamW and the first argument pass model.paramaters and the 2nd pass the learning rate (see website).

Typical optimizers:

1. Mean Squared Error (mainly for regression)
2. Gradient Descent (just standard grad desc)
3. Momentum is an extension of stochastic gradient descent (helps smooth out changes and allows it to continuing movigng in right direction even if gradient changes direction), (good is like 90% of previous gradient and 10% of current one)
4. RMSprop (uses a moving average of the squared gradients of each parameter, avoids oscillations and can improve convergence)
5. Adam combines momentum and rms prop
6. AdamW adds weight decay to regularize and improve generalization.

In [None]:
lr = 3e-4
max_iters = 10000
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

Call the main loop. We zero the grad because pytorch automatically accumulates the gradients over time and we do not want that. We only want to optimize based on our current data. Instead of setting it to zero we set it to None as it occupies much less space. Typically for RNNs we don't zero grad, we do gradient accumulation.


Also do not compute the loss every step as its noisy for each batch (since random fluctuations) so do every number of eval_iters.

In [None]:
eval_iters = 1000
max_iters = 2000
for iter in range(max_iters):

  if iter % eval_iters == 0:
    losses = estimate_loss()
    print(f"step {iter}: train_loss: {losses['train']:.4f}, val_loss: {losses['val']:.4f}")

  xb, yb = get_batch('train')

  logits, loss = model.forward(xb, yb) #Do a forward pass
  optimizer.zero_grad(set_to_none=True)
  loss.backward() #Back propagate to compute the gradients
  optimizer.step() #Step the gradient

print(loss.item()) #Loss.item converts tensor to python float

step 0: train_loss: 4.8881, val_loss: 4.8862
step 1000: train_loss: 4.6436, val_loss: 4.6398
4.547786712646484


Generate text, we get gibbrish since its a bigram model which doesn't take much context.

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


p'E55qNlDY,?9P&j[1tTL﻿qK﻿n.9IjV[ !PtGi.SyKjNM)So;?vrarebnsvA6*ceb:FM."uCkiPYKV873p'QHoflI"n95FH"﻿xAQ" jDgt[&2dGB4esMmor*[:1s51Od1q:eZ0apGXXfqo_?4SL?Dw1ntuSgxrd1?MlrR3O0!s?nT﻿DHdX*Yamgte;8*WS8-)W45mFfytiJq0SFk5Q1?﻿-nc4"X73xFqN'QDY1083-I45,WRYAISgbq6d_iZzbz7vN&5!P,*LYvrkn
h8aQg5MwQhouF Y ! Dg.ngISZKM6q(RmzbEBTZZwq90Bl7W9yuCM6vgoR9w7NPt0BH8UWRY
PraNo&L"7A](EzL?4me6j[T68?9ILu_w3A1f7_meE_t](CHgQC tfCUMviOatpetor0 hAt*3'K(yM84mQ"kshMZ5y2ZKElPYPqD_lId1btf&ltf.T?mmoi2hOX)(&??gM"KMmeHL(KMlmf.bj)iYK8UuqNM


**Note on activation functions**

tanh is used over sigmoid to counter the vanishing gradient problem, and the mean is 0 so is used for multilayer perceptrons.

#Model Architecture and Training (Transformers)

**Residual Neural Networks**

Allows jumps, i.e the input layer is added to the current layer. This is done in order to make sure the network doesn't forget the earlier words/layers in the network. There are 2 ways this could be implemented, it could be normalized first and then added or add the layer and then normalize together (the 2nd way works better)

**Multi-head attention**

For attention mechanism see 3b1b or see lecture notes from cambridge, essentially keys, queries, values, softmax, take into account how context affects word embeddings. Query is like a question, query dimension is a hyperparameter equal to head_size. And then normalize by dividing by square root of the dimensions. This is done to keep the variance approximately constant so the softmax function does not blow up (numerical stability). We do not want softmax to sharpen too near initialization as then we're only taking data from one node/token.

We use multi-heads because you can imagine it as many different people reading a book and learning different perspectives and takes (different queries and keys asking different questions and answers about the input). Then we combine their results.

Masked attention doesn't use the future words for the training data. Also allows you to essentially get multiple training samples from one sequence, and the model can see many context sizes. In the transformer architecture in the google "Attention is all you need" paper, masked attention is only used once out of the 3 multihead attention sections.

**Encoder**

The purpose of an encoder is to learn the present, past and future and put that into a vector representation for the decoder, so its ok if we use the future data here. We used masked attention for the decoder


**nn.Linear**

The nn.Linear layers effectively act as some form of data compression and dimensionality reduction where lots and lots of data are reduced to less data, effectively summarizing the previous process. ReLU is typically used for this as you're just transferring the data. Keep this mind when doing future neural network architecture.

**Embedding and Positional Encoding**

Words are first converted into a dense vector by multiplying some sort of embedding matrix. Then the positional information is also encoded in (as transformer doesn't do that automatically like RNN) so we use these encodings to tell the order of the sequence. We use a trigonometric formula for each dimension to ensure each position has a unique encode.

**The Encoder Stage**

It processes the input sequence. It consists of a:

1. multi-head self attention layer
2. Residual connection and norm
3. position-wise feed forward network
4. Add and norm

After the multi-head self attention, which is meant to encode the embedding of the words given the context of the words around it, the residual connection (from the original input) is added, then layer normalization occurs. This preserves gradient flow. Each embedding word in the output passes through a separate feed forward networks, consisting of 2 linear transformations with a relu activation in between. The first step typically increases the dimensionality of the data (less to more neurons) which increases the capacity of the network to capture complex features and patterns, then we apply ReLU to introduce non-linearity essentially cutting the less important features and then bringing the model back down to the original dimension to integrate with the rest of the model but now it has some of the data of the expanded features with encodings of some more complex data, this is like feature extraction.

After that the a residual connection from the input to the feed forward layer is added and layer normalization is applied to stabilize and speed up training.


**Decoding Stage**

The input to the decoder is the target sequence (shifted by one) so the decoder predicts the next word in the sequence given the previous words.

1. Masked multi-head self attention
2. Residual connection and norm
3. multi-head cross attention
4. Add and norm
5. Feed forward network
6. Add and Norm
7. Linear Transformation and Softmax

Now we need to include the masked version as each position can only consider information from earlier positions and there should be no information transfer from future tokens. This also turns one training sequence into multiple.

Then we do cross attention where the keys and values come from the encoded inputs and the queries come from the decoder. This allows the decoder to understand the information from the entire input sequence processed by the encoder. Then the rest is self-explanatory. The linear layer transforms the decoder's outputs into logits, its good to separate things like this so as to not provide too many functions to a single layer.


Now we're building a GPT not just a transformer. The GPT only uses the decoder part of the transformer architecture which is good for sequence generation. The original transformer includes both which is good for understanding and generating sequences (machine translation). GPT is pre-trained to predict the next word in a sequence and later fine tuned for specific tasks. Also in text generation, GPT is autoregressive whereas transformer encodes the entire input and decoder processes this entire information to generate output

Now before we get into that there's also the average method where we average the words where for every batch, every t-th token, we average all the vectors in all the previous tokens, including the current one. But doesn't account for importance at all.

In [None]:
# #averaging- bag of words
# xbow = torch.zeros((B,T,C))
# for b in range(B):
#   for t in range(T):
#     xprev = x[b,:t+1] #all previous elements
#     xbow[b,t] = torch.mean(xprev, 0) #mean over all previous elements

# x[0]

In [None]:
# #its more efficient to do it as matrix multiplication
#  #tril returns lower triangular part of a matrix
# #represent this sequence averaging as
# a = torch.tril(torch.ones(3,3))
# a = a/torch.sum(a, 1, keepdim=True) #sum along axis 1
# b = torch.randint(0,10,(3,2)).float() #converts to float
# c = a @ b
# print(a)
# print(b)
# print(c)

In [None]:
# wei = torch.tril(torch.ones(T, T))
# wei = wei / wei.sum(1, keepdim=True)
# xbow2 = wei @ x # (B, T, T) @ (B, T, C) -> (B, T, C)
# torch.allclose(xbow, xbow2) #Check if these 2 are the same

In [None]:
# #Again but instead of standard normalization, apply softmax
# tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T, T))
# wei - wei.masked_fill(tril == 0, float('-inf')) #wherever tril is 0, set to -inf (see 3b1b), tokens from past cannot communicate
# wei = F.softmax(wei, dim=-1)
# xbow3 = wei @ x
# torch.allclose(xbow, xbow3)

Now **ATTENTION**

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:

max_iters = 3000
eval_iters = 1
eval_interval = 500
learning_rate = 3e-4
n_layer = 8
n_embd = 384
n_head = 6
batch_size = 128
block_size = 64
dropout = 0.25 #20% of neurons will dropout

@torch.no_grad()

def estimate_loss():
  out = {}
  model.eval()
  for split in ['train', 'val']:
    losses = torch.zeros(eval_iters)
    for k in range(eval_iters):
      X, Y = get_batch(split)
      logits, loss = model.forward(X, Y)
      losses[k] = loss.item()
    out[split] = losses.mean()
  model.train()
  return out

We're using multiple decoder blocks in our GPT sequentially.

In [None]:
class GPTLanguageModel(nn.Module):
  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
    self.position_embedding_table = nn.Embedding(block_size, n_embd) #We're going to make the positional encodings learnable rather than sinusoidal
    self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)]) #4 decoder blocks

    self.ln_f = nn.LayerNorm(n_embd) #Final layer norm at the end of all decoding blocks
    self.lm_head = nn.Linear(n_embd, vocab_size) #This transforms the embeddings into the vocabularies so we can do softmax

    self.apply(self.__init__weights) #initialize our weight, self.apply is provided by pytorch (inherited)

  def __init__weights(self, module):
    if isinstance(module, nn.Linear): #if its a linear thing
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
      if module.bias is not None:
        torch.nn.init.zeros_(module.bias) #assign 0 to the biases
    elif isinstance(module, nn.Embedding): #if its embedding
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

  def forward(self, index, targets=None):
    tok_emb = self.token_embedding_table(index)
    B, T, C = tok_emb.shape
    pos_emb = self.position_embedding_table(torch.arange(T, device=device)) #move to GPU
    x = tok_emb + pos_emb #add the positional embeddings to the token embeddings to encode both behavior

    #Now we have the embeddings so we feed it into our blocks

    x = self.blocks(x)
    x = self.ln_f(x)
    logits = self.lm_head(x)

    if targets is None:
      loss = None
    else:
      B, T, C = logits.shape
      logits = logits.view(B*T, C)
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  def generate(self, index, max_new_tokens):
    for _ in range(max_new_tokens):
      logits, loss = self.forward(index[:,-block_size:]) if _ >= block_size else self.forward(index)
      logits = logits[:, -1, :]
      probs = F.softmax(logits, dim=-1)
      index_next = torch.multinomial(probs, num_samples=1)
      index = torch.cat((index, index_next), dim=1)
    return index

Now we need to generate the block class (the decoders)

In [None]:
class Block(nn.Module):
  def __init__(self, n_embd, n_head):
    super().__init__()
    head_size = n_embd // n_head #headsize is number of features each head captures, its essentially the query dimension
    self.sa = MultiHeadAttention(n_head, head_size) #This is another class we need to create
    self.ffwd = FeedForward(n_embd) #Another class
    self.ln1 = nn.LayerNorm(n_embd) #For the attention
    self.ln2 = nn.LayerNorm(n_embd) #For the feed forward


  def forward(self, x):
    y = self.sa(x)
    x = self.ln1(x + y) #Add and norm, our residual connection
    y = self.ffwd(x)
    x = self.ln2(x + y) #Add and norm, so we need 2 variables x and y for resnet

    return x

Now the Feed Forward and Multihead attention

In [None]:
class FeedForward(nn.Module):
  def __init__(self, n_embd):
    super().__init__()
    self.net = nn.Sequential( #define the net with sequential api
        nn.Linear(n_embd, 4 * n_embd), #dimensionality increase
        nn.ReLU(),
        nn.Linear(4 * n_embd, n_embd),
        nn.Dropout(dropout) #Add dropout to prevent overfitting
    )

  def forward(self, x):
    return self.net(x)

class MultiHeadAttention(nn.Module):
  def __init__(self, num_heads, head_size):
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) #Create a list of heads
    self.proj = nn.Linear(head_size * num_heads, n_embd) #project so its ready for next layer, we only put this so if we change one of these there won't be dimensional errors
    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
    out = torch.cat([h(x) for h in self.heads], dim=-1) #concatenate the heads
    out = self.dropout(self.proj(out)) #Applies the layers to each element in tensor
    return out

class Head(nn.Module):
  def __init__(self, head_size):
    super().__init__()
    self.key = nn.Linear(n_embd, head_size, bias=False) #no biases in the attention
    self.query = nn.Linear(n_embd, head_size, bias=False)
    self.value = nn.Linear(n_embd, head_size, bias=False)
    self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) #registers a tensor as a buffer in the model (not parameters so not updated so easier to load)

    self.dropout = nn.Dropout(dropout)

  def forward(self, x):
    B, T, C = x.shape
    k = self.key(x)
    q = self.query(x)

    wei = q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5 #a form of normalization
    wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
    wei = F.softmax(wei, dim=-1)
    wei = self.dropout(wei)

    v = self.value(x)
    out = wei @ v
    return out



Now we need to train our model. Just a note: module list and sequential are different. Module List is just a list of modules, not in any particular order. It can levarage arraywise operations

In [None]:
import pickle

In [None]:
model = GPTLanguageModel(vocab_size)
m = model.to(device)

print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

max_iters = 500

for iter in range(max_iters):

  if iter % eval_iters == 0:
    losses = estimate_loss()
    print(f"step {iter}: train_loss: {losses['train']:.4f}, val_loss: {losses['val']:.4f}")

  xb, yb = get_batch('train')

  logits, loss = model.forward(xb, yb) #Do a forward pass
  optimizer.zero_grad(set_to_none=True)
  loss.backward() #Back propagate to compute the gradients
  optimizer.step() #Step the gradient

print(loss.item()) #Loss.item converts tensor to python float

with open('model-01.pkl', 'wb') as f: #wb = write binary
  pickel.dump(model, f)
  print('model saved') #saves everything about the model

14.274129 M parameters
step 0: train_loss: 4.4398, val_loss: 4.4351
step 1: train_loss: 3.5471, val_loss: 3.5422
step 2: train_loss: 3.2923, val_loss: 3.3172
step 3: train_loss: 3.2025, val_loss: 3.2248
step 4: train_loss: 3.1244, val_loss: 3.1297
step 5: train_loss: 3.0778, val_loss: 3.0524
step 6: train_loss: 2.9922, val_loss: 2.9843
step 7: train_loss: 2.9284, val_loss: 2.9830
step 8: train_loss: 2.8634, val_loss: 2.8930
step 9: train_loss: 2.8611, val_loss: 2.8991
step 10: train_loss: 2.8186, val_loss: 2.8243
step 11: train_loss: 2.7899, val_loss: 2.7805
step 12: train_loss: 2.6887, val_loss: 2.7739
step 13: train_loss: 2.7487, val_loss: 2.7103
step 14: train_loss: 2.7091, val_loss: 2.7108
step 15: train_loss: 2.6697, val_loss: 2.7381
step 16: train_loss: 2.6862, val_loss: 2.6956
step 17: train_loss: 2.6652, val_loss: 2.6957
step 18: train_loss: 2.6345, val_loss: 2.7164
step 19: train_loss: 2.6781, val_loss: 2.6547
step 20: train_loss: 2.5695, val_loss: 2.6496
step 21: train_loss: 

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device='cuda')
generated_chars = decode(m.generate(context, max_new_tokens=500)[0].tolist())
print(generated_chars)


when the so
theselr, horry as pached with them, in shas for heer, and then tine
Sorce Wizard obe darhel agren. It Is
Carchunger vegetey after his headfers fownor-jothings, "becan one with,
the ming to yethou ne. 
"We it a dimplect toodI'd sme," voitary; so the raitter a very spight.

"But Iffurery too Mabook flacim One," and speecled in Jim," the repided Dorothy.

"Your in wenterere?" a asked Zeb, and slise we to them," can8y "I
gooMy had true you make the alloon all aboroun the Wizard, of Zeb, 


#Large Language Model

For datasets to use for LLMs, you can use the OpenWebText Corpus which contains highly upvoted reddit posts (~40 GB) Common Crawl is a large scale database on the order of petabytes. A paper for a list of data is called the survey of large language models.

These datasets are large so we cannot read 45GB of text into RAM at once, its not feasible. So we have to load data some other way.

Install the dataset from the website locally and extract them.


As we're reading the little compressed file, we can take the new characters from them and push it onto the vocab file.

It will also be more efficient to create an output train file and output val file

In [None]:
import os
import lzma #to handle xz files
from tqdm import tqdm #progress bar

import mmap
import random

In [None]:
def xz_files_in_dir(directory):
  """Takes directory as input and returns all xz files within the directory"""

  files = []
  for filename in os.list(directory):
    if filename.endswith(".xz") and os.path.isfile(os.path.join(directory, filename)):
      #check if it is a file and not a directory or a link
      files.append(filename)
  return files

folder_path = #The folder wherever the xz file is located
output_file_train = "output_train.txt" #The output file
output_file_val = "output_val.txt"
vocab_file = "vocab.txt" #vocabulary file

files = xz_files_in_dir(folder_path)
total_files = len(files)

split_index = int(total_files * 0.9)
files_train = files[:split_index]
files_val = files[split_index:]

SyntaxError: invalid syntax (<ipython-input-47-34a2174b2893>, line 12)

In [None]:
vocab = set()


with open(output_file_train, "w", encoding="utf-8") as outfile:
  for filename in tqdm(files_train, total=len(files_train)):
    file_path = os.path.join(folder_path, filename)
    with lzma.open(file_path, "rt", encoding="utf-8") as infile:
      text = infile.read()
      outfile.write(text)
      characters = set(text)
      vocab.update(characters)

with open(output_file_val, "w", encoding="utf-8") as outfile:
  for filename in tqdm(files_val, total=len(files_val)):
    file_path = os.path.join(folder_path, filename)
    with lzma.open(file_path, "rt", encoding="utf-8") as infile:
      text = infile.read()
      outfile.write(text)
      characters = set(text)
      vocab.update(characters)

with open(vocab_file, "w", encoding="utf-8") as vfile:
  for char in vocab:
    vfile.write(char)
    vfile.write("\n")

Now we have to change the way we load our file and get batches

In [None]:
chars = ""

with open("vocab.txt", "r", encoding="utf-8") as f:
  text = f.read()
  chars = sorted(list(set(text)))

vocab_size = len(chars)

string_to_int = {ch:i for i, ch in enumerate(chars)} #Create mapping
int_to_string = {i:ch for i, ch in enumerate(chars)}
encode = lambda x: [string_to_int[c] for c in x]
decode = lambda x: "".join([int_to_string[i] for i in x])


FileNotFoundError: [Errno 2] No such file or directory: 'vocab.txt'

In [None]:
def get_batch(split):
  data = get_random_chunk(split) #get from our file
  ix = torch.randint(len(data) - block_size, (batch_size,))
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  x, y = x.to(device), y.to(device) #load data on the gpu
  return x, y

#Get files by memory mapping by looking only at pieces of file in very large files without opening whole thing
def get_random_chunk(split):
  filename = "output_train.txt" if split == "train" else "output_val.txt"

  with open(filename, 'rb') as f:
    with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
      file_size = len(mm)
      start_pos = random.randint(0, (file_size) - block_size * batch_size)

      mm.seek(start_pos)
      block = mm.read(block_size * batch_size - 1)

      #convert binary to utf-8 and ignore the erroneous data
      decoded_block = block.decode('utf-8', errors='ignore').replace('\r', '')
      data = torch.tensor(encode(decoded_block), dtype=torch.long)

  return data

Then run model like normal

**Saving the model**

You can use torch.load and torch.save to save our model parameters and architecture (its the GPT Language Model class).

You can use pickle as well, it only works on one GPU (see up). Then we can train a bit of our model, take a break and then train again.

In [None]:
print("Loading model")

with open('model-01.pkl', 'rb') as f:
  model = pickle.load(f)

print("Loaded successfully")

m = model.to(device)
#and then run the training loop being careful to hashtag out
#model = GPTLanguageModel because we dont want to create the model again