In [1]:
with open('input.txt', 'r', encoding = 'utf-8') as f:
    text = f.read()

In [2]:
print('length of dataset in characters', len(text))

length of dataset in characters 1115394


In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
chars = sorted(list(set(text))) # list causes an arbitrary ordering
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In here, we have a total of 65 characters in total that are possible elements of our sequences that the model can see or emit including a space at the beginning followed by special characters, caps and small alphabets.

#### 1.     TOKENIZATION

First step is to tokenize which means the convert the raw text as a string to some sequence of integers according to some notebook, a vocabulary of possible elements.

In our case, its simply going to be translating individual characters into integers

In [5]:
 # Encoder and decoder
 
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] # encode a string input and output a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # decode a list of integers input and output a string

print(encode("hi there"))
print(decode(encode("hi there")))

[46, 47, 1, 58, 46, 43, 56, 43]
hi there


[46, 47, 1, 58, 46, 43, 56, 43]
represents string and when decoded, due to a reverse mapping between the value and encoded value, returns the exact same text.

Therefore this could be seen as a translation to integers and back for an arbitrary string done on a character level.

What we did is :

- Iterate over all the characters
- Create a lookup table from the character to the integer and vice versa
- Encode some string by translating all characters individually to list of integers
- Decode it back, we use reverse mapping and concatenate all of them.


Other type of schemas exist.

- Google use **[SentencePiece](https://github.com/google/sentencepiece)** which encode text into integers but in different schema and vocabulary. **SentencePiece** is a subword, a sort of tokenizer which wont be encoding entire words or individual characters but a subword unit level

- OpenAI uses **[tiktoken](https://github.com/openai/tiktoken)** library taht use a byte pair encoding tokenizer. (used for GPT too)

In [6]:
import tiktoken
enc = tiktoken.get_encoding("gpt2")
enc.n_vocab # instead of 65, we have 50k


50257

In [7]:
enc.encode("hii there")

# Unlike the previous case, we get only 3 integers where they are not between 0 and 64 but between Z and 50,257

[71, 4178, 612]

 
``` encode("hi there")  [46, 47, 1, 58, 46, 43, 56, 43]```
 
 ```enc.encode("hii there") [71, 4178, 612]```

 A trade off happens between the code book size and the sequence lengths enabling us to have very long sequences of integers with very small vocabularies (character level tokenizer) or we could have short sequences of integers with very large vocabularies (subword). 

 Typically people use subword encoding

In [8]:
# Applying a character level tokenizer for the entire training set of Shakespeare

import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # 1000 characters will be shown as how it would look for the GPT

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

This sequence of integers is the identical translation of the first 1000 characters where each of the number is >= 0  and <=65 where zero cud be a newline character and 1 a space and it goes on.

Basically a direct translation to integers happened

In [9]:
# Splitting up into train and validation sets for the transformer.
n = int(0.9*len(data)) # 90% is train and rest 10% val
train_data = data[:n]
val_data = data[n:] # help us understand to what extend, our model is overfitting

We are never going to feed the text into Transformer all at once cause it would be computationally very expensive and prohibitive.

So it will be just chunks of data set that the Transformer will be working on and when we train it, we give sample random little chunks of the training data and will train on just chunks at a time. These chunks have a particular length and that maximum length is called block size/ context length.

In [10]:
block_size = 8
train_data[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

While sampling a chunk of data like this, it has multiple examples packed into it because all of these characters follow each other and so plugging on these to the Transformer is that we are going to simultaneously train it to make predictions at every one of the positions.

In a chunk of 9 characters, there is actually 8 individual characters packed in there. 

In context of 18---> 47 comes up <br>
In context of 18,47 ---> 56 comes up <br>
In context of 18,47,56 ---> 57 comes up <br>

In [11]:
# TIME DIMENSION
x = train_data[:block_size] # inputs of the transformer, first block size characters 
y = train_data[1:block_size + 1] # next block size characters offset by 1. y is the targets for each position in the input
for t in range(block_size): # going over 8
    context = x[:t+1] # context is always the 9 characters (upto t incl t)
    target = y[t] # target
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


8 examples with context between 1 to context of block size hidden in the chunk of 9 characters that we sampled from the dataset. We train for not just computational reasons because we happen to have sequence already, its not done for efficiency but also to make the Transformer network be used to seeing contexts all the way from as little as one to block size. This way, transformer wud be able to see everything in between and thats useful during inference.

During sampling, we cud start the sampling generation as little as 1 character and the Transformer wud be able to predict the next character with all the way up to just context of 1 and then can predict everything up to block size and after block size, we start truncating because Transformer when predicting the next character, wont receive more than a block size input.

As we sample the chunks of text, every time we feed into the Transformer, we will have many batches of multiple chunks of texts that are all stacked up in a single tensor. This is for efficiency and also keep the gpus busy because they r very good at parallel processing of data.

In this way, we want to process multiple chunks all at the same time but those chunks are processed completely independently they don't talk to each other and so on.

In [12]:
# Start sampling random locations in the data set to pull chunks from. For that we need to set a seed
torch.manual_seed(1337) # setting up the seed
batch_size = 4  # how many independent sequences will we process in parallel or every forward backward pass of the transformer?
block_size = 8  # the maximum context length for predictions

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # generate random positions to grab a chunk out of. Generate the batch size number of random offsets
    
    x = torch.stack([data[i:i+block_size] for i in ix]) # first block size characters starting at i 
    # stack 1 d tensors into row in a 4x8 tensor
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])# y's are offset by 1s
    return x, y # get the chunks for every i and ix 

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb) # each row is a chunk of a training set.
# this 4x8 array contains 32 examples and completely independent of the transformer
print('targets:')
print(yb.shape)
print(yb) # to create the loss fn

print('======')

for t in range(block_size):  # batch dimension
    for b in range(batch_size):  # time dimension
        context = xb[b, :t+1]
        target = yb[b][t]
        print(f"When input is {context.tolist()} the target: {target}")


inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
When input is [24] the target: 43
When input is [44] the target: 53
When input is [52] the target: 58
When input is [25] the target: 17
When input is [24, 43] the target: 58
When input is [44, 53] the target: 56
When input is [52, 58] the target: 1
When input is [25, 17] the target: 27
When input is [24, 43, 58] the target: 5
When input is [44, 53, 56] the target: 1
When input is [52, 58, 1] the target: 58
When input is [25, 17, 27] the target: 10
When input is [24, 43, 58, 5] the target: 57
When input is [44, 53, 56, 1] the target: 58
When input is [52, 58, 1, 58] the target: 46
When input is [25, 1

Input is 24, target is 43<br>
Input 24,43 target 58<br>
all 32 examples packed into a single batch of input x nd desired targets in y<br>

X is going to feed into transformer and it simultaneously process all the examples and then lookup the correct integers to predict in every one of these positions in the tensor.

In [13]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


### FEED THIS INTO THE NEURAL NETWORKS

Starting of with the most simplest possible neural network in case of language modelling - BIGRAM

In [14]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageMOdel(nn.Module): # sub class of nn module
    
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # thin tensor of shape vocab_size by vocab_size
        
    def forward(self, idx, targets): # index= idx 
        
        #idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # passing the index 
        return logits
    
m = BigramLanguageMOdel(vocab_size)
out = m(xb,yb) # inp xb and targets yb
print(out.shape)

torch.Size([4, 8, 65])


Every single integer in our input is going to refer to this embedding table and pluck out a row of the embedding table corresponding to the index

Say 24 in the input, will go to the embedding table and pluck out the 24th row.
pytorch will then arrange all of this into a (B= Batch, T= Time, C= channel) tensor
B = 4, T = 8 and C = vocab_size = 65

This step will pluck out rows from the table and arrange it like this and this is interpreted as logits which are nothing but scores for the next character in the sequence

We are actually predicting what comes next based on the individual identity of the single token.

Currently, the tokens are not talking to each other and they are not seeing any context except for they see themselves. Basically a token number 5 says it can make decent predictions about what comes next just by knowing it is a token number 5 because some character follow some certain other character in typical scenarios.

torch.Size([4, 8, 65]), we get the scores or predictions for every single 4x8 positions.

### EVALUATING THE LOSS

A good way to measure a loss or like a quality of predictions is to use the neg log likelihood loss which is implemented as cross_entropy in pytorch. loss is the quality of logits with respect to targets. Given the Identity of the next character, how well we could predict the next character based on the logits.

The correct dimension of logits depending on whatever target is, should have a very high number and all other dimensions should be very low number.

In [18]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageMOdel(nn.Module): # sub class of nn module
    
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # thin tensor of shape vocab_size by vocab_size
        
    def forward(self, idx, targets): # index= idx 
        
        #idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # passing the index 
        loss = F.cross_entropy(logits, targets) # cross entropy on predictions and targets
        
        return logits,loss
    
m = BigramLanguageMOdel(vocab_size)
logits, loss = m(xb,yb) # inp xb and targets yb
print(logits.shape)

RuntimeError: Expected target size [4, 65], got [4, 8]

We get a error here because pytorch library has different configurations for inputs.

For a multi dimensional arrays, its set as B by C by T instead of a B by T by C.

In [21]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageMOdel(nn.Module): # sub class of nn module
    
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # thin tensor of shape vocab_size by vocab_size
        
    def forward(self, idx, targets): # index= idx 
        
        #idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # passing the index 
        
        B, T, C = logits.shape
        logits = logits.view(B*T, C) # B and T are crushed to a  1d and C is the 2nd dimension. It is stretched out from a 3 to a 2 dimensional.
        targets = targets.view(-1)
        # targets = targets.view(B*T)
        loss = F.cross_entropy(logits, targets) # cross entropy on predictions and targets
        
        return logits,loss
    
m = BigramLanguageMOdel(vocab_size)
logits, loss = m(xb,yb) # inp xb and targets yb
print(logits.shape)
print(loss) # because we have 65 possible vocabulary elements 

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


We are expecting a -ln(1/65) which is 4.17 and we get a 4.87.

From this we understand, the initial predictions are not super diffuse and yes they got a bit of entropy so we were guessing wrong.

### GENERATE FROM THE MODEL

In [27]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageMOdel(nn.Module): # sub class of nn module
    
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # thin tensor of shape vocab_size by vocab_size
        
    def forward(self, idx, targets = None): # index= idx 
        
        #idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # passing the index 
        
        if targets is None:
            loss = None
        else: 
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # B and T are crushed to a  1d and C is the 2nd dimension. It is stretched out from a 3 to a 2 dimensional.
            targets = targets.view(-1)
            # targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # cross entropy on predictions and targets
        
        return logits,loss
    
    def generate(self, idx, max_new_tokens): # idx is (B, T) array of indices in the current context in some batch
        # task of the generate function is to take the B x T and extend it to by B x T+1, B x T+2, therefore it continues generation in all the batch dimensions in the time dimensions and do for max new tokens 
        for _ in range(max_new_tokens):
            # get the predictions
            logits,loss = self(idx) # loss is ignored. no targets to compare
            logits = logits[: , -1, :] # from the BxTxC, we pluck out the -1,last dimension in the time dimension because those are predictions for what comes next.
            # softmax of logits give us probabilities
            probs = F.softmax(logits, dim = -1)
            # sample from the probabilities and we get 1 sample.
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1), from each one of the batch dimension we will have a single prediction for what comes next
            idx = torch.cat((idx, idx_next), dim=1) # integer idx_next get concatenated on top of the current stream of integers and give us BxT+1
            # whatever is predicted is concated on top of the previous idx along the first dimension (dim=1) which is the time dimension to create BxT+1, so that because a new idx.
        return idx
                    
m = BigramLanguageMOdel(vocab_size)
logits, loss = m(xb,yb) # inp xb and targets yb
print(logits.shape)
print(loss) # because we have 65 possible vocabulary elements 

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [28]:
idx = torch.zeros((1,1), dtype= torch.long) # a tensor 1 x 1 with batch 1 and time 1 and holds all 0s
# zero kicks of the generations and it is actually a new line character

generation = decode(m.generate(idx, max_new_tokens=100)[0].tolist()) # 100 tokens
# Generate will continue this.
# since it works on level of batches, then we will have to index into the 0th row and basically unplug the single batch dimension that exists and 
# then that gives us time steps that is a 1 d array of all the indices which is converted to a python list from pytorch tensor .
# This can be fed into our decode fn and convert those integers into text.

print(generation)


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


This value is totally a garbage. it is a totally random model.
This function needs to general but its not right as in we are feeding all this without building out the context and we concatenate it all and we feeding it all of them.

Actually even if we are feeding in the entire sequence, but they look only at the last piece and predict the next character

### TRAINING THE BIGRAM MODEL

Training help us make this less random. Instead of a stochastic gradient descent which is the most simplest possible optimizer 

While AdamW is the more advanced and popular optimizer and works extremely well for a learning rate roughly 3e-4 but for very very small networks like this, we can have learning rates of 1e-3

The optimizer works by taking the gradients computed during backpropagation and applying these gradients to update the model's parameters. In other words, it adjusts the parameters (weights, biases, embeddings, etc.) in the direction that reduces the error (loss) in the model’s predictions.

It’s based on the concept of stochastic gradient descent (SGD) but with improvements like adaptive learning rates and momentum, making training faster and more stable.

In [30]:
# STEP 1: Create a pytorch optimization object
optimizer = torch.optim.AdamW(m.parameters(), lr = 1e-3)
# A variant of the Adam optimizer that includes weight decay (a regularization technique) to help prevent overfitting.
# takes the gradients and update the parameters using the gradients 

In [37]:
batch_size = 32
for steps in range(10000):
    
    xb, yb = get_batch('train')
    # evaluate the loss
    logits, loss = m(xb,yb)
    optimizer.zero_grad(set_to_none=True) # zeros out all the gradients from the previous step and getting gradients for all params
    # using those gradients, we update our params
    loss.backward()
    optimizer.step() # update
    
print(loss.item())

2.4571714401245117


In [42]:
generation = decode(m.generate(idx = torch.zeros((1,1), dtype= torch.long), max_new_tokens=500)[0].tolist()) 
print(generation)


Trerrs, nes hthemuretirs, fonoumew carerted, w l' fryeathenill ws y thon astidispl m.
We linerny.
MAur hale, whour, bud shel--avesen, p
S:
TUS:
S:


Towingorshivous icarout at I ncars Praves sorsee itesis arifr m geon h othayer,

And te tou bldo alalide m

The wf ld sall finthe! h hom sbut f rayoooest ENGO:
TAUShe hen mim, ler ety tst,
I s
BYowisttoue flle or s, bous, w


arrindfre.
Hoond w f y prin aid.

BEToom me, ongahe GBimad ry
Wherd

SThour me t INNGO my n d fFose:
TELENAu, morchorve tond 


Some kind of improvement happened here.