### Transformer Language Model (based on Adrej Karpathy's nanoGPT tutorial)

In [1]:
# read the txt entire file into a single string
with open('tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print(f"Total length of dataset in characters: {len(text)}")

Total length of dataset in characters: 1115394


In [16]:
# get vocabulary of characters
vocab = sorted(set(list(text)))
print("character vocabulary: ", vocab)
vocab_size = len(vocab)

character vocabulary:  ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [11]:
# tokenize the text
ctoi = {vocab[i]:i for i in range(vocab_size)}
itoc = {i:vocab[i] for i in range(vocab_size)}
encode = lambda s: [ctoi[c] for c in s]  # converts a string to integer token sequence
decode = lambda s: [itoc[ix] for ix in s]  # converts an integer token sequence to string of characters

In [17]:
print(encode('Hello world!'))
print(decode(encode('Hello world!')))

[20, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2]
['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']


In [18]:
# tokenize the dataset into integer sequence, convert to torch tensor of type int64
import torch

data = torch.tensor(encode(text), dtype=torch.long) 
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [19]:
# train-validation splits (90-10)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

We now split the data into chuncks of size block_size. For each chunk, we create (input,target) pairs for next character prediction, where the input is a context window containing all characters preceding the target character. Note that the context sizes range from 1 up to block size, i.e. there will be block_size number of (input,target) pairs per chunk.

In [21]:
block_size = 8

# example showing the first chunk and all possible (input,target) pairs we can get from it
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"Context: {context} --> target: {target}")

Context: tensor([18]) --> target: 47
Context: tensor([18, 47]) --> target: 56
Context: tensor([18, 47, 56]) --> target: 57
Context: tensor([18, 47, 56, 57]) --> target: 58
Context: tensor([18, 47, 56, 57, 58]) --> target: 1
Context: tensor([18, 47, 56, 57, 58,  1]) --> target: 15
Context: tensor([18, 47, 56, 57, 58,  1, 15]) --> target: 47
Context: tensor([18, 47, 56, 57, 58,  1, 15, 47]) --> target: 58


Now lets create a batch generator which creates a batch of randomly selected blocks/chunks from the data

In [33]:
torch.manual_seed(1223)
batch_size = 4
block_size = 8

def get_batch(split='train'):
    data = train_data if split=='train' else val_data

    # sample positions from which to grab blocks
    ix = torch.randint(len(data)-block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])      
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x,y 

xbatch, ybatch = get_batch('train')
print("input batch: ")
print(xbatch.shape)
print(xbatch)
print("target batch: ")
print(ybatch.shape)     
print(ybatch)     
print("")

# context target pairs
print(f"A batch of {batch_size} blocks:")
for b in range(batch_size): # batch dimension
    print(f"\nBlock {b}:")
    for t in range(block_size):  # time dimension
        context = xbatch[b,:t+1]
        target = ybatch[b,t]
        print(f"Context: {context.tolist()} --> target: {target}")
    print("")

input batch: 
torch.Size([4, 8])
tensor([[17, 17, 26,  1, 17, 24, 21, 38],
        [14, 17, 24, 24, 13, 10,  0, 28],
        [63,  1, 47, 52,  1, 56, 43, 55],
        [56, 59, 57, 58,  1, 59, 54, 53]])
target batch: 
torch.Size([4, 8])
tensor([[17, 26,  1, 17, 24, 21, 38, 13],
        [17, 24, 24, 13, 10,  0, 28, 50],
        [ 1, 47, 52,  1, 56, 43, 55, 59],
        [59, 57, 58,  1, 59, 54, 53, 52]])

A batch of 4 blocks:

Block 0:
Context: [17] --> target: 17
Context: [17, 17] --> target: 26
Context: [17, 17, 26] --> target: 1
Context: [17, 17, 26, 1] --> target: 17
Context: [17, 17, 26, 1, 17] --> target: 24
Context: [17, 17, 26, 1, 17, 24] --> target: 21
Context: [17, 17, 26, 1, 17, 24, 21] --> target: 38
Context: [17, 17, 26, 1, 17, 24, 21, 38] --> target: 13


Block 1:
Context: [14] --> target: 17
Context: [14, 17] --> target: 24
Context: [14, 17, 24] --> target: 24
Context: [14, 17, 24, 24] --> target: 13
Context: [14, 17, 24, 24, 13] --> target: 10
Context: [14, 17, 24, 24, 13,

### Now let's create a pytorch-ified Bi-gram language model (will serve as a baseline for comparing the transformer model later on)

In [48]:
from torch import nn
from torch.nn import functional as F
torch.manual_seed(1234)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        '''
        Define model parameters
        '''
        # lookup table for finding logits for the next token (i.e. log of counts for all possible next token given input token)
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size) # shape: (C,C)


    # forward pass takes in a batch of input token sequences of shape (B,T) and corresponding targets of shape (B,T)
    def forward(self, idx, targets=None):
        # get logits for every input token
        logits = self.token_embedding_table(idx) # shape: (B,T,C)
        loss = None
        if targets is not None:
            B,T,C = logits.shape
            # reshape the logits and targets such that batch of input sequences are flattened into a single big input sequence
            # i.e. (B,T) --> (B*T)
            logits = logits.view(B*T,C) # reshaped to (B*T,C)
            targets = targets.view(B*T) # reshaped to (B*T)
            # compute cross entropy loss (i.e. average negative log likelihood)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    # generates new sequences continuing from a given batch of context tokens
    def generate(self, idx, max_new_tokens):
        # batch of contexts, idx has shape (B,T)
        for _ in range(max_new_tokens):
            # get predictions
            logits, _ = self(idx) # shape: (B,T,C)
            # for each context sequence (in the batch), compute the probability of the next token using the logits of the last token in the context sequence
            logits = logits[:,-1,:] # shape: (B,C)
            probs = F.softmax(logits, dim=-1) 
            # sample from the probability distribution to get next token
            idx_next = torch.multinomial(probs, num_samples=1) # shape: (B,1)
            # append to the current context
            idx = torch.cat((idx, idx_next), dim=1) # shape: (B,T+1)
        return idx


In [49]:
# create a bigram language model and test it on the example batch
m = BigramLanguageModel(vocab_size=vocab_size)
logits, loss = m(xbatch, ybatch)
print(logits.shape)
print(loss)

# generate a single sequences using the model with start token 0
idx = torch.zeros((1,1), dtype=torch.long)
generated_seq = m.generate(idx, max_new_tokens=100)[0].tolist()
# Decode integer tokens into characters
generated_seq = decode(generated_seq)
print("\nGenerated sequence:\n","".join(generated_seq))


torch.Size([32, 65])
tensor(4.9995, grad_fn=<NllLossBackward0>)

Generated sequence:
 
W?MwDJJwiFUs&vgOaq$KLpDBQRCMw
UVVoH3GYFUMIvMMw!rEOAYFbvdvSYw3?E!s uQFhq''Vutws$F&
jk,hH,eTCDH
A
yY.



Generated sequence looks like gibberish, because model is untrained. We now train the model using a graident based optimiser.

In [50]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [56]:
batch_size = 64
num_epochs = 10000

for epoch in range(num_epochs):
    # sample a batch of trainin data
    xb, yb = get_batch('train')
    # evaluate the loss
    _, loss = m(xb, yb)
    # reset parameter gradients
    optimizer.zero_grad(set_to_none=True) 
    # backward pass
    loss.backward()
    # optimizer step
    optimizer.step()

    if epoch % 10 == 0:
        print(f"epoch: {epoch}, training loss: {loss.item()}")    



epoch: 0, training loss: 2.653836488723755
epoch: 10, training loss: 2.7165653705596924
epoch: 20, training loss: 2.7700629234313965
epoch: 30, training loss: 2.7690882682800293
epoch: 40, training loss: 2.727691173553467
epoch: 50, training loss: 2.736506700515747
epoch: 60, training loss: 2.7170002460479736
epoch: 70, training loss: 2.679131031036377
epoch: 80, training loss: 2.7840893268585205
epoch: 90, training loss: 2.707040309906006
epoch: 100, training loss: 2.682605028152466
epoch: 110, training loss: 2.7026965618133545
epoch: 120, training loss: 2.726210117340088
epoch: 130, training loss: 2.7604587078094482
epoch: 140, training loss: 2.6572349071502686
epoch: 150, training loss: 2.6831789016723633
epoch: 160, training loss: 2.6536099910736084
epoch: 170, training loss: 2.701310634613037
epoch: 180, training loss: 2.738854169845581
epoch: 190, training loss: 2.7191481590270996
epoch: 200, training loss: 2.6766655445098877
epoch: 210, training loss: 2.7442188262939453
epoch: 2

Now let's try generating some text using the trained bigram model.

In [61]:
# generate a single sequences using the model with start token 0
idx = torch.zeros((1,1), dtype=torch.long)
generated_seq = m.generate(idx, max_new_tokens=300)[0].tolist()
# Decode integer tokens into characters
generated_seq = decode(generated_seq)
print("\nGenerated sequence:\n","".join(generated_seq))


Generated sequence:
 
ANougeng IISEO, k w's wet
Thea than f as at'lathe moonouroree St t, do copous ET:
fty is?
FO:
QUThey, thothe ye o s agowir pre at ge'dnd as,
OLonkerul amed
HE em ld t bus IZAy itoresisblat ind, man t: mys mary
Is; tee,
Thareel spr bes macKIst he y gbar
QUntled wh: gu; rd methobr wiu.

S:

Anseve pes


This is better! It has similar syntactic structure as the training text and even has some correct words. The quality is still very bad because the context window is too small, only the previous character is used to predict the next character.