### Read the dataset

In [1]:
with open('data/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

### Data exploration and preprocessing of dataset

In [2]:
print("Length of the text is: {}".format(len(text)))

Length of the text is: 1115394


In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
chars = sorted(list(set(text)))
print("".join(chars))


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [5]:
vocab_size = len(chars)
print("Vocabulary size is: {}".format(vocab_size))

Vocabulary size is: 65


### Tokenization

There are different ways to tokenize the text. Google uses SentencePiece tokenizer mechanism [Link](https://github.com/google/sentencepiece) that uses subwords to tokenize the sentences. OpenAI which created chatCPT uses tiktoker library developed by them [Link](https://github.com/openai/tiktoken) that uses a BPE (byte-pair encoding) to tokenize.

For this work we will be using character level encoding (converting the text to their ASCII equivalent).

There is a tradeoff between the encoding numbers (below it is the ASCII codes) and the sequence length generated (a code for every character below).

In [6]:
stoi = {ch: i for i,ch in enumerate(chars)}
itos = {i: ch for i,ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

print("Encoding for the string \"hi there\" is: {}".format(encode("hii there")))
print("Decoding for the string \"hi there\" is: {}".format(decode(encode("hii there"))))

Encoding for the string "hi there" is: [46, 47, 47, 1, 58, 46, 43, 56, 43]
Decoding for the string "hi there" is: hii there


### Encoding the entire data and using storing them as tensors

In [7]:
import torch

In [8]:
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Train test split

In [9]:
train_data_len = int(0.9 * len(data))

In [10]:
train_data = data[:train_data_len]
val_data = data[train_data_len:]

### Batch size and batches for training

In [11]:
# this is the chunk of data that will be fed to model at once, this might be known as context-length in other cases

batch_size = 32 # how many independent sequences will we process in parellel?
block_size = 8 # What is the maximum length allowed for prediction?
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
max_iters = 10000
eval_interval = 1000
n_embed = 32
# if learning rate is lower, number of iterations should be higher
lr = 1e-3
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [12]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for i,c in enumerate(x):
    print("When the context is {}, the most probable next data is {}".format(x[:i+1], y[i]))

When the context is tensor([18]), the most probable next data is 47
When the context is tensor([18, 47]), the most probable next data is 56
When the context is tensor([18, 47, 56]), the most probable next data is 57
When the context is tensor([18, 47, 56, 57]), the most probable next data is 58
When the context is tensor([18, 47, 56, 57, 58]), the most probable next data is 1
When the context is tensor([18, 47, 56, 57, 58,  1]), the most probable next data is 15
When the context is tensor([18, 47, 56, 57, 58,  1, 15]), the most probable next data is 47
When the context is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the most probable next data is 58


In [13]:
torch.manual_seed(1337)

def get_batch(split):
    # generate a small batch of data of input x and outputs y
    data = train_data if split=="train" else val_data
    ix = torch.randint(len(data)-block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    
    return x, y

In [14]:
xb, yb = get_batch("train")
print(f"Inputs are {xb} with shape {xb.shape}")
print(f"Outputs are {yb} with shape {yb.shape}")

Inputs are tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54],
        [57, 43, 60, 43, 52,  1, 63, 43],
        [60, 43, 42,  8,  0, 25, 63,  1],
        [56, 42,  5, 57,  1, 57, 39, 49],
        [43, 57, 58, 63,  6,  1, 58, 46],
        [43,  1, 51, 39, 63,  1, 40, 43],
        [58, 46, 43,  1, 43, 39, 56, 57],
        [39, 58, 47, 53, 52, 12,  1, 37],
        [53, 56, 43,  1, 21,  1, 41, 39],
        [50, 39, 52, 63,  1, 47, 58, 57],
        [56, 53, 63,  1, 42, 47, 42,  1],
        [39, 51,  1, 39, 44, 56, 39, 47],
        [17, 24, 21, 38, 13, 14, 17, 32],
        [ 1, 39, 52, 42,  1, 45, 43, 50],
        [ 1, 58, 46, 39, 58,  1, 42, 53],
        [ 1, 61, 53, 59, 50, 42,  1, 21],
        [59, 57, 40, 39, 52, 42,  1, 40],
        [52, 42,  8,  0,  0, 23, 21, 26],
        [45, 53, 42, 57,  0, 23, 43, 43],
        [52,  1, 61, 39, 57,  1, 51, 53],
        [39, 49, 12,  1

In [15]:
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"When input is {context.tolist()} then most probable output is {target.tolist()}")

When input is [24] then most probable output is 43
When input is [24, 43] then most probable output is 58
When input is [24, 43, 58] then most probable output is 5
When input is [24, 43, 58, 5] then most probable output is 57
When input is [24, 43, 58, 5, 57] then most probable output is 1
When input is [24, 43, 58, 5, 57, 1] then most probable output is 46
When input is [24, 43, 58, 5, 57, 1, 46] then most probable output is 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] then most probable output is 39
When input is [44] then most probable output is 53
When input is [44, 53] then most probable output is 56
When input is [44, 53, 56] then most probable output is 1
When input is [44, 53, 56, 1] then most probable output is 58
When input is [44, 53, 56, 1, 58] then most probable output is 46
When input is [44, 53, 56, 1, 58, 46] then most probable output is 39
When input is [44, 53, 56, 1, 58, 46, 39] then most probable output is 58
When input is [44, 53, 56, 1, 58, 46, 39, 58] then mos

### Baseline mode - Bigram model

In [16]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x7ffdd55661f0>

In [17]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [24]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        
    def forward(self, x):
        B,T,C = x.shape

        k = self.key(x) # (B,T,16)
        q = self.query(x) # (B,T,16)

        # compute attention score ("affinities to other tokens around")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B,T,16) @ (B,16,T) => (B,T,T)
        wei = wei.masked_fill(self.tril[:T,:T]==0, float('-inf')) # (B,T,T)
        wei = F.softmax(wei, dim=-1) # (B,T,T)

        # perform the aggrrgation
        v = self.value(x) # (B,T,C)

        out = wei @ v # (B,T,T) @ (B,T,C) => (B,T,C)

        out.shape
        return out

In [25]:
class BiGramLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        
        # each token directly reads the logits of the next token from lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        # position embedding table
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        # self attention head
        self.sa_head = Head(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
        
    def forward(self, idx, targets=None):
        B,T = idx.shape
        
        # idx and target are both (B,T) tensors of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) => (Batch, Time, Channel=n_embed) => (4,8,vocab_size)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        
        x = tok_emb + pos_emb # (B,T,C)
        x = self.sa_head(x) # apply one head of self attention (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)
        
        # cross entropy expects the output as (B, C, T), so we need to reshape
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        for _ in range(max_new_tokens):
            # we need to crop idx because our embedding size is limited to block size now
            idx_cropped = idx[:,-block_size:]
            
            # get the predictions
            logits, loss = self(idx_cropped)
            # focus only on the last tim step
            logits = logits[:,-1,:] # becomes (B,C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples = 1) # (B,1)
            # append sampled index to running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B,T+1)
        return idx
    
model = BiGramLanguageModel()
m = model.to(device)
# logits, loss = m(xb,yb)
# print(logits.shape)
# print(loss)

We can actually predict the log liklihood of the data if we know the vocab size, it is `-ln(1/vocab_size)`.

#### Now let us create the model and train it

In [26]:
# create a Pytorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=lr)

In [27]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.2497, val loss 4.2480
step 1000: train loss 2.5221, val loss 2.5427
step 2000: train loss 2.4438, val loss 2.4740
step 3000: train loss 2.4170, val loss 2.4383
step 4000: train loss 2.3873, val loss 2.4235
step 5000: train loss 2.3862, val loss 2.4076
step 6000: train loss 2.3802, val loss 2.3992
step 7000: train loss 2.3738, val loss 2.4066
step 8000: train loss 2.3606, val loss 2.3987
step 9000: train loss 2.3484, val loss 2.3955


In [28]:
print(decode(m.generate(idx=torch.zeros((1,1), dtype = torch.long), max_new_tokens=500)[0].tolist()))


Ins iees yofrickild che om 't youtfour my sesm heur cech heat my bug ho, lay theich aw harsemanoollt, theed
An onovoliatclonone eces San hiceato,
Tonrut ourefast had heesat to ono our yo ten hendety njeie by, sto!
DULINAS:
Her loure by ciray?


Yowme Whar;
As, sofry wow, aig h wis the mseshe che indy the laschint outr atr, hour foliroshe. Winga nd! I h'treitheesour fain:
Se
thant thas yo wowe od, kint yod this.
The alke Iis, tre thowoum gbe migortald honwios beat wher histt amat se
t cut, ithan 


# Transformers

In the last model we created (Bigram model), there was no way for the character level tokens to talk to their ancestors. They all were generated based on just the previous token. 

To begin with lets assume that token at `ith` location interacts with all the tokens before that. For this behavior we take the mean of sum of interactions of `ith` element with all the elements before that.

In [23]:
torch.manual_seed(1337)
B,T,C = 4,8,2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [24]:
# version 1
# xbow is x bag of words
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] #(t,C)
        xbow[b,t] = torch.mean(xprev, 0)

In [25]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [26]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

The above method is inefficient as we are doing it in n^2 complexity, we can use matrix multiplication to lower the time complexity. We will use the lower triagular matrix and make it more efficient.

In [27]:
# version 2

wei = torch.tril(torch.ones(T,T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x
# both the torch vectors are same, this is the more efficient way
torch.allclose(xbow, xbow2)

True

In [28]:
# version 3

tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

In [41]:
# version 4
# Self attention !

torch.manual_seed(1337)
B,T,C = 4,8,32
x = torch.randn(B,T,C)
x.shape

# single head perform self attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x) # (B,T,16)
q = query(x) # (B,T,16)

wei = q @ k.transpose(-2,-1) # (B,T,16) @ (B,16,T) => (B,T,T)
tril = torch.tril(torch.ones(T,T))

# wei is equivalent to the np.dot(Q,K) 
#wei comes from Q.K_transpose
# wei = torch.zeros((T,T))

# this line of code prevents future nodes to communicate information to past nodes. If removed every node(token)
# will interact with each other. This is removed on the encoder side of the transformers.
wei = wei.masked_fill(tril==0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)

out = wei @ v

out.shape

torch.Size([4, 8, 16])

In [42]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

### Attention

1. Attention is a communication mechanism
2. There is no notion of space, they are just vectors without any idea of dimension thats why we add positional encoding.
3. Every batch is independently trained and they never talk to each other.
4. "Self attention" - The key, query and value matrix are all coming from the same source.
5. "Cross attention" - When we pull matrix key and value from different nodes, this is called cross attention.
6. If we have unit gaussian input when we set our matrix, the variance of the resultant matrix will be of the order of head_size which is far from when we had started setting the matrix up. If we multiply by sqrt(head_size) then we preserve information. This is important so that the the result after softmax is not sharpened towards the maximum value.