## Building an LLM (Large Language Model) from scratch

In this notebook, we will build a character-level LLM from scratch using PyTorch. We will start with a Bigram model and then extend it to a Transformer model. We will use the text from "Tiny Shakespeare" dataset for training. The model will be able to generate text character by character.

In [18]:
import torch

In [None]:
# Opening and reading the text from the file
with open('input.txt') as f:
    text = f.read()

print(f"Total number of characters: {len(text)}")

In [5]:
# Let sample the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [9]:
# Let's find all the characters that are there in the text
chars = sorted(list(set(text)))
print(f"Total number of unique characters: {len(chars)}")
print(''.join(chars))

Total number of unique characters: 65

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [15]:
# Let's now write an encode and decode function
itos = { i: c for i, c in enumerate(chars) }
stoi = { c: i for i, c in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode('hii there'))
print(decode(encode('hii there')))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [35]:
# Now let's encode the first 1000 characters
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Shape of data: {data.shape}")
print(f"Data type: {data.dtype}")
print(data[:1000])

Shape of data: torch.Size([1115394])
Data type: torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57, 

In [38]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"{len(train_data)}, {len(val_data)}")
print(f"train_data type: {train_data.dtype}")
print(f"val_data type: {val_data.dtype}")

1003854, 111540
train_data type: torch.int64
val_data type: torch.int64


In [30]:
block_size = 8

x = data[:block_size]
y = data[1:block_size+1]

for i in range(block_size):
    context = x[:i+1]
    target = y[i]
    print(f"Context: {context} -> Target: {target}")

Context: tensor([18]) -> Target: 47
Context: tensor([18, 47]) -> Target: 56
Context: tensor([18, 47, 56]) -> Target: 57
Context: tensor([18, 47, 56, 57]) -> Target: 58
Context: tensor([18, 47, 56, 57, 58]) -> Target: 1
Context: tensor([18, 47, 56, 57, 58,  1]) -> Target: 15
Context: tensor([18, 47, 56, 57, 58,  1, 15]) -> Target: 47
Context: tensor([18, 47, 56, 57, 58,  1, 15, 47]) -> Target: 58


In [143]:
torch.manual_seed(1337)
batch_size=4

def get_batch(split, batch_size=4):
    if split == "train":
        data = train_data
    else:
        data = val_data
    ix = torch.randint(len(train_data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch("train")
print("Shape of xb: ", xb.shape)
print("Shape of yb: ", yb.shape)
print("xb: ", xb)
print("yb: ", yb)


for b in range(batch_size):
    for i in range(block_size):
        context = x[b, :i+1]
        target = y[b, i]
        print(f"Context: {context.tolist()} -> Target: {target}")

Shape of xb:  torch.Size([4, 8])
Shape of yb:  torch.Size([4, 8])
xb:  tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
yb:  tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
Context: [0] -> Target: 43
Context: [0] -> Target: 58
Context: [0] -> Target: 5
Context: [0] -> Target: 57
Context: [0] -> Target: 1
Context: [0] -> Target: 46
Context: [0] -> Target: 43
Context: [0] -> Target: 39


IndexError: index 1 is out of bounds for dimension 0 with size 1

In [153]:
import torch.nn as nn
import torch.nn.functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, vocab_size)

    def forward(self, x, targets=None):
        # x.shape = (B, T)
        # targets.shape = (B)

        # (B, T) -> (B, T, C)
        logits = self.embedding(x)

        if targets is not None:
            B, T, C = logits.shape
            loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
        else:
            loss = None
        return logits, loss

    def generate(self, x, n):

        # x.shape = (B, T)
        for _ in range(n):
            # (B, T) -> (B, T, C)
            logits, _ = self(x)
            # (B, T, C) -> (B, T, C)
            probs = F.softmax(logits, dim=-1)
            # (B, T, C) -> (B, C)
            probs = probs[:, -1, :]
            # (B, C) -> (B)
            x_next = torch.multinomial(probs, num_samples=1)
            # cat[(B), (B, T)] -> (B, T+1)
            x = torch.cat([x, x_next], dim=-1)
        return x
            

In [154]:
vocab_size = len(chars)
m = BigramLanguageModel(vocab_size=vocab_size)
print(f"m.vocab_size: {m.vocab_size}")

# Let's pass the first batch
b = 2
idx = torch.zeros((b, 1), dtype=torch.long)
logits, loss = m(idx)
print(f"logits.shape: {logits.shape}")

# Calculate the loss
targets = torch.zeros((b), dtype=torch.long)
loss = F.cross_entropy(logits[:, -1, :], targets)
print(f"loss: {loss}")

probs = F.softmax(logits, dim=-1)
# Shape should be [b, 1, vocab_size]
print(f"probs.shape: {probs.shape}")
# For manual inspection if the probs are positive and sum to 1
print(f"probs: {probs}")

# Probs should sum to 1
print(f"Sum of probs: {probs.sum()}")

m.vocab_size: 65
logits.shape: torch.Size([2, 1, 65])
loss: 5.7610015869140625
probs.shape: torch.Size([2, 1, 65])
probs: tensor([[[0.0031, 0.0027, 0.0027, 0.0202, 0.0086, 0.0411, 0.0016, 0.0050,
          0.0161, 0.0392, 0.0191, 0.0415, 0.0249, 0.0070, 0.0014, 0.0046,
          0.0736, 0.0260, 0.0204, 0.0149, 0.0181, 0.0019, 0.0034, 0.0248,
          0.0165, 0.0202, 0.0174, 0.0123, 0.0023, 0.0050, 0.0246, 0.0068,
          0.0045, 0.0419, 0.0252, 0.0334, 0.0107, 0.0158, 0.0264, 0.0061,
          0.0037, 0.0075, 0.0050, 0.0219, 0.0027, 0.0102, 0.0359, 0.0069,
          0.0060, 0.0021, 0.0325, 0.0049, 0.0027, 0.0100, 0.0337, 0.0139,
          0.0133, 0.0053, 0.0070, 0.0102, 0.0323, 0.0047, 0.0152, 0.0044,
          0.0166]],

        [[0.0031, 0.0027, 0.0027, 0.0202, 0.0086, 0.0411, 0.0016, 0.0050,
          0.0161, 0.0392, 0.0191, 0.0415, 0.0249, 0.0070, 0.0014, 0.0046,
          0.0736, 0.0260, 0.0204, 0.0149, 0.0181, 0.0019, 0.0034, 0.0248,
          0.0165, 0.0202, 0.0174, 0.0123, 0

In [155]:
# Sampling from the distribution
# The last element of the sequence is the one that we are interested in
last_probs = probs[:, -1, :]
print(f"last_probs.shape: {last_probs.shape}")
x_next = torch.multinomial(input=last_probs, num_samples=1)
print(f"x_next.shape: {x_next.shape}")
print(f"x_next: {x_next}")

last_probs.shape: torch.Size([2, 65])
x_next.shape: torch.Size([2, 1])
x_next: tensor([[46],
        [34]])


In [156]:
# Generating using the model's generate method
x = torch.zeros((1, 1), dtype=torch.long)

# Untrained model will generate random characters
m = BigramLanguageModel(vocab_size=vocab_size)
print(decode((m.generate(x, n=200)[0]).tolist()))


giWkcz3bTv?luMIKuBlwJDQhrM3qTQT:zhHHycj:VK?CEFv
x.?:?kCrAiIS:vB
YnUr&my3niX.nw&fddP$qqDBqcefCMvtAJDfCcbDMX3kwXc.MJjaoMeXlC?gV.uHznpYc!JTdJ
aF'cMW.,EAGThcnuDgDgm'zUFS$YFvFeAjbGI!$lZEJVhRE IGfTduAcsZcA$


#### Training BigramLanguageModel

Now, we will try to train a Bigram language model. The model will predict the next character based on the previous character only. In an autoregressive manner, we will keep feeding the generated character back to the model to predict the next character.


In [167]:
# Training the model
batch_size = 32
block_size = 8
vocab_size = len(chars)
learning_rate = 1e-3
max_iters = 100
eval_iters = 10

from torch.optim import AdamW

xb, yb = get_batch("train", batch_size=batch_size)
print(f"xb.shape: {xb.shape}")
print(f"yb.shape: {yb.shape}")

xb.shape: torch.Size([32, 8])
yb.shape: torch.Size([32, 8])


In [168]:
# Initialize the model and the optimizer
m = BigramLanguageModel(vocab_size=vocab_size)
optimizer = AdamW(m.parameters(), lr=1e-3)

In [214]:
max_iters = 10000

for step in range(max_iters):

    # Sample a batch of data
    xb, yb = get_batch("train", batch_size=batch_size)

    # Forward pass
    logits, loss = m(xb, yb)

    # Backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(f"Loss: {loss.item()}")

Loss: 2.4350764751434326


#### Results
So, with just BigramLanguageModel, we were able to get the loss down to ~2.5. The model is able to create some structure in the text. But, it is not able to generate meaningful text. We will now try to build a Transformer model to improve the results.

In [223]:
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), n=500)[0].tolist()))



A:
Anen.

IUENTo heles, he y onthene h mas:
Jut hyoiththe s yor at y, he, foonghefico, isinor h thed ach o tatheonkicalarck'ean, ththe n, IToo-ld t ld y,
I ey, ite st tmerule th a tapan neme tat d,
ANGowe.
aceagulimar VINGowe watheyoutofolos choerr.
Ke, hencu ld milb, wonelole,
whesp IOMESen O, co s vo bertharoy,
Torot hallanoth y'sher t hon,
Paver asto? pp f hed tin, histard, tome t t
NGr eacomaselldesh o?

QUMAnd ingend stat me; me! pe mantyser my; ivencofon 'de
Wavete be se'lfat, s D:
AREThe
