<a href="https://colab.research.google.com/github/yebyyy/Infinite-Shakepear/blob/main/lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Building GPT

First let's download the shakespear text

In [2]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-06-25 02:22:43--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-06-25 02:22:44 (5.54 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


#### Tokenization

The tokenizer for this project is very simple

In [5]:
charInt = {Char:Int for Int, Char in enumerate(chars)}
intChar = {Int:Char for Int, Char in enumerate(chars)}

In [6]:
def encode(txt):
  li = []
  for chr in txt:
    li.append(charInt.get(chr))
  return li

In [7]:
def decode(intList):
  string = ""
  for i in intList:
    string += intChar[i]
  return string

In [8]:
encode("hello")

[46, 43, 50, 50, 53]

In [9]:
decode(encode("where are you"))

'where are you'

Now tokenize the entire text

In [10]:
import torch

In [11]:
data = torch.tensor(encode(text), dtype=torch.long)

In [12]:
data.shape, data.dtype

(torch.Size([1115394]), torch.int64)

In [13]:
data[:500]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 

#### Train test split

In [14]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

#### Chunk(Block) Size

In practice, we don't put all the training data set to the transformer all at once, which would be computationally expensive. So we only use chunks of data to train the transformer.

What we do is we sample random chunks and feed them to the transformer. Each chunk has a maximum length, called `block_size`.

In [15]:
block_size = 8
train_data[ : block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

This is how the training works: predicting the next word

In [16]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


This will make the transformer to be able to predict the context length of 1 ~ `block_size`, and more than the maximum length, the text will be truncated

#### Batch Size

Every time we train the transformer, we will train the transfromer using batches with multiple chunks of data, and they are stacked up in a single tensor.

This is because GPUs' parallel processing can boost efficiency.

The chunks in a batch are processed individually and do not correlate with one another.

In [17]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    # stack up the one dimensional tensors, shape is batch_size * block_size
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

#### Simplest Language Model: Bigram

In [18]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x7ca03b3328f0>

In [19]:
class BigramLanguageModel(nn.Module):
  def __init__(self, vocab_size) -> None:
     super().__init__()
     self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)  # for every token, in this case char, will have a row of vocab_size to represent its embedding vector

  def forward(self, idx, targets=None):
    # Target is none because that when we generate we don't have a target

    logits = self.token_embedding_table(idx)  # (Batch, Time, Channel) -> (4 * 8 * 65)

    if targets is None:
      loss = None
    else:
    # loss = F.cross_entropy(logits, targets)
    # This is what we want but we cannot achieve this since pytorch treats the input as (B, C, T) which is not a very good representation of the data
    # Thus reshape
      B, T, C = logits.shape
      logits = logits.view(B*T, C)
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  def generate(self, idx, max_new_tokens):
    # idx: (B, T) the current context
    for _ in range(max_new_tokens):
      logits, loss = self(idx)
      # only care about the last time step since this is a bigram model
      logits = logits[:, -1, :]  # B, C
      probs = F.softmax(logits, dim=-1)  # B, C
      idx_next = torch.multinomial(probs, num_samples=1)  # B, 1, which means that for every batch get the next token
      # Again, batch means that we have batch_size number of random start points
      idx = torch.cat((idx, idx_next), dim=1)  # B, T+1
    return idx


m = BigramLanguageModel(vocab_size)
out, loss = m(xb, yb)
print(out.shape)
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


- -log(1 / 65) = 4.17, so we are actually higher than expected

Generate from the model

In [20]:
idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))  # [0] because its [[]] shape


Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


#### Train the Bigram Model

Create an AdamW optimizer

In [21]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)  # Usually 1e-4, but for small network can be larger

Use a bigger Batch Size

In [22]:
batch_size = 32

In [23]:
for steps in range(10000):
  # sample a batch of data
  xb, yb = get_batch('train')

  # evaluate loss
  logits, loss = m(xb, yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

print(loss.item())

2.5727508068084717


Generate to see how we doin'

In [24]:
idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=300)[0].tolist()))  # [0] because its [[]] shape


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht 


#### Self Attention Version 1

Example:

In [25]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

For each token as the target, we cannot let it talk to future tokens, and it can only be paired with the previous tokens. Now, since we are pairing things up, we can average up the previous tokens and pair it with the target token

But this is pretty lossy
- This is not a strong relation
- This also loses information about the position