<a href="https://colab.research.google.com/github/szhou12/gpt-from-scratch/blob/main/gpt_dev_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT From Scratch
## Resources
- [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy)
- [Andrej Karpathy《从零开始搭建GPT|Let's build GPT from scratch, in code, spelled out》中英字](https://www.bilibili.com/video/BV1v4421c7fr/?spm_id_from=333.337.search-card.all.click&vd_source=0c02ef6f6e7a2b0959d7dd28e9e49da4)

In [1]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-02-19 23:26:23--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-02-19 23:26:23 (15.7 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
# read text data in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
  text = f.read()

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [4]:
# let's look at the first 1,000 chars
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
# Get all the unique characters that occur in the text
## set(text): make the set of all unique chars in text data
## list(...): to have some ordering so that we can sort it
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


## Tokenize [00:09:30]
- Tokenization: to convert the raw text as string into some sequences of integers according to some notebooks/rules/vocabularies of elements (字符串转换成数字序列的mapping过程).
- 下面给出的例子中, tokenization rule就是根据每个字符对应的index来进行编码.
- 常用的Tokenization方法:
  1. Byte-Pair Encoding (BPE)
    - [Byte-Pair Encoding tokenization - Hugging Face NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
  2. SentencePiece by Google
  3. tiktoken by OpenAI
- One important Observation: trade-off between codebook size and sequence length.
  - The smaller the vobabulary size, the longer the sequences of integers.
    - e.g. character level vocab is small, so the encoded seq of ints will be long. (shown below)
  - The larger the vobabulary size, the shorter the sequences of integers.
    - e.g. sub-word level vocab is large, so the encoded seq of ints will be short.

In [6]:
# create a mapping from characters to integers
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [7]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long) # 1-D list
print(data.shape, data.dtype)
print(data[:1000]) # show first 1000 characters we looked at earlier

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

## Train & Test Split [00:13:42]

In [8]:
# let's now split up the data into train and validation sets
n = int(0.9 * len(data)) # first 90% chars will be train, rest val
train_data = data[:n]
val_data = data[n:] # use val set to get a sense of overfitting

## Train On Chunks [00:14:28]
- **Important thing to notice**:
  1. we'll never feed the entire text into the transformer all at once because it's computationally expensive.
  2. We only feed chunks of the text.
  3. Ramdomly sampling chunks from the text dataset and train on a chunk at a time.
  4. These chunks will be pre-set with max length (normally called `block_size` or `context_length`).

In [9]:
# set block_size = 8
block_size = 8
train_data[:block_size+1] # first 9 chars. Do you understand why 9 intead of 8?

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

- How many training examples are there in this sequence of integers of length=9?
  - 8 examples!!!
   - Because to predict `i`-th position's char (called "target"), we need to use all `[0:i-1]` positions' chars (called "context"). i.e. one example = `[0:i-1]` predict `[i]`
  - [18] predicts 47, [18, 47] predict 56, [18, 47, 56] predict 58, ...
  - The code below illustrate this concept:

In [10]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"Round {t}: when input (aka. context) is {context}, the target: {target}")

Round 0: when input (aka. context) is tensor([18]), the target: 47
Round 1: when input (aka. context) is tensor([18, 47]), the target: 56
Round 2: when input (aka. context) is tensor([18, 47, 56]), the target: 57
Round 3: when input (aka. context) is tensor([18, 47, 56, 57]), the target: 58
Round 4: when input (aka. context) is tensor([18, 47, 56, 57, 58]), the target: 1
Round 5: when input (aka. context) is tensor([18, 47, 56, 57, 58,  1]), the target: 15
Round 6: when input (aka. context) is tensor([18, 47, 56, 57, 58,  1, 15]), the target: 47
Round 7: when input (aka. context) is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target: 58


- **One More Important Thing**
  - Why, for a given block size (=8 in this exmaple), we train from context length of 1 all the way up to context length of block_size (=8)?
  - Not only for computational efficiency;
  - But in order for the transformer to be used to seeing context from as little as length 1 all the way to block size. That is, we want the transformer to be used to all these context lengths (1, 2, 3, ..., 8).
    - Why? Because doing so is useful for later inference.
    - How so? Because while we start sampling, we can start the sampling generation with context length = 1. The transformer will know how to predict in this situation. Similarly, the transformer will how how to predict in the situation of context length = `2, 3, ..., block_size`.
    - 简言之，让模型习惯应对任意长度(`1,2,3, ..., block_size`)的输入情况下的预测。

## Batch dimension [00:18:07]
- Notice that, so far, the tensor is 1-D (call it "time dimension").
- Now onto batch dimension! Every time of sampling, we retreive a batch of chunks instead of one chunk, and feed this batch (multiple chunks) into the transformer all at once at the same time.
- Why? Mainly for computational efficiency. Because GPU is good at parallel processing of data.
- But notice! Each chunk from the batch is processed independently! They don't talk to each other!

In [11]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we parallelly process in every forward-backward pass in the transformer?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs X and targets y
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # 从[0, n-block_size)中随机抽4个数，返回长度为4的list [a, b, c, d]. (batch_size=4,)输出1-D list
    x = torch.stack([data[i:i+block_size] for i in ix]) # X
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # true y - used to calculate loss y - y_hat
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)
print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

## Feeding Data to Neural Network - Bigram Language Model [00:22:14]
- Start with a simple neural network: Bi-gram language model (details in his previous courses)

### Pytorch Recap
1. `import torch.nn as nn`
    - The `torch.nn` module contains classes and functions that are used for building neural networks in PyTorch. This includes the foundational building blocks for neural networks, such as layers (e.g., `Linear`, `Conv2d`), and modules (`Module`), which are the base class for all neural network modules.
    - `nn` (Object-oriented API): Provides classes that allow you to encapsulate parameters and helpers in objects. This is useful for defining complex models and is typically preferred when designing architectures.
2. `from torch.nn import functional as F`
    - The `torch.nn.functional` module contains function versions of many of the operations and layers available in `torch.nn`.
    - These functions include activation functions (`F.relu`, `F.sigmoid`), operations used in convolutional neural networks (`F.conv2d`, `F.max_pool2d`), and loss functions (`F.cross_entropy`), among others.
    - Unlike the classes in `torch.nn`, which require creating an instance of the class (e.g., `layer = nn.Conv2d(...)`), functions in `torch.nn.functional` can be used directly by passing inputs and any necessary parameters (e.g., `output = F.conv2d(input, weight)`).
    - `F` (Functional API): Provides stateless, functional alternatives to the classes in `nn`. This is useful for operations that don't require storing state (parameters), such as applying activations, performing a convolution operation with dynamically created filters, or applying a loss function directly within the model's forward method.
3. `nn.Embedding`
    - A PyTorch layer that's typically used to convert token indices in a vocabulary into dense vector representations (embeddings). The layer takes two main arguments: **(the number of embeddings, the dimensionality of each embedding vector)**. In most natural language processing (NLP) tasks, the dimensionality of embeddings is much smaller than the vocabulary size, facilitating efficient representation of words or tokens.
    - Assume `vocab_size=65`, This layer (`self.token_embedding_table`) now acts as a 65x65 table where **each row corresponds to a token** and **each column to a possible next token**, **with values being the logits** (raw predictions prior to normalization).
    - When you input an index (or indices) to this layer, it returns the corresponding row(s) (aka. channel) from the table. For instance, if the input index is 3 (for `a`), the layer returns the 4th row (0-indexed) of the table, which is a vector of size 65. Each element of this vector represents the model's logit for the probability of each vocabulary token being the next token after `a`.
4. Why Use Cross-Entropy For Loss?

In [12]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token direclty reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        '''
        idx: (B,T) tensor of integer. data. xb from previous cell
        targets (Optional): (B,T) tensor of integer. true y. yb from previous cell. Make it optional as generate() won't use loss.
        '''

        # idx and targets are both (B, T) tensor of integers
        # 相当于, 根据idx对应的输入抽出所有下一个token可能出现的概率, 所有下一个token称为channel, 长度为vocab_size
        logits = self.token_embedding_table(idx) # (B,T,C) = (batch, time, channel/class=vocab_size)

        if targets is None:
            loss = None
        else:
            # 进行维度的转换，因为Pytorch中cross_entropy()的第二个param只能接收C, 保留原来的(B,T,C)会报错
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # squash first 2 dims into 1 dim, keep the 3rd dim
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        '''
        idx: (B, T) array of indices in the current context.
            Note! indices meant here are integer representations of chars, so essentially idx are data (context) we feed
        max_new_tokens: # of new tokens to be predicted and appended to idx
        '''
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions (logits)
            logits, loss = self(idx) # self(idx) calls forward()

            # focus only on the last time step
            # -1意味着我们现在只在意用8个token一起predict的那个example
            # 注意: logits[:, i, :] ith位子上的logit表示用前i个token一起predict下一个可能出现的token
            # e.g. 1st logit是只用第一个token predict的结果, 2nd logit是用头两个token predict的结果, 依此类推.
            # -1表示最后一个位子上的logit, 用了所有前面能用到的token
            # !!!上面的理解错了!!!
            # -1意味着我们现在只用最后一个token来predict下一个token, 没有用到再往前的tokens!
            # 注意: logits[:, i, :] ith位子上的logit表示用第i个token来predict下一个可能出现的token
            # e.g. 1st logit是只用第一个token predict的结果, 2nd logit只用第二个token predict的结果, 依此类推.
            # -1表示最后一个位子上的logit, 只用了所有最后一个token
            logits = logits[:, -1, :] # becomes (B, C)

            # apply softmax to get the probablities
            # dim=-1: 沿着last dimension of logits (ie. C)计算softmax score
            # 每条data, 横向地沿着所有classes计算softmax. 这样，batch中每一条data都独立计算
            probs = F.softmax(logits, dim=-1) # (B, C)

            # sample from the distribution
            # 每条data, 根据class probabilities随机采样一个class.
            # e.g. 假设一条data的probability: [0.1, 0.2, 0.5, 0.3], multinomial根据每个class的概率大小抽一个class (num_samples=1)
            # 显然, i=2的元素最容易抽到因为它的概率最高=0.5
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

            # append sampled index to the running sequence
            # dim=1 沿着第二个dimension (sequence length dimension)，一列一列地append新的采样到的data
            # [[24, 43], -> [[24, 43, 58],
            #  [52, 58]] ->  [52, 58,  1]]
            # 注意！这里可以直接append index, 是因为tokenization是index直接对应char. 如果是其他tokenize方法, 可能需要额外转换index对应的元素
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

        return idx



# 1. instantiate BigramLanguageModel class;
# 2. init token_embedding_table as an embedding layer, with logits randomly generated
m = BigramLanguageModel(vocab_size)

logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

# Given [[0]] (1 batch, 1 time step), sequentially generate 100 tokens after first token: [[0]] -> [[0, 1]] -> [[0, 1, 3]] -> ...
# Remember, in this case, each token newly generated uses ONLY ONE token generated ahead of it. NO HISTORY USED HERE! Can you recall why?
# [0].tolist(): take out the content of the first data from the batch and convert it to list
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


## Train The Model [0:34:50]

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32

iters = 100 # increase # of steps for better results
for steps in range(iters):
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())