## makemore

Karpathy's [makemore](https://github.com/karpathy/makemore) is an end-to-end python application that takes in a pure text file, then generate new text similar to what's given. The project `makemore` is part of his [neural networks: zero to hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) series which I recommend to all. 

The example data in the repo  is a large collection of baby names (about 30k names) and the applicaiton trains a state of the art **transformer** model to learn the  mechanism of naming things, then sample from the learned model.  

The purpose of this post is three fold:  
- gain familiarity with **transformer**  
- summarise essential `torch` functionalities and workflow for building neural nets application  
- use minimal tools e.g `argparse` to produce a command line interface for the app

## torch stuffs

In [None]:
import torch 
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

Simplistically, a neural net application is built on two major components: a dataset and a model. `torch` allows us to do both fairly easily and straightforward.

### Dataset

`Dataset` class is a container like a sequence. To define a custom Dataset class, one must implement `__init__`, `__len__`,  `__getitem__`, together with transformations relevant to the particular use case of the application. Exampe:

In [None]:
class CharDataset(Dataset):

    def __init__(self, words:list[str], chars:str, max_word_length:int):
        self.words = words
        self.chars = chars
        self.max_word_length = max_word_length
        self.stoi = {ch:i+1 for i,ch in enumerate(chars)}
        self.itos = {i:s for s,i in self.stoi.items()} # inverse mapping

    def __len__(self):
        return len(self.words)

    def contains(self, word):
        return word in self.words

    def get_vocab_size(self):
        return len(self.chars) + 1 # all the possible characters and special 0 token

    def get_output_length(self):
        return self.max_word_length + 1 # <START> token followed by words

    def encode(self, word):
        ix = torch.tensor([self.stoi[w] for w in word], dtype=torch.long)
        return ix

    def decode(self, ix):
        word = ''.join(self.itos[i] for i in ix)
        return word

    def __getitem__(self, idx):
        word = self.words[idx]
        ix = self.encode(word)
        x = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        y = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        x[1:1+len(ix)] = ix
        y[:len(ix)] = ix
        y[len(ix)+1:] = -1 # index -1 will mask the loss at the inactive locations
        return x, y

The custom transformations in this example consists of mapping character to integer, aka tokenisation. There is one special token `0` representing both the start of name and end of name. We can loop through or get at index a CharDataset because we have implemented sufficient dunder methods for this. 

In [None]:
import string

examples = CharDataset(['emma', 'richard'], string.ascii_letters, 10)
for e in examples:
    print(e)

(tensor([ 0,  5, 13, 13,  1,  0,  0,  0,  0,  0,  0]), tensor([ 5, 13, 13,  1,  0, -1, -1, -1, -1, -1, -1]))
(tensor([ 0, 18,  9,  3,  8,  1, 18,  4,  0,  0,  0]), tensor([18,  9,  3,  8,  1, 18,  4,  0, -1, -1, -1]))


In [None]:
print(examples[0])

(tensor([ 0,  5, 13, 13,  1,  0,  0,  0,  0,  0,  0]), tensor([ 5, 13, 13,  1,  0, -1, -1, -1, -1, -1, -1]))


Let's break down the output tuple `x,y`. 

### `x` tensor
It starts with `0` token, then each input character gets mapped to an integer from 1 to 26, with trailing zeros if the length of name is less than `max_word_length` (set to be max length of names in the dataset)

### `y` tensor
By definition, it is the same as `x` shifted by 1 token to the left, modulo extra subtleties with the trailing -1. What is going on here? Well, in language modelling, the learning task is next token prediction, so at index `idx` such that `x[idx].item()!=0`, given `x[:idx+1]`, the goal is to predict `x[idx+1]` which by definition is nothing but `y[idx]`.  If `x[idx].item()==0`, then there is nothing to predict (name finished), we set by convention `y[idx]=-1`.

Our ultimate goal is to build and train a neural net which can learn from the 30k  `x,y` tuples a good way of sampling next token given some context. Concetely, throw a `0` token at the model and let the model sample the next token `t1`, concatenate it with `0`, then sample next given `[0,t1]`, until a `0` token is sampled which means we have arrived at the end of a name. Repeat this to produce as many names as we want. We'll get back to inference later.

### DataLoader

For efficiency's sake, it is beneficial to stack multiple examples together, aka mini-batch, and process them all at once. The `DataLoader` class is meant to help us with this.  Example: 

In [None]:
class InfiniteDataLoader:
    """
    this is really hacky and I'm not proud of it, but there doesn't seem to be
    a better way in PyTorch to just create an infinite dataloader?
    """

    def __init__(self, dataset, **kwargs):
        train_sampler = torch.utils.data.RandomSampler(dataset, replacement=True, num_samples=int(1e10))
        self.train_loader = DataLoader(dataset, sampler=train_sampler, **kwargs)
        self.data_iter = iter(self.train_loader)

    def next(self):
        try:
            batch = next(self.data_iter)
        except StopIteration: # this will technically only happen after 1e10 samples... (i.e. basically never)
            self.data_iter = iter(self.train_loader)
            batch = next(self.data_iter)
        return batch

In [None]:
dataset = CharDataset(['emma', 'richard', 'ben', 'steve'],string.ascii_letters, 10)
batch_loader = InfiniteDataLoader(dataset, batch_size=2)

In [None]:
for _ in range(2):
    X,Y = batch_loader.next()
    print(f'{X=}') # B,T = batch_size, max_word_length+1
    print('-'*60)
    print(f'{Y=}') # (B,T)
    print('*'*60)

X=tensor([[ 0, 18,  9,  3,  8,  1, 18,  4,  0,  0,  0],
        [ 0, 19, 20,  5, 22,  5,  0,  0,  0,  0,  0]])
------------------------------------------------------------
Y=tensor([[18,  9,  3,  8,  1, 18,  4,  0, -1, -1, -1],
        [19, 20,  5, 22,  5,  0, -1, -1, -1, -1, -1]])
************************************************************
X=tensor([[ 0,  2,  5, 14,  0,  0,  0,  0,  0,  0,  0],
        [ 0, 19, 20,  5, 22,  5,  0,  0,  0,  0,  0]])
------------------------------------------------------------
Y=tensor([[ 2,  5, 14,  0, -1, -1, -1, -1, -1, -1, -1],
        [19, 20,  5, 22,  5,  0, -1, -1, -1, -1, -1]])
************************************************************


### Tensor

Most fundamental data structure for neural nets.

A tensor is a collection of numbers index by tuple of non-negative integers. In the above, we've seen that a batch `X` is (B,T) tensor, we can index `X[b,t]` for b in range(B) and t in range(T). 

`torch` provides optimised tensor operations and auto differentiation engine. Rather than understanding low level optimisations (parallel programming as in e.g. cuda kernels), we just take these optimisations for granted and see what we can build with `Tensor`. 



In [None]:
torch.arange(24).view(3,4,2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5],
         [ 6,  7]],

        [[ 8,  9],
         [10, 11],
         [12, 13],
         [14, 15]],

        [[16, 17],
         [18, 19],
         [20, 21],
         [22, 23]]])

In [None]:
assert torch.equal(torch.arange(60).view(3,4,5),torch.arange(60).view(3,4,-1))

In [None]:
torch.tril(torch.ones(4,4))

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

In [None]:
torch.randn(3,4).split(2,dim=1)  # a tuple of 4/2 tensors of shape (3,2) 

(tensor([[-0.1093,  0.0426],
         [ 2.5843,  0.2994],
         [-0.2158, -1.7210]]),
 tensor([[-1.2973,  0.2321],
         [ 1.1551,  0.2394],
         [ 0.4124,  0.1518]]))

In [None]:
att = torch.arange(16).view(4,4)
att.masked_fill(torch.tril(torch.ones(4,4))==0, -99)

tensor([[  0, -99, -99, -99],
        [  4,   5, -99, -99],
        [  8,   9,  10, -99],
        [ 12,  13,  14,  15]])

In [None]:
assert torch.randn(24).view(2,3,4).transpose(1,2).shape == (2,4,3)
assert torch.randn(5).unsqueeze(1).shape == (5,1)
assert torch.cat([torch.ones(3,3), torch.arange(9).view(3,3)], dim=1).shape == (3,3+3)
assert torch.stack([torch.ones(3,3), torch.arange(9).view(3,3)], dim=1).shape == (3,2,3)

### Module

`nn.Module` is the base class for all neural nets, which, mathemtically, are just functions taking  Tensor as input and compute the output as another Tensor. Just as we can compose functions, we can compose Module's to build complicated achitechture. 

custom Module must implement `forward` method. Example: 

In [None]:
class NewGELU(nn.Module):
    """
    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
    Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
    """
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

## build Transformer 

the transformer is a custom module built on attention blocks, which themselves are built on multi-head masked self-attention layers. 

let's build from the layer level all the way to transformer. here is our model config. 

In [None]:
from dataclasses import dataclass

@dataclass
class ModelConfig:
    block_size: int = None # length of the input sequences of integers
    vocab_size: int = None # the input integers are in range [0 .. vocab_size -1]
    # parameters below control the sizes of each model slightly differently
    n_layer: int = 4
    n_embd: int = 64
    n_embd2: int = 64
    n_head: int = 4

### attention layer

In [None]:
class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.c_proj(y)
        return y

### block 

In [None]:
class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlpf(self.ln_2(x))
        return x

### the entire thing

In [None]:
class Transformer(nn.Module):
    """ Transformer Language Model, exactly as seen in GPT-2 """

    def __init__(self, config):
        super().__init__()
        self.block_size = config.block_size

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("number of parameters: %.2fM" % (n_params/1e6,))

    def get_block_size(self):
        return self.block_size

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

        return logits, loss