# Evaluating Transformer implementations for small-scale applications

In the course of learning about Transformers for text, I came across multiple different implementations, all purportedly built
on the publicly available details of OpenAI's GPT-2. So I decided to 
setup a testbed to compare the different implementations. I also decided to add a simple
non-transformer baseline, and I chose a simple 2-layer feed-forward MLP. To be clear, we're not comparing pre-trained models, but
rather we're comparing different implementations of the model (i.e. how would these different implementations perform when
configured similarly, and trained on the same data, in terms of accuracy and speed). 

## Methods being compared
1. HuggingFaceGPT: HuggingFace's [GPT implementation](https://huggingface.co/transformers/v2.0.0/_modules/transformers/modeling_openai.html) (to be clear, we are not using the pre-trained weights or architecture)
2. NanoGPT: Andrej Karpathy's [implementation](https://github.com/karpathy/nanoGPT)
3. TLTransformer: [GPT implementation based on Neel Nanda's lectures](https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/clean-transformer-demo/Clean_Transformer_Demo_Template.ipynb), referred to as "TLTransformer" based on his package TransformerLens. Note that a cleanly reusable public implementation of this does not exist, hence I needed to reimplement this with some copy-pasting here, in this notebook. The code is fairly small, so this did not pose a problem. 
4. 2-layer MLP: A simple MLP, see implementation in the Section named "MLP" below.

## Test beds
1. n-digit addition: Specifically, here I used 3-digit addition problems of the sort '272+926=1090'. I have also run tests on subtraction and using more digits than just 3; I have omitted those results here since the general patterns observed here have held there as well. 
2. Character-level modeling of Shakespeare's poems. 

## Results
1. The most surprising result here is that for n-digit addition (and other math problems as well), MLP significantly outperforms all transformer implementations. It trains much faster, gets perfect accuracy, and running inference is also much faster. 
2. For n-digit addition, among the Transformer implementations, TLTransformer does the best - it both gets near perfect accuracy as well as trains faster, compared to HuggingFaceGPT and NanoGPT. 
3. For Shakespeare data, however, MLP is unable to do well (phew!). No matter how many parameters it is configured with, the test loss obtained by MLP is clearly much worse than what's achieved by either of the Transformer implementations. 
4. For Shakespeare data, there's roughly similar performance between TLTransformer and NanoGPT. 

## Takeaways
If you care about having something simple which you understand in its entirety, I would say TLTransformer is simpler than NanoGPT. The differences in performance between the different implementations probably don't matter in the big picture. If you have a really simple application, consider just throwing in a MLP as a baseline - you may be surprised by its performance. 

# Table of Contents
1. [Setup](#Setup)
2. [N-digit math problems](#N-digit-math-problems)
    1. [Common code for training models and storing the results](#Common-code-for-training-models-and-storing-the-results)
    2. [Huggingface's GPT implementation](#Huggingfaces-gpt-implementation)
    3. [MLP](#MLP)
    4. [NanoGPT](#NanoGPT)
    5. [TLTransformer](#TLTransformer)
    6. [Results for n-digit addition](#Results-for-n-digit-addition)
3. [Shakespeare Data](#Shakespeare-data)
    1. [Using TL Transformer](#Using-tl-transformer)
    2. [Using MLP](#Using-mlp)
    3. [Using NanoGPT](#Using-nanogpt)
    4. [Results](#Results)

# Setup

In [None]:
!nvidia-smi

Mon Mar 13 06:16:43 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    53W / 400W |   8105MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
try:
  import google.colab
  IN_COLAB = True
  print("Running as a Colab notebook")
  %pip install einops
  %pip install transformers
  %pip install fancy_einsum
  %pip install git+https://github.com/neelnanda-io/TransformerLens.git
except:
  IN_COLAB = False

import torch
import torch.nn as nn
import time
import torch.nn.functional as F
from typing import Any
import random
from transformers import OpenAIGPTConfig, OpenAIGPTModel
from transformer_lens.utils import gelu_new
from dataclasses import dataclass
import dataclasses
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
import sys
if IN_COLAB:
   !git clone https://github.com/karpathy/nanoGPT.git
   sys.path.insert(0, 'nanoGPT')
else: 
    sys.path.insert(0, '../../nanoGPT')
from model import GPTConfig, GPT


# N-digit math problems

In [None]:
@dataclass
class DataConfig:
    reverse: bool = False
    upper_bound: int = 999
    each_digit_of_result_separate: bool = False
    lhs_number_width: int = len(str(upper_bound))
    rhs_number_width: int = len(str(upper_bound))+1
    num_plus_examples: int = int(3e5)
    num_minus_examples: int = int(3e5)
    test_size: int = 4096

@dataclass
class GeneratedData:
    all_strs: set[str]
    train_x: torch.Tensor
    train_y: torch.Tensor
    test_x: torch.Tensor
    test_y: torch.Tensor
    

In [None]:
# m stands for unary negative, | indicates end of sequence. 
# "." is reserved for potential future use, but currently unused. 
chars = '0123456789.=+-m|'
end_of_sequence = "|"
encoder = dict((c, i) for i, c in enumerate(chars))
decoder = dict((i, c) for i, c in enumerate(chars))

def reverse_str(s):
    return s[::-1]

def stringify_problem(x, y, math_symbol, reverse=False, lhs_number_width=3, rhs_number_width=4):
    if math_symbol == "+":
        rhs = f"{x+y:0{rhs_number_width}}"
    elif math_symbol == "-":
        if x < y:
            # since we're adding a unary negative, we need to subtract 1 from the width
            if len(str(y-x)) > rhs_number_width - 1:
                raise ValueError(f"{x} minus {y} doesn't fit in {rhs_number_width} digits")
            rhs = "m" + f"{y-x:0{rhs_number_width-1}}" 
        else:
            rhs = f"{x-y:0{rhs_number_width}}"
    else:
        raise ValueError(f"Unsupported math symbol {math_symbol}")
    if reverse:
        rhs = reverse_str(rhs)
    # pad z with |s on the right, upto max_len
    padded_rhs = rhs + end_of_sequence + end_of_sequence * (rhs_number_width - len(rhs))
    str_x = f"{x:0{lhs_number_width}}"
    str_y = f"{y:0{lhs_number_width}}"
    if reverse:
        lhs = reverse_str(str_x) + math_symbol + reverse_str(str_y) + "="
    else:
        lhs = str_x + math_symbol + str_y + "="
    return lhs + padded_rhs

def generate_math_example(cfg: DataConfig, math_symbol = "+"):
    x = random.randint(0, cfg.upper_bound)
    y = random.randint(0, cfg.upper_bound)
    return stringify_problem(x, y, math_symbol, cfg.reverse, cfg.lhs_number_width, cfg.rhs_number_width)

def tensorify_example(example):
    return torch.tensor([encoder[c] for c in example])

## Let's create a function for taking X and Y, and including the digits of Y in X
## This function needs to ensure all the rows of X have the same length, hence
## it's going to be padding with the appropriate number of zeros on the left. 
def include_digits_of_y_in_x(X, Y):
    n = Y.shape[1]
    # We need to create X0, X1, ..., Xn-1
    Xs = []
    Ys = []
    for i in range(n):
        Xs.append(torch.cat([torch.zeros(X.shape[0], n - 1 - i, dtype=torch.long), X, Y[:, :i]], dim=1))
        Ys.append(Y[:, i])
    X = torch.cat(Xs, dim=0)
    Y = torch.cat(Ys, dim=0)
    return X, Y

def generate_data(cfg: DataConfig) -> GeneratedData:
  plus_strs = set([generate_math_example(cfg, math_symbol = "+") for _ in range(cfg.num_plus_examples)])
  minus_strs = set([generate_math_example(cfg, math_symbol = "-") for _ in range(cfg.num_minus_examples)])
  all_strs = plus_strs.union(minus_strs)

  all_examples = torch.stack([tensorify_example(i) for i in all_strs])
  assert cfg.test_size < all_examples.shape[0] * 0.9, "Test size requested is more than 90% of all data"
  train_size = all_examples.shape[0] - cfg.test_size
  rhs_len = cfg.rhs_number_width + 1
  train_x = all_examples[:train_size, :-rhs_len]
  train_y = all_examples[:train_size, -rhs_len:]
  test_x = all_examples[train_size:, :-rhs_len]
  test_y = all_examples[train_size:, -rhs_len:]

  if cfg.each_digit_of_result_separate:
    train_x, train_y = include_digits_of_y_in_x(train_x, train_y)
    test_x, test_y = include_digits_of_y_in_x(test_x, test_y)

  return GeneratedData(all_strs=all_strs, train_x=train_x, train_y=train_y, test_x=test_x, test_y=test_y)

In [None]:
## Let's make sure the include_digits_of_y_in_x function works as expected
def test_for_include_digits_of_y_in_x():
    X = tensorify_example("123+456=").unsqueeze(dim=0)
    Y = tensorify_example("579" + end_of_sequence).unsqueeze(dim=0)
    X, Y = include_digits_of_y_in_x(X, Y)
    assert torch.equal(X[0], tensorify_example("000123+456="))
    assert torch.equal(Y[0], tensorify_example("5").squeeze())
    assert torch.equal(X[1] , tensorify_example("00123+456=5"))
    assert torch.equal(Y[1] , tensorify_example("7").squeeze())
    assert torch.equal(X[2] , tensorify_example("0123+456=57"))
    assert torch.equal(Y[2] , tensorify_example("9").squeeze())
    assert torch.equal(X[3] , tensorify_example("123+456=579"))
    assert torch.equal(Y[3] , tensorify_example(end_of_sequence).squeeze())

test_for_include_digits_of_y_in_x()

In [None]:
stringify_problem(123, 456, "+", lhs_number_width=3, rhs_number_width=4, reverse=True)

'321+654=9750|'

In [None]:
def test_for_stringify_problem():
    assert stringify_problem(123, 456, "+", lhs_number_width=3, rhs_number_width=4, reverse=False) == "123+456=0579|"
    assert stringify_problem(123, 999, "+", lhs_number_width=5, rhs_number_width=4, reverse=False) == "00123+00999=1122|"
    assert stringify_problem(123, 999, "-", lhs_number_width=3, rhs_number_width=4, reverse=False) == "123-999=m876|"
    assert stringify_problem(999, 123, "-", lhs_number_width=3, rhs_number_width=4, reverse=False) == "999-123=0876|"
    assert stringify_problem(123, 456, "+", lhs_number_width=3, rhs_number_width=4, reverse=True) == "321+654=9750|"
    assert stringify_problem(23, 45, "+", lhs_number_width=3, rhs_number_width=4, reverse=True) == "320+540=8600|"
    assert stringify_problem(123, 999, "-", lhs_number_width=3, rhs_number_width=4, reverse=True) == "321-999=678m|"

test_for_stringify_problem()

In [None]:
reversed_add_3digit_separated_cfg = DataConfig(
    each_digit_of_result_separate=True, upper_bound=999, 
    num_plus_examples=int(1e5), num_minus_examples=0, test_size=4096, reverse=True,
)
reversed_add_3digit_separated = generate_data(reversed_add_3digit_separated_cfg)

In [None]:
list(reversed_add_3digit_separated.all_strs)[0]

'272+926=1090|'

## Common code for training models and storing the results

In [None]:
results_dict = {}

In [None]:
@dataclass
class TrainConfig:
    epochs: int = 1000
    train_batch_size: int = 128
    lr: float = 1e-3
    weight_decay: float = 1e-4
    epoch_interval: int = 100
    time_budget_seconds: int = 120


In [None]:
@dataclass
class ResultRow:
    model_config: Any
    train_config: TrainConfig
    num_parameters: int
    epochs: int
    train_loss: float
    train_accuracy: float
    test_loss: float
    test_accuracy: float
    train_time_in_seconds: float
    time_per_example_in_micros: float
    train_losses: dict[int, float]
    train_accuracies: dict[int, float]
    test_losses: dict[int, float]
    test_accuracies: dict[int, float]

In [None]:
def train_and_eval(model_config: Any, m: nn.Module, data: GeneratedData, train_config: TrainConfig):
    m = m.to(device)
    num_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
    print(f"Number of parameters: {num_params}")

    optimizer = torch.optim.AdamW(m.parameters(), lr=train_config.lr, weight_decay=train_config.weight_decay)
    train_x = data.train_x.to(device)
    train_y = data.train_y.to(device)
    test_x = data.test_x.to(device)
    test_y = data.test_y.to(device)

    training_losses = {}
    test_losses = {}
    training_accuracies = {}
    test_accuracies = {}

    outer_start = time.time()
    ep = 0
    while ep < train_config.epochs:
        start = time.time()
        optimizer.zero_grad()
        rind = torch.randint(0, train_x.shape[0], (train_config.train_batch_size,))
        X = train_x[rind]
        Y = train_y[rind]
        output = m(X)
        if type(output) is tuple:
            # some models output other stuff besides the logits, let's just use the logits
            logits = output[0][:, -1, :] #get logits for last token
        else:
            logits = output[:, -1, :] #get logits for last token
        loss = F.cross_entropy(logits, Y)
        if ep % train_config.epoch_interval == 0:
            preds = torch.argmax(logits, dim=-1)
            training_losses[ep] = loss.item()
            training_accuracies[ep] = torch.sum(preds == Y).item() / preds.shape[0]
        
        loss.backward()
        optimizer.step()
        elapsed = time.time() - start

        if ep % train_config.epoch_interval == 0:
            with torch.no_grad():
                #calculate test loss
                output = m(test_x)
                if type(output) is tuple:
                    # some models output other stuff besides the logits, let's just use the logits
                    test_logits = output[0][:, -1, :]
                else:
                    test_logits = output[:, -1, :]
                test_loss = F.cross_entropy(test_logits, test_y)
                test_preds = torch.argmax(test_logits, dim=-1)

                test_losses[ep] = test_loss.item()
                test_accuracies[ep] = torch.sum(test_preds == test_y).item() / test_preds.shape[0]
                print(f"Epoch {ep}, train loss {training_losses[ep]: .3E}, test loss {test_losses[ep]: .3f}, " +
                    f"training accuracy {training_accuracies[ep]: .2f}, test accuracy {test_accuracies[ep]: .2f}, " +
                    f"time per example {elapsed * 1e6 / train_config.train_batch_size: .2f} µs")
                if time.time() - outer_start > train_config.time_budget_seconds:
                    print("Time budget exceeded, hence stopping training")
                    break
                if test_accuracies[ep] > 0.995:
                    print("Test accuracy > 99.5%, hence stopping training")
                    break
        ep += 1

    if len(training_losses) is None or len(training_accuracies) is None:
        raise RuntimeError("Training did not run at all")
    if len(test_losses) is None or len(test_accuracies) is None:
        raise RuntimeError("Tests did not run at all")
    
    total_elapsed = time.time() - outer_start
    print(f"Total training time {total_elapsed: .2f} s")
    result_row = ResultRow(
        model_config=model_config,
        train_config=train_config,
        num_parameters=num_params, 
        epochs=ep+1, 
        train_loss=training_losses[max(training_losses.keys())], 
        train_accuracy=training_accuracies[max(training_accuracies.keys())],
        test_loss=test_losses[max(test_losses.keys())],
        test_accuracy=test_accuracies[max(test_accuracies.keys())],
        train_time_in_seconds=total_elapsed,
        time_per_example_in_micros=total_elapsed * 1e6 / ((ep + 1) * train_config.train_batch_size),
        train_losses=training_losses,
        train_accuracies=training_accuracies,
        test_losses=test_losses,
        test_accuracies=test_accuracies,
    )
    return result_row

## HuggingFace's GPT implementation

In [None]:
## create wrapper around OpenAIGPTModel, so we get logits and not the last state. 
class OpenAIGPTLMHeadModel(OpenAIGPTModel):
    def __init__(self, config):
        super().__init__(config)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        #self.apply(self.init_weights)

    def forward(self, input_ids, position_ids=None, token_type_ids=None, past=None, head_mask=None):
        transformer_outputs = super().forward(input_ids, position_ids, token_type_ids, past, head_mask)
        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states)
        return lm_logits

In [None]:
train_config = TrainConfig(
    epochs=10000,
    train_batch_size=2048,
    lr=1e-3,
    weight_decay=1e-4,
    epoch_interval=50,
    time_budget_seconds=60,
)
#train_config.(lr=1e-2)

dataclasses.replace(train_config, lr=1e-2)

TrainConfig(epochs=10000, train_batch_size=2048, lr=0.01, weight_decay=0.0001, epoch_interval=50, time_budget_seconds=60)

In [None]:
model_config = OpenAIGPTConfig(
        vocab_size=len(chars), 
        n_positions=reversed_add_3digit_separated.train_x.shape[1], 
        n_embd=16, 
        n_layer=4, 
        n_head=4,
    )
train_config = TrainConfig(
    epochs=10000,
    train_batch_size=2048,
    lr=1e-3,
    weight_decay=1e-4,
    epoch_interval=50,
    time_budget_seconds=60,
)
results_dict[
    ("HuggingFaceGPT", f"run_{int(time.time())}t")
] = train_and_eval(
    model_config, OpenAIGPTLMHeadModel(model_config), reversed_add_3digit_separated, train_config
)

results_dict[
    ("HuggingFaceGPT", f"run_{int(time.time())}t")
] = train_and_eval(
    model_config, OpenAIGPTLMHeadModel(model_config), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-4)
)

results_dict[
    ("HuggingFaceGPT", f"run_{int(time.time())}t")
] = train_and_eval(
    model_config, OpenAIGPTLMHeadModel(model_config), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-2)
)


Number of parameters: 13824
Epoch 0, train loss  3.035E+00, test loss  2.953, training accuracy  0.03, test accuracy  0.05, time per example  23.54 µs
Epoch 50, train loss  2.088E+00, test loss  2.081, training accuracy  0.36, test accuracy  0.37, time per example  13.81 µs
Epoch 100, train loss  1.771E+00, test loss  1.738, training accuracy  0.38, test accuracy  0.40, time per example  13.90 µs
Epoch 150, train loss  1.578E+00, test loss  1.550, training accuracy  0.42, test accuracy  0.43, time per example  15.00 µs
Epoch 200, train loss  1.461E+00, test loss  1.449, training accuracy  0.44, test accuracy  0.46, time per example  14.44 µs
Epoch 250, train loss  1.412E+00, test loss  1.424, training accuracy  0.47, test accuracy  0.46, time per example  15.62 µs
Epoch 300, train loss  1.402E+00, test loss  1.412, training accuracy  0.46, test accuracy  0.46, time per example  13.55 µs
Epoch 350, train loss  1.366E+00, test loss  1.405, training accuracy  0.47, test accuracy  0.46, ti

## MLP

In [None]:
@dataclass
class MLPForSeq2SeqConfig:
    vocab_size: int
    input_len: int
    n_embed: int
    n_hidden: int
    output_len: int

## MLP for sequence to sequence problems.
## It operates in 3 steps: Embed each input token into a vector, concatenate all the vectors, and then pass through an MLP. 
## The result of the MLP are the logits. 
class MLPForSeq2Seq(nn.Module):
    def __init__(self, cfg: MLPForSeq2SeqConfig):
        super().__init__()
        self.cfg = cfg
        self.embed = nn.Embedding(cfg.vocab_size, cfg.n_embed)
        self.mlp = nn.Sequential(
            nn.Linear(cfg.n_embed * cfg.input_len, cfg.n_hidden),
            nn.ReLU(),
            nn.Linear(cfg.n_hidden, cfg.vocab_size * cfg.output_len)
        )
        
    def forward(self, x):
        # x is of shape (batch_size, input_len)
        x = self.embed(x)
        # now x is of shape (batch_size, input_len, n_embed)
        # reshape x to have shape (batch_size, input_len * n_embed)
        x = x.view(-1, self.cfg.n_embed * self.cfg.input_len)
        x = self.mlp(x)
        # now x is of shape (batch_size, vocab_size * output_len)
        # reshape x to have shape (batch_size, output_len, vocab_size)
        x = x.view(-1, self.cfg.output_len, self.cfg.vocab_size)
        return x

### debugging, ignore

In [None]:
model_config = MLPForSeq2SeqConfig(
    vocab_size=len(chars),
    n_embed=8,
    n_hidden=128,
    input_len=reversed_add_3digit_separated.train_x.shape[1],
    output_len=1,
)
m = MLPForSeq2Seq(model_config)
m(torch.randint(0, len(chars), (10, reversed_add_3digit_separated.train_x.shape[1])))[:, -1, :].shape

torch.Size([10, 16])

In [None]:
type(m(torch.randint(0, len(chars), (10, reversed_add_3digit_separated.train_x.shape[1]))))

torch.Tensor

### run for real

In [None]:
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-3,
    weight_decay=1e-4,
    epoch_interval=500,
    time_budget_seconds=60,
)
model_config = MLPForSeq2SeqConfig(
    vocab_size=len(chars),
    n_embed=8,
    n_hidden=128,
    input_len=reversed_add_3digit_separated.train_x.shape[1],
    output_len=1,
)
results_dict[("MLP", f"run_{int(time.time())}t")] = train_and_eval(
    model_config, MLPForSeq2Seq(model_config), reversed_add_3digit_separated, train_config
)
results_dict[("MLP", f"run_{int(time.time())}t")] = train_and_eval(
    model_config, MLPForSeq2Seq(model_config), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-4)
)
results_dict[("MLP", f"run_{int(time.time())}t")] = train_and_eval(
    model_config, MLPForSeq2Seq(model_config), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-2)
)


Number of parameters: 14608
Epoch 0, train loss  2.796E+00, test loss  2.749, training accuracy  0.10, test accuracy  0.13, time per example  1.47 µs
Epoch 500, train loss  3.596E-01, test loss  0.355, training accuracy  0.89, test accuracy  0.90, time per example  1.04 µs
Epoch 1000, train loss  3.558E-02, test loss  0.035, training accuracy  1.00, test accuracy  1.00, time per example  1.15 µs
Test accuracy > 99.5%, hence stopping training
Total training time  2.17 s
Number of parameters: 14608
Epoch 0, train loss  2.755E+00, test loss  2.745, training accuracy  0.06, test accuracy  0.08, time per example  1.52 µs
Epoch 500, train loss  1.565E+00, test loss  1.552, training accuracy  0.45, test accuracy  0.45, time per example  1.03 µs
Epoch 1000, train loss  1.434E+00, test loss  1.428, training accuracy  0.47, test accuracy  0.48, time per example  1.41 µs
Epoch 1500, train loss  1.346E+00, test loss  1.345, training accuracy  0.51, test accuracy  0.53, time per example  1.28 µs
Ep

## NanoGPT

In [None]:
model_config = GPTConfig(
    block_size=reversed_add_3digit_separated.train_x.shape[1], 
    vocab_size=len(chars), 
    n_layer=6, n_head=4, n_embd=16, dropout=0.1
)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-3,
    weight_decay=1e-4,
    epoch_interval=50,
    time_budget_seconds=60,
)
results_dict[("NanoGPT", f"run_{int(time.time())}t")] = train_and_eval(
    model_config, GPT(model_config), reversed_add_3digit_separated, train_config
)
results_dict[("NanoGPT", f"run_{int(time.time())}t")] = train_and_eval(
    model_config, GPT(model_config), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-4)
)
results_dict[("NanoGPT", f"run_{int(time.time())}t")] = train_and_eval(
    model_config, GPT(model_config), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-2)
)


fatal: destination path 'nanoGPT' already exists and is not an empty directory.
number of parameters: 0.02M
Number of parameters: 20160
Epoch 0, train loss  2.780E+00, test loss  2.728, training accuracy  0.06, test accuracy  0.20, time per example  33.63 µs
Epoch 50, train loss  2.224E+00, test loss  2.219, training accuracy  0.28, test accuracy  0.29, time per example  20.93 µs
Epoch 100, train loss  1.890E+00, test loss  1.867, training accuracy  0.37, test accuracy  0.39, time per example  22.87 µs
Epoch 150, train loss  1.747E+00, test loss  1.686, training accuracy  0.38, test accuracy  0.40, time per example  20.56 µs
Epoch 200, train loss  1.587E+00, test loss  1.604, training accuracy  0.41, test accuracy  0.40, time per example  21.32 µs
Epoch 250, train loss  1.526E+00, test loss  1.540, training accuracy  0.43, test accuracy  0.43, time per example  21.29 µs
Epoch 300, train loss  1.429E+00, test loss  1.469, training accuracy  0.47, test accuracy  0.45, time per example  2

## TLTransformer

TLTransformer code (it's a bit long, hence hidden)

In [None]:
import einops
from fancy_einsum import einsum
import math

@dataclass
class TLConfig:
    d_model: int = 768
    debug: bool = True
    layer_norm_eps: float = 1e-5
    d_vocab: int = 50257
    init_range: float = 0.02
    n_ctx: int = 1024
    d_head: int = 64
    d_mlp: int = 3072
    n_heads: int = 12
    n_layers: int = 12

class LayerNorm(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.w = nn.Parameter(torch.ones(cfg.d_model))
        self.b = nn.Parameter(torch.zeros(cfg.d_model))
    
    def forward(self, residual):
        # residual: [batch, position, d_model]
        if self.cfg.debug: print("Residual:", residual.shape)
        residual = residual - einops.reduce(residual, "batch position d_model -> batch position 1", "mean")
        # Calculate the variance, square root it. Add in an epsilon to prevent divide by zero.
        scale = (einops.reduce(residual.pow(2), "batch position d_model -> batch position 1", "mean") + self.cfg.layer_norm_eps).sqrt()
        normalized = residual / scale
        normalized = normalized * self.w + self.b
        if self.cfg.debug: print("Normalized:", residual.shape)
        return normalized

class Embed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_E = nn.Parameter(torch.empty((cfg.d_vocab, cfg.d_model)))
        nn.init.normal_(self.W_E, std=self.cfg.init_range)
    
    def forward(self, tokens):
        # tokens: [batch, position]
        if self.cfg.debug: print("Tokens:", tokens.shape)
        embed = self.W_E[tokens, :] # [batch, position, d_model]
        if self.cfg.debug: print("Embeddings:", embed.shape)
        return embed

class PosEmbed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_pos = nn.Parameter(torch.empty((cfg.n_ctx, cfg.d_model)))
        nn.init.normal_(self.W_pos, std=self.cfg.init_range)
    
    def forward(self, tokens):
        # tokens: [batch, position]
        if self.cfg.debug: print("Tokens:", tokens.shape)
        pos_embed = self.W_pos[:tokens.size(1), :] # [position, d_model]
        pos_embed = einops.repeat(pos_embed, "position d_model -> batch position d_model", batch=tokens.size(0))
        if self.cfg.debug: print("pos_embed:", pos_embed.shape)
        return pos_embed

class Attention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_Q = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_Q, std=self.cfg.init_range)
        self.b_Q = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        self.W_K = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_K, std=self.cfg.init_range)
        self.b_K = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        self.W_V = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_V, std=self.cfg.init_range)
        self.b_V = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        
        self.W_O = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_head, cfg.d_model)))
        nn.init.normal_(self.W_O, std=self.cfg.init_range)
        self.b_O = nn.Parameter(torch.zeros((cfg.d_model)))
        
        self.register_buffer("IGNORE", torch.tensor(-1e5, dtype=torch.float32))
    
    def forward(self, normalized_resid_pre):
        # normalized_resid_pre: [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_pre:", normalized_resid_pre.shape)
        
        q = einsum("batch query_pos d_model, n_heads d_model d_head -> batch query_pos n_heads d_head", normalized_resid_pre, self.W_Q) + self.b_Q
        k = einsum("batch key_pos d_model, n_heads d_model d_head -> batch key_pos n_heads d_head", normalized_resid_pre, self.W_K) + self.b_K
        
        attn_scores = einsum("batch query_pos n_heads d_head, batch key_pos n_heads d_head -> batch n_heads query_pos key_pos", q, k)
        attn_scores = attn_scores / math.sqrt(self.cfg.d_head)
        attn_scores = self.apply_causal_mask(attn_scores)

        pattern = attn_scores.softmax(dim=-1) # [batch, n_head, query_pos, key_pos]

        v = einsum("batch key_pos d_model, n_heads d_model d_head -> batch key_pos n_heads d_head", normalized_resid_pre, self.W_V) + self.b_V

        z = einsum("batch n_heads query_pos key_pos, batch key_pos n_heads d_head -> batch query_pos n_heads d_head", pattern, v)

        attn_out = einsum("batch query_pos n_heads d_head, n_heads d_head d_model -> batch query_pos d_model", z, self.W_O) + self.b_O
        return attn_out

    def apply_causal_mask(self, attn_scores):
        # attn_scores: [batch, n_heads, query_pos, key_pos]
        mask = torch.triu(torch.ones(attn_scores.size(-2), attn_scores.size(-1), device=attn_scores.device), diagonal=1).bool()
        attn_scores.masked_fill_(mask, self.IGNORE)
        return attn_scores

class MLP(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_in = nn.Parameter(torch.empty((cfg.d_model, cfg.d_mlp)))
        nn.init.normal_(self.W_in, std=self.cfg.init_range)
        self.b_in = nn.Parameter(torch.zeros((cfg.d_mlp)))
        self.W_out = nn.Parameter(torch.empty((cfg.d_mlp, cfg.d_model)))
        nn.init.normal_(self.W_out, std=self.cfg.init_range)
        self.b_out = nn.Parameter(torch.zeros((cfg.d_model)))
    
    def forward(self, normalized_resid_mid):
        # normalized_resid_mid: [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_mid:", normalized_resid_mid.shape)
        pre = einsum("batch position d_model, d_model d_mlp -> batch position d_mlp", normalized_resid_mid, self.W_in) + self.b_in
        post = gelu_new(pre)
        mlp_out = einsum("batch position d_mlp, d_mlp d_model -> batch position d_model", post, self.W_out) + self.b_out
        return mlp_out

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg

        self.ln1 = LayerNorm(cfg)
        self.attn = Attention(cfg)
        self.ln2 = LayerNorm(cfg)
        self.mlp = MLP(cfg)
    
    def forward(self, resid_pre):
        # resid_pre [batch, position, d_model]
        normalized_resid_pre = self.ln1(resid_pre)
        attn_out = self.attn(normalized_resid_pre)
        resid_mid = resid_pre + attn_out
        
        normalized_resid_mid = self.ln2(resid_mid)
        mlp_out = self.mlp(normalized_resid_mid)
        resid_post = resid_mid + mlp_out
        return resid_post

class Unembed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_U = nn.Parameter(torch.empty((cfg.d_model, cfg.d_vocab)))
        nn.init.normal_(self.W_U, std=self.cfg.init_range)
        self.b_U = nn.Parameter(torch.zeros((cfg.d_vocab), requires_grad=False))
    
    def forward(self, normalized_resid_final):
        # normalized_resid_final [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_final:", normalized_resid_final.shape)
        logits = einsum("batch position d_model, d_model d_vocab -> batch position d_vocab", normalized_resid_final, self.W_U) + self.b_U
        return logits
    
class TLTransformer(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.embed = Embed(cfg)
        self.pos_embed = PosEmbed(cfg)
        self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
        self.ln_final = LayerNorm(cfg)
        self.unembed = Unembed(cfg)
    
    def forward(self, tokens):
        # tokens [batch, position]
        embed = self.embed(tokens)
        pos_embed = self.pos_embed(tokens)
        residual = embed + pos_embed
        for block in self.blocks:
            residual = block(residual)
        normalized_resid_final = self.ln_final(residual)
        logits = self.unembed(normalized_resid_final)
        # logits have shape [batch, position, logits]
        return logits

### Using TLTransformer

In [None]:
tlconfig = TLConfig(
    d_model = 16, 
    debug = False, 
    d_vocab = len(chars), 
    n_ctx = reversed_add_3digit_separated.train_x.shape[1], 
    d_head = 4, n_heads = 4, n_layers = 4, d_mlp = 64)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-3,
    weight_decay=1e-4,
    epoch_interval=50,
    time_budget_seconds=60,
)
results_dict[("TLTransformer", f"run_{int(time.time())}t")] = train_and_eval(
    tlconfig, TLTransformer(tlconfig), reversed_add_3digit_separated, train_config
)
results_dict[("TLTransformer", f"run_{int(time.time())}t")] = train_and_eval(
    tlconfig, TLTransformer(tlconfig), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-4)
)
results_dict[("TLTransformer", f"run_{int(time.time())}t")] = train_and_eval(
    tlconfig, TLTransformer(tlconfig), reversed_add_3digit_separated, 
    dataclasses.replace(train_config, lr=1e-2)
)

Number of parameters: 13872
Epoch 0, train loss  2.778E+00, test loss  2.730, training accuracy  0.03, test accuracy  0.16, time per example  20.18 µs
Epoch 50, train loss  2.171E+00, test loss  2.161, training accuracy  0.40, test accuracy  0.39, time per example  13.98 µs
Epoch 100, train loss  1.790E+00, test loss  1.791, training accuracy  0.43, test accuracy  0.43, time per example  14.48 µs
Epoch 150, train loss  1.585E+00, test loss  1.598, training accuracy  0.47, test accuracy  0.46, time per example  15.03 µs
Epoch 200, train loss  1.496E+00, test loss  1.484, training accuracy  0.46, test accuracy  0.46, time per example  14.75 µs
Epoch 250, train loss  1.455E+00, test loss  1.440, training accuracy  0.45, test accuracy  0.46, time per example  19.40 µs
Epoch 300, train loss  1.421E+00, test loss  1.421, training accuracy  0.46, test accuracy  0.46, time per example  24.74 µs
Epoch 350, train loss  1.474E+00, test loss  1.465, training accuracy  0.43, test accuracy  0.44, ti

## Results for n-digit addition

In [None]:
simple_results = []
for key, value in results_dict.items():
    simple_results.append((key[0], f"{value.train_config.lr: .1e}", value.train_loss, value.test_loss, value.test_accuracy, value.num_parameters, value.time_per_example_in_micros, value.train_time_in_seconds))
import pandas as pd
pd.DataFrame.from_records(simple_results, 
    columns=["model", "Learning Rate", "train_loss", "test_loss", "test_accuracy", 
             "num_parameters", "time_per_example_in_micros", "train_time_in_seconds"]
).round(3)

Unnamed: 0,model,Learning Rate,train_loss,test_loss,test_accuracy,num_parameters,time_per_example_in_micros,train_time_in_seconds
0,HuggingFaceGPT,0.001,0.344,0.367,0.852,13824,15.079,60.249
1,HuggingFaceGPT,0.0001,1.42,1.431,0.461,13824,15.241,60.898
2,HuggingFaceGPT,0.01,1.352,1.385,0.46,13824,15.501,60.349
3,MLP,0.001,0.036,0.035,1.0,14608,1.058,2.169
4,MLP,0.0001,0.101,0.108,0.999,14608,1.109,17.04
5,MLP,0.01,0.001,0.001,1.0,14608,1.278,1.311
6,NanoGPT,0.001,1.365,1.387,0.46,20160,23.255,61.963
7,NanoGPT,0.0001,1.728,1.72,0.412,20160,23.687,60.688
8,NanoGPT,0.01,0.117,0.122,0.956,20160,23.442,60.061
9,TLTransformer,0.001,1.382,1.384,0.463,13872,17.296,60.254


# Shakespeare Data

In [None]:
@dataclass
class BardData:
    train: torch.Tensor
    test: torch.Tensor
    vocab_size: int
    stoi: dict[str, int]
    itos: dict[int, str]
    

In [None]:
def make_shakespeare_data():
    import os
    import requests
    
    filename = 'shakespeare.txt'
    if not os.path.exists(filename):
        url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
        with open(filename, 'w') as f:
            f.write(requests.get(url).text)
    with open(filename, 'r') as f:
        text = f.read()
    print("length of dataset in characters: ", len(text))
    chars = sorted(list(set(text)))
    vocab_size = len(chars)
    print("all the unique characters:", ''.join(chars))
    print("vocab size:", vocab_size)

    # create a mapping from characters to integers
    stoi = { ch:i for i,ch in enumerate(chars) }
    itos = { i:ch for i,ch in enumerate(chars) }

    def encode(s):
        return [stoi[c] for c in s] 

    n = len(text)
    train_data = text[:int(n*0.9)]
    val_data = text[int(n*0.9):]
    train_ids = encode(train_data)
    val_ids = encode(val_data)
    print(f"train has {len(train_ids)} tokens")
    print(f"val has {len(val_ids)} tokens")

    # don't make the mistake of setting dtype to uint8 in an effort to save memory
    # somehow the code doesn't work with dtypes other than long
    train_ids = torch.tensor(train_ids, dtype=torch.long)
    val_ids = torch.tensor(val_ids, dtype=torch.long)

    return BardData(train_ids, val_ids, vocab_size, stoi, itos)



In [None]:
bd = make_shakespeare_data()

length of dataset in characters:  1115394
all the unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1003854 tokens
val has 111540 tokens


In [None]:
def get_batch(data: BardData, is_train: bool, batch_size: int, block_size: int, is_y_single_token: bool = False):
    d = data.train if is_train else data.test
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    if is_y_single_token:
        y = torch.stack([d[i+block_size] for i in ix])
    else:
        y = torch.stack([d[i+1:i+1+block_size] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

def train_and_eval_bard(model_config: Any, m: nn.Module, data: BardData,
                        train_config: TrainConfig, block_size: int, 
                        is_y_single_token: bool = True, is_nano_gpt: bool = False):
    m = m.to(device)
    num_params = sum(p.numel() for p in m.parameters() if p.requires_grad)
    print(f"Number of parameters: {num_params}")

    optimizer = torch.optim.AdamW(m.parameters(), lr=train_config.lr, weight_decay=train_config.weight_decay)

    training_losses = {}
    test_losses = {}
    training_accuracies = {}
    test_accuracies = {}

    outer_start = time.time()
    ep = 0
    while ep < train_config.epochs:
        start = time.time()
        optimizer.zero_grad()
        x, y = get_batch(data, is_train=True, batch_size=train_config.train_batch_size, 
                         block_size=block_size, is_y_single_token=is_y_single_token)
        if is_nano_gpt:
          output, loss = m(x, y)
          logits = einops.rearrange(output, "b s v -> b v s")
        else:
          output = m(x)
          if type(output) is tuple:
              output = output[0]
          if is_y_single_token:
              logits = output.squeeze()
          else:
              logits = einops.rearrange(output, "b s v -> b v s")
          loss = F.cross_entropy(logits, y)
        if ep % train_config.epoch_interval == 0:
            training_losses[ep] = loss.item()
            if is_y_single_token:
                preds = torch.argmax(logits, dim=-1)
                next_tokens = y
            else:
                preds = torch.argmax(logits[:, -1, :], dim=-1)
                next_tokens = y[:, -1]
            training_accuracies[ep] = torch.sum(preds == next_tokens).item() / preds.shape[0]
        
        loss.backward()
        optimizer.step()
        elapsed = time.time() - start

        if ep % train_config.epoch_interval == 0:
            with torch.no_grad():
                #calculate test loss
                test_x, test_y = get_batch(data, is_train=False, batch_size=train_config.train_batch_size, 
                         block_size=block_size, is_y_single_token=is_y_single_token)
                if is_nano_gpt:
                  output, test_loss = m(x, y)
                  test_logits = einops.rearrange(output, "b s v -> b v s")
                else:
                  output = m(test_x)
                  if type(output) is tuple:
                      output = output[0]
                  if is_y_single_token:
                      test_logits = output.squeeze()
                  else:
                      test_logits = einops.rearrange(output, "b s v -> b v s")
                  test_loss = F.cross_entropy(test_logits, test_y)
                test_losses[ep] = test_loss.item()

                if is_y_single_token:
                    test_preds = torch.argmax(test_logits, dim=-1)
                    next_tokens = test_y
                else:
                    test_preds = torch.argmax(test_logits[:, -1, :], dim=-1)
                    next_tokens = test_y[:, -1]
                
                test_accuracies[ep] = torch.sum(test_preds == next_tokens).item() / test_preds.shape[0]
                print(f"Epoch {ep}, train loss {training_losses[ep]: .3E}, test loss {test_losses[ep]: .3f}, " +
                    f"training accuracy {training_accuracies[ep]: .2f}, test accuracy {test_accuracies[ep]: .2f}, " +
                    f"time per example {elapsed * 1e6 / train_config.train_batch_size: .2f} µs")
                if time.time() - outer_start > train_config.time_budget_seconds:
                    print("Time budget exceeded, hence stopping training")
                    break
        ep += 1

    if len(training_losses) is None or len(training_accuracies) is None:
        raise RuntimeError("Training did not run at all")
    if len(test_losses) is None or len(test_accuracies) is None:
        raise RuntimeError("Tests did not run at all")
    
    total_elapsed = time.time() - outer_start
    print(f"Total training time {total_elapsed: .2f} s")
    result_row = ResultRow(
        model_config=model_config,
        train_config=train_config,
        num_parameters=num_params, 
        epochs=ep+1, 
        train_loss=training_losses[max(training_losses.keys())], 
        train_accuracy=training_accuracies[max(training_accuracies.keys())],
        test_loss=test_losses[max(test_losses.keys())],
        test_accuracy=test_accuracies[max(test_accuracies.keys())],
        train_time_in_seconds=total_elapsed,
        time_per_example_in_micros=total_elapsed * 1e6 / ((ep + 1) * train_config.train_batch_size),
        train_losses=training_losses,
        train_accuracies=training_accuracies,
        test_losses=test_losses,
        test_accuracies=test_accuracies,
    )
    return result_row

In [None]:
bard_results = {}

## Using TL Transformer

In [None]:
tlconfig = TLConfig(
    d_model = 64, 
    debug = False, 
    d_vocab = bd.vocab_size, 
    n_ctx = 128, 
    d_head = 16, 
    n_heads = 4, 
    n_layers = 4, 
    d_mlp = 256)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=1024,
    lr=1e-2,
    weight_decay=1e-4,
    epoch_interval=50,
    time_budget_seconds=300,
)
bard_results[("TLTransformer", f"run_{int(time.time())}t")] = train_and_eval_bard(
    tlconfig, TLTransformer(tlconfig), bd, train_config, block_size=tlconfig.n_ctx, is_y_single_token=False
)
tlconfig = TLConfig(
    d_model = 96, 
    debug = False, 
    d_vocab = bd.vocab_size, 
    n_ctx = 128, 
    d_head = 24, 
    n_heads = 4, 
    n_layers = 4, 
    d_mlp = 384
)
bard_results[("TLTransformer", f"run_{int(time.time())}t")] = train_and_eval_bard(
    tlconfig, TLTransformer(tlconfig), bd, train_config, block_size=tlconfig.n_ctx, is_y_single_token=False
)


Number of parameters: 216641
Epoch 0, train loss  4.221E+00, test loss  4.210, training accuracy  0.02, test accuracy  0.00, time per example  86.75 µs
Epoch 50, train loss  3.300E+00, test loss  3.335, training accuracy  0.00, test accuracy  0.00, time per example  103.44 µs
Epoch 100, train loss  2.947E+00, test loss  2.962, training accuracy  0.00, test accuracy  0.01, time per example  103.55 µs
Epoch 150, train loss  2.789E+00, test loss  2.784, training accuracy  0.01, test accuracy  0.01, time per example  103.28 µs
Epoch 200, train loss  2.644E+00, test loss  2.639, training accuracy  0.01, test accuracy  0.01, time per example  104.05 µs
Epoch 250, train loss  2.569E+00, test loss  2.556, training accuracy  0.01, test accuracy  0.01, time per example  103.91 µs
Epoch 300, train loss  2.486E+00, test loss  2.509, training accuracy  0.01, test accuracy  0.01, time per example  103.95 µs
Epoch 350, train loss  2.435E+00, test loss  2.458, training accuracy  0.01, test accuracy  0

## Using MLP

In [None]:
cfg1 = MLPForSeq2SeqConfig(
    vocab_size=bd.vocab_size,
    input_len=128,
    n_embed=16,
    n_hidden=256,
    output_len=1)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-2,
    weight_decay=1e-4,
    epoch_interval=200,
    time_budget_seconds=300,
)
bard_results[("MLP", f"run_{int(time.time())}t")] = train_and_eval_bard(
    cfg1, MLPForSeq2Seq(cfg1), bd, train_config, block_size=cfg1.input_len, is_y_single_token=True
)
cfg2 = MLPForSeq2SeqConfig(
    vocab_size=bd.vocab_size,
    input_len=128,
    n_embed=8,
    n_hidden=128,
    output_len=1)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-2,
    weight_decay=1e-4,
    epoch_interval=200,
    time_budget_seconds=300,
)
bard_results[("MLP", f"run_{int(time.time())}t")] = train_and_eval_bard(
    cfg2, MLPForSeq2Seq(cfg2), bd, train_config, block_size=cfg2.input_len, is_y_single_token=True
)

Number of parameters: 542289
Epoch 0, train loss  4.216E+00, test loss  5.999, training accuracy  0.01, test accuracy  0.16, time per example  19.51 µs
Epoch 200, train loss  2.293E+00, test loss  2.493, training accuracy  0.33, test accuracy  0.29, time per example  18.44 µs
Epoch 400, train loss  2.301E+00, test loss  2.315, training accuracy  0.33, test accuracy  0.32, time per example  17.59 µs
Epoch 600, train loss  2.148E+00, test loss  2.259, training accuracy  0.37, test accuracy  0.35, time per example  27.56 µs
Epoch 800, train loss  2.176E+00, test loss  2.232, training accuracy  0.37, test accuracy  0.36, time per example  18.03 µs
Epoch 1000, train loss  2.117E+00, test loss  2.216, training accuracy  0.37, test accuracy  0.38, time per example  17.59 µs
Epoch 1200, train loss  2.098E+00, test loss  2.210, training accuracy  0.40, test accuracy  0.37, time per example  18.21 µs
Epoch 1400, train loss  2.049E+00, test loss  2.166, training accuracy  0.41, test accuracy  0.3

In [None]:
cfg = MLPForSeq2SeqConfig(
    vocab_size=bd.vocab_size,
    input_len=128,
    n_embed=16,
    n_hidden=64,
    output_len=1)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-2,
    weight_decay=1e-4,
    epoch_interval=25,
    time_budget_seconds=600,
)
bard_results[("MLP", f"run_{int(time.time())}t")] = train_and_eval_bard(
    cfg, MLPForSeq2Seq(cfg), bd, train_config, block_size=cfg.input_len, is_y_single_token=True
)

Number of parameters: 136401
Epoch 0, train loss  4.149E+00, test loss  4.607, training accuracy  0.02, test accuracy  0.17, time per example  51.37 µs
Epoch 25, train loss  3.201E+00, test loss  3.214, training accuracy  0.18, test accuracy  0.18, time per example  85.03 µs
Epoch 50, train loss  2.968E+00, test loss  2.920, training accuracy  0.21, test accuracy  0.22, time per example  88.08 µs
Epoch 75, train loss  2.727E+00, test loss  2.719, training accuracy  0.24, test accuracy  0.24, time per example  45.28 µs
Epoch 100, train loss  2.569E+00, test loss  2.584, training accuracy  0.28, test accuracy  0.27, time per example  50.96 µs
Epoch 125, train loss  2.512E+00, test loss  2.555, training accuracy  0.29, test accuracy  0.28, time per example  62.23 µs
Epoch 150, train loss  2.437E+00, test loss  2.460, training accuracy  0.29, test accuracy  0.30, time per example  82.49 µs
Epoch 175, train loss  2.460E+00, test loss  2.432, training accuracy  0.30, test accuracy  0.29, tim

## Using NanoGPT

In [None]:
model_config = GPTConfig(
    block_size=128, 
    vocab_size=65, 
    n_layer=4, n_head=4, n_embd=64, dropout=0.1
)
train_config = TrainConfig(
    epochs=20000,
    train_batch_size=2048,
    lr=1e-2,
    weight_decay=1e-4,
    epoch_interval=50,
    time_budget_seconds=300,
)
bard_results[("NanoGPT", f"run_{int(time.time())}t")] = train_and_eval_bard(
    model_config, GPT(model_config), bd, train_config, block_size=128, is_y_single_token=False, is_nano_gpt=True
)
cfg2 = dataclasses.replace(model_config, n_embd=96)
bard_results[("NanoGPT", f"run_{int(time.time())}t")] = train_and_eval_bard(
    cfg2, GPT(cfg2), bd, train_config, block_size=128, is_y_single_token=False, is_nano_gpt=True
)

number of parameters: 0.20M
Number of parameters: 212416
Epoch 0, train loss  4.193E+00, test loss  3.763, training accuracy  0.01, test accuracy  0.01, time per example  136.27 µs
Epoch 50, train loss  3.315E+00, test loss  3.315, training accuracy  0.01, test accuracy  0.01, time per example  126.46 µs
Epoch 100, train loss  3.312E+00, test loss  3.312, training accuracy  0.01, test accuracy  0.01, time per example  125.70 µs
Epoch 150, train loss  3.308E+00, test loss  3.307, training accuracy  0.01, test accuracy  0.00, time per example  126.57 µs
Epoch 200, train loss  3.117E+00, test loss  3.116, training accuracy  0.01, test accuracy  0.01, time per example  125.18 µs
Epoch 250, train loss  2.947E+00, test loss  2.941, training accuracy  0.01, test accuracy  0.01, time per example  125.28 µs
Epoch 300, train loss  2.744E+00, test loss  2.741, training accuracy  0.01, test accuracy  0.01, time per example  125.31 µs
Epoch 350, train loss  2.626E+00, test loss  2.625, training acc

## Results

In [None]:
sb_results = []
for key, value in bard_results.items():
    sb_results.append((key[0], f"{value.train_config.lr: .1e}", value.train_loss, value.test_loss, value.num_parameters, value.time_per_example_in_micros, value.train_time_in_seconds))
import pandas as pd
pd.DataFrame.from_records(sb_results, 
    columns=["model", "Learning Rate", "train_loss", "test_loss",  
             "num_parameters", "time_per_example_in_micros", "train_time_in_seconds"]
).round(3)

Unnamed: 0,model,Learning Rate,train_loss,test_loss,num_parameters,time_per_example_in_micros,train_time_in_seconds
0,MLP,0.01,1.848,2.182,542289,19.452,302.81
1,MLP,0.01,1.831,2.135,140105,19.259,307.696
2,TLTransformer,0.01,1.236,1.613,216641,101.261,300.808
3,TLTransformer,0.01,1.203,1.648,472385,131.621,303.39
4,NanoGPT,0.01,1.815,1.812,212416,128.491,302.885
5,NanoGPT,0.01,1.474,1.47,466080,150.595,308.727
