### Part 3: Applications | Addition

**Tutorial on Transformers for Mathematics**

*Simons Institute and SLMath Joint Workshop: AI for Mathematics and Theoretical Computer Science, April 8 2025*

Author: Sean Welleck

------

This notebook trains a transformer language model on a dataset using the [makemore]() library as a black box. 

Then we generate outputs with the language model and evaluate the outputs for correctness.

The task is **4-digit addition**.


------


#### Generate a dataset

In [3]:
!python generate_addition_data.py

!head -n 5 data/addition_train.txt

Train: 4990000 Test: 10000
7272+3991=11263
4576+3741=8317
4180+3775=7955
2503+3478=5981
5642+1208=6850


#### Train a transformer language model on the dataset

In [None]:
!python makemore.py -i data/addition_train.txt --n-layer 8 --n-head 4 --n-embd 128 --n-embd2 128

#### Generate outputs

In [None]:
!python makemore.py -i data/addition_train.txt --sample-only --n-layer 8 --n-head 4 --n-embd 128 --n-embd2 128

### Evaluate correctness

Now we want to evaluate the correctness of the outputs. We'll give the model problems from the test set (which were not seen during training) and have the model generate a solution for each problem. Then we'll parse the output and check it.

This will require writing some code.

In [23]:
import makemore
import torch

def load(filename='out/model.pt', n_layer=8, n_head=4, n_embd=128, n_embd2=128):
    train_dataset, test_dataset = makemore.create_datasets("data/addition_test.txt")
    vocab_size = train_dataset.get_vocab_size()
    block_size = train_dataset.get_output_length()

    config = makemore.ModelConfig(
        vocab_size=vocab_size, block_size=block_size,
        n_layer=n_layer, n_head=n_head,
        n_embd=n_embd, n_embd2=n_embd2
    )
    model = makemore.Transformer(config)
    model.load_state_dict(torch.load(filename, map_location=torch.device('cpu')))
    return train_dataset, test_dataset, model

In [24]:
def trim_padding(x):
    start = 0
    end = len(x)
    for j in range(len(x)):
        if x[j] == 0:
            start = j+1
            break
    for j in range(len(x)-1, start, -1):
        if x[j] == 0:
            end = j
    x = x[start:end]
    return x

def check(train_dataset_decode, out):
    out = out[0].tolist()
    out = trim_padding(out)
    out = train_dataset_decode(out)

    # use a regex and evaluate (e.g. 1468+1657=3125)
    import re
    try:
        m = re.match(r'(\d+)\+(\d+)=(\d+)', out)
        a = int(m.group(1))
        b = int(m.group(2))
        c = int(m.group(3))
        correct = (a + b) == c
    except AttributeError:
        a, b, c = -1, -1, -1
        correct = False
    return a, b, c, correct


#### Evaluate

You can evaluate the model we trained for awhile or your own model:

In [28]:
# --- To use our provided model, use:
model_filename = 'out/model_provided.pt'

# --- To use your model, use:
# model_filename = 'out/model.pt'

In [29]:
train_dataset, test_dataset, model = load(model_filename)
train_dataset_decode = train_dataset.decode

# Evaluate accuracy on the test set
from tqdm import trange

correct = 0
total = 0
dataset = test_dataset
incorrect = []
for i in trange(len(dataset)):
    model.eval()
    x = dataset[i][0].tolist()
    x = trim_padding(x)
    prompt = dataset.decode(x).split('=')[0]+'='
    prompt_tokens = dataset.encode(prompt)
    prompt_tokens = torch.cat([torch.tensor([0]), prompt_tokens]).unsqueeze(0)
    out = makemore.generate(
        model, prompt_tokens, 10, top_k=None, do_sample=False
    ).to('cpu')
    a, b, c, correct_ = check(train_dataset_decode, out)
    correct += correct_
    if not correct_:
        incorrect.append((prompt, out, a, b, c))
    total += 1
print('Accuracy:', correct/total)

number of examples in the dataset: 10000
max word length: 15
number of unique characters in the vocabulary: 12
vocabulary:
+0123456789=
split up the dataset into 9000 training examples and 1000 test examples
number of parameters: 1.59M


100%|██████████| 1000/1000 [00:29<00:00, 33.39it/s]

Accuracy: 1.0





#### Manually try it out

In [39]:
prompt = "2727+7272="

prompt_tokens = dataset.encode(prompt)
prompt_tokens = torch.cat([torch.tensor([0]), prompt_tokens]).unsqueeze(0)
out = makemore.generate(
    model, prompt_tokens, 10, top_k=None, do_sample=False
).to('cpu')
out = out[0].tolist()
out = trim_padding(out)
out = train_dataset_decode(out)
out

'2727+7272=9999'