# Training GPT to do Multiplication
The goal of this exercise is for you to get more familiar with transformers and and GPT. We will focus on a much simpler task than language modelling that can be trained in a few minutes. Specifically we will train a small GPT model from scratch to perform multiplications. We will use indidual characters as tokens. Doing multiplications directly for example in the form: \\
"12\*34=" -> "408" \\
can be challenging. Even large language models are often not able to do this accurately for moderately large numbers, say 5 digits (although it is possible using [special tricks](https://arxiv.org/abs/2311.14737)).

For large numbers predicting the first digit directly without any "intermediate" steps becomes hard. We will therefore consider training it do perform these intermediate steps explicitly, for example: \\
"12\*34" -> "68\[68\]+340\[408\]=408" \\
where we compute the products of digits in the first number with the second number and then sum them up in the brackets, i.e. 68=2\*34, 340=10\*34, 0+68=68 and 68+340=408.

This only requires a much simpler transformation at each step (corresponding to the generation of one output token) making it easier to learn and perform accurately. A similar idea is used in ["chain-of-thought" prompting](https://arxiv.org/abs/2201.11903) in large language models. They often perform much better if you ask them to "think step by step".

One final thing we can do to make the task even easier is to write the numbers backwards during the intermediate steps. This is because comptuting the digits from left-to-right is significantly harder than computing them right-to-left for additions particulary but also multiplication.

# Part 1: Data Creation
This part contains the functions involved in the dataset creation. The pipline is as follows:
* We generate random number pairs and split them into a train and validation set (generate_dataset).
* For each pair of numbers we create a string showing the multiplication of the two numbers, potentially involving intermediate steps. For example (12,34) could get mapped to "12\*34=408" or "12\*34=68\[68\]+340\[408\]=408".
* The strings are padded with spaces to all have a given length. Note that this is typically not done in standard next token prediction on language, but simplifies the training proceedure in this case.
* The strings are tokenized, mapping them to arrays of indices (integers). In this case we use one token for each character so the main difference is that all the token indices are in an interval going from zero to the vocabulary size.

You need to fill in missing details in two fuctions:
* generate_mul_sequence
* tokenize


In [None]:
import random
import itertools
import numpy as np


def generate_mul_seq(a, b, max_digits=3, sum_cot=False, reverse_cot=False):
    """
    This function takes in two integers and returns a string representing their
    multiplication and result, optionally with intermediate steps.

    >>> generate_mul_seq(867, 821, max_digits=3, sum_cot=False, reverse_cot=False)
    '867*821=711807'
    >>> generate_mul_seq(386, 273, max_digits=3, sum_cot=False, reverse_cot=True)
    '386*273=873501=105378'
    >>> generate_mul_seq(507, 779, max_digits=3, sum_cot=True, reverse_cot=False)
    '507*779=5453[5453]+0[5453]+389500[394953]=394953'
    >>> generate_mul_seq(807, 214, max_digits=3, sum_cot=True, reverse_cot=True)
    '807*214=8941[8941]+0[8941]+002171[896271]=172698'
    """
    prompt = f"{a}*{b}="
    prompt = " " * (2 * max_digits + 2 - len(prompt)) + prompt
    if not sum_cot:
        if not reverse_cot:
            # No COT e.g. "12*34=408"
            return prompt + f"{a*b}"
        else:
            # Reversed intermediate result e.g. "12*34=804=408"
            return prompt + f"{str(a*b)[::-1]}=" + f"{a*b}"

    # ***************************************************
    # INSERT YOUR CODE HERE
    # TODO: Return a string of the type "12*34=68[68]+340[408]=408" if
    # reverse_cot==False or 12*34=86[86]+043[804]=408 otherwise.
    # You should use the prompt created above (which has a fixed length
    # which is something we rely on later)
    # ***************************************************
    raise NotImplementedError


token_table = {
    **{f"{d}": d for d in range(10)},
    "*": 10,
    "=": 11,
    "+": 12,
    "[": 13,
    "]": 14,
    " ": 15,  # Hacky padding
}


def tokenize(dataset):
    """
    This function takes in a list of strings and converts each one to a uint8
    numpy array of tokens corresponding to the characters in the string (see
    the token_table above for the mapping).

    >>> print(tokenize(["867*821=711807"]))
    [array([8, 6, 7, 10, 8, 2, 1, 11, 7, 1, 1, 8, 0, 7], dtype=uint8)]
    >>> print(tokenize(["0", "123456789*=+[]"]))
    [array([0], dtype=uint8), array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype=uint8)]
    """
    # ***************************************************
    # INSERT YOUR CODE HERE
    # TODO: Map each string in the the dataset [list] to a numpy array of
    # token indices (integers) according to the token_table defined above.
    # Return a list of numpy arrays corresponding to the input strings.
    # The datatype used for the numpy array should be np.uint8.
    # For example: "2*3=6" should be mapped to np.array([2, 10, 3, 11, 6], dtype=np.uint8)
    # ***************************************************
    raise NotImplementedError
    return out


# Other helper functions
def generate_dataset(
    train_samples=100000,
    val_samples=1000,
    max_digits=3,
    sum_cot=False,
    reverse_cot=False,
):
    assert train_samples + val_samples < 0.75 * 10 ** (
        max_digits * 2
    ), "Too many requested data samples"

    def generate_pairs(n_samples, existing_pairs=None):
        if existing_pairs is None:
            existing_pairs = set()
        max_num = 10**max_digits - 1
        pairs = set()
        while len(pairs) < n_samples:
            num1 = random.randint(0, max_num)
            num2 = random.randint(0, max_num)
            pair = (num1, num2)
            if pair not in existing_pairs:
                pairs.add(pair)
        return pairs

    train_set = generate_pairs(train_samples)
    val_set = generate_pairs(val_samples, train_set)

    train_set = [
        generate_mul_seq(a, b, max_digits, sum_cot, reverse_cot) for (a, b) in train_set
    ]
    val_set = [
        generate_mul_seq(a, b, max_digits, sum_cot, reverse_cot) for (a, b) in val_set
    ]

    return list(train_set), list(val_set)


def pad_datasets(train, val):
    max_len = max(len(seq) for seq in train + val)
    return [seq + " " * (max_len - len(seq)) for seq in train], [
        seq + " " * (max_len - len(seq)) for seq in val
    ]


inverse_table = {val: key for (key, val) in token_table.items()}


def detokenize(data):
    out = ["".join([inverse_table[idx] for idx in seq]) for seq in data]
    return out


# Test function we use to test your implementations
import doctest
import io
import sys

np.set_printoptions(
    threshold=np.inf, linewidth=np.inf, formatter={"int": lambda x: f"{x:d}"}
)


def test(f):
    # The `globs` defines the variables, functions and packages allowed in the docstring.
    tests = doctest.DocTestFinder().find(f)
    assert len(tests) <= 1
    for test in tests:
        # We redirect stdout to a string, so we can tell if the tests worked out or not
        orig_stdout = sys.stdout
        sys.stdout = io.StringIO()

        try:
            results: doctest.TestResults = doctest.DocTestRunner().run(test)
            output = sys.stdout.getvalue()
        finally:
            sys.stdout = orig_stdout

        if results.failed > 0:
            print(f"❌ The are some issues with your implementation of `{f.__name__}`:")
            print(output, end="")
            print(
                "**********************************************************************"
            )
        elif results.attempted > 0:
            print(f"✅ Your `{f.__name__}` passed {results.attempted} tests.")
        else:
            print(f"Could not find any tests for {f.__name__}")

In [None]:
# Example data
random.seed(0)
print("No COT:")
for seq in generate_dataset(3, 0, sum_cot=False, reverse_cot=False)[0]:
    print(repr(seq))

print("\nReverse intermediate COT:")
for seq in generate_dataset(3, 0, sum_cot=False, reverse_cot=True)[0]:
    print(repr(seq))

print("\nSum COT:")
for seq in generate_dataset(3, 0, sum_cot=True, reverse_cot=False)[0]:
    print(repr(seq))

print("\nReverse sum COT:")
for seq in generate_dataset(3, 0, sum_cot=True, reverse_cot=True)[0]:
    print(repr(seq))

print()
test(generate_mul_seq)
test(tokenize)

# Part 2: Defining the model
Below you will find a slightly simplified model definition from [NanoGPT](https://github.com/karpathy/nanoGPT), a lean codebase for training real GPT models. We won't ask you to implement anything but encourage you to read through the code. We do not expect you to understand everything or to be able to implement code like this. However, see if you can answer the following questions:
* Where is Layer Normalization applied relative to the attention and MLP subblocks?
* What activation function do we use and where?
* How do we ensure that the model doesn't cheat during training by looking at future tokens?
* Where do we convert the initial input tokens (integers) to vector embeddings?
* How do we ensure that attention is aware of the sequence order?

In [None]:
# NanoGPT model.py
# Slightly simplified to remove things unrelated to this exercise

"""
Full definition of a GPT Language Model, all of it in this single file.
References:
1) the official GPT-2 TensorFlow implementation released by OpenAI:
https://github.com/openai/gpt-2/blob/master/src/model.py
2) huggingface/transformers PyTorch implementation:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py
"""

import math
import inspect
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F


class LayerNorm(nn.Module):
    """LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False"""

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)


class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
        self.flash = hasattr(torch.nn.functional, "scaled_dot_product_attention")
        if not self.flash:
            print(
                "WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0"
            )
            # causal mask to ensure that attention is only applied to the left in the input sequence
            self.register_buffer(
                "bias",
                torch.tril(torch.ones(config.block_size, config.block_size)).view(
                    1, 1, config.block_size, config.block_size
                ),
            )

    def forward(self, x):
        (
            B,
            T,
            C,
        ) = x.size()  # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(
            1, 2
        )  # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(
            1, 2
        )  # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(
            1, 2
        )  # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if self.flash:
            # efficient attention using Flash Attention CUDA kernels
            y = torch.nn.functional.scaled_dot_product_attention(
                q,
                k,
                v,
                attn_mask=None,
                dropout_p=self.dropout if self.training else 0,
                is_causal=True,
            )
        else:
            # manual implementation of attention
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = (
            y.transpose(1, 2).contiguous().view(B, T, C)
        )  # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y


class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x


class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x


@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304  # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0
    bias: bool = True  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster


class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config

        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.vocab_size, config.n_embd),
                wpe=nn.Embedding(config.block_size, config.n_embd),
                drop=nn.Dropout(config.dropout),
                h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
                ln_f=LayerNorm(config.n_embd, bias=config.bias),
            )
        )
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = (
            self.lm_head.weight
        )  # https://paperswithcode.com/method/weight-tying

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith("c_proj.weight"):
                torch.nn.init.normal_(
                    p, mean=0.0, std=0.02 / math.sqrt(2 * config.n_layer)
                )

        # report number of parameters
        print("number of parameters: %.2fM" % (self.get_num_params() / 1e6,))

    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert (
            t <= self.config.block_size
        ), f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device)  # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx)  # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos)  # position embeddings of shape (t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1
            )

            mask = targets != -1
            correct = (torch.argmax(logits, dim=-1) == targets) & mask
            acc = torch.sum(1.0 * (correct)) / torch.sum(mask)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(
                x[:, [-1], :]
            )  # note: using list [-1] to preserve the time dim
            loss = None
            acc = None

        return logits, loss, acc

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {"params": decay_params, "weight_decay": weight_decay},
            {"params": nodecay_params, "weight_decay": 0.0},
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(
            f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters"
        )
        print(
            f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters"
        )
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = "fused" in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == "cuda"
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(
            optim_groups, lr=learning_rate, betas=betas, **extra_args
        )
        print(f"using fused AdamW: {use_fused}")

        return optimizer

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = (
                idx
                if idx.size(1) <= self.config.block_size
                else idx[:, -self.config.block_size :]
            )
            # forward the model to get the logits for the index in the sequence
            logits, _, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float("Inf")
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

# Part 3: Training Script
Below you will find a simplified version of the training code used in [NanoGPT](https://github.com/karpathy/nanoGPT). We will also not ask you to implement anything here but encourage you to read through the code. This code uses some concepts you are probably not familiar with such as low precision training in float16 (for faster GPU execution). Do not worry about understanding everything but see if you can answer the following question:
* Since the numbers in the prompt are random they can not be accurately predicted. Here we opt to ignore the prompt in our loss and accuracy computation. How do we do this?



In [None]:
# NanoGPT train.py (modified)
# Slightly simplified to remove features not needed here

import time
import math
from contextlib import nullcontext

import numpy as np
import torch

# -----------------------------------------------------------------------------
# Some configuration hyperparameters that we keep constant in this notebook

# I/O
eval_interval = 1000
log_interval = 500
eval_iters = 8

# data
batch_size = 128

# model
n_layer = 6
n_head = 4
n_embd = 128
dropout = 0.0  # for pretraining 0 is good, for finetuning try 0.1+
bias = False  # do we use bias inside LayerNorm and Linear layers?

# adamw optimizer
learning_rate = 1e-3  # max learning rate
max_iters = 5000  # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0

# learning rate decay settings
decay_lr = True  # whether to decay the learning rate
warmup_iters = 1000  # how many steps to warm up for
lr_decay_iters = max_iters  # should be ~= max_iters per Chinchilla
min_lr = 0  # minimum learning rate, should be ~= learning_rate/10 per Chinchilla

# system
device = (
    "cuda"  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
)
dtype = (
    "bfloat16"
    if torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    else "float16"
)  # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
compile = False  # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------


# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))  # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)


def train_model(train_data, val_data, prompt_length, block_size):
    # Usually we would take a lot more of the hyperparameters as some sort of
    # arguments but here we only change the train_data and val_data

    torch.manual_seed(1337)
    torch.backends.cuda.matmul.allow_tf32 = True  # allow tf32 on matmul
    torch.backends.cudnn.allow_tf32 = True  # allow tf32 on cudnn
    device_type = (
        "cuda" if "cuda" in device else "cpu"
    )  # for later use in torch.autocast
    # note: float16 data type will automatically use a GradScaler
    ptdtype = {
        "float32": torch.float32,
        "bfloat16": torch.bfloat16,
        "float16": torch.float16,
    }[dtype]
    ctx = (
        nullcontext()
        if device_type == "cpu"
        else torch.amp.autocast(device_type=device_type, dtype=ptdtype)
    )

    # poor man's data loader
    def get_batch(split, mask_first=prompt_length - 1):
        data = train_data if split == "train" else val_data
        ix = torch.randint(len(data), (batch_size,))
        x = torch.stack([torch.from_numpy((data[i][:-1]).astype(np.int64)) for i in ix])
        y = torch.stack([torch.from_numpy((data[i][1:]).astype(np.int64)) for i in ix])
        if mask_first:
            y[:, :mask_first] = -1
        if device_type == "cuda":
            # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
            x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(
                device, non_blocking=True
            )
        else:
            x, y = x.to(device), y.to(device)
        return x, y

    print(f"{block_size=}")
    tokens_per_iter = batch_size * block_size
    print(f"tokens per iteration will be: {tokens_per_iter:,}")

    # model init
    model_args = dict(
        n_layer=n_layer,
        n_head=n_head,
        n_embd=n_embd,
        block_size=block_size,
        bias=bias,
        vocab_size=None,
        dropout=dropout,
    )  # start with model_args from command line
    print("Initializing a new model from scratch")
    model_args["vocab_size"] = 16
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    model.to(device)

    # initialize a GradScaler. If enabled=False scaler is a no-op
    scaler = torch.cuda.amp.GradScaler(enabled=(dtype == "float16"))

    # optimizer
    optimizer = model.configure_optimizers(
        weight_decay, learning_rate, (beta1, beta2), device_type
    )

    # compile the model
    if compile:
        print("compiling the model... (takes a ~minute)")
        unoptimized_model = model
        model = torch.compile(model)  # requires PyTorch 2.0

    # helps estimate an arbitrarily accurate loss over either split using many batches
    @torch.no_grad()
    def estimate_loss():
        out_losses = {}
        out_accs = {}
        model.eval()
        for split in ["train", "val"]:
            losses = torch.zeros(eval_iters)
            accs = torch.zeros(eval_iters)
            for k in range(eval_iters):
                X, Y = get_batch(split)
                with ctx:
                    logits, loss, acc = model(X, Y)
                losses[k] = loss.item()
                accs[k] = acc.item()
            out_losses[split] = losses.mean()
            out_accs[split] = accs.mean()
        model.train()
        return out_losses, out_accs

    # training loop
    iter_num = 0
    best_val_loss = 1e9
    X, Y = get_batch("train")  # fetch the very first batch
    t0 = time.time()

    while True:
        # determine and set the learning rate for this iteration
        lr = get_lr(iter_num) if decay_lr else learning_rate
        for param_group in optimizer.param_groups:
            param_group["lr"] = lr

        # evaluate the loss on train/val sets and write checkpoints
        if iter_num % eval_interval == 0:
            losses, accs = estimate_loss()
            print(
                f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}, train acc {accs['train']:0.4f}, val acc {accs['val']:0.4f}"
            )

        # forward backward update, with optional gradient accumulation to simulate larger batch size
        # and using the GradScaler if data type is float16
        with ctx:
            logits, loss, accuracy = model(X, Y)
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch("train")
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
        # clip the gradient
        if grad_clip != 0.0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        # step the optimizer and scaler if training in fp16
        scaler.step(optimizer)
        scaler.update()
        # flush the gradients as soon as we can, no need for this memory anymore
        optimizer.zero_grad(set_to_none=True)

        # timing and logging
        t1 = time.time()
        dt = t1 - t0
        t0 = t1
        if iter_num % log_interval == 0:
            # get loss as float. note: this is a CPU-GPU sync point
            lossf = loss.item()
            print(
                f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, {accuracy=:0.3f}"
            )
        iter_num += 1

        # termination conditions
        if iter_num > max_iters:
            break

    return model

# Part 4: Evaluating models
In this part you will train models on different types of data and evaluate how well they perform during inference.

Fill out the missing details to train and compare the different models below. How do the different types of data change the difficulty of the task? Note that here we don't significantly tune the training much, it is probably possible to receive somewhat better performance on the direct task with longer training and better tuning (but we want to keep things short / managable here).

In [None]:
# Helper functions for evaluation (no action needed)

from IPython.display import HTML
import difflib


def compare_strings_html(pred_str, true_str):
    # Initialize an empty string for the HTML output
    diff_str = ""

    # Iterate over the characters based on the length of the shorter string
    for i in range(min(len(pred_str), len(true_str))):
        if pred_str[i] != true_str[i]:
            # If characters are different, color them red
            diff_str += '<span style="color: #FF0000; text-decoration: line-through;">{}</span>'.format(
                pred_str[i]
            )
        else:
            # If characters are the same, keep them as they are
            diff_str += pred_str[i]

    # Add the remaining characters of pred_str in red if they are extra
    if len(pred_str) > len(true_str):
        for extra_char in pred_str[len(true_str) :]:
            diff_str += '<span style="color: #FF0000;">{}</span>'.format(extra_char)

    return diff_str


def calculate_accuracies(Y_true, Y_pred, prompt_length):
    total_strings = len(Y_true)
    exact_match_count = 0
    total_chars = 0
    char_match_count = 0

    for true_str, pred_str in zip(Y_true, Y_pred):
        # Count exact matches
        if true_str == pred_str:
            exact_match_count += 1

        # Count character matches, ignoring prompt_length
        for t_char, p_char in zip(true_str[prompt_length:], pred_str[prompt_length:]):
            if t_char != " ":
                total_chars += 1
                if t_char == p_char:
                    char_match_count += 1

    # Calculate accuracies
    string_accuracy = exact_match_count / total_strings
    char_accuracy = char_match_count / total_chars

    return string_accuracy, char_accuracy


def evaluate_model_generation(model, val_data, num_sequences=1024, batch_size=128):
    # Generate sequences based on validation prompts and compare / compute accuracy
    num_sequences = min(num_sequences, len(val_data))
    model.eval()

    true_seq = []
    pred_seq = []

    for batch_idx in range((num_sequences + batch_size - 1) // batch_size):
        start_idx = batch_idx * batch_size
        end_idx = min((batch_idx + 1) * batch_size, num_sequences)
        sequences = [(val_data[i]) for i in range(start_idx, end_idx)]

        X_val = torch.stack(
            [torch.from_numpy(seq.astype(np.int64)) for seq in sequences]
        ).to(device)
        X_prompt = X_val[:, :prompt_length]
        Y_pred = model.generate(X_prompt, block_size - prompt_length + 1, top_k=1)

        # Convert to strings
        true_seq.extend(detokenize(X_val.cpu().numpy()))
        pred_seq.extend(detokenize(Y_pred.cpu().numpy()))

    # Compute accuracy
    string_accuracy, char_accuracy = calculate_accuracies(
        true_seq, pred_seq, prompt_length=prompt_length
    )
    print("Auto-regressive Generation")
    print(f"Sequence-Level Accuracy: {string_accuracy:0.4f}")
    print(f"Character-Level Accuracy: {char_accuracy:0.4f}")

    print("\nExamples of correct/incorrect sequences:")
    correct = 0
    incorrect = 0
    for tseq, pseq in zip(true_seq, pred_seq):
        if tseq == pseq and correct < 5:
            correct += 1
            print(f"Correct:   {tseq}")
        if tseq != pseq and incorrect < 5:
            incorrect += 1
            print(f"Incorrect: {tseq}")
            display(
                HTML(f'<div style="margin-left: 20px;"><pre>Target: {tseq}</pre></div>')
            )
            display(
                HTML(
                    f'<div style="margin-left: 20px;"><pre>Output: {compare_strings_html(pseq, tseq)}</pre></div>'
                )
            )

In [None]:
for max_digits in [3, 6]:
    for sum_cot in [False, True]:
        for reverse_cot in [False, True]:
            print("=" * 80)
            print(f"{max_digits=}, {sum_cot=}, {reverse_cot=}")
            print("=" * 80)

            random.seed(42)
            prompt_length = 2 * max_digits + 2

            # ***************************************************
            # INSERT YOUR CODE HERE
            # TODO: Generate the training and validation datasets
            # TODO: Pad the sequences in the resulting datasets
            # TODO: Tokanize the results to obtain the final train/val data
            # ***************************************************
            raise NotImplementedError

            print("===== Training =====")
            block_size = len(train_data[0]) - 1
            # ***************************************************
            # INSERT YOUR CODE HERE
            # TODO: Train a model, saving the resulting model
            # ***************************************************
            raise NotImplementedError

            print("\n\n===== Evaluation =====")
            # ***************************************************
            # INSERT YOUR CODE HERE
            # TODO: Evaluate the model generation
            # ***************************************************
            raise NotImplementedError

            print("=" * 80)
            print("\n\n")