# HW5: Transformer


Designed by Ruizhao Zhu with help of Brian Kulis, Ashok Cutkosky.

This assignment will introduce you to 

1. Understanding the structure of transformer. 

2. Building a GPT model step by step

You can run this assignment on Colab.



## Q1 Sequence to Sequence Modelling with nn.Transformer and Torch Text (20 points)

You will implement a part of transformer. This question aims to let you to get familiar with the transformer architecture purposed in the paper [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf). This question is modified from the original pytorch tutorial [here](https://pytorch.org/tutorials/beginner/transformer_tutorial.html?highlight=transformer), you can refer it when you fill out the code. The general architecture of trasnsformer is shown in the figure below:

<img src="https://pytorch.org/tutorials/_images/transformer_architecture.jpg" width="360em">

This question requires you to implement a sequence to sequence model by encoder, which is the left part of the figure. You will use integrated layers in pytorch.

The transformer model has been proved to be superior in quality for many sequence-to-sequence
problems while being more parallelizable. The ``nn.Transformer`` module
relies entirely on an attention mechanism (another module recently
implemented as `nn.MultiheadAttention`) to draw global dependencies
between input and output. The ``nn.Transformer`` module is now highly
modularized such that a single component (like [`nn.TransformerEncoder `](<https://pytorch.org/docs/master/nn.html?highlight=nn%20transformerencoder#torch.nn.TransformerEncoder>)
in this tutorial) can be easily adapted/composed.

### Q1.1 Define the model 
In this question, we train ``nn.TransformerEncoder`` model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words. A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the next paragraph for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
``nn.TransformerEncoderLayer`` . Along with the input sequence, a square
attention mask is required because the self-attention layers in
``nn.TransformerEncoder`` are only allowed to attend the earlier positions in
the sequence. For the language modeling task, any tokens on the future
positions should be masked. To have the actual words, the output
of ``nn.TransformerEncoder`` model is sent to the final Linear
layer, which is followed by a log-Softmax function. We will see how to implement the ``PositionalEncoding`` in the later question. 

<img src="https://raw.githubusercontent.com/ruizhaoz/ruizhaoz.github.io/master/encoder.png
" width="em">



In the following model, we only train a encoder model, which is the left part of the figure. Then we concatenate a Linear model `self.decoder` to replace the right part of the model.

In [None]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerModel(nn.Module):
    '''
    This is a transformer encoder model, the input arguments are as follows:
    args:
    ntoken:  dimension of tokens
    ninp: dimension of input embeddings
    nhid: dimension of the hidden encoding between two layers of TransformerEncoderLayer
    nlayers: number of TransformerEncoderLayer layers
    nhead: the number of heads in the multiheadattention model
    '''
    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        from torch.nn import TransformerEncoder, TransformerEncoderLayer
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(ninp, dropout) # PositionalEncoding will be implemented in next section.
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def generate_square_subsequent_mask(self, sz):
        """YOUR CODE HERE"""
        '''
        You can use torch.triu and masked_fill to get an upper triangle mask. 
        The upper right entries are -inf, down left entries including the diagonal are 0.
        '''
        mask = None
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        """YOUR CODE HERE"""
        '''
        Fill the forward function accoreding to the diagram above.
        In the embedding layers, we multiply those weights by square root of 
        self.ninp.
        '''
        output = None
        return output

### Q1.2 Positional Encoding
#### Q1.2.1 Fill the code block
``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.


In [None]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        """YOUR CODE HERE"""


    def forward(self, x):
        """YOUR CODE HERE"""

        return None

#### Q1.2.2 Why do we need this positional encoding in the transformer architectrue.

### Q1.3 Running the model

#### Q1.3.1 Run the code to get desired performance.
The training process uses Wikitext-2 dataset from ``torchtext``. The
vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the ``batchify()``
function arranges the dataset into columns, trimming off any tokens remaining
after the data has been divided into batches of size ``batch_size``.
For instance, with the alphabet as the sequence (total length of 26)
and a batch size of 4, we would divide the alphabet into 4 sequences of
length 6:

\begin{align}\begin{bmatrix}
  \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y} & \text{Z}
  \end{bmatrix}
  \Rightarrow
  \begin{bmatrix}
  \begin{bmatrix}\text{A} \\ \text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} &
  \begin{bmatrix}\text{G} \\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} &
  \begin{bmatrix}\text{M} \\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} &
  \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
  \end{bmatrix}\end{align}

These columns are treated as independent by the model, which means that
the dependence of ``G`` and ``F`` can not be learned, but allows more
efficient batch processing.


In [None]:
import torchtext
from torchtext.data.utils import get_tokenizer
TEXT = torchtext.data.Field(tokenize=get_tokenizer("spacy"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(train_txt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def batchify(data, bsz):
    data = TEXT.numericalize([data.examples[0].text])
    # Divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_txt, batch_size)
val_data = batchify(val_txt, eval_batch_size)
test_data = batchify(test_txt, eval_batch_size)

downloading wikitext-2-v1.zip


wikitext-2-v1.zip: 100%|██████████| 4.48M/4.48M [00:01<00:00, 3.08MB/s]


extracting


``get_batch()`` function generates the input and target sequence for
the transformer model. It subdivides the source data into chunks of
length ``bptt``. For the language modeling task, the model needs the
following words as ``Target``. For example, with a ``bptt`` value of 2,
we’d get the following two Variables for ``i`` = 0:

![](../_static/img/transformer_input_target.png)


It should be noted that the chunks are along dimension 0, consistent
with the ``S`` dimension in the Transformer model. The batch dimension
``N`` is along dimension 1.


In [None]:
bptt = 35
def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1)
    return data, target
ntokens = len(TEXT.vocab.stoi) # the size of vocabulary
emsize = 200 # embedding dimension
nhid = 200 # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2 # the number of heads in the multiheadattention models
dropout = 0.2 # the dropout value
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
lr = 5.0 # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

import time
def train():
    model.train() # Turn on the train mode
    total_loss = 0.
    start_time = time.time()
    ntokens = len(TEXT.vocab.stoi)
    src_mask = model.generate_square_subsequent_mask(bptt).to(device)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        optimizer.zero_grad()
        if data.size(0) != bptt:
            src_mask = model.generate_square_subsequent_mask(data.size(0)).to(device)
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        log_interval = 200
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | '
                  'lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                    epoch, batch, len(train_data) // bptt, scheduler.get_lr()[0],
                    elapsed * 1000 / log_interval,
                    cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

def evaluate(eval_model, data_source):
    eval_model.eval() # Turn on the evaluation mode
    total_loss = 0.
    ntokens = len(TEXT.vocab.stoi)
    src_mask = model.generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            if data.size(0) != bptt:
                src_mask = model.generate_square_subsequent_mask(data.size(0)).to(device)
            output = eval_model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    return total_loss / (len(data_source) - 1)

Running the code block below. You will get around 220 ppl on training at the end of epoch 1.

In [None]:
best_val_loss = float("inf")
epochs = 1 # The number of epochs
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train()
    val_loss = evaluate(model, val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
          'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                     val_loss, math.exp(val_loss)))
    print('-' * 89)

    scheduler.step()

#### 1.3.2 Why do we need to use `torch.nn.utils.clip_grad_norm_` in training?

## Q2 Transformer Block for GPT (40 points)

### Q 2.1 Multi-head self-attention
#### Q 2.1.1 The first part is multi-head self-attention. In this layer, you will need to:
- Apply linear projections to convert the feature vector at each token into separate vectors for the query, key, and value. The input and output size of linear projection are both `n_embd`
- Apply attention, scaling the logits by $\frac{1}{\sqrt{d_{qkv}}}$.
- Ensure proper masking, such that padding tokens are never attended to.
- Perform attention `n_head` times in parallel, where the results are concatenated and then projected using a linear layer.

<img src="https://www.researchgate.net/publication/332139525/figure/fig3/AS:743081083158528@1554175744311/a-The-Transformer-model-architecture-b-left-Scaled-Dot-Product-Attention.ppm" width="360em">

You should include two types of dropout in your code (with probability set by the  `dropout` argument):
- Dropout should be applied to the output of the attention layer (just prior to the residual connection, denoted by "Add & Norm" in the first figure)
- Dropout should *also* be applied after the final projection.
Notes:
- Query, key, and value vectors should have shape `[batch_size, n_heads, sequence_len, d_qkv]`
- Apply a mask to the scaled dot product of Q and K, before the Softmax function. Let the entry to be a small enough number where the entry of the causal mask is zero. You can use `torch.tril` or `torch.triu` to create a mask, usually we define the mask as a lower triangular matrix. Lower left (incude the diagonal) entries are ones, rest of entries are zeros.
Then apply `tensor.masked_fill()` to the output of the scaled dot product of Q and K (It is also the input of softmax). Where the mask is zero, set the input to softmax to a negative number with very large magnitude.
- Attention logits and probabilities should have shape `[batch_size, n_heads, sequence_len, sequence_len]`
- Vaswani et al. define the output of the attention layer as concatenating the various heads and then multiplying by a matrix $W^O$. It's also possible to implement this is a sum without ever calling `torch.cat`: note that $\text{Concat}(head_1, \ldots, head_h)W^O = head_1 W^O_1 + \ldots + head_h W^O_h$ where $W^O = \begin{bmatrix} W^O_1\\ \vdots\\ W^O_h\end{bmatrix}$. You may define the `self.proj` this way.


In [None]:
import math
import logging

import torch
import torch.nn as nn
from torch.nn import functional as F
class MultiHeadSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    You can also use torch.nn.MultiheadAttention to validate your implementation

    """

    def __init__(self, n_embd, n_head, block_size, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        assert n_embd % n_head == 0
        self.n_head = n_head
        #Define key, query, value projections for all heads
        """YOUR CODE HERE"""
        self.key = None
        self.query = None
        self.value = None
        # Dropout layers
        self.attn_drop = None
        self.resid_drop = None
        # output projection
        self.proj = None
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.mask = None

        

    def forward(self, x, layer_past=None):
        B, T, C = x.size() # B = Batch
        """YOUR CODE HERE"""
        output = None
        return output

#### Q 2.1.2 Why do we need to divide a scale of the dot product of Q and K?

### Q2.2 Transformer
We will implement the transformer block, which is the blue box in the figure. You can use `nn.LayerNorm` layer to apply layer norm. We defined the feed forward layer as `self.mlp`.

Notice that where to use the layer norm is a design choice, you can change to see how it affect the final results in the application of Question 3.

<img src="https://raw.githubusercontent.com/ruizhaoz/ruizhaoz.github.io/master/Screen%20Shot%202020-11-05%20at%2011.37.09%20AM.png" width="240em">






In [None]:
class TransformerBlock(nn.Module):
    """ an Transformer block """

    def __init__(self, n_embd, n_head, block_size, attn_pdrop=0.1, resid_pdrop=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.attn = MultiHeadSelfAttention(n_embd, n_head, block_size, attn_pdrop, resid_pdrop)
        self.mlp = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(resid_pdrop),
        )

    def forward(self, x):
        """YOUR CODE HERE?""" 
        
        return None


## Q3 GPT on Addition (40 points)
In this question, we will train an GPT transformer to do addition. We first need to get the dataset and encode addition equation to a vocabulary by integers since we want to use GPT dealing with sequences of integers, and completing them according to patterns in the data. 

  The sum of two n-digit numbers gives a third up to (n+1)-digit number. So our
  encoding will simply be the n-digit first number, n-digit second number, and (n+1)-digit result, all simply concatenated together. Because each addition problem is so structured, there is no need to bother the model with encoding +, =, or other tokens. Each possible sequence has the same length, and simply contains the raw digits of the addition problem. As a few examples, the 2-digit problems:
- 85 + 50 = 135 becomes the sequence `[8, 5, 5, 0, 1, 3, 5]`
- 6 + 39 = 45 becomes the sequence `[0, 6, 3, 9, 0, 4, 5]`

We will also only train GPT on the final (n+1)-digits because the first two n-digits are always assumed to be given. So when we give GPT an exam later, we will e.g. feed it the sequence `[0, 6, 3, 9]`, which encodes that we'd like to add 6 + 39, and hope that the model completes the integer sequence with `[0, 4, 5]` in 3 sequential steps.

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F

In [None]:
from torch.utils.data import Dataset

class AdditionDataset(Dataset):
    """
    Define the addition dataset
    """

    def __init__(self, ndigit, split):
        self.split = split # train/test
        self.ndigit = ndigit
        self.vocab_size = 10 # 10 possible digits 0..9
        # +1 due to potential carry overflow, but then -1 because very last digit doesn't plug back
        self.block_size = ndigit + ndigit + ndigit + 1 - 1
        
        # split up all addition problems into either training data or test data
        num = (10**self.ndigit)**2 # total number of possible combinations
        r = np.random.RandomState(1337) # make deterministic
        perm = r.permutation(num)
        num_test = min(int(num*0.2), 1000) # 20% of the whole dataset, or only up to 1000
        self.ixes = perm[:num_test] if split == 'test' else perm[num_test:]

    def __len__(self):
        return self.ixes.size

    def __getitem__(self, idx):
        # given a problem index idx, first recover the associated a + b
        idx = self.ixes[idx]
        nd = 10**self.ndigit
        a = idx // nd
        b = idx %  nd
        c = a + b
        render = f'%0{self.ndigit}d%0{self.ndigit}d%0{self.ndigit+1}d' % (a,b,c) # e.g. 03+25=28 becomes "0325028" 
        dix = [int(s) for s in render] # convert each character to its token index
        # x will be input to GPT and y will be the associated expected outputs
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long) # predict the next token in the sequence
        y[:self.ndigit*2-1] = -100 
        return x, y

In [None]:
# create a dataset for e.g. 2-digit addition
ndigit = 2
train_dataset = AdditionDataset(ndigit=ndigit, split='train')
test_dataset = AdditionDataset(ndigit=ndigit, split='test')
train_dataset[0] # sample a training instance just to see what one raw example looks like

(tensor([4, 7, 1, 7, 0, 6]), tensor([-100, -100, -100,    0,    6,    4]))

### Q3.1 Define the GPT model 
Now, we start constructing the GPT model. As is shown in the figure, there are 12 transformer blocks concatenated together. In our model, we use `n_layer` to represent the number of blocks. In this question, you need to do the following:

- Define the `n_layer` transformer blocks `self.blocks`
- Fill out the forward function. Note that the positional embedding is not hard coded as the original transformer, it is learned during training.
- You can add the drop out layer `self.drop` right after the text and position embedding before feeding into the transformer blocks. And adding a layer norm `self.ln_f` right after the output of transformer blocks.  

<img src="https://raw.githubusercontent.com/ruizhaoz/ruizhaoz.github.io/master/GPT1.png" width="240em">

In [None]:
class GPT(nn.Module):
    """  the full GPT language model, with a squence size of block_size """

    def __init__(self, vocab_size, n_embd, n_head, block_size, n_layer, embd_pdrop=0.1, attn_pdrop=0.1,resid_pdrop=0.1):
        super().__init__()

        # input embedding stem
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))
        self.drop = nn.Dropout(embd_pdrop)
        # transformer
        """YOUR CODE HERE"""
        self.blocks = None
        # decoder head
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)

        self.block_size = block_size
        self.apply(self._init_weights)

        logger.info("number of parameters: %e", sum(p.numel() for p in self.parameters()))


    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def configure_optimizers(self, train_config):
        """
        You don't need to change this function. This is setting specific parameters for optimization.
        """

        # separate out all parameters to those that will and won't experience regularizing weight decay
        decay = set()
        no_decay = set()
        whitelist_weight_modules = (torch.nn.Linear, )
        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
        for mn, m in self.named_modules():
            for pn, p in m.named_parameters():
                fpn = '%s.%s' % (mn, pn) if mn else pn # full param name

                if pn.endswith('bias'):
                    # all biases will not be decayed
                    no_decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                    # weights of whitelist modules will be weight decayed
                    decay.add(fpn)
                elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                    # weights of blacklist modules will NOT be weight decayed
                    no_decay.add(fpn)

        # special case the position embedding parameter in the root GPT module as not decayed
        no_decay.add('pos_emb')

        # validate that we considered every parameter
        param_dict = {pn: p for pn, p in self.named_parameters()}
        inter_params = decay & no_decay
        union_params = decay | no_decay
        assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
        assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                    % (str(param_dict.keys() - union_params), )

        # create the pytorch optimizer object
        optim_groups = [
            {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
            {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
        ]
        optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
        return optimizer

    def forward(self, x, targets=None):
        b, t = x.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."
        """YOUR CODE HERE"""

        # forward the GPT model
        logits = None
        loss = None

        return logits, loss


### Q3.2 Training the model

##### You will train the GPT model. Fill out the code of the training process.

In [None]:
import math
import logging
from tqdm import tqdm
import numpy as np
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data.dataloader import DataLoader

Setting some parameters for training. Initialize the GPT model.

In [None]:
logger = logging.getLogger(__name__)
class TrainerConfig:
    # optimization parameters
    max_epochs = 10
    batch_size = 64
    learning_rate = 3e-4
    betas = (0.9, 0.95)
    grad_norm_clip = 1.0
    weight_decay = 0.1 # only applied on matmul weights
    # learning rate decay params: linear warmup followed by cosine decay to 10% of original
    lr_decay = False
    warmup_tokens = 375e6 # these two numbers come from the GPT-3 paper, but may not be good defaults elsewhere
    final_tokens = 260e9 # (at what point we reach 10% of original LR)
    # checkpoint settings
    ckpt_path = None
    num_workers = 0 # for DataLoader

    def __init__(self, **kwargs):
        for k,v in kwargs.items():
            setattr(self, k, v)
# initialize a baby GPT model
model = GPT(vocab_size = train_dataset.vocab_size, n_embd=128, n_head=4, block_size =  train_dataset.block_size, n_layer=2)

You need to fill out training process. 
- Forward the model with current batch `x`, `y`;
- Zero the grad before update;
- Backward the loss and update the model parameter;
- You might want to use `torch.nn.utils.clip_grad_norm_`. The parameter max_norm is `config.grad_norm_clip`;
- You will run this getting a loss around 0.1 and accuracy on both train and test around 99%.

In [None]:
config = TrainerConfig(max_epochs=50, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=1024, final_tokens=50*len(train_dataset)*(ndigit+1),
                      num_workers=4)

device = 'cpu'
if torch.cuda.is_available():
  device = torch.cuda.current_device()
  model = torch.nn.DataParallel(model).to(device)


optimizer = model.configure_optimizers(config)
tokens = 0
for epoch in range(config.max_epochs):
    model.train()
    data = train_dataset 
    loader = DataLoader(data, shuffle=True, pin_memory=True,
                        batch_size=config.batch_size,
                        num_workers=config.num_workers)
    losses = []
    pbar = tqdm(enumerate(loader), total=len(loader)) 
    for iter, (x, y) in pbar:
        # place data on the correct device
        x = x.to(device)
        y = y.to(device)
        # forward the model
        """ CODE HERE """
        loss = None
        """ CODE HERE END """
        losses.append(loss.item())
        # decay the learning rate based on our progress
        if config.lr_decay:
            tokens += (y >= 0).sum() # number of tokens processed this step (i.e. label is not -100)
            if tokens < config.warmup_tokens:
                # linear warmup
                lr_mult = float(tokens) / float(max(1, config.warmup_tokens))
            else:
                # cosine learning rate decay
                progress = float(tokens - config.warmup_tokens) / float(max(1, config.final_tokens - config.warmup_tokens))
                lr_mult = max(0.1, 0.5 * (1.0 + math.cos(math.pi * progress)))
            lr = config.learning_rate * lr_mult
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr
        else:
            lr = config.learning_rate
        # report progress
        pbar.set_description(f"epoch {epoch+1} iter {iter}: train loss {loss.item():.5f}. lr {lr:e}")


Now you can run the following code to test the training data sne testing data. You should reach more than 95% correctness on both train and testing data.

In [None]:
def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    """
    take a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time. 
    """
    block_size = train_dataset.block_size
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        logits, _ = model(x_cond)
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)
    return x
def Addition_GPT(dataset, batch_size=32, max_batches=-1):
    
    results = []
    loader = DataLoader(dataset, batch_size=batch_size)
    for b, (x, y) in enumerate(loader):
        x = x.to(device)
        d1d2 = x[:, :ndigit*2]
        d1d2d3 = sample(model, d1d2, ndigit+1)
        d3 = d1d2d3[:, -(ndigit+1):]
        factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]).to(device)
        # decode the integers from individual digits
        d1i = (d1d2[:,:ndigit] * factors[:,1:]).sum(1)
        d2i = (d1d2[:,ndigit:ndigit*2] * factors[:,1:]).sum(1)
        d3i_pred = (d3 * factors).sum(1)
        d3i_gt = d1i + d2i
        correct = (d3i_pred == d3i_gt).cpu() # Software 1.0 vs. Software 2.0 fight RIGHT on this line, lol
        for i in range(x.size(0)):
            results.append(int(correct[i]))
            judge = 'CORRECT' if correct[i] else 'WRONG'
            if not correct[i]:
                print("GPT claims that %03d + %03d = %03d (gt is %03d; %s)" 
                      % (d1i[i], d2i[i], d3i_pred[i], d3i_gt[i], judge))
        
        if max_batches >= 0 and b+1 >= max_batches:
            break

    print("final score: %d/%d = %.2f%% correct" % (np.sum(results), len(results), 100*np.mean(results)))

In [None]:
Addition_GPT(train_dataset, batch_size=1024, max_batches=10)

In [None]:
Addition_GPT(test_dataset, batch_size=1024, max_batches=10)