# Demo: copying a sequence

In this tutorial we train a decoder-only transformer to simply repeat back an integer sequence. We expect the transformer, which contains a masked multihead attention layer, to be able to accomplish this task because each position of the sequence is able to attend to itself (and all previous positions), so in particular it should be able to copy itself.

In [1]:
import torch
from transformer import layers

# set device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f'Device is {device}')

Device is cpu


Our model is a standard decoder-only transformer, consisting of the following layers:
- embedding and positional encoding
- decoder stack
- linear (de-embedding) layer
- softmax

Each decoder in the decoder stack consists of a masked multihead attention layer followed by a position-wise fully connected two-layer feed forward neural network. We will restrict to input sequences of length 10, containing integers from 0 to 99.

In [5]:
seq_len = 10
n_min = 0
n_max = 99

model = layers.Transformer(
    vocab=100,
    n_pe=1000,
    d_model=32,
    num_heads=4,
    num_stacks=2,
    d_ff=64,
    dropout=0.1
).to(device)

Let's see how the randomly initialized model performs on the sequence copying task.

In [6]:
test_data = torch.randint(
    low=n_min,
    high=n_max,
    size=(100, seq_len) # 100 sequences
)

def test(data, model):
    model.eval()
    with torch.no_grad():
        input = data
        output = model.greedy_output(input)

        n_total = len(input)

        # number of output sequences that match their input sequence
        n_correct = (input == output).all(-1).sum().item() 

    model.train()

    return n_correct / n_total

accuracy = test(test_data, model)
print(f'Initial accuracy: {accuracy:.1%}')

Initial accuracy: 0.0%


As expected, the untrained transformer can't copy the sequence. 

Now let's train the model. We will use cross entropy loss and an Adam optimizer.

In [7]:
cost_fn = layers.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters(), lr=0.001)

def train(model, steps, batch_size, seq_len, grad_norm):
    # steps between tests
    test_interval = steps // 10 
    print(f'Test set accuracies:')

    for step in range(steps):
        # randomly generate input sequences
        input = torch.randint(n_min, n_max, (batch_size, seq_len))

        # compute output
        output = model(input)

        # compute loss. transpose puts class dim in correct position for CE loss
        loss = cost_fn(output.transpose(1,2), input)

        # backward pass
        optim.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_norm)

        # optimizer step
        optim.step()

        # test
        if step % test_interval == 0 or step == steps - 1:
            accuracy = test(test_data, model)
            print(f'  Step {step:3d}: {accuracy:.1%}')

train(model, 200, 64, 10, 1)

Test set accuracies:
  Step   0: 0.0%
  Step  20: 0.0%
  Step  40: 0.0%
  Step  60: 24.0%
  Step  80: 74.0%
  Step 100: 100.0%
  Step 120: 100.0%
  Step 140: 100.0%
  Step 160: 100.0%
  Step 180: 100.0%
  Step 199: 100.0%


After training the model is able to perfectly copy the test sequences. For example:

In [8]:
input = torch.randint(n_min, n_max, (10,))
output = model.greedy_output(input)

print(f'Input sequence:  {input.tolist()}')
print(f'Output sequence: {output.tolist()}')

Input sequence:  [56, 4, 43, 81, 65, 9, 71, 11, 21, 79]
Output sequence: [56, 4, 43, 81, 65, 9, 71, 11, 21, 79]
