# Demo: reversing a sequence

In this tutorial we train a decoder-only transformer to reverse an integer sequence.

Our tranformer will be built using a masked multihead attention layer which enables each position in an input sequence to attend to itself and all previous positions, but not to latter positions.
Therefore we expect that the transformer should be able to correctly output the second half of reversed sequence, but not the first half.

In [1]:
import torch
from transformer import layers

# set device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f'Device is {device}')

Device is cpu


Our model is a standard decoder-only transformer, consisting of the following layers:
- embedding and positional encoding
- decoder stack
- linear (de-embedding) layer
- softmax

Each decoder in the decoder stack consists of a masked multihead attention layer followed by a position-wise fully connected two-layer feed forward neural network. We will restrict to input sequences of length 10, containing integers from 0 to 99.

In [8]:
seq_len = 10
n_min = 0
n_max = 99

model = layers.Transformer(
    vocab=100,
    n_pe=1000,
    d_model=32,
    num_heads=4,
    num_stacks=2,
    d_ff=64,
    dropout=0.1
).to(device)

Let's see how the randomly initialized model performs on the sequence reversal task. Here is one example:

In [3]:
input = torch.randint(n_min, n_max, (10,))
output = model.greedy_output(input)

print(f'Input sequence:  {input.tolist()}')
print(f'Output sequence: {output.tolist()}')

Input sequence:  [20, 7, 68, 34, 55, 93, 85, 40, 74, 7]
Output sequence: [32, 99, 6, 23, 15, 22, 98, 3, 34, 22]


The model has no idea what it's doing. 
Let's define a test set containing 100 examples, each of length 10, and test the model's performance on this set.
In particular we will seperately track the model's ability to generate the first 5 target integers and the second 5 target integers. 

In [4]:
test_data = torch.randint(
    low=n_min,
    high=n_max,
    size=(100, seq_len) # 100 sequences
)

def test(data, model):
    model.eval()
    with torch.no_grad():
        input = data
        output = model.greedy_output(input)
        reversed_input = torch.flip(input, dims=[1])

        # split into halves
        rev_in_half_1, rev_in_half_2 = reversed_input.chunk(2, dim=1)
        out_half_1, out_half_2 = output.chunk(2, dim=1)

        # count correct predictions
        n_total = len(input)
        n_correct_1 = (out_half_1 == rev_in_half_1).all(-1).sum().item() 
        n_correct_2 = (out_half_2 == rev_in_half_2).all(-1).sum().item() 

    model.train()

    return n_correct_1 / n_total, n_correct_2 / n_total

accuracy_1, accuracy_2 = test(test_data, model)
print(f'Initial accuracy:\n'
      f'  1st half: {accuracy_1:.1%}\n'
      f'  2nd half: {accuracy_2:.1%}'
)

Initial accuracy:
  1st half: 0.0%
  2nd half: 0.0%


As expected, the untrained transformer can't reverse either half of the sequence. 

Now let's train the model. We will use cross entropy loss and an Adam optimizer.

In [9]:
cost_fn = layers.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters(), lr=0.001)

def train(model, steps, batch_size, seq_len, grad_norm):
    # steps between tests
    test_interval = steps // 10 
    print(f'Test set accuracies (1st half, 2nd half):')

    for step in range(steps):
        # randomly generate input sequences
        input = torch.randint(n_min, n_max, (batch_size, seq_len))

        # compute output and reversed input
        output = model(input)
        reversed_input = torch.flip(input, dims=[1])

        # compute loss. transpose puts class dim in correct position for CE loss
        loss = cost_fn(output.transpose(1,2), reversed_input)

        # backward pass
        optim.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_norm)

        # optimizer step
        optim.step()

        # test
        if step % test_interval == 0 or step == steps - 1:
            accuracies = test(test_data, model)
            print(f'  Step {step:4d}: '
                  f'({accuracies[0]:.1%}, {accuracies[1]:6.1%})')

train(model, 2000, 64, 10, 1)

Test set accuracies (1st half, 2nd half):
  Step    0: (0.0%,   0.0%)
  Step  200: (0.0%,   0.0%)
  Step  400: (0.0%,   0.0%)
  Step  600: (0.0%,   0.0%)
  Step  800: (0.0%,   3.0%)
  Step 1000: (0.0%,  37.0%)
  Step 1200: (0.0%,  91.0%)
  Step 1400: (0.0%,  99.0%)
  Step 1600: (0.0%, 100.0%)
  Step 1800: (0.0%, 100.0%)
  Step 1999: (0.0%, 100.0%)


After training the model is able to produce the correct second half of the reversed sequence, but not the first half.
This is as expected: the positions in the second half of the sequence are able to attend to those in the first half, so they have the necessary information to produce the correct output.
But due to the masking in the multihead attention layer the positions in the first half cannot attend to those in the second half, and are thus unable to reproduce them as required.

Here is an example:

In [10]:
input = torch.randint(n_min, n_max, (10,))
output = model.greedy_output(input)

print(f'Input sequence:  {input.tolist()}')
print(f'Output sequence: {output.tolist()}')

Input sequence:  [11, 0, 78, 93, 19, 53, 28, 18, 39, 19]
Output sequence: [11, 11, 11, 11, 19, 19, 93, 78, 0, 11]


Notice that the second half of the output sequence is precisely the reverse of the first half of the input sequence.