# TL;DR

1. In this lab scenario you will have a chance to compare performance of the classic RNN and LSTM on a toy example. 
2. This toy example will show that maintaining memory over even 20 steps is non-trivial. 
3. Finally, you will see how curriculum learning may allow to train a model on larger sequences.

# Problem definition

Here we consider a toy example, where the goal is to discriminate between two types of binary sequences:
* [Type 0] a sequence with exactly one zero (remaining entries are equal to one).
* [Type 1] a sequence full of ones,

We are especially interested in the performance of the trained models on discriminating between a sequence full of ones versus a sequence with leading zero followed by ones. Note that in this case the goal of the model is to output the first element of the sequence, as the label (sequence type) is fully determined by the first element of the sequence.

#Implementation

## Importing torch

Install `torch` and `torchvision`

In [None]:
!pip3 install torch torchvision



In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(1)

<torch._C.Generator at 0x7f13db1468b0>

## Understand dimensionality

Check the input and output specification [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) and [RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). The following snippet shows how we can process
a sequence by LSTM and output a vector of size `hidden_dim` after reading
each token of the sequence. 

In [3]:
hidden_dim = 5
lstm = nn.LSTM(1, hidden_dim)  # Input sequence contains elements - vectors of size 1

# create a random sequence
sequence = [torch.randn(1) for _ in range(10)]

# initialize the hidden state (including cell state)
hidden = (torch.zeros(1, 1, 5),
          torch.zeros(1, 1, 5))

for i, elem in enumerate(sequence):
  # we are processing only a single element of the sequence, and there
  # is only one sample (sequence) in the batch, the third one
  # corresponds to the fact that our sequence contains elemenents,
  # which can be treated as vectors of size 1
  out, hidden = lstm(elem.view(1, 1, 1), hidden)
  print(f'i={i} out={out.detach()}')
print(f'Final hidden state={hidden[0].detach()} cell state={hidden[1].detach()}')

i=0 out=tensor([[[-0.0675,  0.1179,  0.1081,  0.0414, -0.0341]]])
i=1 out=tensor([[[-0.1067,  0.1726,  0.1400,  0.0902, -0.0596]]])
i=2 out=tensor([[[-0.1148,  0.1885,  0.1956,  0.0974, -0.0840]]])
i=3 out=tensor([[[-0.1270,  0.2031,  0.1495,  0.1249, -0.0860]]])
i=4 out=tensor([[[-0.1281,  0.2019,  0.1810,  0.1475, -0.1027]]])
i=5 out=tensor([[[-0.1274,  0.2060,  0.0798,  0.1330, -0.0860]]])
i=6 out=tensor([[[-0.1318,  0.2039,  0.0997,  0.1772, -0.1011]]])
i=7 out=tensor([[[-0.1145,  0.2008, -0.0431,  0.1051, -0.0717]]])
i=8 out=tensor([[[-0.1289,  0.1989,  0.0515,  0.1944, -0.1030]]])
i=9 out=tensor([[[-0.1329,  0.1920,  0.0686,  0.1772, -0.0988]]])
Final hidden state=tensor([[[-0.1329,  0.1920,  0.0686,  0.1772, -0.0988]]]) cell state=tensor([[[-0.2590,  0.4080,  0.1307,  0.4329, -0.2895]]])


## To implement

Process the whole sequence all at once by calling `lstm` only once and check that the output is exactly the same as above (remember to initialize the hidden state the same way).

In [4]:
# To implement

input = torch.cat(sequence).view(len(sequence), 1, -1)

hidden = (torch.zeros(1, 1, 5),
          torch.zeros(1, 1, 5))

out, hidden = lstm(input, hidden)
print(out)
print(hidden)

tensor([[[-0.0675,  0.1179,  0.1081,  0.0414, -0.0341]],

        [[-0.1067,  0.1726,  0.1400,  0.0902, -0.0596]],

        [[-0.1148,  0.1885,  0.1956,  0.0974, -0.0840]],

        [[-0.1270,  0.2031,  0.1495,  0.1249, -0.0860]],

        [[-0.1281,  0.2019,  0.1810,  0.1475, -0.1027]],

        [[-0.1274,  0.2060,  0.0798,  0.1330, -0.0860]],

        [[-0.1318,  0.2039,  0.0997,  0.1772, -0.1011]],

        [[-0.1145,  0.2008, -0.0431,  0.1051, -0.0717]],

        [[-0.1289,  0.1989,  0.0515,  0.1944, -0.1030]],

        [[-0.1329,  0.1920,  0.0686,  0.1772, -0.0988]]],
       grad_fn=<StackBackward>)
(tensor([[[-0.1329,  0.1920,  0.0686,  0.1772, -0.0988]]],
       grad_fn=<StackBackward>), tensor([[[-0.2590,  0.4080,  0.1307,  0.4329, -0.2895]]],
       grad_fn=<StackBackward>))


## Training a model

Below we define a very simple model, which is a single layer of LSTM, where the output in each time step is processed by relu followed by a single fully connected layer, the output of which is a single number. We are going
to use the number generated after reading the last element of the sequence,
which will serve as the logit for our classification problem.

In [5]:
class Model(nn.Module):

    def __init__(self, hidden_dim):
        super(Model, self).__init__()
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(1, self.hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        out, _ = self.lstm(x)
        sequence_len = x.shape[0]
        logits = self.hidden2label(F.relu(out[-1].view(-1)))
        return logits

Below is a training loop, where we only train on the two hardest examples.

In [99]:
SEQUENCE_LEN = 10

# Pairs of (sequence, label)
HARD_EXAMPLES = [([0.]+(SEQUENCE_LEN-1)*[1.], 0),
                 (SEQUENCE_LEN*[1.], 1)]


def eval_on_hard_examples(model):
    with torch.no_grad():
        logits = []
        for sequence in HARD_EXAMPLES:
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')


def train_model(hidden_dim, lr, num_steps=10000):
    model = Model(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.99)

    for step in range(num_steps):  
        if step % (num_steps // 5) == 0:
            eval_on_hard_examples(model)

        for sequence, label in HARD_EXAMPLES:
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step() 

    return model

In [100]:
model = train_model(hidden_dim=20, lr=0.01, num_steps=10000)

Logits for hard examples=[tensor([-0.1291]), tensor([-0.1295])]
Logits for hard examples=[tensor([-12.7435]), tensor([15.6889])]
Logits for hard examples=[tensor([-12.7764]), tensor([15.6845])]
Logits for hard examples=[tensor([-12.8088]), tensor([15.6804])]
Logits for hard examples=[tensor([-12.8401]), tensor([15.6765])]


## To implement

1. Check for what values of `SEQUENCE_LEN` the model is able to discriminate betweeh the two hard examples (after training).
2. Instead of training on `HARD_EXAMPLES` only, modify the training loop to train on sequences where zero may be in any position of the sequence (so any valid sequence of `Type 0`, not just the hardest one). After modifying the training loop check for what values of `SEQUENCE_LEN` you can train the model successfully.
3. Replace LSTM by a classic RNN and check for what values of `SEQUENCE_LEN` you can train the model successfully.
4. Write a proper curricullum learning loop, where in a loop you consider longer and longer sequences, where expansion of the sequence length happens only after the model is trained successfully on the current length.

Note that for steps 2-4 you may need to change the value of `num_steps`.


In [8]:
def generate_hard_examples(sequence_len):
  return [([0.]+(sequence_len-1)*[1.], 0), (sequence_len*[1.], 1)]

In [27]:
# 1. Check for what values of SEQUENCE_LEN the model is able to discriminate betweeh the two hard examples (after training).
hard_examples = generate_hard_examples(1000)

with torch.no_grad():
  for i in list(range(1, 10)) + list(range(100, 1001, 300)):
    logits = [model(torch.tensor(seq[:i]).view(-1, 1, 1)) for seq, _ in hard_examples]
    print("Sequence size ", i, ":", *logits)      

Sequence size  1 : tensor([-6.0568]) tensor([3.0099])
Sequence size  2 : tensor([-8.1653]) tensor([8.7463])
Sequence size  3 : tensor([-9.7948]) tensor([11.0172])
Sequence size  4 : tensor([-11.1367]) tensor([11.9786])
Sequence size  5 : tensor([-11.8641]) tensor([12.2417])
Sequence size  6 : tensor([-12.2062]) tensor([12.3171])
Sequence size  7 : tensor([-12.4140]) tensor([12.3553])
Sequence size  8 : tensor([-12.5057]) tensor([12.3768])
Sequence size  9 : tensor([-12.5301]) tensor([12.3896])
Sequence size  100 : tensor([-12.5552]) tensor([12.4122])
Sequence size  400 : tensor([-12.5552]) tensor([12.4122])
Sequence size  700 : tensor([-12.5552]) tensor([12.4122])
Sequence size  1000 : tensor([-12.5552]) tensor([12.4122])


In [201]:
# 2. Train on examples other than HARD_EXAMPLES
import random

def get_train_examples(batch_size, sequence_len):
  examples = [[[1.] * sequence_len, 1] for _ in range(batch_size)]
  for i in range(0, batch_size, 2):
    examples[i][0][random.randint(0, sequence_len - 1)] = 0.
    examples[i][1] = 0
  
  random.shuffle(examples)
  return examples

def eval_on_hard_examples(model, sequence_len):
    with torch.no_grad():
        logits = []
        for sequence in generate_hard_examples(sequence_len):
            input = torch.tensor(sequence[0]).view(-1, 1, 1)
            logit = model(input)
            logits.append(logit.detach())
        print(f'Logits for hard examples={logits}')

def train_model_on_all(model_cl, hidden_dim, lr=0.001, num_steps=64, batch_size=128, sequence_len=SEQUENCE_LEN):
    model = model_cl(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    for step in range(num_steps):  
        if step % (num_steps // 5) == 0:
            eval_on_hard_examples(model, sequence_len)

        for sequence, label in get_train_examples(batch_size, sequence_len):
            model.zero_grad()
            logit = model(torch.tensor(sequence).view(-1, 1, 1))  
            
            loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
            loss.backward()

            optimizer.step() 

    return model

In [165]:
model = train_model_on_all(model_cl=Model, hidden_dim=20, lr=0.0001, sequence_len=5)

Logits for hard examples=[tensor([0.1277]), tensor([0.1261])]
Logits for hard examples=[tensor([0.1080]), tensor([0.1081])]
Logits for hard examples=[tensor([0.0956]), tensor([0.0999])]
Logits for hard examples=[tensor([0.0867]), tensor([0.1218])]
Logits for hard examples=[tensor([-0.0936]), tensor([0.2174])]
Logits for hard examples=[tensor([-0.5864]), tensor([0.4229])]


In [166]:
model = train_model_on_all(model_cl=Model, hidden_dim=20, lr=0.0001, sequence_len=10)

Logits for hard examples=[tensor([-0.0681]), tensor([-0.0681])]
Logits for hard examples=[tensor([-0.0198]), tensor([-0.0188])]
Logits for hard examples=[tensor([0.0039]), tensor([0.0251])]
Logits for hard examples=[tensor([-1.3582]), tensor([0.1065])]
Logits for hard examples=[tensor([-1.9673]), tensor([0.8719])]
Logits for hard examples=[tensor([-2.4964]), tensor([0.9570])]


In [167]:
model = train_model_on_all(model_cl=Model, hidden_dim=20, lr=0.0001, sequence_len=11)

Logits for hard examples=[tensor([-0.2064]), tensor([-0.2064])]
Logits for hard examples=[tensor([-0.1264]), tensor([-0.1251])]
Logits for hard examples=[tensor([-0.4975]), tensor([-0.0659])]
Logits for hard examples=[tensor([-1.0397]), tensor([0.6177])]
Logits for hard examples=[tensor([-1.4367]), tensor([0.4972])]
Logits for hard examples=[tensor([-1.7889]), tensor([0.9149])]


In [230]:
model = train_model_on_all(model_cl=Model, hidden_dim=20, lr=0.0005, sequence_len=15)

Logits for hard examples=[tensor([-0.1940]), tensor([-0.1940])]
Logits for hard examples=[tensor([0.0106]), tensor([0.0120])]
Logits for hard examples=[tensor([-1.0678]), tensor([0.3629])]
Logits for hard examples=[tensor([-2.2032]), tensor([0.6266])]
Logits for hard examples=[tensor([-3.5476]), tensor([1.3773])]
Logits for hard examples=[tensor([-5.3074]), tensor([1.2402])]


In [231]:
model = train_model_on_all(model_cl=Model, hidden_dim=20, lr=0.0005, sequence_len=20)

Logits for hard examples=[tensor([-0.1625]), tensor([-0.1625])]
Logits for hard examples=[tensor([-0.0161]), tensor([-0.0161])]
Logits for hard examples=[tensor([0.0245]), tensor([0.0360])]
Logits for hard examples=[tensor([-0.4180]), tensor([0.7219])]
Logits for hard examples=[tensor([-2.7927]), tensor([0.8207])]
Logits for hard examples=[tensor([-4.2558]), tensor([1.8743])]


In [232]:
model = train_model_on_all(model_cl=Model, hidden_dim=20, lr=0.0005, sequence_len=100)

Logits for hard examples=[tensor([-0.1570]), tensor([-0.1570])]
Logits for hard examples=[tensor([-0.0348]), tensor([-0.0348])]
Logits for hard examples=[tensor([-0.0176]), tensor([-0.0176])]
Logits for hard examples=[tensor([-0.0024]), tensor([-0.0024])]
Logits for hard examples=[tensor([-0.0045]), tensor([-0.0045])]
Logits for hard examples=[tensor([-0.0022]), tensor([-0.0022])]


In [233]:
# 3. Replace LSTM by a classic RNN and check for what values of SEQUENCE_LEN you can train the model successfully.

class ModelRNN(nn.Module):
    def __init__(self, hidden_dim):
        super(ModelRNN, self).__init__()
        self.hidden_dim = hidden_dim
        self.rnn = nn.RNN(1, self.hidden_dim)
        self.hidden2label = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        out, _ = self.rnn(x)
        sequence_len = x.shape[0]
        logits = self.hidden2label(F.relu(out[-1].view(-1)))
        return logits

In [175]:
model = train_model_on_all(model_cl=ModelRNN, hidden_dim=20, lr=0.0005, sequence_len=10)

Logits for hard examples=[tensor([-0.3080]), tensor([-0.3078])]
Logits for hard examples=[tensor([-1.3091]), tensor([0.5107])]
Logits for hard examples=[tensor([-3.0044]), tensor([1.3341])]
Logits for hard examples=[tensor([-4.1045]), tensor([2.1033])]
Logits for hard examples=[tensor([-4.7564]), tensor([2.6488])]
Logits for hard examples=[tensor([-5.2818]), tensor([3.2748])]


In [176]:
model = train_model_on_all(model_cl=ModelRNN, hidden_dim=20, lr=0.0005, sequence_len=15)

Logits for hard examples=[tensor([0.1063]), tensor([0.1063])]
Logits for hard examples=[tensor([0.0147]), tensor([0.0734])]
Logits for hard examples=[tensor([-2.4030]), tensor([0.5212])]
Logits for hard examples=[tensor([-3.1977]), tensor([1.0458])]
Logits for hard examples=[tensor([-4.4182]), tensor([1.4480])]
Logits for hard examples=[tensor([-5.1073]), tensor([1.9153])]


In [177]:
model = train_model_on_all(model_cl=ModelRNN, hidden_dim=20, lr=0.0005, sequence_len=20)

Logits for hard examples=[tensor([0.0564]), tensor([0.0564])]
Logits for hard examples=[tensor([0.0204]), tensor([0.0204])]
Logits for hard examples=[tensor([-0.4085]), tensor([0.6426])]
Logits for hard examples=[tensor([-2.3408]), tensor([1.5616])]
Logits for hard examples=[tensor([-3.4412]), tensor([2.1318])]
Logits for hard examples=[tensor([-4.0318]), tensor([2.2593])]


In [178]:
model = train_model_on_all(model_cl=ModelRNN, hidden_dim=20, lr=0.0005, sequence_len=100)

Logits for hard examples=[tensor([0.0399]), tensor([0.0399])]
Logits for hard examples=[tensor([-0.0109]), tensor([-0.0109])]
Logits for hard examples=[tensor([0.0209]), tensor([0.0209])]
Logits for hard examples=[tensor([-0.0067]), tensor([-0.0067])]
Logits for hard examples=[tensor([0.0215]), tensor([0.0314])]
Logits for hard examples=[tensor([-0.1254]), tensor([0.0667])]


In [234]:
# 4. Write a proper curricullum learning loop, where in a loop you consider longer and longer sequences, where expansion of the sequence length happens only after the model is trained successfully on the current length.
import numpy as np

def score_on_hard_examples(model, sequence_len):
    with torch.no_grad():
        (z, _), (o, _) = generate_hard_examples(sequence_len)
        logits = [model(torch.tensor(seq).view(-1, 1, 1)) for seq in [z, o]]
        return (int(logits[0].item() < 0) + int(logits[1].item() >= 0)) / 2

def train_model_curriculum(model_cl, hidden_dim, lr=0.001, num_steps=64, batch_size=128, sequence_lens=[SEQUENCE_LEN]):
    model = model_cl(hidden_dim=hidden_dim)
    loss_function = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    for sequence_len in sequence_lens:
      print("Evaluating for sequence_len: ", sequence_len)
      scores = [0]
      step = 0
      while step < num_steps and np.mean(scores[-100:]) < 0.9:
        if step % (num_steps // 3) == 0:
            eval_on_hard_examples(model, sequence_len)

        for sequence, label in get_train_examples(batch_size, sequence_len):
          model.zero_grad()
          logit = model(torch.tensor(sequence).view(-1, 1, 1))  
          
          loss = loss_function(logit.view(-1), torch.tensor([label], dtype=torch.float32))
          loss.backward()

          optimizer.step() 
          scores.append(score_on_hard_examples(model, sequence_len))
          
        step += 1
      if step == num_steps:
        break

      print("Successfully trained for sequence len ", sequence_len)

    return model

In [237]:
model = train_model_curriculum(model_cl=Model, hidden_dim=20, lr=0.0001, sequence_lens=[10, 15, 20, 40, 100, 200, 1000, 2000])

Evaluating for sequence_len:  10
Logits for hard examples=[tensor([0.1304]), tensor([0.1304])]
Logits for hard examples=[tensor([0.0579]), tensor([0.0611])]
Successfully trained for sequence len  10
Evaluating for sequence_len:  15
Logits for hard examples=[tensor([0.0653]), tensor([0.1337])]
Successfully trained for sequence len  15
Evaluating for sequence_len:  20
Logits for hard examples=[tensor([-0.7871]), tensor([0.2694])]
Successfully trained for sequence len  20
Evaluating for sequence_len:  40
Logits for hard examples=[tensor([-0.6842]), tensor([0.2893])]
Successfully trained for sequence len  40
Evaluating for sequence_len:  100
Logits for hard examples=[tensor([-0.7157]), tensor([0.2920])]
Successfully trained for sequence len  100
Evaluating for sequence_len:  200
Logits for hard examples=[tensor([-0.7349]), tensor([0.2837])]
Successfully trained for sequence len  200
Evaluating for sequence_len:  1000
Logits for hard examples=[tensor([-0.7447]), tensor([0.2766])]
Successful

In [238]:
model = train_model_curriculum(model_cl=ModelRNN, hidden_dim=20, lr=0.0001, sequence_lens=[10, 15, 20, 40, 100, 200, 1000, 2000])

Evaluating for sequence_len:  10
Logits for hard examples=[tensor([0.0689]), tensor([0.0689])]
Successfully trained for sequence len  10
Evaluating for sequence_len:  15
Logits for hard examples=[tensor([-0.2744]), tensor([-0.0728])]
Successfully trained for sequence len  15
Evaluating for sequence_len:  20
Logits for hard examples=[tensor([-0.8226]), tensor([-0.1406])]
Successfully trained for sequence len  20
Evaluating for sequence_len:  40
Logits for hard examples=[tensor([-0.9403]), tensor([-0.9395])]
Successfully trained for sequence len  40
Evaluating for sequence_len:  100
Logits for hard examples=[tensor([-0.1743]), tensor([-0.1743])]
Logits for hard examples=[tensor([0.1553]), tensor([0.1553])]
Logits for hard examples=[tensor([0.0336]), tensor([0.0336])]
Logits for hard examples=[tensor([0.0048]), tensor([0.0048])]
