# RNN

![rnn image](https://image.slidesharecdn.com/dlcvd2l6recurrentneuralnetworks-160802094750/95/deep-learning-for-computer-vision-recurrent-neural-networks-upc-2016-16-638.jpg?cb=1470131837)

$\boldsymbol{h}_t = \tanh(\boldsymbol{W}\boldsymbol{x}_t + \boldsymbol{U}\boldsymbol{h}_{t-1} + \boldsymbol{b})$

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
import matplotlib.pyplot as plt

torch.manual_seed(1)
device = 'cpu'
if torch.cuda.is_available():
  device = 'cuda'
  

### Exercise 1: Building a recurrent neural layer

In the next cell, we will define our own unidirectional RNN layer. The class `MyUnidirectionalRNN` must make use of `nn.Linear` layers to make the feed-forward and time projections, and use the `nn.Parameter` class to build the biases. Please build the recurrent neural component with the addition of the recurrent connections.


In [None]:
class MyUnidirectionalRNN(nn.Module):

  def __init__(self, num_inputs, rnn_size=128):
    super().__init__()

    # Linear layers
    # Define the input activation matrix W
    self.W = nn.Linear(num_inputs, rnn_size, bias=False)
    # TODO: Define the hidden activation matrix U
    self.U = nn.Linear(rnn_size, rnn_size, bias=False)
    
    self.rnn_size = rnn_size
    # Define the bias
    self.b = nn.Parameter(torch.zeros(1, rnn_size))

  def forward(self, x, state=None):
    # Assuming x is of shape [batch_size, seq_len, num_feats]
    seq_length = x.shape[1]
    xs = torch.chunk(x, seq_length, dim=1)
    hts = []
    if state is None:
      state = self.init_state(x.shape[0])
    for xt in xs:
      # turn x[t] into shape [batch_size, num_feats] to be projected
      xt = xt.squeeze(1)
      ct = self.W(xt)
      ct = ct + self.U(state)
      state = ct + self.b
      # give the temporal dimension back to h[t] to be cated
      hts.append(state.unsqueeze(1))
    hts = torch.cat(hts, dim=1)
    return hts

  def init_state(self, batch_size):
    return torch.zeros(batch_size, self.rnn_size)

# To correctly assess the answer, we build an example RNN with 10 inputs and 32 neurons
rnn = MyUnidirectionalRNN(10, 32)
# Then we will forward some random sequences, each of length 15
xt = torch.randn(5, 15, 10)
# The returned tensor will be h[t]
ht = rnn(xt)
assert ht.shape[0] == 5 and ht.shape[1] == 15 and ht.shape[2] == 32, \
'Something went wrong within the RNN :('
print('Success! Output shape: {} sequences, each of length {}, each '\
      'token with {} dims'.format(ht.shape[0], ht.shape[1], ht.shape[2]))

Success! Output shape: 5 sequences, each of length 15, each token with 32 dims


In [None]:
x = torch.tensor([1, 2, 3, 4])
torch.chunk(x, 4, dim=0)

(tensor([1]), tensor([2]), tensor([3]), tensor([4]))

# PyTorch RNN

In [None]:
num_inputs = 10
seq_length = 25
batch_size = 5
hidden_size = 128

rnn1 = nn.RNN(num_inputs, hidden_size)
rnn1

RNN(10, 128)

In [None]:
xt = torch.randn(seq_length, batch_size, num_inputs)

In [None]:
ht, state = rnn1(xt)

In [None]:
print(f"ht: {ht.shape}")
print(f"state: {state.shape}")

ht: torch.Size([25, 5, 128])
state: torch.Size([1, 5, 128])


#### OK STOP IT HERE, We've got to talk

Think about how many things are happening in the previous cell. First, we define some hyper-parameters to define the input tensor shape and the RNN size. Then, we build one RNN layer. Then, we build random data. Finally, we forward the random data, and what is returned? Why does the input tensor `x` have that shape? Why is the RNN returning 2 output values?

**First answer:** The input data to an RNN can be shaped in 2 formats: `batch_first=True` and `batch_first=False`. As its name indicates, when it is `False`, the `batch_size` dimension is not the first but the second one. Then which is the first one? The `sequence_length`. If we do not specify anything, by default `batch_first=False`, so the tensor $\boldsymbol{x}_t$ must have the dimensions: [`seq_len`, `batch_size`, `num_feats`]. We normally use `batch_first=True` to couple the RNN easily with other layers like the `nn.Linear` one.

### Exercise 2

Find the second answer on "**Why is the RNN returning 2 output values?**". Understand what is the `state` output and answer: "**what does it contain?**". Your source of knowledge is in the following URL, where the outputs description for the `RNN` module is given: https://pytorch.org/docs/stable/nn.html#torch.nn.RNN


In [None]:
torch.all(ht[-1] == state)


tensor(True)

In [None]:
rnn2 = nn.RNN(num_inputs, hidden_size, num_layers=2)
rnn2

RNN(10, 128, num_layers=2)

In [None]:
output, hn = rnn2(xt)
print(f"output: {output.shape}")
print(f"hn: {hn.shape}")

output: torch.Size([25, 5, 128])
hn: torch.Size([2, 5, 128])


### Exercise 3.1
Build a **single bidirectional RNN layer** by completing the TODO in the code 



In [None]:
# TODO: build the bidirectional RNN layer
bi_rnn = nn.RNN(num_inputs, hidden_size, num_layers=1, bidirectional=True) 

# forward xt_bf
bi_ht, bi_state = bi_rnn(xt)
print('Bidirectional RNN layer >> bi_ht.shape: ', bi_ht.shape)
print('Bidirectional RNN layer >> bi_state.shape: ', bi_state.shape)

Bidirectional RNN layer >> bi_ht.shape:  torch.Size([25, 5, 256])
Bidirectional RNN layer >> bi_state.shape:  torch.Size([2, 5, 128])


### Exercise 3.2
What is the output $\boldsymbol{h}_t$ shape and why?

### Exercise 3.3
What is the output `state` shape and why?.

![lstm](http://dprogrammer.org/wp-content/uploads/2019/04/RNN-vs-LSTM-vs-GRU-1200x361.png)

### Exercise 4: An LSTM Character-based Language Model

In this final exercise we will train a language model that will work at the character level. This is, a neural network based on an RNN architecture that will complete language (textual) sequences.


In [None]:
!wget https://raw.githubusercontent.com/telecombcn-dl/dlai-2019/master/labs/episode1_english.txt

--2020-06-02 19:31:24--  https://raw.githubusercontent.com/telecombcn-dl/dlai-2019/master/labs/episode1_english.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23466 (23K) [text/plain]
Saving to: ‘episode1_english.txt’


2020-06-02 19:31:24 (6.07 MB/s) - ‘episode1_english.txt’ saved [23466/23466]



In [None]:
!cat episode1_english.txt

Monica: There's nothing to tell! He's just some guy I work with!

Joey: C'mon, you're going out with the guy! There's gotta be something wrong with him!

Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?

Phoebe: Wait, does he eat chalk?

(They all stare, bemused.)

Phoebe: Just, 'cause, I don't want her to go through what I went through with Carl- oh!

Monica: Okay, everybody relax. This is not even a date. It's just two people going out to dinner and- not having sex.

Chandler: Sounds like a date to me.

[Time Lapse]

Chandler: Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked.

All: Oh, yeah. Had that dream.

Chandler: Then I look down, and I realize there's a phone... there.

Joey: Instead of...?

Chandler: That's right.

Joey: Never had that dream.

Phoebe: No.

Chandler: All of a sudden, the phone starts to ring. Now I don't know what to do, everybody starts looking at me.

Monica: 

In [None]:
def prepare_sequence(seq, char2idx, onehot=True):
    # convert sequence of words to indices
    idxs = [char2idx[c] for c in seq]
    idxs = torch.tensor(idxs, dtype=torch.long)
    if onehot:
      # conver to onehot (if input to network)
      ohs = F.one_hot(idxs, len(char2idx)).float()
      return ohs
    else:
      return idxs

In [None]:
with open('episode1_english.txt', 'r') as txt_f:
  training_data = [l.rstrip() for l in txt_f if l.rstrip() != '']

# merge the training data into one big text line
training_data = '$'.join(training_data)

In [None]:
training_data

'Monica: There\'s nothing to tell! He\'s just some guy I work with!$Joey: C\'mon, you\'re going out with the guy! There\'s gotta be something wrong with him!$Chandler: All right Joey, be nice.  So does he have a hump? A hump and a hairpiece?$Phoebe: Wait, does he eat chalk?$(They all stare, bemused.)$Phoebe: Just, \'cause, I don\'t want her to go through what I went through with Carl- oh!$Monica: Okay, everybody relax. This is not even a date. It\'s just two people going out to dinner and- not having sex.$Chandler: Sounds like a date to me.$[Time Lapse]$Chandler: Alright, so I\'m back in high school, I\'m standing in the middle of the cafeteria, and I realize I am totally naked.$All: Oh, yeah. Had that dream.$Chandler: Then I look down, and I realize there\'s a phone... there.$Joey: Instead of...?$Chandler: That\'s right.$Joey: Never had that dream.$Phoebe: No.$Chandler: All of a sudden, the phone starts to ring. Now I don\'t know what to do, everybody starts looking at me.$Monica: And

In [None]:
char2idx = {}
for c in training_data:
    if c not in char2idx:
        char2idx[c] = len(char2idx)
        

In [None]:
char2idx

{' ': 7,
 '!': 17,
 '"': 57,
 '$': 26,
 "'": 12,
 '(': 40,
 ')': 41,
 ',': 29,
 '-': 42,
 '.': 33,
 '0': 56,
 '2': 64,
 '3': 55,
 '6': 61,
 ':': 6,
 ';': 63,
 '?': 37,
 'A': 32,
 'B': 53,
 'C': 28,
 'D': 54,
 'E': 66,
 'F': 51,
 'G': 59,
 'H': 18,
 'I': 23,
 'J': 27,
 'L': 46,
 'M': 0,
 'N': 50,
 'O': 43,
 'P': 38,
 'R': 52,
 'S': 34,
 'T': 8,
 'U': 65,
 'V': 60,
 'W': 39,
 'Y': 58,
 '[': 45,
 ']': 47,
 'a': 5,
 'b': 30,
 'c': 4,
 'd': 31,
 'e': 10,
 'f': 48,
 'g': 15,
 'h': 9,
 'i': 3,
 'j': 19,
 'k': 25,
 'l': 16,
 'm': 21,
 'n': 2,
 'o': 1,
 'p': 36,
 'q': 62,
 'r': 11,
 's': 13,
 't': 14,
 'u': 20,
 'v': 35,
 'w': 24,
 'x': 44,
 'y': 22,
 'z': 49}

In [None]:
idx2char = {v: k for k, v in char2idx.items()}

In [None]:
VOCAB_SIZE = len(char2idx)
RNN_SIZE = 1024
MLP_SIZE = 2048
SEQ_LEN = 50


##### Exercise 4.1
* What is the amount of outputs needed by the character prediction model?

##### Exercise 4.2
* What is the proper activation to plug on top of the MLP (if any)? (Note that we use `NLLLoss` later on).

##### Exercise 4.3
* Finish the definition of the `CharLSTM` model to include a `nn.LSTM` layer, with `batch_first=True`, `vocab_size` inputs and `rnn_size` cells, and an MLP that projects the `rnn_size` to `mlp_size` with one `ReLU` hidden layer and then to the appropriate amount of outputs. Put a `Dropout(0.4)` after the `ReLU`.


In [None]:
class CharLSTM(nn.Module):

    def __init__(self, vocab_size, rnn_size, mlp_size):
        super().__init__()
        self.rnn_size = rnn_size

        # TODO:
        self.lstm = nn.LSTM(VOCAB_SIZE, RNN_SIZE, batch_first=True)

        self.dout = nn.Dropout(0.4)

        # TODOs:
        # An MLP with a hidden layer of mlp_size neurons that maps from the RNN 
        # hidden state space to the output space of vocab_size
        self.mlp = nn.Sequential(
          nn.Linear(RNN_SIZE, MLP_SIZE), # Linear layer
          nn.ReLU(), # Activation function
          nn.Dropout(0.4),
          nn.Linear(MLP_SIZE, VOCAB_SIZE), # Linear layer
          nn.LogSoftmax() # Output layer
        )

    def forward(self, sentence, state=None):
        bsz, slen, vocab = sentence.shape
        ht, state = self.lstm(sentence, state)
        ht = self.dout(ht)
        h = ht.contiguous().view(-1, self.rnn_size)
        logprob = self.mlp(h)
        return logprob, state

##### Exercise 4.4

What is the length of the sliding window that will run over each of the training sub-sequences? NOTE: it is defined as a hyper-parameter above. How is this related to the backpropagation through time (BPTT)?

In [None]:
BATCH_SIZE = 64
T = len(training_data)
CHUNK_SIZE = T // BATCH_SIZE
# let's first chunk the huge train sequence into BATCH_SIZE sub-sequences
trainset = [training_data[beg_i:end_i] \
            for beg_i, end_i in zip(range(0, T - CHUNK_SIZE, CHUNK_SIZE),
                                    range(CHUNK_SIZE, T, CHUNK_SIZE))]
print('Original training string len: ', T)
print('Sub-sequences len: ', CHUNK_SIZE)

Original training string len:  23149
Sub-sequences len:  361


In [None]:
# Let's build an example model and see what the scores are before training
model = CharLSTM(VOCAB_SIZE, RNN_SIZE, MLP_SIZE)
# This should output crap as it is not trained, so a fixed random tag for everything

def gen_text(model, seed, char2idx, num_chars=150):
  model.eval()
  # Here we don't need to train, so the code is wrapped in torch.no_grad()
  with torch.no_grad():
      inputs = prepare_sequence(seed, char2idx)
      # fill the RNN memory with the seed sentence
      seed_pred, state = model(inputs.unsqueeze(0))
      # now begin looping with feedback char by char from the last prediction
      preds = seed
      curr_pred = torch.topk(seed_pred[-1, :], k=1, dim=0)[1]
      curr_pred = idx2char[curr_pred.item()]
      preds += curr_pred
      for t in range(num_chars):
        curr_pred, state = model(prepare_sequence(curr_pred, char2idx).unsqueeze(0), state)
        curr_pred = torch.topk(curr_pred[-1, :], k=1, dim=0)[1]
        curr_pred = idx2char[curr_pred.item()]
        if curr_pred == '$':
          # special token to add newline char
          preds += '\n'
        else:
          preds += curr_pred
      return preds

In [None]:
gen_text(model, "Monica", char2idx)

  input = module(input)


'MonicamJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ'

In [None]:
from timeit import default_timer as timer


# Let's now build a model to train with its optimizer and loss
model = CharLSTM(VOCAB_SIZE, RNN_SIZE, MLP_SIZE)
model.to(device)
loss_function = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
NUM_EPOCHS = 5000
tr_loss = []
state = None
timer_beg = timer()
for epoch in range(NUM_EPOCHS):
  model.train()
  # let's slide over our dataset
  for beg_t, end_t in zip(range(0, CHUNK_SIZE - SEQ_LEN - 1, SEQ_LEN + 1),
                          range(SEQ_LEN + 1, CHUNK_SIZE, SEQ_LEN + 1)):
    # Step 1. Remember that Pytorch accumulates gradients.
    # We need to clear them out before each instance
    optimizer.zero_grad()

    dataX = []
    dataY = []
    # Step 2. Get our inputs ready for the network, that is, turn them into
    # Tensors of one-hot sequences. 
    for sent in trainset:
      # chunk the sentence
      chunk = sent[beg_t:end_t]
      # get X and Y with a shift of 1
      X = chunk[:-1]
      Y = chunk[1:]
      # convert each sequence to one-hots and labels respectively
      X = prepare_sequence(X, char2idx)
      Y = prepare_sequence(Y, char2idx, onehot=False)
      dataX.append(X.unsqueeze(0)) # create batch-dim
      dataY.append(Y.unsqueeze(0)) # create batch-dim
    dataX = torch.cat(dataX, dim=0).to(device)
    dataY = torch.cat(dataY, dim=0).to(device)

    # Step 3. Run our forward pass.
    # Forward through model and carry the previous state forward in time (statefulness)
    y_, state = model(dataX, state)
    # detach the previous state graph to not backprop gradients further than the BPTT span
    state = (state[0].detach(), # detach c[t]
             state[1].detach()) # detach h[t]

    # Step 4. Compute the loss, gradients, and update the parameters by
    #  calling optimizer.step()
    loss = loss_function(y_, dataY.view(-1))
    loss.backward()
    optimizer.step()
    tr_loss.append(loss.item())
  timer_end = timer()  
  if (epoch + 1) % 50 == 0:
    # Generate a seed sentence to play around
    model.to('cpu')
    print('-' * 30) 
    print(gen_text(model, 'They ', char2idx))
    print('-' * 30)
    model.to(device)
    print('Finished epoch {} in {:.1f} s: loss: {:.6f}'.format(epoch + 1, 
                                                               timer_end - timer_beg,
                                                               np.mean(tr_loss[-10:])))
  timer_beg = timer()

plt.plot(tr_loss)
plt.xlabel('Epoch')
plt.ylabel('NLLLoss')

  input = module(input)


In [None]:
gen_text(model, "Oriol ", char2idx)