# GRUs explained with matrices: Part 2 Training and Loss Function

In [2]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import numpy as np
import itertools
import pickle

%autosave 180

Autosaving every 180 seconds


### Recap:
In part one of this tutorial series, we demonstrated the matrix operations used to estimate the hidden states and outputs for the forward pass of a GRU. Based on our poor results, we obvioulsy need to optimize our algorithm and test it on a test set to ensure generalizability. This is typically done using several steps/techniques. In this tutorial we will walkthrough what happens under the hood during optimization, specifically calculating the loss function and performing backpropagation through time to update the weights over several epochs. 

### Input text

In [3]:
# This will be our input ---> x
text = 'MathMathMathMathMath'

### Integer representation of inputs

In [4]:
character_list = list(set(text))   # get all of the unique letters in our text variable
vocabulary_size = len(character_list)   # count the number of unique elements
character_dictionary = {'h': 0, 'a': 1, 't': 2, 'M': 3}
# {char:e for e, char in enumerate(character_list)}  # create a dictionary mapping each unique char to a number
encoded_chars = [character_dictionary[char] for char in text] #integer representation of our vocabulary 

### One hot encode 

In [5]:
def one_hot_encode(encoded,vocab_size):
    result = torch.zeros((len(encoded), vocab_size))
    for i, idx in enumerate(encoded):
        result[i, idx] = 1.0
    return result

### Training data

In [6]:
# One hot encode our encoded charactes
batch_size = 2
seq_length = 3
num_samples = (len(encoded_chars) - 1) // seq_length # time lag of 1 for creating the labels
vocab_size = 4

data = one_hot_encode(encoded_chars[:seq_length*num_samples], vocab_size).reshape((num_samples, seq_length, vocab_size))
num_batches = len(data) // batch_size
X = data[:num_batches*batch_size].reshape((num_batches, batch_size, seq_length, vocab_size))
# swap batch_size and seq_length axis to make later access easier
X = X.transpose(1, 2)

### Label encoding

In [7]:
# +1 shift the labels by one so that given the previous letter the char we should predict would be or next char
labels = one_hot_encode(encoded_chars[1:seq_length*num_samples+1],vocab_size=vocab_size) 
y = labels.reshape((num_batches, batch_size, seq_length, vocab_size))
y = y.transpose(1, 2) # transpose the first and second index

### Intitialize weight matrices and bias vectors

In [8]:
torch.manual_seed(1) # reproducibility

####  Define the network parameters:
hiddenSize = 2 # network size, this can be any number (depending on your task)
numClass = 4 # this is the same as our vocab_size

#### Weight matrices for our inputs 
Wz = Variable(torch.randn(vocab_size, hiddenSize), requires_grad=True)
Wr = Variable(torch.randn(vocab_size, hiddenSize), requires_grad=True)
Wh = Variable(torch.randn(vocab_size, hiddenSize), requires_grad=True) 

#### Weight matrices for our hidden layer
Uz = Variable(torch.randn(hiddenSize, hiddenSize), requires_grad=True)
Ur = Variable(torch.randn(hiddenSize, hiddenSize), requires_grad=True)
Uh = Variable(torch.randn(hiddenSize, hiddenSize), requires_grad=True)

#### bias vectors for our hidden layer
bz = Variable(torch.zeros(hiddenSize), requires_grad=True)
br = Variable(torch.zeros(hiddenSize), requires_grad=True)
bh = Variable(torch.zeros(hiddenSize), requires_grad=True)

#### Output weights
Wy = Variable(torch.randn(hiddenSize, numClass), requires_grad=True)
by = Variable(torch.zeros(numClass), requires_grad=True)

### Define network

In [9]:
def gru(x, h):
    outputs = []
    for sequence in x: # iterates over the sequences in each batch
        z = torch.sigmoid(torch.matmul(sequence, Wz) + torch.matmul(h, Uz) + bz)
        r = torch.sigmoid(torch.matmul(sequence, Wr) + torch.matmul(h, Ur) + br)
        h_tilde = torch.tanh(torch.matmul(sequence, Wh) + torch.matmul(r * h, Uh) + bh)
        h = z * h + (1 - z) * h_tilde

        #Linear layer
        y_linear = torch.matmul(h, Wy) + by
        
        # Softmax activation function
        y_t = F.softmax(y_linear, dim=1)
        outputs.append(y_t)
    return torch.stack(outputs), h
    

### Sample to generate text

In [10]:
def sample(primer, length_chars_predict):
    
    word = primer

    primer_dictionary = [character_dictionary[char] for char in word]
    test_input = one_hot_encode(primer_dictionary, vocab_size)
    

    h = torch.zeros(1, hiddenSize)

    for i in range(length_chars_predict):
        outputs, h = gru(test_input, h)
        choice = np.random.choice(vocab_size, p=outputs[-1][0].detach().numpy())
        word += character_list[choice]
        input_sequence = one_hot_encode([choice],vocab_size)
    return word

### Training loop

In [11]:
max_epochs = 5  # passes through the data
for e in range(max_epochs):
    h = torch.zeros(batch_size, hiddenSize)
    for i in range(num_batches):
        x_in = X[i]
        y_in = y[i]

        out, h = gru(x_in, h)
        print(sample('Ma',20))

MataMttataatthtthaMhMa
MattatttttattatatataMt
MattMtMttttthtMtatMtth
MaMtttttttttatttttatht
MattattMattttMtttthtMt
MattataahMtataatttttMt
MahMtMMtttMttMMaMMthaM
MatttMttttttMttttattah
MahMtMMhhatMaMattttaMt
MaMMtMMhtttththhMtthth
MaMthttMthhttaathtMMta
MaMMaaaahtMhhttatthtMt
MaataMtthMtMththtMMatt
MahhhtthMMthtMtatMtttt
MahtMhttttaattMttttMaa


### What's happening here?
As we pointed out in the first tutorial the first couple of strings generated are a bit erratic, but after a few passes it seems to get at least the next two characters correct. However, in order to measure how inconsistent our predictions are versus the true labels, we need a metric. This metric is call the loss function, that measures how well the model is performing. It is a positive value that decreases as the network becomes more confident with it's predictions. This loss function for multiclass classification problems is defined as:

<h1><center>Cross Entropy = $-\frac{1}{N}\sum_{j} {y_j * log(\hat{y_j}})$<br>


Recall, our calculated hidden states and predicted outputs for the first batch? This picture seems a bit busy, however the goal here is to visualize what you outputs and hidden states actually look like under the hood. The predictions are probabilities which were calculated using the Softmax activation function.

<img src="img/hidden_image.png" style="height:500px">

Let's re-run the training loop storing the outputs (  $\hat{y}$) and hidden states ($h_{(t-1)}, h_t, and,  h_{(t+1)} )$ for each sequence in batch 1.

### Illustration in code:
To understand what is happening you will notice that we work from the inside out, before moving to functions. Here, we are grabbing the outputs and hidden states calculated with just two loops.

In [12]:
ht_2 = [] # stores the calculated h for each input x
outputs = []
h = torch.zeros(batch_size, hiddenSize) # intitalizes the hidden state
for i in range(num_batches):  # this loops over the batches 
    x = X[i]
    for i,sequence in enumerate(x): # iterates over the sequences in each batch
        z = torch.sigmoid(torch.matmul(sequence, Wz) + torch.matmul(h, Uz) + bz)
        r = torch.sigmoid(torch.matmul(sequence, Wr) + torch.matmul(h, Ur) + br)
        h_tilde = torch.tanh(torch.matmul(sequence, Wh) + torch.matmul(r * h, Uh) + bh)
        h = z * h + (1 - z) * h_tilde
        
        # Linear layer
        y_linear = torch.matmul(h, Wy) + by
        
        # Softmax activation function
        y_t = F.softmax(y_linear, dim=1)
        
        ht_2.append(h)
        outputs.append(y_t)
        
ht_2 = torch.stack(ht_2)
outputs = torch.stack(outputs)

The cross entropy loss is first calculated for each sequence in the batch then averaged over all sequences. So, in this example we will calculate the cross entropy loss for each sequence from scratch. But first, let's grab the predictions made on the first batch. To do this we will grab the for element ( index 0) from our ht_2 and ouputs variables.

In [13]:
hidden_batch_1 = ht_2[:3]
outputs_batch_1 = outputs[:3]
print(f' Predictions for the first batch: \n\n{outputs_batch_1}, \
      \n \n Hidden states for the first bactch: \n{hidden_batch_1}')

 Predictions for the first batch: 

tensor([[[0.4342, 0.1669, 0.1735, 0.2254],
         [0.2207, 0.2352, 0.3322, 0.2119]],

        [[0.2045, 0.1916, 0.4443, 0.1596],
         [0.4384, 0.1563, 0.1995, 0.2058]],

        [[0.4261, 0.1340, 0.2763, 0.1636],
         [0.1819, 0.1798, 0.4972, 0.1411]]], grad_fn=<SliceBackward>),       
 
 Hidden states for the first bactch: 
tensor([[[ 0.7565, -0.3472],
         [-0.1355, -0.2040]],

        [[-0.1535, -0.5712],
         [ 0.7664, -0.5062]],

        [[ 0.7495, -0.8616],
         [-0.2399, -0.6680]]], grad_fn=<SliceBackward>)


### How well did we perform?
Let's look at our lables for batch 1

By looking at the output probabilities we can tell that we did not do so well. However, let's quantify it using the cross entropy equation! Here we will work our way from the inner term out on the first sequence in the batch. Note, the code will included all 3 sequences in batch 1.

<img src="img/cross_en.png" style="height:300px">

**First term:** Element-wise multiplication of the true labels with the log of the predicted labels

<img src="img/cross_t1.png" style="height:400px">

In [15]:
y[0] * torch.log(outputs_batch_1)

tensor([[[-0.0000, -1.7905, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.0000, -1.5516]],

        [[-0.0000, -0.0000, -0.8114, -0.0000],
         [-0.0000, -1.8560, -0.0000, -0.0000]],

        [[-0.8532, -0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.6987, -0.0000]]], grad_fn=<ThMulBackward>)

**Second term:** Summation of remaining values within each sequence. In this step, it is key to note that the axis will be reduced row-wise, only containing the non-zero terms. This will be done in a loop programatically.

<img src="img/cross_t2.png" style="height:300px">

### Implementation in code

In [16]:
ce_sums = []
for prediction, label in zip(outputs_batch_1, y[0]):
    ce_sum = torch.sum(label * torch.log(prediction),dim=1)
    ce_sums.append(ce_sum)
ce_sums = torch.stack(ce_sums)
ce_sums

tensor([[-1.7905, -1.5516],
        [-0.8114, -1.8560],
        [-0.8532, -0.6987]], grad_fn=<StackBackward>)

**Third term:** Mean of the reduced samples for first sequence within the batch tow-wise. This example calculation was done on the first sequence within batch 1. However, the code implementation covers all 3 sequences in batch 1.

<img src="img/cross_t3.png" style="height:300px">

In [17]:
ce_scores = []
for ce in ce_sums:
    ce = -torch.mean(ce_sums, dim=1)
    ce_scores.append(ce)
ce

tensor([1.6710, 1.3337, 0.7760], grad_fn=<NegBackward>)

### Averaging the cross entropy losses of each sequence  within batch 1:
Note, in practice this step will be done over each mini-batch by keeping a running average of the losses for each batch. It essentially sums up what we calculated for the cross entropy (loss for each sequence in batch 1) and divides it by the number of sequences in the batch.

In [32]:
torch.mean(ce)

tensor(1.2602, grad_fn=<MeanBackward1>)

### How did we do?
A batch loss of 1.2602 is high and means that we have plenty room for improvement

In [19]:
def cross_entropy(yhat, y):
    return -torch.mean(torch.sum(y * torch.log(yhat), dim=1))

In [20]:
def total_loss(predictions, y_true):
    total_loss = 0.0
    for prediction, label in zip(predictions, y_true):
        cross = cross_entropy(prediction, label)
        total_loss += cross
    return total_loss/ len(predictions)   

In [21]:
params = [Wz, Wr, Wh, Uh, Uz, Ur, bz, br, bh, Wy, by] # iterable of parameters that require gradient computation

In [31]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(params, lr = 0.01)
max_epochs = 100  # passes through the data
for e in range(max_epochs):
    h = torch.zeros(batch_size, hiddenSize)
    for i in range(num_batches):
        x_in = X[i]
        y_in = y[i]
        
        optimizer.zero_grad() # zero out gradients 
        
        out, h = gru(x_in, h)
        loss = total_loss(out, y_in)
        loss.backward(retain_graph=True) # backpropagate through time to adjust the weights and find the gradients of the loss function
        optimizer.step()
    if e % 10 == 0:
        print(f'Epoch: {e+1}/{max_epochs}')
        print(f'Loss: {loss}')
        print(sample('Ma',10))

Epoch: 1/100
Loss: 0.06194562092423439
Matttttttttt
Epoch: 11/100
Loss: 0.06142784655094147
Matttttttttt
Epoch: 21/100
Loss: 0.06091505289077759
MatttttttttM
Epoch: 31/100
Loss: 0.060407400131225586
Matttttttttt
Epoch: 41/100
Loss: 0.05990474298596382
Matttttttttt
Epoch: 51/100
Loss: 0.05940718576312065
Matttttttttt
Epoch: 61/100
Loss: 0.058914750814437866
Matttttttttt
Epoch: 71/100
Loss: 0.05842741206288338
Matttttttttt
Epoch: 81/100
Loss: 0.05794508382678032
Matttttttttt
Epoch: 91/100
Loss: 0.057467926293611526
Matttttttttt


### Explanation:
So we optimized reduced our loss and we are not predicting well...why? Well, as mentioned in the first tutorial this is an extremely small dataset, when training on a neural net made from scratch. It is recommended that you do so with lots of data. However, the purpose of this tutorial is not to create the high performance neural net, but to demonstrate what goes on under the hood.

### Backpropagation:
The final step involves a backward pass through the algorithm. This step is called backpropagation, and it involves understanding the impact of adjusting the weights on the cost function. This is done by calculating the error vectors $\delta$ starting from the final layer backward by repeatedly applying the chain rule through each layer. For more detailed proof of back-prop through time: https://github.com/tianyic/LSTM-GRU/blob/master/MTwrtieup.pdf

### References:
1. The Unreasonable Effectiveness of Recurrent Neural Networks
2. Udacity Deep Learning with Pytorch
3. Fastai Deep Learning for Coders
4. Deep Learning - The Straight Dope (RNNs)
5. Deep Learning Book