In [None]:
# Install TensorFlow
# !pip install -q tensorflow-gpu==2.0.0-beta1

try:
  %tensorflow_version 2.x  # Colab only.
except Exception:
  pass

import tensorflow as tf
print(tf.__version__)

[K     |████████████████████████████████| 348.9MB 45kB/s 
[K     |████████████████████████████████| 501kB 42.9MB/s 
[K     |████████████████████████████████| 3.1MB 43.5MB/s 
[?25h2.0.0-beta1


## Goal
In this lecture we are going to emphasize the importance of shapes in RNN. Whenever you hear something like `NxTxD`, you should be automatically visualizing about a box and its dimensions. This lecture is all about tracking the shapes in an RNN, and also we are going to go through the RNN calculation manually to reinforce our understanding of how an RNN works.

In [None]:
from tensorflow.keras.layers import Input, SimpleRNN, Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import SGD, Adam

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

I have listed out all the important size variables we have to pay attention to. These things should be permanently stored in your memory. You should never be asking what does `M` mean again.

Just to recap, `N` is the number of samples in your dataset. `T` is the sequence length. Remember that in Tensorflow, we assume constant size sequences. 

`D` is the input feature dimensionality. We've gone through many examples of this where you might have a `D` > 1. `M` is the number of hidden units. This is the same as we have in a regular feed forward ANN, so it's a hyperparameter which you can choose. 

Finally, `K` is the number of output nodes. As a side note, `K` > 1 does not automatically imply you are doing classification with a softmax. You can do multi-dimensional regression too eg. you are trying to predict lat-long coordinates. In that scenario, `K` = 2 but it would still be a regression problem. 

In [None]:
# Things you should automatically know and have memorized
# N = number of samples
# T = sequence length
# D = number of input features
# M = number of hidden units
# K = number of output units

In [None]:
# Make some dummy data
N = 1 # 1 sample
T = 10 # sequence length = 10
D = 3 # feature dimensionality = 3
K = 2 # 2 output nodes
X = np.random.randn(N, T, D) # input X has the shape N x T x D

We create our model. Here we also set `M`, the number of hidden units, to be 5. As usual, we start with an input layer whose shape is `TxD`, then we create a Simple RNN layer, which has the number of hidden units `M`. Let's assume the default activation, which is a TanH. Finally we create a dense layer with the number of output units `K`. For this I'll assume we're doing regression, so there is no activation function. 

In [None]:
# Make an RNN
M = 5 # number of hidden units
i = Input(shape=(T, D))
x = SimpleRNN(M)(i)
x = Dense(K)(x)

model = Model(i, x)

Next we use our model to make a prediction. Obviously both our data and weights are random, so this prediction is not meaningful. These numbers are just for sanity checking. As you can see, the output shape is 1x2 as expected ie. we have 1 sample and 2 output nodes. Take notes of these numbers as this is what we want to compare with later on. 

In [None]:
# Get the output
Yhat = model.predict(X)
print(Yhat)

[[-0.7062384   0.45167243]]


Next we can do a model summary so that we can see all the layers of our RNN. As expected, We have three layers: the input layer, the Simple RNN layer and the dense layer. We don't exactly know what parameters are stored in a Simple RNN, although we do know the mathematical equation to get the output. 

In [None]:
# See if we can replicate this output
# Get the weights first
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 10, 3)]           0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 5)                 45        
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 12        
Total params: 57
Trainable params: 57
Non-trainable params: 0
_________________________________________________________________


Let's check the weights and see what they are. So we have some idea of what's stored in the Simple RNN layer. It looks like 3 big arrays, and it's actually more helpful to prints out the shape of these arrays. From there we can deduce which array corresponds to which weight in the Simple RNN. 

We get a 3x5, a 5x5, and a 5 length vector. If you recall `D`=3 and `M`=5, so that makes sense. The first weight is `DxM`, which means it's the input to hidden weight. The second weight is `MxM`, which means it's the hidden-hidden weight. And the third weight is a vector of length `M`, which means it's the bias term.

In [None]:
# See what's returned
model.layers[1].get_weights()

[array([[ 0.06160122,  0.16070706,  0.83621055,  0.04993761, -0.36932853],
        [ 0.4978891 , -0.474034  ,  0.55890614,  0.06967717,  0.21268493],
        [-0.44685632, -0.28297323, -0.17539108,  0.42829865,  0.22275227]],
       dtype=float32),
 array([[ 0.00272548,  0.04928541,  0.32022277,  0.3270029 ,  0.88774437],
        [ 0.6996881 ,  0.64928424, -0.08133215, -0.27187836,  0.09128988],
        [-0.22173485,  0.50949985,  0.6649476 ,  0.31805265, -0.38461757],
        [ 0.5346833 , -0.24025255, -0.13355102,  0.7674074 , -0.22280595],
        [ 0.41877976, -0.50861543,  0.65639263, -0.35927662, -0.07747886]],
       dtype=float32),
 array([0., 0., 0., 0., 0.], dtype=float32)]

In [None]:
# Check their shapes
# Should make sense
# First output is input > hidden
# Second output is hidden > hidden
# Third output is bias term (vector of length M)
a, b, c = model.layers[1].get_weights()
print(a.shape, b.shape, c.shape)

(3, 5) (5, 5) (5,)


Now we can assign our weight variables with confidence. So for the layer at index 1, we assign these weights `Wx`, `Wh` and `bh`. Notice I'm using shorthand here so I don't use `WxH` and `WHH` since it's not that useful. 

For the layer at index 2, this corresponds to the output layer. So we assign these to the weights `Wo` and `bo`. 

In [None]:
Wx, Wh, bh = model.layers[1].get_weights()
Wo, bo = model.layers[2].get_weights()

The last step is to do our manual RNN calculation. To start, we are going to initialize the initial hidden state to a vector of 0s. By the way, if we get this wrong, then the output will be different. So here's another way we can confirm that the initial state really is 0. 

Next we get X at index 0, which is our one and only sample, we call this `x`. Next we initialize an empty list for all of our ŷ's. In this example, we only care about the final ŷ, but we are going to calculate them all for completeness. 

Next we enter into our loop where `t` counts up from 0 up to `T`. Inside the loop, we calculate the first `h`, that's the hidden value at the hidden layer. It's equal to `tanh(x[t].dot(Wx) + h_last.dot(Wh) + bh)`. You should recognize this formula from the lecture slides, so you should be cross referencing with those. Once we have `h` we can calculate ŷ, which is just the usual neuron equation. Finally we assign `h` to `h_last`, so that `h_last` has the correct value for the next iteration of the loop. 

And once we're outside the loop, we can print out the final value of the `Yhats` list. And hopefully this is equal to what we calculated before when we called `model.predict()`. So that's pretty awesome, we've confirmed that these are indeed the calculations that are done in the Simple RNN.

In [None]:
h_last = np.zeros(M) # initial hidden state
x = X[0] # the one and only sample
Yhats = [] # where we store the outputs

for t in range(T):
  h = np.tanh(x[t].dot(Wx) + h_last.dot(Wh) + bh)
  y = h.dot(Wo) + bo # we only care about this value on the last iteration
  Yhats.append(y)
  
  # important: assign h to h_last
  h_last = h

# print the final output
print(Yhats[-1])

[-0.70623848  0.45167215]


One thing that made this exercise simpler was that we only had a one sample. As a bonus exercise, you can use an N > 1. Modify this code so that it still produces the same result even when you have multiple samples.

In [None]:
# Bonus exercise: calculate the output for multiple samples at once (N > 1)