This notebook will take you through a basic understanding of the working of RNN. You can have a lok at __[Chris Olah's blog on RNN](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)__ to understand RNN architecture.

We will be using a simple RNNfor which the cell-equation (not a standard name) is:

$h_t = f(X \times W + h_{t-1} \times U + b)$


**Setup**:<br>
Our aim in first problem is to predict the sum of 3 numbers with RNN.
Thus for each input sequence $[x_0, x_1, x_2]$, output should be $y = x_0 + x_1 + x_2$

**Note**: I know the same can be achieved with a simple neural net, but to keep it simple we are setting the problem this way.


In [1]:
# Import modules 
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, SimpleRNN

In [2]:
# Data and model parameters
seq_len = 3   #Length of each sequence 
rnn_size = 1  #Output shape of RNN
input_size = 10000 #Numbers of instances

Creating Data:

In [3]:
all_feat = np.random.randint(low=0, high=10, size=(input_size,3,1))
all_feat[:5, :]

array([[[0],
        [9],
        [3]],

       [[2],
        [1],
        [8]],

       [[4],
        [4],
        [6]],

       [[8],
        [6],
        [1]],

       [[8],
        [7],
        [8]]])

In [4]:
all_label = np.apply_along_axis(func1d=np.sum, axis=1, arr=all_feat)
all_feat = all_feat.astype('float64') 
all_label = all_label.astype('float64') 

all_label[:5]

array([[12.],
       [11.],
       [14.],
       [15.],
       [23.]])

### Define model

Our model will have only a Simple RNN.<br> 
Our expectation with RNN is that it will learn to pass the input as it is to next layer.<br>
One more thing to note: to keep things simple to understand, we'll use linear activation($y = f(x) = x$)

In [5]:
x = Input(shape=(3,1,), name='Input_Layer', dtype=tf.float64)
y = SimpleRNN(rnn_size, activation='linear', name='RNN_Layer')(x)

model = Model(inputs=x, outputs=y)

model.summary()



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input_Layer (InputLayer)     [(None, 3, 1)]            0         
_________________________________________________________________
RNN_Layer (SimpleRNN)        (None, 1)                 3         
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________


In [6]:
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['acc'])

Time to train the model

In [7]:
history = model.fit(x=all_feat, y=all_label, batch_size=4, epochs=5, validation_split=0.2, verbose=1)

Train on 8000 samples, validate on 2000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Model looks fine. Let's check few predictions. 

In [8]:
print('\nInput features: \n', all_feat[-5:,:])
print('\nLabels: \n', all_label[-5:,:])
print('\nPredictions: \n', model.predict(all_feat[-5:,:]))


Input features: 
 [[[1.]
  [4.]
  [4.]]

 [[3.]
  [6.]
  [3.]]

 [[6.]
  [6.]
  [7.]]

 [[1.]
  [8.]
  [0.]]

 [[4.]
  [0.]
  [8.]]]

Labels: 
 [[ 9.]
 [12.]
 [19.]
 [ 9.]
 [12.]]

Predictions: 
 [[ 9.]
 [12.]
 [19.]
 [ 9.]
 [12.]]


Let's look at what RNN learnt. A little info on the RNN weight matrices:<br>
There are three weights:
1. W: Input to RNN weight Matrix
2. U: RNN to RNN (or hidden layer to RNN) weight Matrix
3. b: Bias matrix

In [9]:
wgt_layer = model.get_layer('RNN_Layer')

In [10]:
wgt_layer.get_weights()

[array([[1.]], dtype=float32),
 array([[1.]], dtype=float32),
 array([-5.6379376e-12], dtype=float32)]

The weights match the expectations. RNN equation is:

$h_t = f(X \times W +  h_{t-1} \times U + b)$

As we have set $f$ to linear, the equations is

$h_t = X \times W +  h_{t-1} \times U + b$

We were expecting $W= 1, U = 1$ and $b = 0$, and the weights we got are quite close.

## Moving to higher dimension

This time we will use one-hot encodings as the input to make the problem bit more interesting.

In [11]:
#Using keras preprocessing function
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam

In [12]:
all_cat_feat = np.apply_along_axis(func1d=lambda x: to_categorical(x,10), arr=all_feat, axis=1)
all_cat_feat = all_cat_feat.reshape(all_feat.shape[0], 3, 10)

In [13]:
all_feat[:5]

array([[[0.],
        [9.],
        [3.]],

       [[2.],
        [1.],
        [8.]],

       [[4.],
        [4.],
        [6.]],

       [[8.],
        [6.],
        [1.]],

       [[8.],
        [7.],
        [8.]]])

In [14]:
all_cat_feat[:5]

array([[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]],

       [[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]],

       [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]],

       [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]]], dtype=float32)

Before creating new model, we should delete the previous one

In [15]:
del model

In [16]:
x = Input(shape=(3,10,), name='Input_Layer', dtype=tf.float64)
y = SimpleRNN(rnn_size, activation='linear', name='RNN_Layer')(x)

model = Model(inputs=x, outputs=y)

model.summary()



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input_Layer (InputLayer)     [(None, 3, 10)]           0         
_________________________________________________________________
RNN_Layer (SimpleRNN)        (None, 1)                 12        
Total params: 12
Trainable params: 12
Non-trainable params: 0
_________________________________________________________________


In [17]:
model.compile(optimizer=Adam(0.005), loss='mean_squared_error', metrics=['acc'])
history = model.fit(x=all_cat_feat, y=all_label, batch_size=8, epochs=8, validation_split=0.2, verbose=1)

Train on 8000 samples, validate on 2000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


Yo may have noticed that I chaanges training paramters like learning rate, batch size etc. This is done to reach high accuracy.

Let's check predictions

In [18]:
print('\nInput features: \n', all_feat[-5:,:])
print('\nLabels: \n', all_label[-5:,:])
print('\nPredictions: \n', model.predict(all_cat_feat[-5:,:]))


Input features: 
 [[[1.]
  [4.]
  [4.]]

 [[3.]
  [6.]
  [3.]]

 [[6.]
  [6.]
  [7.]]

 [[1.]
  [8.]
  [0.]]

 [[4.]
  [0.]
  [8.]]]

Labels: 
 [[ 9.]
 [12.]
 [19.]
 [ 9.]
 [12.]]

Predictions: 
 [[11.216051]
 [11.630124]
 [14.002243]
 [10.156763]
 [14.664776]]


This time input dimenion is 10 and output dimension is still 1. Looking back at RNN equation:

$h_t = f(X \times W +  h_{t-1} \times U + b)$

$W$ should have size $10 \times 1$, while $U$ should still have size $1 \times 1$


In [19]:
wgt_layer = model.get_layer('RNN_Layer')
wgts_mats = wgt_layer.get_weights()

In [20]:
print('W shape: ', wgts_mats[0].shape)
print('U shape: ', wgts_mats[1].shape)
print('b shape: ', wgts_mats[2].shape)

W shape:  (10, 1)
U shape:  (1, 1)
b shape:  (1,)


We expect that W learns to transform one hot enocding to actual numbers. 

In [21]:
wgts_mats

[array([[0.3786769 ],
        [0.4866051 ],
        [0.52083737],
        [0.57245713],
        [0.72293156],
        [0.7136882 ],
        [0.7967396 ],
        [0.8854931 ],
        [0.9589181 ],
        [1.0973808 ]], dtype=float32),
 array([[-3.0299761]], dtype=float32),
 array([1.1489911], dtype=float32)]

$U$ looks alright, but $W$ seems somewhat different. Let me add $b$ to $W$

In [22]:
print('\nW+b: \n', wgts_mats[0]+wgts_mats[2])
print('\nU: \n', wgts_mats[1])


W+b: 
 [[1.527668 ]
 [1.6355963]
 [1.6698284]
 [1.7214482]
 [1.8719227]
 [1.8626792]
 [1.9457307]
 [2.0344841]
 [2.1079092]
 [2.2463717]]

U: 
 [[-3.0299761]]


For a much, much clear understanding, round the numbers

In [23]:
print('\nW+b: \n', np.round(wgts_mats[0]+wgts_mats[2]))
print('\nU: \n', np.round(wgts_mats[1]))


W+b: 
 [[2.]
 [2.]
 [2.]
 [2.]
 [2.]
 [2.]
 [2.]
 [2.]
 [2.]
 [2.]]

U: 
 [[-3.]]


When our input vector $X$, which has only one 1 at the position given by input number, is multipled with $W$, it essentially gives out the value at same positions from the weight matrix $W$.

$
\begin{bmatrix} 0 & 0 & 1 & 0 \end{bmatrix}
\times
\begin{bmatrix} W_1 \\ W_2 \\ W_3 \\ W_4 \end{bmatrix}
=
\begin{bmatrix} 0 \times W_1 \\ + \\ 0 \times W_2 \\ + \\ 1 \times W_3 \\ + \\ 0 \times W_4 \end{bmatrix}
=
\begin{bmatrix} W_3 \end{bmatrix}
$


$\begin{bmatrix} 0 & 0 & 1 & 0 \end{bmatrix}$ is one-hot encoding for 3

## Using embeddings

In a multitude of RNN models, you'll see embeddings beings used. Embedding are similar to one-hot encodings: An n-dimensional representation of your input(text generally) which learns the represetation along with the rest of the model.

Here, We'll try to replace one-hot encodings with embeddings.

Input will be numbers, need to be reshaped, and before the RNN layer, there will be an embedding layer.

In [24]:
from tensorflow.keras.layers import Embedding

In [25]:
all_feat_reshaped = all_feat.reshape(all_feat.shape[0], 3)

In [26]:
del model

In [27]:
input_1 = Input(shape=(3,), name='Input_Layer')
x = Embedding(input_dim=10, output_dim=10, name='Embedding_Layer')(input_1)
y = SimpleRNN(rnn_size, activation='linear', name='RNN_Layer')(x)

model = Model(inputs=input_1, outputs=y)

model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Input_Layer (InputLayer)     [(None, 3)]               0         
_________________________________________________________________
Embedding_Layer (Embedding)  (None, 3, 10)             100       
_________________________________________________________________
RNN_Layer (SimpleRNN)        (None, 1)                 12        
Total params: 112
Trainable params: 112
Non-trainable params: 0
_________________________________________________________________


In [28]:
model.compile(optimizer=Adam(0.01), loss='mean_squared_error', metrics=['acc'])
history = model.fit(x=all_feat_reshaped, y=all_label, batch_size=8, epochs=4, validation_split=0.2, verbose=1)

Train on 8000 samples, validate on 2000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Time to check predictions

In [29]:
print('\nInput features: \n', all_feat_reshaped[-5:,:])
print('\nLabels: \n', all_label[-5:,:])
print('\nPredictions: \n', model.predict(all_feat_reshaped[-5:,:]))


Input features: 
 [[1. 4. 4.]
 [3. 6. 3.]
 [6. 6. 7.]
 [1. 8. 0.]
 [4. 0. 8.]]

Labels: 
 [[ 9.]
 [12.]
 [19.]
 [ 9.]
 [12.]]

Predictions: 
 [[10.375401]
 [12.513451]
 [12.754671]
 [ 8.693607]
 [12.877496]]


This time we need to check embedding weight too.

In [30]:
embd_layer = model.get_layer('Embedding_Layer')
embd_mats = embd_layer.get_weights()

wgt_layer = model.get_layer('RNN_Layer')
wgts_mats = wgt_layer.get_weights()

Embedding layer should have size = $10 \times 10$, as we're mapping 10 numbers(integers to be precise) to 10 dimensional vectors (1 vector for each of the number). In the weight matrix, index indicates the integer to which it is mapped.

RNN weight shapes will be similar to the previous excerxise.

In [31]:
print('Embedding W shape: ', embd_mats[0].shape)
print('W shape: ', wgts_mats[0].shape)
print('U shape: ', wgts_mats[1].shape)
print('b shape: ', wgts_mats[2].shape)

Embedding W shape:  (10, 10)
W shape:  (10, 1)
U shape:  (1, 1)
b shape:  (1,)


Let's check the weight matrices

In [32]:
embd_mats

[array([[-0.36289757,  0.40314817, -0.37529683, -0.40290484,  0.39402398,
          0.3687898 ,  0.4199001 ,  0.40679476,  0.3741078 ,  0.359894  ],
        [-0.35347167,  0.3330296 , -0.39868364, -0.33981174,  0.3365241 ,
          0.3281662 ,  0.3660595 ,  0.4048839 ,  0.34692815,  0.38161314],
        [-0.45016566,  0.42293632, -0.42147616, -0.43580624,  0.4297624 ,
          0.45729828,  0.44733754,  0.4054712 ,  0.46323246,  0.38260564],
        [-0.5018592 ,  0.50228935, -0.4505884 , -0.5009186 ,  0.49097726,
          0.47848928,  0.47305825,  0.46186924,  0.47454175,  0.44277638],
        [-0.40599436,  0.43200484, -0.46300104, -0.4639299 ,  0.43344742,
          0.46506122,  0.40431377,  0.42332262,  0.45043844,  0.42279685],
        [-0.43375313,  0.43648008, -0.45081672, -0.44851768,  0.47521338,
          0.47410107,  0.50128907,  0.49491104,  0.4410805 ,  0.4213932 ],
        [-0.4954618 ,  0.46851364, -0.5116735 , -0.45529932,  0.47856602,
          0.45150945,  0.4934331

In [33]:
wgts_mats

[array([[-0.38934448],
        [ 0.57726926],
        [-0.16260014],
        [-0.28410706],
        [ 0.3671195 ],
        [ 0.34153858],
        [ 0.3436649 ],
        [ 0.2549568 ],
        [ 0.5274176 ],
        [ 0.09343947]], dtype=float32),
 array([[-2.2878616]], dtype=float32),
 array([1.5465231], dtype=float32)]

Only $U$ makes the sense. Remember the RNN equaition:

$h_t = f(X \times W +  h_{t-1} \times U + b)$

Here, $X$ is the embedding output. Let's do one more transformation:  $ W_{embd} \times W + b$, this will give us a number a vector containing 10 numbers, each corresponding to input number.

Let's do it one by one

In [34]:
np.matmul(embd_mats[0], wgts_mats[0])

array([[1.2990777],
       [1.1745284],
       [1.4629037],
       [1.6165873],
       [1.4564604],
       [1.5284245],
       [1.6033008],
       [1.927354 ],
       [2.1227646],
       [2.1521482]], dtype=float32)

In [35]:
np.matmul(embd_mats[0], wgts_mats[0]) + wgts_mats[2]

array([[2.8456008],
       [2.7210515],
       [3.0094268],
       [3.1631103],
       [3.0029836],
       [3.0749476],
       [3.149824 ],
       [3.473877 ],
       [3.6692877],
       [3.6986713]], dtype=float32)

In [36]:
print('\n W_embd * W + b: \n', np.matmul(embd_mats[0], wgts_mats[0]) + wgts_mats[2])
print('\nU: \n', wgts_mats[1])


 W_embd * W + b: 
 [[2.8456008]
 [2.7210515]
 [3.0094268]
 [3.1631103]
 [3.0029836]
 [3.0749476]
 [3.149824 ]
 [3.473877 ]
 [3.6692877]
 [3.6986713]]

U: 
 [[-2.2878616]]


Makes some sense, right!

Let's round it.

In [37]:
print('\n W_embd * W + b: \n', np.round(np.matmul(embd_mats[0], wgts_mats[0]) + wgts_mats[2]))
print('\nU: \n', np.round(wgts_mats[1]))


 W_embd * W + b: 
 [[3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [3.]
 [4.]
 [4.]]

U: 
 [[-2.]]


Here's an explanation of what happened:

When you input an integer to ebmedding layer,it gives out a vector at corresponding index.

In [38]:
embd_mats[0]

array([[-0.36289757,  0.40314817, -0.37529683, -0.40290484,  0.39402398,
         0.3687898 ,  0.4199001 ,  0.40679476,  0.3741078 ,  0.359894  ],
       [-0.35347167,  0.3330296 , -0.39868364, -0.33981174,  0.3365241 ,
         0.3281662 ,  0.3660595 ,  0.4048839 ,  0.34692815,  0.38161314],
       [-0.45016566,  0.42293632, -0.42147616, -0.43580624,  0.4297624 ,
         0.45729828,  0.44733754,  0.4054712 ,  0.46323246,  0.38260564],
       [-0.5018592 ,  0.50228935, -0.4505884 , -0.5009186 ,  0.49097726,
         0.47848928,  0.47305825,  0.46186924,  0.47454175,  0.44277638],
       [-0.40599436,  0.43200484, -0.46300104, -0.4639299 ,  0.43344742,
         0.46506122,  0.40431377,  0.42332262,  0.45043844,  0.42279685],
       [-0.43375313,  0.43648008, -0.45081672, -0.44851768,  0.47521338,
         0.47410107,  0.50128907,  0.49491104,  0.4410805 ,  0.4213932 ],
       [-0.4954618 ,  0.46851364, -0.5116735 , -0.45529932,  0.47856602,
         0.45150945,  0.4934331 ,  0.4998772 

In [39]:
# In input was '5', output will be
embd_mats[0][5]

array([-0.43375313,  0.43648008, -0.45081672, -0.44851768,  0.47521338,
        0.47410107,  0.50128907,  0.49491104,  0.4410805 ,  0.4213932 ],
      dtype=float32)

This input is similar to one-hot encoding. 

In the next step(RNN), this vector get multipled to $W$ to produce a vector of rnn_size, which in this case is 1, so it gives out one number in our case.


As you could see, embeddings learn represetation in combination to other matrices and thus might be difficult to explain directly. 