# Basic RNNs in TensorFlow

First, let's implement a very simple RNN model, without using any of TensorFlow's RNN operations. We will create an RNN composed of a layer of five recurrent neurons using the tanh activation function. We will assume that the RNN runs over only two time steps, taking input vector of size 3 at each time step.

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
n_inputs = 3
n_neurons = 5

X0 = tf.placeholder(tf.float32,[None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

Wx = tf.Variable(tf.random_normal(shape = [n_inputs, n_neurons],dtype = tf.float32))
Wy = tf. Variable(tf.random_normal(shape = [n_neurons, n_neurons], dtype = tf.float32))
b = tf.Variable(tf.zeros([1, n_neurons], dtype = tf.float32))

Y0 = tf.tanh(tf.matmul(X0, Wx) + b)
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)

init = tf.global_variables_initializer()

This network looks much like a two-layer feedforward neural network, with a few twists: first, the same weights and bias terms are shared by both layers, and second, we feed input at each layer, and we get outputs from each layer. 

To run the model, we need to feed it the inputs at both time steps like so:

In [3]:
#Mini-batch:    instance 0, instance 1, instance 2, instance 3
X0_batch = np.array([[0,1,2], [3,4,5], [6,7,8], [9,0,1]]) # t = 0
X1_batch = np.array([[9,8,7], [0,0,0], [6,5,4],[3,2,1]]) # t = 1

with tf.Session() as sess:
    init.run()
    Y0_val, Y1_val = sess.run([Y0,Y1], feed_dict = {X0:X0_batch, X1: X1_batch})

The mini-batch contains four instances, each with an input sequence composed of the exactly two inputs. At the end, Y0_val and Y1_val contain the outputs of the network at both time steps for all neurons and all instances in the mini-batch:

In [4]:
#output (4 mini batches (row) x 5 neurons (columns)) x 2 time steps (t=0,1)
print(Y0_val) #output at t = 0
print(Y1_val) #output at t = 1

[[-0.9978484  -0.8388984  -0.9963468   0.99921966  0.9462805 ]
 [-1.         -0.42275447 -0.82361877  1.          0.9999997 ]
 [-1.          0.30530632  0.67278045  1.          1.        ]
 [-0.9472425   1.          1.         -0.89053035  1.        ]]
[[-1.          0.9989584   1.          1.          1.        ]
 [-0.9976455  -0.00828366  0.9999609  -0.23960122 -0.9914078 ]
 [-1.          0.36997396  1.          1.          1.        ]
 [-0.99510556 -0.9900733   0.9999999   0.9999997   0.9999914 ]]


## Static Unrolling Through Time

The static_rnn() (now replaced by tf.nn.rnn) function creates an unrolled RNN network by chaining cells. The following code creates the exact same code as above:

In [5]:
X0 = tf.placeholder(tf.float32, [None, n_inputs])
X1 = tf.placeholder(tf.float32, [None, n_inputs])

basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)
output_seqs , states = tf.nn.rnn(basic_cell, [X0,X1], 
                                 dtype = tf.float32)

Y0, Y1 = output_seqs

First we create input placeholders, as before. Then we create a BasicRNNCell, which you can think of as a factory that creates copies of the cell to build the unrolled RNN (one for each time step). 

Then we call nn.rnn(), giving the cell factory and the input tensors, and telling it the data type of inputs.

The nn.rnn() function calls the cell factory's __call__() function once per input, creating two copies of the cell (each containing a layer of five recurrent neurons, which shared weights and bias terms, and it chains them just like we did earlier.

The static_rnn() function returns two objects. The first is a Python list containing the output tensors for each step. The second is a tensor containing the final states of the network.

When you are using basic cells, the final state is simply equal to the last output.

------------------------
------------------------
Suppose, we have 50 time steps, it would be inconvenient to define 50 input placeholders and 50 output tensors. To simplify this the following code builds the same RNN again, but this time it takes a single input placeholder of shape [None, n_steps, n_inputs] where the first dimension is the mini-batch size.

Then it extracts the list of input sequences for each time step. X_seqs is a python list of n_steps tensors of shape [None, n_inputs], where once again the first is the mini-batch size.

To do this, we first swap the first two dimensions using the transpose() function, so that the time steps are now the first dimension. Then we extract a Python list of tensors along the first dimension (i.e. one tensor per time step) using the unstack () function.

The next two lines are the same as before.

Finally, we merge all the output tensors into a single tensor using the stack() function, and  we swap the first two dimensions to get the final outputs tensor of shape [None, n_steps, n_neurons](again first dimension is the mini-batch size).

In [16]:
tf.reset_default_graph() #stop getting reuse error

n_steps = 2
n_inputs = 3
n_neurons = 5

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
X_seqs = tf.unstack(tf.transpose(X, perm = [1,0,2]))
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)

output_seqs, states = tf.nn.rnn(basic_cell, X_seqs, dtype = tf.float32)

outputs = tf.transpose(tf.stack(output_seqs), perm = [1,0,2])


Now we can run the network by feeding it a single tensor that contains all the mini-batch sequences:

In [17]:
X_batch = np.array([
    [[0,1,2],[9,6,7]], #instance 0
    [[3,4,5], [0,0,0]], #instance 1
    [[6,7,8], [6,5,4]], #instance 2
    [[9,0,1], [3,2,1]], #instance 3
])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    outputs_val = outputs.eval(feed_dict = {X: X_batch})

In [18]:
print(outputs_val)

[[[ 0.85013986  0.22620182  0.09570671  0.01165123  0.41765642]
  [ 0.99999917  0.92645323  0.5786223   0.8990805   0.9251097 ]]

 [[ 0.99981105  0.8789149   0.2906526   0.49308833  0.89505   ]
  [-0.6515364  -0.7587294   0.8698612  -0.672047   -0.8924125 ]]

 [[ 0.99999964  0.9869238   0.46412593  0.78893834  0.9851777 ]
  [ 0.99985224  0.74896526  0.9554794   0.1372736  -0.00246042]]

 [[ 0.9916968  -0.25550136 -0.6158796   0.99162024  0.6474958 ]
  [ 0.9407799   0.8512076   0.384363   -0.47657797 -0.6362606 ]]]


## Dynamic Unrolling Through Time

The dynamic_rnn() function uses a while_loop() operation to run over the cell the appropriate number of times, and you can set swap_memory = True if you want it to swap the GPU's memory to CPU during backprop to avoid OOM errors.

Conveniently, it also accepts a single tensor for all inputs at every time step(shape[None, n_steps, n_inputs]) and it outputs a single tensor for all outputs at every time step (shape[None, n_steps, n_neurons]);

There is no need to stack, unstack or transpose. The following code creates the same RNN as earlier using the dynamic_rnn() function much niver.

In [20]:
tf.reset_default_graph()

n_inputs = 3
n_neurons = 5
n_steps = 2

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype = tf.float32)

## Handling Variable Length Input Sequences

So far we have used only fixed size input sequences (all exactly two steps long). What if the input sequences have variable lengths (eg, like sentences)? In this case you should set the sequence_length parameter when calling the dynamic_rnn() (or static_rnn()) function it must be a 1D tensor indicating the length of the input sequence for each instance. For example:

In [22]:
tf.reset_default_graph()

seq_length = tf.placeholder(tf.int32, [None])

#----same as before----
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)
#----------------------
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, 
                                    dtype = tf.float32, 
                                   sequence_length = seq_length)

For example, suppose the second input sequence contains only one input instead of two. It must be padded with zero vector in order to fit in the input tensor x(because the input tensor,s second dimension is the size of the longest sequence - i.e. 2.

In [23]:
X_batch = np.array([
    #step 0   step 1
    [[0,1,2],[9,8,7]], #instance 0
    [[3,4,5],[0,0,0]], #instance 1 (padded with zero vector)
    [[6,7,8],[6,5,4]], #instance 2
    [[9,0,1], [3,2,1]], #instacne 3    
])

seq_length_batch = np.array([2,1,2,2])

Of couse, you now need to feed values for both placeholders x and seq_length:

In [24]:
init = tf.global_variables_initializer()

with tf.Session() as sess:
    init.run()
    output_val, states_val = sess.run(
        [outputs, states], feed_dict = {X: X_batch, seq_length: seq_length_batch})

Now, the RNN outputs zero vectors for every time step past the input sequence

In [25]:
print(output_val)

[[[ 0.7481924  -0.41833863 -0.865391    0.67262447  0.51136667]
  [ 0.42523554 -0.93336535 -0.936078   -0.6782177   0.9999345 ]]

 [[ 0.8977136  -0.76589394 -0.9768165   0.6474266   0.9889001 ]
  [ 0.          0.          0.          0.          0.        ]]

 [[ 0.96045864 -0.91781133 -0.99619526  0.62072885  0.99980736]
  [-0.11894488 -0.94140685 -0.5675743  -0.49118972  0.9960541 ]]

 [[-0.999943   -0.96134156  0.9994977  -0.99759614  0.9962898 ]
  [-0.6357697  -0.85447615  0.8104649  -0.26680082  0.95177114]]]


Moreover, the states tensor contains the final state of each cell (excluding the zero vectors):

In [26]:
print(states_val)

[[ 0.42523554 -0.93336535 -0.936078   -0.6782177   0.9999345 ]
 [ 0.8977136  -0.76589394 -0.9768165   0.6474266   0.9889001 ]
 [-0.11894488 -0.94140685 -0.5675743  -0.49118972  0.9960541 ]
 [-0.6357697  -0.85447615  0.8104649  -0.26680082  0.95177114]]


# Training RNNs

To train RNN, the trick is to unroll it through time (like we did above) and then simply use regular backpropogation. This strategy is called backpropogation through time. 

Moreover since the same parameters W and b are used at each time step, backpropogation will do the right thing and sum over all time steps.

## Training a Sequence Classifier

Lets train an RNN to classify MNIST images. We will train each image as a sequence of 28 rows and 28 pixels each (since each MNIST image is 28 x 28 pixels). We will sue cells of 150 recurrent neurons plus a fully connected layer containing 10 neurons (one per class) connected to the output of the last time step, followed by a softmax layer.

In [28]:
from tensorflow.contrib.layers import fully_connected
from tensorflow.examples.tutorials.mnist import input_data

In [30]:
tf.reset_default_graph() #reset graphs

n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32,[None])

basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype = tf.float32)

logits = fully_connected(states, n_outputs, activation_fn = None)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels = y,
                                                         logits = logits)

loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

Now lets load the MNIST data and reshape the test data to [batch_size, n_steps, n_onputs] as is expected by the network.

In [32]:
mnist = input_data.read_data_sets("tmp/data/")
X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting tmp/data/t10k-labels-idx1-ubyte.gz


Now we are ready to train the RNN. We reshape each training batch before feeding it to the network.

In [34]:
n_epochs = 100
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            sess.run(training_op, feed_dict = {X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict = {X:X_batch, y:y_batch})
        acc_test = accuracy.eval(feed_dict = {X:X_test, y:y_test})
        print(epoch, "Train accuracy", acc_train, "Test accuracy:", acc_test)
            

0 Train accuracy 0.98 Test accuracy: 0.884
1 Train accuracy 0.94666666 Test accuracy: 0.9503
2 Train accuracy 0.97333336 Test accuracy: 0.9556
3 Train accuracy 0.94666666 Test accuracy: 0.9573
4 Train accuracy 0.97333336 Test accuracy: 0.9596
5 Train accuracy 0.9866667 Test accuracy: 0.9672
6 Train accuracy 0.96666664 Test accuracy: 0.9696
7 Train accuracy 0.97333336 Test accuracy: 0.964
8 Train accuracy 0.97333336 Test accuracy: 0.9689
9 Train accuracy 0.96666664 Test accuracy: 0.9757
10 Train accuracy 0.98 Test accuracy: 0.9727
11 Train accuracy 0.98 Test accuracy: 0.9739
12 Train accuracy 0.9866667 Test accuracy: 0.9736
13 Train accuracy 0.96666664 Test accuracy: 0.9711
14 Train accuracy 1.0 Test accuracy: 0.9763
15 Train accuracy 1.0 Test accuracy: 0.9805
16 Train accuracy 0.9866667 Test accuracy: 0.9734
17 Train accuracy 1.0 Test accuracy: 0.9767
18 Train accuracy 0.98 Test accuracy: 0.9716
19 Train accuracy 0.96666664 Test accuracy: 0.9657
20 Train accuracy 0.99333334 Test accura

## Training to Predict Time Series

In this section we will train an RNN to predict the next value in a generated time series. Each trining instance is randomly selected sequence of 20 consecutive values from the time series, and the target sequence is the same as input sequence, except it is shifted by one time step into the future.

First lets create the RNN. It will contain 100 recurrent neurons and we will unroll it over 20 time steps since each training instance will be 20 inputs long. Each input will contain only one feature (the value at that time). 

The targets are also sequence of 20 inputs, each containing a single value.

### Construction Phase

In [36]:
tf.reset_default_graph() #reset graphs

n_steps = 20
n_inputs = 1
n_neurons = 100
n_outputs = 1

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])
cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu)
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

At each time step we now have an output vector of size 100. But what we want is a single output value of each time step. The simplest solution is to wrap the cell in an OutputProjectionWrapper. A cell wrapper acts like a normal cell, proxying every method call to an underlying cell, but it also adds some functionality. 

The OutputProjectionWrapper adds a fully connected layer of linear neurons (i.e. without any activation function) on top of each output (but it also does not affect the cell state). All these fully connected layers share the same (trainable) weights and bias terms.

We will begin by wrapping the BasicRNNCell into an OutputProjectionWrapper:

In [39]:
cell = tf.nn.rnn_cell.OutputProjectionWrapper(
    tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu),
    output_size = n_outputs
)

Now we need to define cost function. We will use the MSE as we did in previous regression tasks. Next we will create an Adam optimizer, the training op, and the variable initialization op, as usual:

In [41]:
learning_rate = 0.001

loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate)
training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()

### Execution phase

In [None]:
n_iterations = 10000
batch_size = 50

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        X_batch, y_batch = [...] #fetch next training batch
        sess.run(training_op, feed_dict = {X:X_batch, y: y_batch})
        if iterations % 100 == 0:
            mse = loss.eval(feed_dict = {X: X_batch, y: y_batch})
            print(iterations, "\tmse", mse)

Once the model is trained, you can make predictions:

In [None]:
X_new = [...] #New sequences
y_pred = sess.run(outputs, feed_dict = {X: X_new})

Although using an OutputProjectionWrapper is the simplest solution to reduce the dimensionality of RNN's output sequence down to just one value per time step it is not the most efficient.

You can reshape the RNN outputs from [batch_size, n_steps, n_neurons] to [batch_size x n_steps, n_neurons] then apply a single fully connected layer with appropriate output size, which will result in an output tensor of shape [batch_size x n_steps, n_outputs] and then rehsape this tensor to [batch_size, n_steps, n_outputs].

To implement this solution, we first revert to a basic cell, without the OutputProjectionWrapper:

In [None]:
cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons, activation = tf.nn.relu)
rnn_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype = tf.float32)

Then we stack all the outputs using the reshape() operation, apply the fully connected layer (without using any activation function; this is just a projection), and finally unstack all the outputs using reshape():

In [None]:
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = fully_connected(stacked_rnn_outputs, n_outputs, activation_fn = None)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])

## Creative RNN

Now that we ahve a model that can predict the future, we can use it to generate some creative sequences. All we need is to provide it a seed sequence containing n_step values (eg: full zeros), use the model to predict the next value, append this predicted value to the sequence, feed the last n_steps values to the model to predict next value, and so on. 

This process generates new sequences that has some resemblance to the original. 


In [None]:
sequence = [0.]*n_steps #seed sequence that are n_steps long
#print(sequence)
for iteration in range(300):
    X_batch = np.array(sequence[-n_steps:]).reshape(1, n_steps, 1)
    y_pred = sess.run(outputs, feed_dict = {X: X_batch})
    sequence.append(y_pred[0,-1,0])

## Deep RNNs

To implement a deep RNN in TensorFlow, you can create a several cells and stack them into a MultiRNNCell. In the following code we stack three identical cells (but you could use various kinds of cells with different number of neurons):


In [53]:
tf.reset_default_graph() #reset graphs

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

n_neurons = 100
n_layers = 3

basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)
multi_layer_cell = tf.nn.rnn_cell.MultiRNNCell([basic_cell]*n_layers)
outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype = tf.float32)

The states variable is a tuple containing one tensor per layer, each representing the final state of that layer's cell (with shape[batch_size, n_neurons]). If you set state_is_tupe = False  when creating the MultiRNNCell, then states becomes a single tensor containing the states from every layer, concatenated along column axis (i.e., its shape is [batch_size, n_layers x n_neurons])


## Applying Dropout

You can simply add a dropout layer before or after the RNN as usual, but if you also want to apply dropout between the RNN layers, you need to use a DropoutWrapper.

The following code applies dropout to the inputs of each layer in the RNN, dropping each input with a 50% probability:

In [59]:
tf.reset_default_graph() #reset graphs
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])

keep_prob = 0.5

cell = tf.nn.rnn_cell.BasicRNNCell(num_units = n_neurons)
cell_drop = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob = keep_prob)
multi_layer_cell = tf.nn.rnn_cell.MultiRNNCell([cell_drop]*n_layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype = tf.float32)

The main problem with this code is that it will apply dropout not only during training but also during testing, which is not what you want. Unfortunately, the DropoutWrapper does not support an is_training placeholder, so you must either write your own dropout wrapper class, or have two different graphs: one for training, and the other for testing.

## LSTM Cell

One issue suffered by RNNs is that the memory of the first inputs gradually fades away. Indeed, due to the transformations that data goes through traversing an RNN, some information is lost at each time step. After a while, RNNs state contains virtually no trace of the first inputs. 

To solve this problem, various types of cells with long0term memory have been introduced which have greatly outperformed basic cells. The most popular among them is the LSTM cell.

The LSTM performs better, training converge faster and it detects long-term dependencies in the data. To implement LSTM instead of basic RNN cell use the following command:

In [60]:
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=n_neurons)

LSTM cells manage two state vectors and for performance reasons they are kept separate by default. You can change this default behavior by setting state_is_tupe = False when creating BasicLSTMCell.

### Peephole Connections

Type of LSTM variant with extra connections called peephole connections where apart from the inputs $x_t$ and $h_t$ the long term state $c_{(t-1)}$ is added as an input to the controllers of the forget gate and input gate, and the current long term state $c_t$ is added as input to the controller of the output gate.

To implement peephole connections in TensorFlow, you muse use the LSTMCell instead of the BasicLSTMCell and set use_peepholes = True:

In [61]:
lstm_cell = tf.nn.rnn_cell.LSTMCell(num_units = n_neurons, use_peepholes = True)

## GRU Cell (Gated Recurrent Unit)

Introduces encoder-decoder sequences to LSTM. Much simpler and works just as well with only 1 state vector.

In [62]:
gru_cell = tf.nn.rnn_cell.GRUCell(num_units = n_neurons)