# Recurrent Neural Network

## 0. Important hyperparameters for RNN

* `batch_size`: The number of training examples to feed the network in one training pass. Typically this should be set as high as you can go without running out of memory.
* `num_steps`: (or time steps) Number of characters (or words) in a sequence of (different batch may have different num_steps). Larger is better typically; the network will learn more long range dependencies. But it takes longer to train. 
* `lstm_units`: Number of units in the hidden layers in the LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `lstm_layers`: Number of LSTM layers in the network. I'd start with 1, then add more if I'm underfitting.
* `learning_rate`: Learning rate
* `keep_prob`: The dropout keep probability when training. If you're network is overfitting, try decreasing this.

**Note: ** the above hyperparameters can be applied to any kind of RNN, not just LSTM. 

## 1. Word embedding

* After preprocessing our data, we typically have training examples with shape `(batch_size, num_steps)` for each batch.
    * For word-based application, we have a vocabulary with V words. Typically, each element in the training examples is a number in the range of [0, V] representing a word in the vocabulary. (In application like text generation, the range might be [1, V] and 0 may represent padding word).
    * For character-based application, we have a vocabulary with C characters. Typically, each element in the training examples is a number in the range of [0, C] representing a character in the vocabulary. (In application like text generation, the range might be [1, V] and 0 may represent padding word).
* We need to add an embedding layer to encode each word in the training example instead of one-hot encoding, which might work for character-based rnn problem. 
    * It is massively inefficient to one-hot encode a word since we may have a very big vocabulary base and one-hot would make the word vector extremely large. 
    * We can use an existing pre-trained word2vec representation. 
    
* Create the embedding lookup matrix as a `tf.Variable`. Use that embedding matrix to get the embedded vectors for each word in the original training examples to pass to the LSTM cell with [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup). 

* [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) takes the embedding matrix and an input tensor, typically training examples. Then, it'll return another tensor with the embedded vectors. So, if the embedding layer has 200 units, the function will return a tensor with size [batch_size, 200].

```
embedding_lookup(
    params,
    ids,
    partition_strategy='mod',
    name=None,
    validate_indices=True,
    max_norm=None
)
```

* `params`: Basically, it is <b>embedding lookup matrix</b>. More specifically, A single tensor representing the complete embedding tensor, or a list of P tensors all of same shape except for the first dimension, representing sharded embedding tensors. Alternatively, a PartitionedVariable, created by partitioning along dimension 0. Each element must be appropriately sized for the given partition_strategy.
* `ids`: A Tensor with type int32 or int64 containing the ids (or indexes) to be looked up in params.
* `partition_strategy`: A string specifying the partitioning strategy, relevant if len(params) > 1. Currently "div" and "mod" are supported. Default is "mod".

returns a dense tensor:
* with shape: `shape(ids) + shape(params)[1:]`.
* with type: the same as the tensors in params.

```python
embed_size = 300 
with graph.as_default():
    # create a embedding lookup matrix 
    embedding = tf.Variable(tf.random_uniform((n_words, embed_dim), -1, 1), name = "embedding")
    # get embeded vector for each word in the inputs
    embed = tf.nn.embedding_lookup(embedding, inputs)
```

The code above creates a embedding lookup matrix with shape (n_words, embed_size), where `n_words` is the size of the word vocabulary and `embed_dim` is <b>embedding dimension</b> that is the dimension of the vector representing a word after embedding. 

After embedding lookup, the training examples would have shape `(batch_size, num_steps, embed_dim)` 

## 2. Define type of cell for RNN

Next, we'll create our cells (e.g., LSTM, GRU) to use in the [recurrent network](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn). Here we are just defining what the cells look like. This isn't actually building the graph, just defining the type of cells we want in our graph. There are many [types of cells](https://www.tensorflow.org/api_guides/python/contrib.rnn#Core_RNN_Cells_for_use_with_TensorFlow_s_core_RNN_methods):

* tf.contrib.rnn.BasicRNNCell
* tf.contrib.rnn.BasicLSTMCell
* tf.contrib.rnn.GRUCell
* tf.contrib.rnn.LSTMCell
* tf.contrib.rnn.LayerNormBasicLSTMCell

If we want to create a basic LSTM cell for the graph, we will want to use [tf.contrib.rnn.BasicLSTMCell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell). Looking at the function documentation:

```python
tf.contrib.rnn.BasicLSTMCell(num_units, forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None)
```

* `num_units`: int, The number of units in the LSTM cell.
* `forget_bias`: float, The bias added to forget gates (see above). Must set to 0.0 manually when restoring from CudnnLSTM-trained checkpoints.
* `state_is_tuple`: If True, accepted and returned states are 2-tuples of the c_state and m_state. If False, they are concatenated along the column axis. The latter behavior will soon be deprecated.
* `activation`: Activation function of the inner states. Default: tanh.
* `reuse`: (optional) Python boolean describing whether to reuse variables in an existing scope. If not True, and the existing scope already has the given variables, an error is raised.

you can see it takes a parameter called `num_units`, the number of units in the cell. So then, you can write something like 

```python
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

to create an LSTM cell with `num_units`. 

Next, you can add dropout to the cell with [`tf.contrib.rnn.DropoutWrapper`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper). This just wraps the cell in another cell, but with dropout added to the inputs and/or outputs. It's a really convenient way to make your network better with almost no effort! So you'd do something like

```python
drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
```

* `cell`: an RNNCell, a projection to output_size is added to it.
* `input_keep_prob`: unit Tensor or float between 0 and 1, input keep probability; if it is constant and 1, no input  dropout will be added.
* `output_keep_prob`: unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1, no output dropout will be added.
* and many other arguments

**Multiple layers of RNN network**

Most of the time, your network will have better performance with more layers. That's sort of the magic of deep learning, adding more layers allows the network to learn really complex relationships. Again, there is a simple way to create multiple layers of LSTM cells with [`tf.contrib.rnn.MultiRNNCell`](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/MultiRNNCell):

```python
def build_cell(num_units, keep_prob):     
      lstm = tf.contrib.rnn.BasicLSTMCell(num_units)     
      drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)      
      return drop  
      
cell = tf.contrib.rnn.MultiRNNCell([build_cell(num_units, keep_prob) for _ in range(num_layers)])
```

The `build_cell()` function build a cell we described above. The `MultiRNNCell` wrapper builds this into multiple layers of RNN cells, one for each cell in the list.

The final `MultiRNNCell` you're using in the network is actually multiple (or just one) cells with typically dropout. But it all works the same from an achitectural viewpoint, just a more complicated graph in the cell.

> NOTE: each RNN cell here represents a layer. It is just that when we unroll a RNN cell in terms of time steps, it is like we have multiple RNN cells per layer. 


## 3. RNN forward pass

We need to actually run the data through the RNN nodes. You can use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) to do this. 

You'd pass in the RNN `cell` you created (our multiple layered LSTM cell for instance), the `inputs` to the network and the `initial_state`:

```python
outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state)
```

### Arguments for tf.nn.dynamic_rnn(...) 

the `initial_state` is the initial values of the cell state that is passed between the hidden layers in successive time steps. `tf.nn.dynamic_rnn` takes care of most of the work for us. We pass in our cell and the input to the cell. Then it does the unrolling and everything else for us.

* `cell`: An instance of RNNCell.
* `inputs`: The RNN inputs. The `inputs` typicall is the word embedding from the embedding layer.
    * If time_major == False (default), this must be a Tensor of shape: [batch_size, max_time, ...], or a nested tuple of such elements. 
    * If time_major == True, this must be a Tensor of shape:[max_time, batch_size, ...], or a nested tuple of such elements. 
        * <b>The first two dimensions must match across all the inputs</b>, but otherwise the ranks and other shape components may differ. In this case, 
        * Input to cell at each time-step will replicate the structure of these tuples, except for the time dimension (from which the time is taken). <b>The input to cell <b style='color:red'>at each time step</b> will be a Tensor or (possibly nested) tuple of Tensors each with dimensions [batch_size, ...]<b>.

* `initial_state`: (optional) An initial state for the RNN. If cell.state_size is an integer, this must be a Tensorof appropriate type and shape [batch_size, cell.state_size]. If cell.state_size is a tuple, this should be a tuple of tensors having shapes [batch_size, s] for s in cell.state_size.

Normally tensor for initial states are initialized with all zeros:

```python
initial_state = cell.zero_state(batch_size, tf.float32)
```
* <b>cell.zero_state(...)</b>
    * All RNN cells have this `zero_state` method
    * It takes `batch_size` and `dtype` as input
    * It returns zero-filled state tensor(s).
        * If state_size is an int or TensorShape, then the return value is a N-D tensor of shape `[batch_size, state_size]` filled with zeros.
        * If state_size is a nested list or tuple, then the return value is a nested list or tuple (of the same structure) of 2-D tensors with the shapes `[batch_size, s]` for each s in state_size.

### Outputs from tf.nn.dynamic_rnn(...) 

`tf.nn.dynamic_rnn` returns `outputs` for each time step and the <b style='color:red'>final</b> `state` of the hidden layer.

> Note that, unlike outputs, the final `state` doesn't contain information about every time step, but only about the last one (that is, the state after the last one)

A pair `(outputs, state)` where:

* `outputs`: The RNN output Tensor.
    * If time_major == False (default), this will be a Tensor shaped: `[batch_size, max_time, cell.output_size]`.
    * If time_major == True, this will be a Tensor shaped: `[max_time, batch_size, cell.output_size]`.

* `state`: The final state. If cell.state_size is an int, this will be shaped [batch_size, cell.state_size]. If it is a TensorShape, this will be shaped [batch_size] + cell.state_size. If it is a (possibly nested) tuple of ints or TensorShape, this will be a tuple having the corresponding shapes. If cells are LSTMCells state will be a tuple containing a LSTMStateTuple for each cell.

**inputs and outputs for one layer RNN **
* The inputs to the `tf.nn.dynamic_rnn` is of shape `[batch_size, max_time, embed_dim]`
* The inputs to cell <b style='color:red'>at each time step</b> will have shape of `[batch_size, embed_dim]`
* The outputs from `tf.nn.dynamic_rnn` have shape of `[batch_size, max_time, cell.output_size]`
    * Generally, the `cell.output_size` is the `lstm_units` of the cell in the last layer of LSTM network if the network has multiple LSTM layers. 
    * For RNN with only one layer, `cell.output_size` is defined by the `lstm_units` of the RNN cell.

## 4. Output Layer of RNN

We only care about the final output, we'll be using that as our sentiment prediction. So we need to grab the last output with outputs[:, -1], the calculate the cost from that and labels_.

```python
def build_output(last_layer_input, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        lstm_output: List of output tensors from the LSTM layer
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape seq_output to a 2D tensor with lstm_size columns
    x = tf.reshape(lstm_output, [-1, in_size])

    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        # Create the weight and bias variables here
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.add(tf.matmul(x, softmax_w), softmax_b)
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits, name="predictions")
    
    return out, logits
    

predictions, logits = build_output(last_layer_input, in_size, out_size) 
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
```

## 4. Optimizer for RNN

Normal RNNs have have issues gradients exploding and vanishing. LSTMs fix the gradients vanishing problem, but the gradients can still grow without bound. To fix this, we can clip the gradients above some threshold. That is, if a gradient is larger than that threshold, we set it to the threshold. This will ensure the gradients never grow overly large. Then we use an AdamOptimizer for the learning step.

```python
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments
        ---------
        loss: Network loss
        learning_rate: Learning rate for optimizer
        grad_clip:
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    # grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    grads = tf.gradients(loss, tvars)
    clip_grads, _ = tf.clip_by_global_norm(grads , grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(clip_grads, tvars))
    
    return optimizer
```

`tf.trainable_variables()` creates a list of all the variables we've defined in our graph.

## 5. validation accuracy

```python
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_) 
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
```

## 6. Batching

This is a simple function for returning batches from data. 
* First it removes data such that we only have full batches. 
* Then it iterates through the `x` and `y` arrays and returns slices out of those arrays with size `[batch_size]`.

```python
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    
    # Only get full batches
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    
    # returns slices with size [batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]
```

## Training


** Below is the typical training pseudo code: **

```
create a session for the specified graph
{
    initialize all global variables

    iterate epochs
    {
            iterate batches
            {
                create graph input map for current run

                evalute the graph for current run with input map, and 
                return results for specified tensors or operations

                (for certain # of iterations) print training process information, such as loss and accuracy

                (for certain # of iterations) validate the graph while training
            }
    } 
}
```

** python code: **

```python
epochs = 10

with graph.as_default():
    saver = tf.train.Saver()

# create a session for the specified graph
with tf.Session(graph=graph) as sess:
    
    # initialize all global variables
    sess.run(tf.global_variables_initializer())
    iteration = 1
    
    # iterate epochs
    for e in range(epochs):
        
        # evaluates the initial_state tensor and returns the results.
        state = sess.run(initial_state)
        
        # iterate batches
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            
            # create graph input map
            feed = {inputs_: x,
                    labels_: y[:, None],   # change ths shape of y to [batch_size, 1]
                    keep_prob: 0.5,
                    initial_state: state}
            
            # evalute the graph for one run with input map, and
            # return results for specified tensors or operations
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            # print training process information
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            # print validation information while training
            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
                        
    saver.save(sess, "checkpoints/sentiment.ckpt")
```

about the initial_state

## Testing

The logic of testing process is quite similar to training except that 
* No need to run epochs and validate the graph
* Get evaluating results of different tensors or operations for each graph run 


** python code: **

```python
test_acc = []
with tf.Session(graph=graph) as sess:
    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # any difference with state = sess.run(initial_state) ???
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        
        test_acc.append(batch_acc)
        
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))
```

Here is [a tutorial on building RNNs](https://www.tensorflow.org/tutorials/recurrent) that will help you out.

In [None]:
import tensorflow as tf
import numpy as np

tf.reset_default_graph()

# # Create a tensor [0, 1, 2, 3, 4 ,...]
# x = tf.range(1, 10, name="x")

# # A queue that outputs 0,1,2,3,...
# range_q = tf.train.range_input_producer(limit=5, shuffle=False)
# slice_end = range_q.dequeue()

# # Slice x to variable length, i.e. [0], [0, 1], [0, 1, 2], ....
# # y = tf.slice(x, [0], [slice_end], name="y")

# # print(y)
# value = np.array([[1], [0, 1], [0, 1, 2],[0, 1, 2],[0, 1, 2],[0, 1, 2],[0, 1, 2],[0, 1, 2],[0, 1, 2]])
# print(value)

# y = tf.placeholder(tf.int32, shape=[None, None], name='y')
# batch_size = 3

# [0, 1, 2, 3, 4 ,...]
x = tf.range(1, 10, name="x")



# batch_size = 3
# # A queue that outputs 0,1,2,3,...
range_q = tf.train.range_input_producer(limit=5, shuffle=False)
slice_end = range_q.dequeue()
 
# # Slice x to variable length, i.e. [0], [0, 1], [0, 1, 2], ....
y = tf.slice(x, [0], [slice_end], name="y")



# Batch the variable length tensor with dynamic padding
# batched_data = tf.train.batch(
#     tensors=[y],
#     batch_size=batch_size,
#     dynamic_pad=True,
#     name="y_batch"
# )

# min_after_dequeue = 10000
# capacity = min_after_dequeue + 3 * batch_size
# batched_data = tf.train.shuffle_batch(
#       [y], batch_size=batch_size, capacity=capacity,
#       min_after_dequeue=min_after_dequeue)
    
# print(batched_data)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    batched_data_ = sess.run(y, feed_dict=None)
    
    print("here: ", batched_data_)
    
# Run the graph
# tf.contrib.learn takes care of starting the queues for us
# res = tf.contrib.learn.run_n({"y": batched_data}, n=1, feed_dict=None)

# Print the result
# print("Batch shape: {}".format(res[0]["y"].shape))
# print(res[0]["y"])


In [3]:
import tensorflow as tf
import numpy as np

N_EXAMPLES = 100
BATCH_SIZE = 4
N_EPOCHS = 3
tf.reset_default_graph()


steps_per_epoch = N_EXAMPLES / BATCH_SIZE

# with tf.Graph().as_default():
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

#     a = tf.convert_to_tensor(np.arange(N_EXAMPLES))
    a = tf.convert_to_tensor([[1], 
                                       [0, 1], 
                                       [0, 1, 2],
                                       [0, 1, 2],
                                       [0, 1, 2],
                                       [0, 1, 2],
                                       [0, 1, 2],
                                       [0, 1, 2],
                                       [0, 1, 2]])
#     b = tf.convert_to_tensor(np.arange(N_EXAMPLES))

    aa = tf.train.slice_input_producer([a], shuffle=True, seed=1, num_epochs=N_EPOCHS)
    batch = tf.train.batch([aa], batch_size=BATCH_SIZE, dynamic_pad=True)

#     s = tf.Session()

#     sess.run(init_op)
    
    aa_, batch_ = sess.run([aa, batch])
    
    print(aa_)
    print(batch_)

ValueError: Argument must be a dense tensor: [[1], [0, 1], [0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2], [0, 1, 2]] - got shape [9], but wanted [9, 1].