<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/jvcqp2iy2jlx2b32rmzdt0tx8lvxgzkp.png" width = 300, align = "center"></a>

# <center>RECURRENT NETWORKS IN DEEP LEARNING</center>

## The Long Short-Term Memory Model
Hello and welcome to this notebook. In this notebook, we will go over concepts of the Long Short-Term Memory (LSTM) model, a refinement of the original Recurrent Neural Network model. By the end of this notebook, you should be able to understand the Long Short-Term Memory model, the benefits and problems it solves, and its inner workings and calculations.

### The Problem to be Solved
**Long Short-Term Memory**, or **LSTM** for short, is one of the proposed solutions or upgrades to the **Recurrent Neural Network model**. The Recurrent Neural Network is a specialized type of Neural Network that solves the issue of <span style="background-color:yellow;">**maintaining context for Sequential data**</span> -- such as Weather data, Stocks, Genes, etc. At each iterative step, the processing unit takes in an input and the current state of the network, and produces an output and a new state that is **re-fed into the network**.

<img src=https://ibm.box.com/shared/static/v7p90neiaqghmpwawpiecmz9n7080m59.png width="720"/>
<center>*Representation of a Recurrent Neural Network*</center>

However, this model has <span style="color:red;">**some problems**</span>.
+ It's very <span style="background-color:yellow;">computationally expensive</span> to maintain the state for a large amount of units, even more so over a long amount of time.
+ Additionally, Recurrent Networks are <span style="background-color:yellow;">very sensitive to changes in their parameters</span>. As such, they are prone to different problems with their Gradient Descent optimizer -- they
  - either grow exponentially (<span style="background-color:yellow;">Exploding Gradient</span>)
  - or drop down to near zero and stabilize (<span style="background-color:yellow;">Vanishing Gradient</span>)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;both problems that greatly harm a model's learning capability.

### Long Short-Term Memory: What is it?

To solve these problems, <span style="background-color:yellow;">Hochreiter and Schmidhuber published a paper in 1997</span> describing a way to keep information over long periods of time and additionally solve the oversensitivity to parameter changes, i.e., make backpropagating through the Recurrent Networks more viable.

The Long Short-Term Memory, as it was called, was an abstraction of how computer memory works. It is "bundled" with whatever processing unit is implemented in the Recurrent Network, although outside of its flow, and is responsible for keeping, reading, and outputting information for the model. The way it works is simple: <span style="background-color:yellow;">you have a linear unit, which is the information cell itself, surrounded by three logistic gates responsible for maintaining the data. One gate is for inputting data into the information cell, one is for outputting data from the input cell, and the last one is to keep or forget data depending on the needs of the network.</span>

Thanks to that, it not only solves the problem of keeping states, because the network can choose to forget data whenever information is not needed, it also solves the gradient problems, since the Logistic Gates have a very nice derivative.

### Long Short-Term Memory Architecture

As seen before, the Long Short-Term Memory is composed of a linear unit surrounded by three logistic gates. The name for these gates vary from place to place, but the most usual names for them are the <span style="background-color:yellow;">"Input" or "Write" Gate</span>, which handles the writing of data into the information cell, the <span style="background-color:yellow;">"Output" or "Read" Gate</span>, which handles the sending of data back onto the Recurrent Network, and the <span style="background-color:yellow;">"Keep" or "Forget" Gate</span>, which handles the maintaining and modification of the data stored in the information cell.

<img src=https://ibm.box.com/shared/static/zx10duv5egw0baw6gh2hzsgr8ex45gsg.png width="720"/>
<center>*Diagram of the Long Short-Term Memory Unit*</center>

The three gates are the centerpiece of the LSTM unit. The gates, when activated by the network, perform their respective functions. For example,
+ the Input Gate will write whatever data it is passed onto the information cell,
+ the Output Gate will return whatever data is in the information cell, and
+ the Keep Gate will maintain the data in the information cell.

These gates are analog and multiplicative, and as such, can modify the data based on the signal they are sent.

---

### Flow of Operations for LSTM Unit

For example, <span style="background-color:yellow;">an usual flow of operations for the LSTM unit</span> is as such:

<br>

**(1) Keep Gate**

First off, the Keep Gate has to decide whether to keep or forget the data currently stored in memory. It receives both the input and the state of the Recurrent Network, and passes it through its <span style="background-color:yellow;">Sigmoid activation</span>.
  + 1: the LSTM unit should keep the data stored perfectly and
  + 0: it should forget it entirely.

Consider
  + $S_{t-1} = \text{previous state at time } t-1$
  + $x_t = \text{incoming input at time } t$
  + $W_k = \text{weight}$
  + $B_k = \text{bias}$
  
for the Keep Gate. Additionally, consider
  + $Old_{t-1} = \text{data previously in memory}$

What happens can be summarized by this equation:

$$
\begin{align}
  K_t &= \sigma \Big( W_k \times [S_{t-1}, \ x_t] + B_k \Big) \\
  Old_t &= K_t \times Old_{t-1}
\end{align}
$$

As you can see, $Old_{t-1}$ was multiplied by value was returned by the Keep Gate -- this value is written in the memory cell. Then, the input and state are passed on to the Input Gate, in which there is another Sigmoid activation applied.

<br>

**(2) Input Gate**

Concurrently, the input is processed as normal by whatever processing unit is implemented in the network, and then multiplied by the Sigmoid activation's result, much like the Keep Gate. Consider

+ $W_i = \text{weight}$
+ $B_i = \text{bias}$
+ $C_t = \text{the result of the processing of the inputs}$
+ $New_t = \text{new data to be input into the memory cell}$ 

$$
\begin{align}
  I_t &= \sigma \Big( W_i\times[S_{t-1}, \ x_t] + B_i \Big) \\
  New_t &= I_t \times C_t
\end{align}
$$

This is then **added** to whatever value is still stored in memory.

$$Cell_t = Old_t + New_t$$

We now have the *candidate data* which is to be kept in the memory cell. The conjunction of the Keep and Input gates work in an analog manner, making it so that it is possible to keep part of the old data and add only part of the new data. <span style="background-color:yellow;">Consider however, what would happen if the Forget Gate was set to 0 and the Input Gate was set to 1:</span>

$$
\begin{align}
  Old_t &= 0 \times Old_{t-1}\\
  New_t &= 1 \times C_t\\
  Cell_t &= C_t
\end{align}
$$

<span style="background-color:yellow;">The old data would be totally forgotten and the new data would overwrite it completely.</span>

<br>

**(2) Output Gate**

The Output Gate functions in a similar manner. To decide what we should output, we take the <span style="background-color:yellow;">input data</span> and <span style="background-color:yellow;">state</span> and pass it through a Sigmoid function as usual. The contents of our memory cell, however, are pushed onto a **`Tanh`** function to bind them between a value of -1 to 1. Consider $W_o$ and $B_o$ as the weight and bias for the Output Gate.

$$
\begin{align}
  O_t &= \sigma \Big( W_o \times [S_{t-1}, \ x_t] + B_o \Big)\\
  Output_t &= O_t \times \tanh(Cell_t)
\end{align}
$$

And that $Output_t$ is what is output into the Recurrent Network.

<br/>
<img width="384" src="https://ibm.box.com/shared/static/rkr60528r3mz2fmtlpah8lqpg7mcsy0g.png">
<center>*The Logistic Function plotted*</Center>

As mentioned many times, <span style="background-color:yellow;">all three gates are logistic</span>. The reason for this is because it is <span style="background-color:yellow;">very easy to backpropagate through them</span>, and as such, it is possible for the model to learn exactly _how_ it is supposed to use this structure. This is one of the reasons for which LSTM is a very strong structure. Additionally, this solves the gradient problems by being able to manipulate values through the gates themselves -- by passing the inputs and outputs through the gates, we have now a easily derivable function modifying our inputs.

<span style="background-color:yellow;">In regards to the problem of storing many states over a long period of time, LSTM handles this perfectly by only keeping whatever information is necessary and forgetting it whenever it is not needed anymore.</span> Therefore, LSTMs are a very elegant solution to both problems.



### LSTM basics
Lets first create a tiny LSTM network sample to underestand the architecture of LSTM networks.

We need to import the necessary modules for our code. We need **`numpy` and `tensorflow`**, obviously. Additionally, we can import directly the **`tensorflow.models.rnn.rnn`** model, which includes the function for building RNNs, and **`tensorflow.models.rnn.ptb.reader`** which is the helper module for getting the input data from the dataset we just downloaded.

If you want to learm more take a look at https://www.tensorflow.org/versions/r0.11/api_docs/python/rnn_cell/

In [1]:
import numpy as np
import tensorflow as tf
sess = tf.Session()

We want to create a network that has only one LSTM cell. We have to pass 2 elements to LSTM, the __`prv_output`__ and __`prv_state`__, so called, __`h`__ and __`c`__. Therefore, we initialize a state vector, __`state`__.  Here, __`state`__ is a tuple with 2 elements, each one is of size `[1 x 4]`, one for passing __`prv_output`__ to next time step, and another for passing the __`prv_state`__ to next time stamp.

In [2]:
LSTM_CELL_SIZE = 4  # output size (dimension), which is same as hidden size in the cell

lstm_cell = tf.contrib.rnn.BasicLSTMCell(LSTM_CELL_SIZE, state_is_tuple = True)
state = (tf.zeros([2, LSTM_CELL_SIZE]), ) * 2
state

(<tf.Tensor 'zeros:0' shape=(2, 4) dtype=float32>,
 <tf.Tensor 'zeros:0' shape=(2, 4) dtype=float32>)

Let's define a sample input. In this example, __`batch_size`__ = 2, and __`seq_len`__ = 6:

In [3]:
sample_input = tf.constant([[1, 2, 3, 4, 3, 2], [3, 2, 2, 2, 2, 2]], dtype = tf.float32)
print(sess.run(sample_input))

[[ 1.  2.  3.  4.  3.  2.]
 [ 3.  2.  2.  2.  2.  2.]]


Now, we can pass the input to __`lstm_cell`__, and check the new state:

In [4]:
with tf.variable_scope("LSTM_sample1"):
    output, state_new = lstm_cell(sample_input, state)
sess.run(tf.global_variables_initializer())
print(sess.run(state_new))

LSTMStateTuple(c=array([[ 0.14302756, -0.21374086,  0.80374324,  0.14653303],
       [-0.21586901,  0.26612467,  0.32489812,  0.30833322]], dtype=float32), h=array([[ 0.10993633, -0.12350125,  0.09232204,  0.06002492],
       [-0.16038321,  0.20041755,  0.04375251,  0.19928519]], dtype=float32))


As we can see, the states has 2 parts, the new state, c, and also the output, h. Lets check the output again:

In [5]:
print (sess.run(output))

[[ 0.10993633 -0.12350125  0.09232204  0.06002492]
 [-0.16038321  0.20041755  0.04375251  0.19928519]]


### Stacked LSTM basics
What about if we want to have a RNN with stacked LSTM? For example, a 2-layer LSTM. In this case, the output of the first layer will become the input of the second.

Lets start with a new session:

In [6]:
sess = tf.Session()

In [7]:
LSTM_CELL_SIZE = 4  #4 hidden nodes = state_dim = the output_dim 
input_dim = 6
num_layers = 2

Lets create the stacked LSTM cell:

In [8]:
cells = []
for _ in range(num_layers):
    cell = tf.contrib.rnn.LSTMCell(LSTM_CELL_SIZE)
    cells.append(cell)
stacked_lstm = tf.contrib.rnn.MultiRNNCell(cells)

Now we can create the RNN:

In [9]:
# Batch size x time steps x features.
data = tf.placeholder(tf.float32, [None, None, input_dim])
output, state = tf.nn.dynamic_rnn(cell, data, dtype=tf.float32)

Lets say the input sequence length is 3, and the dimensionality of the inputs is 6. The input should be a Tensor of shape: [batch_size, max_time, dimension], in our case it would be (2, 3, 6)

In [10]:
#Batch size x time steps x features.
sample_input = [[[1,2,3,4,3,2], [1,2,1,1,1,2],[1,2,2,2,2,2]],[[1,2,3,4,3,2],[3,2,2,1,1,2],[0,0,0,0,3,2]]]
sample_input

[[[1, 2, 3, 4, 3, 2], [1, 2, 1, 1, 1, 2], [1, 2, 2, 2, 2, 2]],
 [[1, 2, 3, 4, 3, 2], [3, 2, 2, 1, 1, 2], [0, 0, 0, 0, 3, 2]]]

we can now send our input to network:

In [11]:
sess.run(tf.global_variables_initializer())
sess.run(output, feed_dict={data: sample_input})

array([[[  6.41652243e-03,   3.51893269e-02,   5.13511360e-01,
          -6.32265396e-03],
        [  5.39605448e-04,   6.89522624e-02,   1.09963939e-01,
          -2.38909069e-02],
        [ -2.38273684e-02,   9.56620127e-02,   1.45685837e-01,
          -1.46623598e-02]],

       [[  6.41652243e-03,   3.51893269e-02,   5.13511360e-01,
          -6.32265396e-03],
        [ -2.77208108e-02,  -4.86584008e-03,   5.58008373e-01,
          -5.13376966e-02],
        [ -5.06761037e-02,   1.03659391e-01,   2.06940784e-03,
          -6.65227370e-03]]], dtype=float32)

---

# Function Reference

| Function | Description |
|:---------|:------------|
|[**`tf.argmax`**](https://www.tensorflow.org/api_docs/python/tf/argmax)|Returns the index with the largest value across axes of a tensor.<br> Note that in case of ties the identity of the return value is not guaranteed.|
|[**`tf.cast`**](https://www.tensorflow.org/api_docs/python/tf/cast)|Casts a tensor to a new type.|
|[**`tf.contrib.rnn.BasicLSTMCell`**](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell)|Basic LSTM recurrent network cell.<br>The implementation is based on: http://arxiv.org/abs/1409.2329.|
|[**`tf.equal`**](https://www.tensorflow.org/api_docs/python/tf/equal)|Returns the truth value of (x == y) element-wise.|
|[**`tf.nn.conv2d`**](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d)|Computes a 2-D convolution given 4-D input and filter tensors.|
|[**`tf.nn.max_pool`**](https://www.tensorflow.org/api_docs/python/tf/nn/max_pool)|Performs the max pooling on the input.|
|[**`tf.nn.dropout`**](https://www.tensorflow.org/api_docs/python/tf/nn/dropout)|Computes dropout.|
|[**`tf.nn.softmax`**](https://www.tensorflow.org/api_docs/python/tf/nn/softmax)|Computes softmax activations.|
|[**`tf.random_normal`**](https://www.tensorflow.org/api_docs/python/tf/random_normal)|Outputs random values from a normal distribution.|
|[**`tf.reshape`**](https://www.tensorflow.org/api_docs/python/tf/reshape)|Given `tensor`, this operation returns a `tensor` that has the same values as tensor with shape `shape`.|
|[**`tf.train.AdamOptimizer`**](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)|Optimizer that implements the Adam algorithm.|
|[**`tf.truncated_normal`**](https://www.tensorflow.org/api_docs/python/tf/truncated_normal)|Outputs random values from a truncated normal distribution.|
|[**`scipy.signal.convolve2d`**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html)|Convolve two 2-dimensional arrays.|
|[**`numpy.absolute`**](https://docs.scipy.org/doc/numpy/reference/generated/numpy.absolute.html)|Calculate the absolute value _element-wise_.|
|[**`numpy.expand_dims`**](https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.expand_dims.html)|Expand the shape of an array. Insert a new axis, corresponding to a given position in the array shape.|
|[**`matplotlib.cm`**](https://matplotlib.org/api/cm_api.html#module-matplotlib.cm)|This module provides a large set of colormaps, functions for registering new colormaps and for getting<br>a colormap by name, and a mixin class for adding color mapping functionality.|
|[**`matplotlib.pyplot.subplots`**](https://matplotlib.org/devdocs/api/_as_gen/matplotlib.pyplot.subplots.html)|Create a figure and a set of subplots.|

---

## Want to learn more?

Running deep learning programs usually needs a high performance platform. PowerAI speeds up deep learning and AI. Built on IBM's Power Systems, PowerAI is a scalable software platform that accelerates deep learning and AI with blazing performance for individual users or enterprises. The PowerAI platform supports popular machine learning libraries and dependencies including Tensorflow, Caffe, Torch, and Theano. You can download a [free version of PowerAI](https://cocl.us/ML0120EN_PAI).

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/ML0120EN_DSX)This is the end of this lesson. Hopefully, now you have a deeper and intuitive understanding regarding the LSTM model. Thank you for reading this notebook, and good luck on your studies.

### Thanks for completing this lesson!

Notebook created by: <a href="https://br.linkedin.com/in/walter-gomes-de-amorim-junior-624726121">Walter Gomes de Amorim Junior</a>, <a href = "https://linkedin.com/in/saeedaghabozorgi"> Saeed Aghabozorgi </a></h4>

<hr>
Copyright &copy; 2017 [IBM Cognitive Class](https://cognitiveclass.ai/?utm_source=ML0151&utm_medium=lab&utm_campaign=cclab). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).