# LSTMs for sequential data analysis

## Introduction
This tutorial will give you an introduction to Long Short Term Memory networks (LSTMs), which is a powerful and widely-used model to recognize patterns in sequences of data, such as text, handwriting, speech, or numerical times series data from sensors, stock markets etc.
The LSTMs, as a variation of recurrent neural ntworks (RNNs), was introduced by [Hochreiter and Schmidhuber (1997)](http://www.bioinf.jku.at/publications/older/2604.pdf). In case some of the students haven't learned artificial neural networks, I'll start from a brief introduction to the conventional feedforward neural networks. Then the recurrent neural network and its limitation is introduced, which provides the motivation for LSTMs. After that we will have a look into the structure of the LSTMs. Finally, the applictaion of the LSTM is illustrated by a text genarator with the python deep learning package Keras.

### Table of contents
- [Backgroung knowledge](#Background-knowledge)
    * [Feedforward neural networks](#Feedforward-neural-networks )
    * [Recurrent neural networks](#Recurrent-neural-networks)
- [Motivation for LSTMs](#Motivation-for-LSTMs)
- [Architecture of LSTMs](#Architecture-of-LSTMs)
- [Application - text generator with LSTMs](#Text-generator-with-LSTMs)


## Background knowledge
- ### Feedforward neural networks
<br/> For those who know little about artificial neural networks, here's a brief introduction to the conventional feedforward neural networks. The feedforward neural network is a computational model to approximate a  possibly complicated mapping $y=f(x;\theta)$ from some input $x$ to the output $y$, where $\theta$ is the parameters to learn. it's typically comprised of many computational units, often called neurons or nodes, connected in chain. Usually a node will calculate the products of the weights $w$ and the input $x$, and pass them through an activation function $a$ to introduce non-linearity into the outputs. The architecture of a simple feedforward network is showed in the diagram below.
<img src="https://cdn-images-1.medium.com/max/479/1*QVIyc5HnGDWTNX3m-nIm9w.png" style="width:400px;"/>  
The model is called __feedforward__ because the structure is a directed acylic graph where the information flows from the input $x$ through the intermediate hidden layers, to the output $y$, with no feedback connections to itself. The outputs of the last layer are ideally an approximation to the corresponding label for each training example, but the outputs of the intermediate layers are not specified, so they are called __hidden layers__.  
For detailed more description of feedforward neural networks, including the training algorithm, activation functions and so on, the [chapter 6 of Goodfellow, Bengio, Courville (2016)](http://www.deeplearningbook.org/contents/mlp.html) is a great material to learn.

- ### Recurrent neural networks  
<br/> Feedforward neural networks have no notion of order in time, and all training examples are considered independent. This becomes a shortcoming when handling sequential data. For instance, if you want to predict the stock price, it depends not only on the factors that influence the price for today, such as the negative news of the company, but also on the price of the previous day. Recurrent neural networks (RNNs) handle the influence from previous time by introducing 'memory' into the network, that is adding a feedback loop connected to their previous predictions. So in RNNs, the present input and the information passed from the past together determine the output. A chunk of RNN is like:  
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/12/05231650/rnn-neuron.png" style="width: 150px"/>  
It can be unfolded as copies of the same neural network connected in the sequence of time. The animation below illustrate the mechanism of RNNs, where each column represents a feedforward network at that point in time. The size of the unfolded graph depends on the sequence length. 
<img src="https://i.imgur.com/kpZBDfV.gif" style="width: 500px"/>

With the unfolded recurrence, the value of the hidden units after $t$ steps can be represented as:
$$h^{(t)}=g^{(t)}(x^{(t)}, x^{(t-1)},...x^{(1)})=f(h^{(t-1)}, x^{(t)}, \theta) \tag {1}$$
Here we take advantage of the idea of sharing the parameters across different part of the network, so it's possible to learn a single transition function $f$ at each time step (rather than a function $g^{(t)}$ for all time steps). Thus the model can be generalized to examples of different length, and can be trained with much fewer training examples.  
It should be noted that except for the architecture depicted in the animation above (recurrent connections from the output of previous time step to the hidden units of this time step), RNNs can also have recurrent connections only between hidden units at different time steps. Also, it's possible to only have a single output at the end of the sequence.

## Motivation for LSTMs

When neural networks become very deep, a problem rises when doing optimization: vanishing or exploding gradients. This is particularly true for RNNs, since they involve the composition of the __same__ function multiple times. If we ignore the activation function, the recurrence of RNNs is approximately matrix multiplication. When the recurrence spans for $t$ time steps, the state of the network can be simplified as $$h^{(t)}=(W^t)^Th^{(0)}$$
Suppose $W$ has eigendecomposition $W=Vdiag(\lambda)V^{-1}$, then $W^t=Vdiag(\lambda)^tV^{-1}$. The eigenvalues $\lambda$ increase to the power of t and they will easily explode or vanish depending on their magnitude. In actual application, exploding gradients are less often to occur and can be relatively easy to solve by, for example, trucation or squashing, so vanishing gradients are kottier.  
For RNNs, the long-term dependencies will make the gradient-based optimization extremely difficult for sequence of only length 10 or 20 ([Bengio et al. (1994)](http://ai.dinfo.unifi.it/paolo/ps/tnn-94-gradient.pdf)). For instance, if we want to predict the next word appearing in the sentence "*The moon moves round the \_\_*", it can effectively solved by RNNs. But consider this example: "*He moved to France 20 years ago with his parents. Now he can speak fluent \_\_*". The information we need to predict the word is far away, so plain RNNs cannot learn the connection.

LSTMs is one of the most successful approaches to solve the challange of long-term dependencies. It's proven effective in many applications such as speech recognition, machine translation and parsing. The core idea of LSTMs is to introduceg self-loops, called gated cells, to contain information outside the normal flow of RNNs, thus maintaining more constant errors through propagation. The gates in the cell filter the signals they receive and decide which information to read in , to remove or to output from the cell. Similar to hidden units in ordinary neural networks, the gated cells learn their own set of weights through gradient descent to control the flow of information.

## Architecture of LSTMs
A normal structure of LSTMs are depicte in the diagram below.  
<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" style="width: 500px"/>  

Let's take a repeating module in LSTMs and have a look into the structure. The module contains forget gate, input gate and output gate, as an improvement of the information control for plain RNNs.  
Recall the transition function of plain RNNs in equation $(1)$. Assume the activation function is a sigmoid function and we can write the equation as:
$$o^{(t)}=\sigma(Wh^{(t-1)}+Ux^{(t)}+b)$$
where $W$, $U$ and $b$ are recurrent weights, input weights and bias respectively.  
Let's still use the text prediction problem mentioned above. For instance, if in the new sentence, the subject changes, the information about the previous subject might needs to be removed. This is empowered by the forget gate. 
<img src="https://raw.githubusercontent.com/wanniz/markdown/master/LSTM3-forget.png" style="width: 300px"/>

The forget gate looks the previous state and the current input, and the __sigmoid__ unit in it controls the output between 0 to 1. The output of the forget gate $f^{(t)}=\sigma(W^fh^{(t-1)}+U^fx^{(t)}+b^f)$.
Next, the input gate will decide which information to add in to the cell state. For example, some words in the sentence is not important for the text prediction problem, so we don't need to store them.
<img src="https://raw.githubusercontent.com/wanniz/markdown/master/LSTM3-input.png" style="width: 300px"/>

Similarly, the __sigmoid__ unit in the input gate regulates what values need to be added, $i^{(t)}=\sigma(W^ih^{(t-1)}+U^ix^{(t)}+b^i)$. And the __tanh__ unit creates a candidate vector $\tilde{C}^{(t)}=\sigma(W^Ch^{(t-1)}+U^Cx^{(t)}+b^C)$, containing all possible that could be added to the state. The multiplication of the two values is then added to the cell state via addition operation. Then the new cel state $C^{(t)}$ is updated by the output of the forget gate and the input gate: $C^{(t)}=f^{(t)}*C^{(t)}+i^{(t)}*\tilde{C}^{(t)}$.
Finally, the output gate is responsible to decide what information to export. In our text prediction problem, this is to decide what type of the word is apt to fill in the blank.
<img src="https://raw.githubusercontent.com/wanniz/markdown/master/LSTM3-output.png" style="width: 300px"/>

The output gate works similarly as the input gate, where a __tanh__ unit generates a vector and the __sigmoid__ unit filters what to pass through. $o^{(t)}=\sigma(W^oh^{(t-1)}+U^ox^{(t)}+b^o)$; $h^{(t)}=o^{(t)}*tanh(C^{(t)})$.

## Text-generator-with-LSTMs

After understanding the mechanism of LSTMs, let's try to implement it to build a text generator (as the one in homework3). For the application, we need the deep learning package Keras. You can install it by `pip`:

    $ pip install keras

For the text to analyze, we can download books whose copyright is no longer protected from [Project Gutenberg](https://www.gutenberg.org/ebooks/search/%3Fsort_order%3Ddownloads).
Here we chose Alice’s Adventures in Wonderland.

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Now we read the text file and do some simple processing.  
In this example, we will generate the text character by character, so we create a vocabulary list that stores all the unique characters in appearing in the text.

In [9]:
with open("Alice.txt", encoding="utf8") as f:
    raw_text = f.read()
raw_text = raw_text.replace("\n","")
Voc = list(set(raw_text))
print(Voc)

[' ', '3', 'm', ']', 'l', 'R', 'F', ':', 'G', 'C', 'x', 'S', '\ufeff', 'w', '“', '.', '(', 'Y', 'y', ';', 'D', 'j', 'I', 't', 'h', 'd', '_', 'L', '!', 'A', 'N', 'n', '?', 'i', '[', 's', 'K', 'H', 'q', 'g', 'u', 'z', '’', 'V', 'b', 'O', 'Z', 'J', 'a', '‘', '”', 'W', '-', 'f', 'E', 'p', '*', ')', 'e', 'r', 'c', 'P', 'X', 'M', 'v', 'k', '0', 'T', 'B', 'Q', ',', 'o', 'U']


Remove the  character '\ufeff' appears in the text, since it's only related to the decode format.

In [10]:
raw_text = raw_text.replace("\ufeff","")
Voc = list(set(raw_text))
print(Voc)

[' ', '3', 'm', ']', 'l', 'R', 'F', ':', 'G', 'C', 'x', 'S', '“', 'w', '.', '(', 'Y', 'y', ';', 'D', 'j', 'I', 't', 'h', 'd', '_', 'L', '!', 'A', 'N', 'n', '?', 'i', '[', 's', 'K', 'H', 'q', 'g', 'u', 'z', '’', 'V', 'b', 'O', 'Z', 'J', 'a', '‘', '”', 'W', '-', 'f', 'E', 'p', '*', ')', 'e', 'r', 'c', 'P', 'X', 'M', 'v', 'k', '0', 'T', 'B', 'Q', ',', 'o', 'U']


Le't see the size of our vocabulary and the text

In [11]:
V_length = len(Voc)
T_length = len(raw_text)
print(V_length, T_length)

72 141097


We create two dictionaries to look up for the index of the character in the vocabulary list or conversely. This is helpful we when do encoding of the characters to create features.

In [12]:
ch_idx = {c:i for i, c in enumerate(Voc)}
idx_ch = {i:c for i, c in enumerate(Voc)}

We set the sequence length to 100, which means we want to predict the 101 charatcer given the previous 100 characters. So we segment our text to sequence of 100 length with a step 1. And the output sequences are just corresponding input sequence shifted by one character.

In [13]:
S_length = 100
seq_in = []
seq_out = []
for i in range(0, T_length-S_length, 1):
    seq_in.append(raw_text[i:i+S_length])
    seq_out.append(raw_text[i+1:i+S_length+1])

In [14]:
n_sample = len(seq_in)
n_sample

140997

We get 140997 trianing examples in total. Then we need to do one-hot encoding for the sequences. (The rest codes run on cloud due to the large memory required)

In [None]:
S_in = np.zeros((n_sample, S_length, V_length))
S_out = np.zeros((n_sample, S_length, V_length))
for i in range(n_sample):
    M_in = np.zeros((S_length, V_length))
    M_out = np.zeros((S_length, V_length))
    for j in range(S_length):
        M_in[j][ch_idx[seq_in[i][j]]] = 1
        M_out[j][ch_idx[seq_out[i][j]]] = 1
    S_in[i] = M_in
    S_out[i] = M_out

Define the LSTMs model. Create checkpoints to save weights at different epochs

In [None]:
model = Sequential()
model.add(LSTM(256, input_shape=(None, V_length), return_sequences=True))
model.add(Dropout(0.2))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
filepath="weights-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

model.fit(S_in, S_out, epochs=20, batch_size=64, callbacks=callbacks_list)

Load the weights with minimum loss

In [None]:
filename = "weights-19-1.9227.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
start = numpy.random.randint(0, n_sample-1)
pattern = seq_in[start]
# generate characters
for i in range(1000):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(V_length)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = idx_ch[index]
    seq = [idx_ch[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

The text generated:  

*Have no mistake about it: it was neither more nor less than a pig, and she felt that it would be quit e aelin that she was a little want oe toiet ano a grtpersent to the tas a little war th tee the tase oa teettee the had been tinhgtt a  at the cadl in a long tuiee aedun that sheer was a little tare gereen to be a gentle of the tabdit  soenee the gad  ouw ie the tay a tirt of toiet at the was a little anonersen, and thiu had been woite io a lott of tueh a tiie  and taede
bot her aeain  she cere thth the bene tith the tere bane to tee toaete to tee the harter was a little tire the same oare cade an anl ano the garee and the was so seat the was a little gareen and the sabdit, and the white rabbit wese tilel an the caoe and the sabbit se teeteer,and the white rabbit wese tilel an the cade in a lonk tfne the sabdi
ano aroing to tea the was sf teet whitg the was a little tane oo thete the sabeit  she was a little tartig to the tar tf tee the tame of the cagd, and the white rabbit was a little toiee to be anle tite thete ofsand the tabdit was the wiite rabbit, and*