## Recurrent Neural Networks (RNNs)
### and Long Short-Term Memory (LSTM)

In [None]:
import keras
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

%matplotlib inline

The neural networks we've talked about so far operate all at once. You have an input data point (or points) and the network, and the values propagate over the connections 

In this lesson we'll discuss **recurrent neural networks** (RNNs) which use time-series data as an input. The state of the network at a given point in time is based not just on the input layer at that time, but the state of the network from the previous point in time. So 

If a simple neural network looks like

<img src="img/simple-mlp.png" width=300>

A recurrent network looks like


<img src="img/rnn.png" width=300>

Where there's a connection from the hidden layer back to itself.

We can unroll this in time, showing each row as a separate time step.

<img src="img/rnn-unrolled-labeled.png" width=350>

Note that there are only three sets of weights: the vertical arrows, the horizontal arrows on the left, and the horizontal arrows on the right. To fit this model you fit those three sets of weights to the inputs $x_0$, $x_1$, $x_2$, and $x_3$ and minimize the loss between the output $\hat{x_4}$ and the label $x_4$.

Let' try this in `keras`!

Suppose we have some time-series data and we want to build a model so, given a bunch of points, we can predict the next point. For example:

In [None]:
n_pts = 500
t = np.linspace(0, 15 * 6, n_pts)
sin_t = np.sin(t)

In [None]:
fig, ax = plt.subplots(figsize=(14, 3))
ax.plot(t, sin_t, '.')

Except that's way to easy. Let's add some noise.

In [None]:
sin_t_noisy = np.sin(t) + stats.norm(0, 0.5).rvs(n_pts)

In [None]:
fig, ax = plt.subplots(figsize=(14, 3))
ax.plot(t, sin_t_noisy, '.')

First we need to get the data into a format we can put into an RNN. Each training data point should consist of a sequence of consecutive values for our data (for input) and the next value of our data (for output).

First we'll write a function to consider every possible group of 50 values followed by one value (for the output) along our time-series data.

In [None]:
def windowize_data(data, n_prev):
    n_predictions = len(data) - n_prev
    y = data[n_prev:]
    # this might be too clever
    indices = np.arange(n_prev) + np.arange(n_predictions)[:, None]
    x = data[indices, None]
    return x, y

Then we'll write a function split the data into training and testing sets. Because it's time-series data we have to do that sequentially rather than shuffling it. They should be completely separate and not overlap, so the the training data isn't used for testing.

In [None]:
def split_and_windowize(data, n_prev, fraction_test=0.3):
    n_predictions = len(data) - 2*n_prev
    
    n_test  = int(fraction_test * n_predictions)
    n_train = n_predictions - n_test   
    
    x_train, y_train = windowize_data(data[:n_train], n_prev)
    x_test, y_test = windowize_data(data[n_train:], n_prev)
    return x_train, x_test, y_train, y_test

In [None]:
n_prev = 50

x_train, x_test, y_train, y_test = split_and_windowize(sin_t_noisy, n_prev)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

Now we build our model. We'll put two RNN layers.

<img src="img/rnn-2-layer-unrolled.png" width=450>

Note that for the last layer we aren't going to fit all the outputs, but just the last one, so we set `return_sequences=False`. For the previous layer that feeds into that we need the output of each step, so `return_sequences=True`.

In [None]:
model = keras.Sequential()
model.add(keras.layers.SimpleRNN(32, input_shape=(n_prev, 1), return_sequences=True))
model.add(keras.layers.SimpleRNN(32, return_sequences=False))
model.add(keras.layers.Dense(1, activation='linear'))
model.compile(optimizer='rmsprop',
              loss='mse')

The input shape is `(n_prev, 1)` because we're training with `n_prev` prior time points and only have a single feature.

In [None]:
model.summary()

In [None]:
model.fit(x_train, y_train, batch_size=32, epochs=20)

Let's see what predictions we get.

In [None]:
y_pred = model.predict(x_test)

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
ax.plot(t[-len(y_test):], y_pred, 'b.-', label='predictions', lw=0.5)
ax.plot(t[-len(y_test):], y_test, 'r.', label='actual')
ax.plot(t[-len(y_test):], np.sin(t[-len(y_test):]), 'g-', label='ideal')
ax.legend()

That's not so great. Note, however, that the blue predictions do vary less than the red data points, so at least it averaged out some of the noise. 

One of the difficulties with traditional RNNs is what's called the "vanishing gradients problem." For neural networks (this is 50 levels deep!) the effect of the input at the beginning exponentially shrinks with the depth of the network. This makes it very hard to remember details from the disThe the signal from each successively earlier point is typically smaller (or maybe larger) than the previous is that while they can "remember" what happened recently, 

There are other architectures of RNNs that will do a better job. One is a Long Short Term Memory (LSTM) network; a good post detailing them is at [http://colah.github.io/posts/2015-08-Understanding-LSTMs/]

In [None]:
model = keras.Sequential()
model.add(keras.layers.LSTM(32, input_shape=(n_prev, 1), return_sequences=True))
model.add(keras.layers.LSTM(32, return_sequences=False))
model.add(keras.layers.Dense(1, activation='linear'))
model.compile(optimizer='rmsprop',
              loss='mse')

In [None]:
model.fit(x_train, y_train, batch_size=32, epochs=10)

In [None]:
y_pred = model.predict(x_test)

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
ax.plot(t[-len(y_test):], y_pred, 'b.-', label='predictions', lw=0.5)
ax.plot(t[-len(y_test):], y_test, 'r.', label='actual')
ax.plot(t[-len(y_test):], np.sin(t[-len(y_test):]), 'g-', label='ideal')
ax.legend()

So this is better.

# Classification

RNNs can also be used for classification. Rather than predicting the next step after a sequence as the output, we predict a class (or rather, a probability). Let's try two sequences, sine waves of slightly difference frequencies.

In [None]:
n_pts = 500
t = np.linspace(0, 15 * 6, n_pts)
sin_11t_noisy = np.sin(1.1*t) + stats.norm(0, 0.5).rvs(n_pts)
sin_t_noisy = np.sin(t) + stats.norm(0, 0.5).rvs(n_pts)

We don't care about the next value any more.

In [None]:
x_train1, x_test1, _, _ = split_and_windowize(sin_t_noisy, n_prev)
x_train2, x_test2, _, _ = split_and_windowize(sin_11t_noisy, n_prev)

Instead, the `y`s are the labels of the class.

In [None]:
x_train = np.concatenate([x_train1, x_train2])
x_test = np.concatenate([x_test1, x_test2])
y_train = np.concatenate([np.zeros(x_train1.shape[0]), np.ones(x_train2.shape[0])])
y_test = np.concatenate([np.zeros(x_test1.shape[0]), np.ones(x_test2.shape[0])])

We'll use a sigmoid activation at the end.

In [None]:
model = keras.Sequential()
model.add(keras.layers.SimpleRNN(32, input_shape=(n_prev, 1), return_sequences=True))
model.add(keras.layers.LSTM(32, input_shape=(n_prev, 1), return_sequences=True))
model.add(keras.layers.LSTM(32, return_sequences=False))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy')

In [None]:
y_train.shape, y_test.shape

In [None]:
model.fit(x_train, y_train, batch_size=32, epochs=5)

In [None]:
y_pred = model.predict(x_test)[:,0]

In [None]:
fix, ax = plt.subplots()
ax.hist(y_pred[y_test == 0], alpha=0.3, bins=20, label="0")
ax.hist(y_pred[y_test == 1], alpha=0.3, bins=20, label="1")
ax.legend()

# Regression

Maybe we could use an RNN to figure out the frequency of a signal. Here we'll just create a lot of sequences (with noise), each with a different frequency and starting point. 

In [None]:
n_pts = 50
n_sequences = 1000
length = 10  # length of each sequence
xpts = np.linspace(0, length, n_pts)
offsets = stats.uniform(0, 2*np.pi).rvs(n_sequences)[:, None]
freqs = stats.uniform(1,4).rvs(n_sequences)[:, None]
signals = np.sin(xpts*freqs + offsets) + stats.norm(0, 0.3).rvs((n_sequences, n_pts))

Let's look at some of the sequences.

In [None]:
fig, axs = plt.subplots(7, 1, figsize=(12,10))
for i, ax in zip(range(7), axs):
    ax.plot(xpts, signals[i], '.-', lw=0.5)


In [None]:
model = keras.Sequential()
#model.add(keras.layers.LSTM(32, input_shape=(n_pts, 1), return_sequences=True))
#model.add(keras.layers.LSTM(32, return_sequences=False))
model.add(keras.layers.LSTM(32, input_shape=(n_pts, 1), return_sequences=False))
model.add(keras.layers.Dense(1, activation='linear'))
model.compile(optimizer='rmsprop',
              loss='mse')

In [None]:
x_train, x_test, y_train, y_test = train_test_split(signals[:, :, None], freqs[:, 0])

In [None]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

In [None]:
model.fit(x_train, y_train,  batch_size=32, epochs=20)

In [None]:
predict_test = model.predict(x_test)

How did we do?

In [None]:
fig, ax = plt.subplots()
ax.scatter(y_test, predict_test)