### The Simple Recurrent Unit  

Recurrent neural networkds (RNNs) are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, or numerical times series data emanating from sensors, stock markets and government agencies. The simple recurrent unit, also known as the Elman unit, is the most basic form of RNNs.  The way an RNN works is that it introduces a feedback look from the output of the hidden layer back into itself, where it uses uses the previous output of the hidden layer as one of its inputs:

![](extras/rnn.PNG)

Here $h(t)$ is an M-sized vector, and the weight matrix of the hidden layer $W_h$ is $MXM$. There are m-hidden units, and each one connects back to all the previous hidden units, meaning in total there will be $M^2$ hidden weights.  

With recurrent units, our data has a new dimensionality, which is the length of a sequence, T.  Often times, the sequences are not of equal length, so we have to store them in a list.  

Here is how we represent the hidden layer and the output layer of the RNN:

\begin{align}
\ h(t) = f(W_h^Th(t-1)+W_x^Tx(t)+b_h
\end{align}

\begin{align}
\ y(t) = softmax(W_o^Th(t)+b_o)
\end{align}

where f = sigmoid, tanh, relu. Note that the hidden layer equation can be upacked further: 

\begin{align}
\ h(t) = f(W_h^Th(t-1)+W_x^Tx(t)+b_h
\end{align}

\begin{align}
\ h(t) = f(W_h^T(f(W_h^Th(t-2)+W_x^Tx(t-1)+b_h)+W_x^Tx(t)+b_h)
\end{align}

etc., where $h(0)$ is a hyperparameter that can be set to zero or be made update-able.

Additionally, we can add more hidden layers, each with it's own recurrence. 

![](extras/rnn_layered.PNG)

There are three types of predicitons we can make using RNNs:
1. Predict one label over an entire sequence (e.g. differentiate between male and female voices).
2. Predict a label for every step of input sequence (e.g. control device using BCI).
3. Predict next value in a sequence (e.g. next word in a sentence).

Another key concept here is _shared weights_.  
RNNs rely on 'backpropagation through time,' which is just a fancy way of saying backpropagation, which is just the same as gradient descent.

Vanishing gradient and exploding gradient can be issues with RNNs.

### Rated Recurrent Unit   

The rated recurrent unit is a simple modification to the basic RNN. 

e want to weight two things:  
1. $f(x(t), h(t-1))$ -> output we would have gotten from a regular recurrent unit.
2. $h(t-1)$ -> previous value of hidden state.

We use a matrix state called the rate matrix, $z$, which is the same size as the hidden layer. We then do an element-by-element multiplication on each dimension:  

$h(t) = 1-z \circ h(t-1) + z \circ f(x(t), h(t-1))$

The gate we add is supposed to be a lot like a low-pass filter. $z$ can be calculated multiple ways:
- it can be a full-size matrix (MXM)
- Can also make $z(t) = f(x(t), h(t-1)) = f(W_{xz}x(t)+W_{hz}h(t-1)+b_z)$


![](extras/rated_recurrence.PNG)

### Gated Recurrent Unit  

Established in 2014, the gated recurrent unit (GRU), has comparable performance to LSTM, though it uses fewer parameters. 

![](extras/gru.PNG)

\begin{array}{lcl}
r_t = \sigma(x_tW_{xr}+h_{t-1}W_{hr}+b_r) \\
z_t = \sigma(x_tW_{xz}+h_{t-1}W_{hz}+b_z) \\
\hat{h}_t=g(x_tW_{xh}+(r_t \odot h_{t-1})W_{hh} + b_h) \\
h_t = (1-z_t) \odot h_{t-1} + z_t \odot \hat{h}_t \\
\end{array}

where $g()$ is the activation function and the _circle-dot_ symbol signifies element-wise multiplication.  

If $r(t) = 0$, we get $\hat{h}_t = g(x_t)W_{xh}+b_h)$, which is like the beginning of a new sequence.  

Note that this is note the full picture, since it doesn't consider h(0), and $\hat{h}_t$ is only a candidate for the new h(t). Thus, h(t) will be a combo of h(t-1) and $\hat{h}_t$.

### Long Short-Term Memory (LSTM)   

LSTMs are composed of:
- 3 gates: input, output, and forget gate
- a memory cell $c_t$ (no more $\hat{h}$)

![](extras/lstm.PNG)

\begin{array}{lcl}
i_t = \sigma(x_tW_{xi}+h_{t-1}W_{hi}+c_{t-1}W_{ci}+b_i) \\
f_t = \sigma(x_tW_{xf}+h_{t-1}W_{hf}+c_{t-1}W_{cf}+b_f) \\
c_t=f_tc_{t-1}+i_ttanh(x_tW_{xc}+h_{t-1}W_{hc} + b_c) \\
o_t = \sigma(x_tW_{xo}+h_{t-1}W_{ho}+c_tW_{co} + b_o \\
h_t = o_ttanh(c_t)
\end{array}

LSTM Parameters:   
Input gate $_t$ determines how much of the new value goes into the cell.
- Parameters: $W_{xi}, W_{hi}, W_{ci}, b_i$
- Depends on: $x_t, h_{t-1}, c_{t-1}$  

Forget gate $f_t$ determines how much of the previous cell value goes into the current cell value.  
- Parameters: $W_{xf}, W_{hf}, W_{cf}, b_f$
- Depends on: $x_t, h_{t-1}, c_{t-1}$  

Candidate cell $c_t$ looks like the simple recurrent unit right before it gets multiplied by the input gate.  
- Parameters: $W_{xc}, W_{hc}, b_c$
- Depends on: $x_t, h_{t-1}$  

Output gate $o_t$ takes into account everything (input at time t, the previous hidden state, and the current cell value). 
- Parameters: $W_{xo}, W_{ho}, W_{co}, b_o$
- Depends on: $x_t, h_{t-1}, c_t$

New hidden state $h_t$ is the tanh of the cell value times the output gate. 

### Coded Example

In [None]:
import theano
import theano.tensor as T
import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from util import init_weight, all_parity_pairs_with_sequence_labels


class SimpleRNN:
    def __init__(self, M):
        self.M = M # hidden layer size

    def fit(self, X, Y, learning_rate=10e-1, mu=0.99, reg=1.0, activation=T.tanh, epochs=100, show_fig=False):
        D = X[0].shape[1] # X is of size N x T(n) x D
        K = len(set(Y.flatten()))
        N = len(Y)
        M = self.M
        self.f = activation

        # initial weights
        Wx = init_weight(D, M)
        Wh = init_weight(M, M)
        bh = np.zeros(M)
        h0 = np.zeros(M)
        Wo = init_weight(M, K)
        bo = np.zeros(K)

        # make them theano shared
        self.Wx = theano.shared(Wx)
        self.Wh = theano.shared(Wh)
        self.bh = theano.shared(bh)
        self.h0 = theano.shared(h0)
        self.Wo = theano.shared(Wo)
        self.bo = theano.shared(bo)
        self.params = [self.Wx, self.Wh, self.bh, self.h0, self.Wo, self.bo]

        thX = T.fmatrix('X')
        thY = T.ivector('Y')

        def recurrence(x_t, h_t1):
            # returns h(t), y(t)
            h_t = self.f(x_t.dot(self.Wx) + h_t1.dot(self.Wh) + self.bh)
            y_t = T.nnet.softmax(h_t.dot(self.Wo) + self.bo)
            return h_t, y_t

        [h, y], _ = theano.scan(
            fn=recurrence,
            outputs_info=[self.h0, None],
            sequences=thX,
            n_steps=thX.shape[0],
        )

        py_x = y[:, 0, :]
        prediction = T.argmax(py_x, axis=1)

        cost = -T.mean(T.log(py_x[T.arange(thY.shape[0]), thY]))
        grads = T.grad(cost, self.params)
        dparams = [theano.shared(p.get_value()*0) for p in self.params]

        updates = [
            (p, p + mu*dp - learning_rate*g) for p, dp, g in zip(self.params, dparams, grads)
        ] + [
            (dp, mu*dp - learning_rate*g) for dp, g in zip(dparams, grads)
        ]

        self.predict_op = theano.function(inputs=[thX], outputs=prediction)
        self.train_op = theano.function(
            inputs=[thX, thY],
            outputs=[cost, prediction, y],
            updates=updates
        )

        costs = []
        for i in xrange(epochs):
            X, Y = shuffle(X, Y)
            n_correct = 0
            cost = 0
            for j in xrange(N):
                c, p, rout = self.train_op(X[j], Y[j])
                # print "p:", p
                cost += c
                if p[-1] == Y[j,-1]:
                    n_correct += 1
            print "shape y:", rout.shape
            print "i:", i, "cost:", cost, "classification rate:", (float(n_correct)/N)
            costs.append(cost)
            if n_correct == N:
                break

        if show_fig:
            plt.plot(costs)
            plt.show()



def parity(B=12, learning_rate=10e-5, epochs=200):
    X, Y = all_parity_pairs_with_sequence_labels(B)

    rnn = SimpleRNN(4)
    rnn.fit(X, Y, learning_rate=learning_rate, epochs=epochs, activation=T.nnet.sigmoid, show_fig=False)


if __name__ == '__main__':
    parity()


In [None]:
# https://deeplearningcourses.com/c/deep-learning-recurrent-neural-networks-in-python
# https://udemy.com/deep-learning-recurrent-neural-networks-in-python
import numpy as np
import theano
import theano.tensor as T

from util import init_weight


class GRU:
    def __init__(self, Mi, Mo, activation):
        self.Mi = Mi
        self.Mo = Mo
        self.f  = activation

        # numpy init
        Wxr = init_weight(Mi, Mo)
        Whr = init_weight(Mo, Mo)
        br  = np.zeros(Mo)
        Wxz = init_weight(Mi, Mo)
        Whz = init_weight(Mo, Mo)
        bz  = np.zeros(Mo)
        Wxh = init_weight(Mi, Mo)
        Whh = init_weight(Mo, Mo)
        bh  = np.zeros(Mo)
        h0  = np.zeros(Mo)

        # theano vars
        self.Wxr = theano.shared(Wxr)
        self.Whr = theano.shared(Whr)
        self.br  = theano.shared(br)
        self.Wxz = theano.shared(Wxz)
        self.Whz = theano.shared(Whz)
        self.bz  = theano.shared(bz)
        self.Wxh = theano.shared(Wxh)
        self.Whh = theano.shared(Whh)
        self.bh  = theano.shared(bh)
        self.h0  = theano.shared(h0)
        self.params = [self.Wxr, self.Whr, self.br, self.Wxz, self.Whz, self.bz, self.Wxh, self.Whh, self.bh, self.h0]

    def recurrence(self, x_t, h_t1):
        r = T.nnet.sigmoid(x_t.dot(self.Wxr) + h_t1.dot(self.Whr) + self.br)
        z = T.nnet.sigmoid(x_t.dot(self.Wxz) + h_t1.dot(self.Whz) + self.bz)
        hhat = self.f(x_t.dot(self.Wxh) + (r * h_t1).dot(self.Whh) + self.bh)
        h = (1 - z) * h_t1 + z * hhat
        return h

    def output(self, x):
        # input X should be a matrix (2-D)
        # rows index time
        h, _ = theano.scan(
            fn=self.recurrence,
            sequences=x,
            outputs_info=[self.h0],
            n_steps=x.shape[0],
        )
        return h


In [None]:
# https://deeplearningcourses.com/c/deep-learning-recurrent-neural-networks-in-python
# https://udemy.com/deep-learning-recurrent-neural-networks-in-python
import numpy as np
import theano
import theano.tensor as T

from util import init_weight


class LSTM:
    def __init__(self, Mi, Mo, activation):
        self.Mi = Mi
        self.Mo = Mo
        self.f  = activation

        # numpy init
        Wxi = init_weight(Mi, Mo)
        Whi = init_weight(Mo, Mo)
        Wci = init_weight(Mo, Mo)
        bi  = np.zeros(Mo)
        Wxf = init_weight(Mi, Mo)
        Whf = init_weight(Mo, Mo)
        Wcf = init_weight(Mo, Mo)
        bf  = np.zeros(Mo)
        Wxc = init_weight(Mi, Mo)
        Whc = init_weight(Mo, Mo)
        bc  = np.zeros(Mo)
        Wxo = init_weight(Mi, Mo)
        Who = init_weight(Mo, Mo)
        Wco = init_weight(Mo, Mo)
        bo  = np.zeros(Mo)
        c0  = np.zeros(Mo)
        h0  = np.zeros(Mo)

        # theano vars
        self.Wxi = theano.shared(Wxi)
        self.Whi = theano.shared(Whi)
        self.Wci = theano.shared(Wci)
        self.bi  = theano.shared(bi)
        self.Wxf = theano.shared(Wxf)
        self.Whf = theano.shared(Whf)
        self.Wcf = theano.shared(Wcf)
        self.bf  = theano.shared(bf)
        self.Wxc = theano.shared(Wxc)
        self.Whc = theano.shared(Whc)
        self.bc  = theano.shared(bc)
        self.Wxo = theano.shared(Wxo)
        self.Who = theano.shared(Who)
        self.Wco = theano.shared(Wco)
        self.bo  = theano.shared(bo)
        self.c0  = theano.shared(c0)
        self.h0  = theano.shared(h0)
        self.params = [
            self.Wxi,
            self.Whi,
            self.Wci,
            self.bi,
            self.Wxf,
            self.Whf,
            self.Wcf,
            self.bf,
            self.Wxc,
            self.Whc,
            self.bc,
            self.Wxo,
            self.Who,
            self.Wco,
            self.bo,
            self.c0,
            self.h0,
        ]

    def recurrence(self, x_t, h_t1, c_t1):
        i_t = T.nnet.sigmoid(x_t.dot(self.Wxi) + h_t1.dot(self.Whi) + c_t1.dot(self.Wci) + self.bi)
        f_t = T.nnet.sigmoid(x_t.dot(self.Wxf) + h_t1.dot(self.Whf) + c_t1.dot(self.Wcf) + self.bf)
        c_t = f_t * c_t1 + i_t * T.tanh(x_t.dot(self.Wxc) + h_t1.dot(self.Whc) + self.bc)
        o_t = T.nnet.sigmoid(x_t.dot(self.Wxo) + h_t1.dot(self.Who) + c_t.dot(self.Wco) + self.bo)
        h_t = o_t * T.tanh(c_t)
        return h_t, c_t

    def output(self, x):
        # input X should be a matrix (2-D)
        # rows index time
        [h, c], _ = theano.scan(
            fn=self.recurrence,
            sequences=x,
            outputs_info=[self.h0, self.c0],
            n_steps=x.shape[0],
        )
        return h


### References  

- https://deeplearning4j.org/lstm
- https://www.udemy.com/deep-learning-recurrent-neural-networks-in-python