# Recurrent Neural Networks

**RNN** have a key flaw, as capturing relationships that span more than 8 or 10 steps back is practically impossible. This flaw stems from the **vanishing gradient** or **exploding gradient** problem in which the contribution of information decays geometrically over time.

**LSTM** is one option to overcome the Vanishing Gradient problem in RNNs.

### RNN vs. FFNN

There are two main differences between FFNNs and RNNs. The Recurrent Neural Network uses:

1. **sequences** as inputs in the training phase, and
2. **memory** elements

Memory is defined as the output of hidden layer neurons, which will serve as additional input to the network during next training step.

### Simple RNN (aka Elman Network)

In RNNs, our output at time $t$, depends not only on the current input and the weight, but also on previous inputs. In this case the output at time $t$ will be defined as:

$$\bar{y}_t = F(\bar{x}_t, \bar{x}_{t-1}, \ldots, \bar{x}_{t-t_0}, W)$$

![](images/rnn.svg)

with the **input vector** $\bar{x}$, the **output vector** $\bar{y}$ and the **state vector** $\bar{s}$.

$W_x$ is the weight matrix connecting the inputs to the **state layer**, $W_y$ is the weight matrix connecting the state layer to the output layer and $W_s$ represents the weight matrix connecting the state from the previous timestep to the state in the current timestep.

In **FFNNs** the hidden layer depended only on the current inputs and weights, as well as on an **activation function** $\Phi$:

$$\bar{h} = \Phi(\bar{x} W)$$

In **RNNs** the state layer depended on the current inputs, their corresponding weights, the activation function and **also on the previous state**:

$$\bar{h} = \bar{s}_t = \Phi(\bar{x}_t W_x + \bar{s}_{t-1} W_s)$$

The output vector is calculated exactly the same as in FFNNs. It can be a linear combination of the inputs to each output node with the corresponding weight matrix $W_y$ or a **softmax function** $\sigma$ of the same linear combination:

$$\bar{y}_t = \bar{s}_t W_y \qquad \text{or} \qquad \bar{y}_t = \sigma(\bar{s}_t W_y)$$

### Backpropagation Through Time (BPTT)