# LSTMs and Advanced Temporal Neurons

(partially exerpted from cs231 lecture notes)

The underlying problems with Simple RNNs are serious but the power of these systems and the opportunities they present have created a rich field of research attempting to overcome their limitations.

### Addressing the Unstable Gradient Problem

The unstable gradient previously mentioned can of course be fixed...if we hold the forward-propagated weights to 1. Then the network can propagate forever without trouble. The problem with this is that the neurons cannot learn from their past. Unless, we provide a way for the neuron to selectively remember and store memories. This is the fundamental principal behind the LSTM or "long short-term memory" neuron.

LSTM memory cell
-----

![chain](./images/chain.png)


The LSTM is an example of a 'gated' neuron, who effectively owns decision layers - layers within layers, so to speak. Each gate is a decision layer with its own activation function, weights and consideration of the input data vector. 

1. An input gate
2. An output gate
3. A keep/forget data ("memory") gate. 

We can formulate the system equations as follows:

$$f_t = \sigma_{g}(W_{f}x_t + U_{f}h_{t-1} + b_f)$$
$$i_t = \sigma_{g}(W_{i} x_t + U_{i} h_{t-1} + b_i)$$
$$o_t = \sigma_{g}(W_{o} x_t + U_{o} h_{t-1} + b_o)$$
$$c_t = f_t \circ c_{t-1} + i_t \circ \sigma_c(W_{c} x_t + U_{c} h_{t-1} + b_c)$$
$$h_t = o_t \circ \sigma_h(c_t)$$


Variables
* $x_t$: input vector
* $h_t$: output vector
* $c_t$: cell state vector
* $W$, $U$ and $b$: Present parameter matrix, temporal parameter matrix and bias vector
* $f_t$, $i_t$ and $o_t$: gate vectors
* $f_t$: Forget gate vector. Weight of remembering old information.
* $i_t$: Input gate vector. Weight of acquiring new information.
* $o_t$: Output gate vector. Output candidate.

## How LSTMs Work

The LSTM neuron (at this point called "cell") stores its place within parameter space (state) in the persistent master cell state **memory** vector:

$$c_t = f_t \circ c_{t-1} + i_t \circ \sigma_c(W_{c} x_t + U_{c} h_{t-1} + b_c)$$


### Step 1

The first  step during prediction that the LSTM considers is **whether or not to flush its memory (for that timestep)**:

![step1](./images/1step.png)

Why? Consider subjects in our previous lecture: ("Doug saw Doug saw Doug"). For example:

> Chris is my aunt. Cameron is my ...

When we see a new subject, we want to forget the old subject (temporarily, perhaps).

### Step 2

Now we update the new cell state based on what we kept from the old cell state.

![step2](./images/2step.png)


## Step 3

In this step we update the state vector using a context weighting:

![step3](./images/3step.png)


## Step 4

At last we decide to produce a prediction. At this point, we filter how much of the cell state will be output (allowing the cell to literally keep a memory in its state vector) based on output weights. Finally this is wrapped in a typical squashing function. 

![step4](./images/4step.png)



<center><img src="images/variants.png" height="500"/></center>


## GRUs (gated recurrent units)

LSTMs offer tremendous performance, essentially curing all the weaknesses of RNNs and MLPs. However they are miserable to optimize, as you will see in the laboratory exercises. One way of addressing this is to simplify the LSTM by removing the output gate, providing an update gate (input), reset gate (forget gate) and reporting the system state as output instead as the activation.
This makes the systemic equations:

$$z_t = \sigma_{g}(W_{z}x_t + U_{z}h_{t-1} + b_z)$$
$$r_t = \sigma_{g}(W_{r} x_t + U_{i} h_{t-1} + b_r)$$
$$h_t = z_t \circ h_{t-1} + (1-z_t) \circ \sigma_{h}(W_{h} x_t + U_{h}(r_{t} \circ h_{t-1}) + b_{h})$$

The update and reset gates end up becoming a scaling factor with respect to how much and what aspects of the previous state of the cell are included in the current response. The rest of the computation is performed simply with a single layer of weights held in the response, which becomes the state vector as well. (as you can see). GRUs are significantly cheaper and offer [similar performance](https://arxiv.org/pdf/1412.3555.pdf) to LSTM cells, and as you can imagine this remains a hot topic of research. 