# L13a: Long Short Term Memory (LSTM) Recurrent Neural Networks
In this lecture, we'll discuss the Long Short Term Memory (LSTM) architecture, which is a type of Recurrent Neural Network (RNN) designed to learn long-term dependencies in sequential data. LSTMs are particularly effective for tasks such as language modeling, machine translation, and time series prediction. The key ideas discussed in this lecture are:

* __LSTM Architecture__: LSTM are recurrent neural networks that have additional logic to that modifies the hidden state. The LSTM architecture consists of three main gates: the input gate, the forget gate, and the output gate. These gates control the flow of information into and out of the _cell state_, which in turn influences the hidden state of the LSTM. The cell state is a memory that can store information over long periods of time, allowing the LSTM to learn long-term dependencies in the data.
* __LSTM Gates__: The _input gate_ determines how much of the new information should be added to the cell state, the _forget gate_ decides how much of the previous cell state should be discarded, and the _output gate_ controls how much of the cell state should be exposed to the next layer. The gates use [sigmoid activation functions](https://en.wikipedia.org/wiki/Sigmoid_function) to produce values between 0 and 1, which are then multiplied with the corresponding inputs or states.
* __LSTM Training__: LSTMs are trained [using backpropagation through time (BPTT)](https://d2l.ai/chapter_recurrent-neural-networks/bptt.html), which is an extension of the standard backpropagation algorithm for training feedforward neural networks. BPTT involves _unrolling_ (looking at time slices) the LSTM over time and computing gradients for each time step, allowing the model to learn from both short-term and long-term dependencies in the data.

The materrial in this lecture is based on the following references:
Fill me in

___

## Motivation: Why LSTMs?
LSTMs were developed to address the limitations of traditional RNNs, which struggle to learn long-term dependencies due to the vanishing/exploding gradient problem. This problem arises when gradients become too small (vanishing) or too large (exploding) during backpropagation, making it difficult for the model to learn from long sequences of data. 

LSTMs mitigate this issue by introducing a cell state that can retain information over long periods, along with gating mechanisms that control the flow of information. However, do LSTMs always outperform traditional RNNs? The answer is no. In some cases, traditional RNNs can be more efficient and effective, especially for tasks with shorter sequences or less complex dependencies.

### Advantages of LSTMs:

* __Better at Capturing Long-Term Dependencies__: LSTMs are designed to retain information over longer time steps due to their gating mechanisms (input, forget, and output gates), which allow them to selectively store or discard information. This makes them superior to traditional RNNs, which struggle with long-term dependencies.
* __Mitigation of Vanishing/Exploding Gradient Problem__: The architecture of LSTMs helps prevent the vanishing/exploiding gradient issue that plagues traditional RNNs. This ensures more stable and effective training over long sequences.
* __Selective Memory Retention__: LSTMs can decide what information to keep or forget using their gating mechanisms, making them more efficient at handling complex sequential data compared to RNNs, which (traditionally) lack such mechanisms.

### Disadvantages of LSTMs:
* __Higher Computational Complexity__: The additional gates and memory cells in LSTMs make them computationally more expensive and slower to train compared to simpler RNNs such as [an Elman network](https://en.wikipedia.org/wiki/Recurrent_neural_network#Elman_networks_and_Jordan_networks).
* __Overfitting__: Due to their large number of parameters, LSTMs are prone to overfitting, especially when the training dataset is small or lacks diversity. 
* __Longer Training Time__: LSTMs often require significantly more time to train than traditional RNNs, which can be a drawback for tasks where speed is critical.

## LSTM Architecture
LSTMs are designed to maintain a _cell state_ (another type of internal memory) that can carry information across many time steps, allowing them to learn long-term dependencies effectively. The cell state is updated using _gates_, which are mechanisms that control the flow of information into and out of the cell state. The LSTM architecture consists of three main components: the input gate, the forget gate, and the output gate.

### LSTM Gates
The big difference between a vanilla RNN and an LSTM is the introduction of a cell state $c_{t}$, which is a memory that can carry information across many time steps. The cell state is updated using the input gate and the forget gate, which control how much of the previous cell state should be retained and how much new information should be added. The output gate determines how much of the cell state should be exposed to the next layer.

* __Forget gate__: The forget gate $f_t$ determines how much of the previous cell state $c_{t-1}$ should be retained. It is computed using a sigmoid activation function, which outputs values between 0 and 1. A value of 0 means "forget everything," while a value of 1 means "keep everything."
* __Input gate__: In an LSTM, the input gate $i_t$ determines how much of the new information from the current input and the previous hidden state should be added to the cell state. It is also computed using a sigmoid activation function, which outputs values between 0 and 1. A value of 0 means "don't add anything," while a value of 1 means "add everything."
* __Output gate__: The output gate $o_t$ determines how much of the cell state should be exposed to the next layer. It is computed using a sigmoid activation function, which outputs values between 0 and 1. A value of 0 means "don't expose anything," while a value of 1 means "expose everything."

### Model
The compact form for the LSTM equations (at time $t$) with a forget gate are given by:
$$
\begin{align*}
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f) \\
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i) \\
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \\
\tilde{c}_t &= \texttt{tanh}(W_c x_t + U_c h_{t-1} + b_c) \\
c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\
h_t &= o_t \odot \texttt{tanh}(c_t)
\end{align*}
$$
where $\odot$ denotes the element-wise product ([Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices))) and $\sigma$ is the $\texttt{sigmoid}$ activation function. Let the superscripts $d$ and $h$ denote the dimensions of the input and hidden state, respectively. The following _variables_ are used in the LSTM architecture:
* $x_t \in \mathbb{R}^{d}$: Input vector at time step $t$.
* $h_t \in (-1,1)^h$: Hidden state vector at time step $t$.
* $f_t \in (0,1)^{h}$: Forget gate vector at time step $t$.
* $i_t \in (0,1)^{h}$: Input gate vector at time step $t$.
* $o_t \in (0,1)^{h}$: Output gate vector at time step $t$.
* $c_t \in \mathbb{R}^{h}$: Cell state vector at time step $t$.
* $\tilde{c}_t \in (-1,1)^{h}$: Candidate cell state vector at time step $t$.

An LSTM has the following parameters:
* $W_f \in \mathbb{R}^{h \times d}$: Weights for the forget gate with respect to the input.
* $W_i \in \mathbb{R}^{h \times d}$: Weights for the input gate with respect to the input.
* $W_o \in \mathbb{R}^{h \times d}$: Weights for the output gate with respect to the input.
* $W_c \in \mathbb{R}^{h \times d}$: Weights for the candidate cell state with respect to the input.
* $U_f \in \mathbb{R}^{h \times h}$: Weights for the forget gate with respect to the hidden state.
* $U_i \in \mathbb{R}^{h \times h}$: Weights for the input gate with respect to the hidden state.
* $U_o \in \mathbb{R}^{h \times h}$: Weights for the output gate with respect to the hidden state.
* $U_c \in \mathbb{R}^{h \times h}$: Weights for the candidate cell state with respect to the hidden state.
* $b_f \in \mathbb{R}^{h}$: Bias for the forget gate.
* $b_i \in \mathbb{R}^{h}$: Bias for the input gate.
* $b_o \in \mathbb{R}^{h}$: Bias for the output gate.
* $b_c \in \mathbb{R}^{h}$: Bias for the candidate cell state.

## Lab
In Lab `L13b`, we will implement (and _hopefully_ train) a Long Short-Term Memory (LSTM) network constructed using [the `Flux.jl` package](https://github.com/FluxML/Flux.jl).

# Today?
That's a wrap! What are some of the interesting things we discussed today?