# LSTM Architecture

A common achitecture of an LSTM unit consists of the _memory_ (also called the cell) and three _gates_: the _input_ gate, the _forget_ gate and the _output_ gate. The memory keeps track of the sequential information in the input sequence that is presented to the LSTM. If we think of the input sequence as being fed to the LSTM unit in time steps, the input gate controls what new new information is added to the memory at each time step; the forget gate controls which information is to be deleted from memory at each time step; the output gate controls how the memory affects the output activation of the LSTM unit in each time step.

Let us assume that each element of the input sequence is a vector in $\mathbf{R}^d$ and that the output of the LSTM are vectors in $\mathbf{R}^h$. The matrices $W_q \in \mathbf{R}^{h \times d}$ and $U_q \in \mathbf{R}^{h \times h}$ contain the weights of the input and recurrent connections; the vector $b_q \in \mathbf{R}^h$ contains the weights of the bias vector. The subscript $q$ can be either the input gate $i$ or the forget gate $f$ or the output gate $o$ or the memory cell $c$.  

At a given time step $t$, we let $x_t$, $s_t$, $c_t$ denote, respectively, the input, the state and the contents of memory at time step $t$. The input vector $x_t \in \mathbf{R}^d$ while the state $s_t$ and the memory $c_t$ are vectors in $\mathbf{R}^h$. The computation in each gate follows the same pattern: $ \text{activation}(W \cdot x_t + U \cdot s_t + b)$. The activation function used depends on the gate. For instance, the forget node uses the sigmoid function. The sigmoid squishes values between $0$ and $1$ and the interpretation is that components that are closer to $0$ will be forgotten and those closer to $1$ will be passed on to the next time step. 


In what follows, $*$ represents component-wise multiplication. 

$$
\begin{align}
    \textbf{Forget Gate} & \\
        f_t & = \sigma(W_f \cdot x_t + U_f \cdot s_t + b_f) \\
    \textbf{Input Gate} & \\
        i_t & = \sigma(W_i \cdot x_t + U_i \cdot s_t + b_i) \\
        \tilde{c}_t & = \tanh(W_c \cdot x_t + U_c \cdot s_t + b_c) \\
    \textbf{Memory Update} & \\
        c_{t + 1} & = f_t * c_t + i_t * \tilde{c}_t \\
    \textbf{Output Gate} & \\
        o_t & = \sigma(W_o \cdot x_t + U_o \cdot s_t + b_o) \\
        s_{t + 1} & = o_t * \tanh(c_{t + 1})
\end{align}
$$