### Long Short-Term Memory (LSTM)

> - Core Idea: pass cell state information straightly without any transformation
    - long-term dependency 문제 해결

> #### Long short-term memory
> - i: Input gate, Whether to write to cell
> - f: Forget gate, Whether to erase cell
> - o: Output gate, How much to reveal cell
> - g: Gate gate, How much to write to cell
> - $\begin{pmatrix} i \\ f \\ o \\ g \end{pmatrix} = \begin{pmatrix} \sigma \\ \sigma \\ \sigma \\ \tanh \end{pmatrix} W \begin{pmatrix} h_{t-1} \\ x_t \end{pmatrix}$
> - $c_t = f \odot c_{t-1} + i \odot g$
> - $h_t = o \odot \tanh{c_t}$

> #### Forget gate
> - A gate exists for controlling how much information could flow from cell state
> - $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$

> #### Input gate, Gate gate
> - Generate information to be added and cut it by input gate
> - $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
> - $\tilde{C_t} = \tanh{(W_C \cdot [h_{t-1}, x_t] + b_C)}$
> - Generate new cell state by adding current information to previous cell state
> - $C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t}$

> #### Output gate
> - Generate hidden state by passing cell state to tanh and output gate
> - Pass this hidden state to next time step, and output or next layer if needed
> - $o_t =\sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
> - $h_t = o_t \cdot \tanh{(C_t)}$

### Gated Recurrent Unit (GRU)

> #### What is GRU?
> - $z_t = \sigma (W_z \cdot [h_{t-1}, x_t])$
> - $r_t = \sigma (W_r \cdot [h_{t-1}, x_t])$
> - $\tilde{h_t} = \tanh (W \cdot [r_t \cdot h_{t-1}, x_t])$
> - $h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot \tilde{h_t}$
> - c.f) $C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t}$ in LSTM

#### Summary on RNN/LSTM/GRU

> - RNNs allow a lot of **flexibility** in architecture design
> - Vanilla RNNs are **simple** but don’t work very well
> - Backward flow of gradients in RNN can **explode or vanish**
> - Common to use LSTM or GRU: their additive interactions **improve gradient flow**