# 6. Long Short-Term Memory (LSTM)

In [1]:
import torch 
import torch.nn as nn
from torch.utils import data
from torch.nn import functional as F

import re
import collections

import math
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
from d2l import torch as d2l

Recall the structure of a **gated recurrent unit**:

$$\mathbf{H}_t = \mathbf{Z}_t \odot \mathbf{H}_{t-1}  + (1 - \mathbf{Z}_t) \odot \tilde{\mathbf{H}}_t$$

![](http://d2l.ai/_images/gru-3.svg)

## Gated Memory Cell

There are **3 gates** in a gated memory cell:

1. **input gate**: pass the input
2. **output gate**: pass the output
3. **forget gate**: reset the states

![](http://d2l.ai/_images/lstm-0.svg)

Given $h$ **hidden states** and a **mini-batch** (with batch size $n$ and number of input $d$), we have the following:

$$
\begin{aligned}
\mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i)\\
\mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f)\\
\mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o)
\end{aligned}
$$

where $\mathbf{I}_t \in \mathbb{R}^{n \times h}$ is the **input gate**, $\mathbf{F}_t \in \mathbb{R}^{n \times h}$ is the **forget gate** and $\mathbf{O}_t \in \mathbb{R}^{n \times h}$ is the **output gate**.

The values of these gates lie in the **range of (0,1)**.

## Candidate Memory Cell

The **candidate memory cell** $\tilde{\mathbf{C}}_t \in \mathbb{R}^{n \times h}$ is given as:

$$\tilde{\mathbf{C}}_t = \text{tanh}(\mathbf{X}_t \mathbf{W}_{xc} + \mathbf{H}_{t-1} \mathbf{W}_{hc} + \mathbf{b}_c)$$

![](http://d2l.ai/_images/lstm-1.svg)

The **activation** used is the $\tanh$ function, therefore, the values lies within the **range of (-1,1)**.

## Memory Cell

The **input gate** $\mathbf{I}_t$ governs how much we take **new data** into account via $\tilde{\mathbf{C}}_t$.

The **forget gate** $\mathbf{F}_t$ addresses how much of the **old cell internal state** we retain via $\mathbf{C}_{t-1}$.

Thus, the **memory cell** at the current time step is given as:

$$\mathbf{C}_t = \mathbf{F}_t \odot \mathbf{C}_{t-1} + \mathbf{I}_t \odot \tilde{\mathbf{C}}_t$$

When we have $\mathbf{F}_t=1$ and $\mathbf{I}_t=0$, we will be passing the exact same information stored in $\mathbf{C}_{t-1}$ to $\mathbf{C}_{t}$, vice versa. 

Such design alleviates the **vanishing gradient** problem and helps capturem the **long-term dependency** in the sequence.

![](http://d2l.ai/_images/lstm-2.svg)

## Hidden State

Finally, the **output gate** is used to obtain the **hidden state** $\mathbf{H}_t \in \mathbb{R}^{n \times h}$ at the current time step: 

$$\mathbf{H}_t = \mathbf{O}_t \odot \tanh(\mathbf{C}_t)$$

The values of the hidden state lies within the **range (-1,1)** because of the $\tanh$ activation function.

With $\mathbf{O}_t=1$, we can pass all the information in the **memory cell** to the **prediction** at time step $t$. On the other hand, if $\mathbf{O}_t=0$, we do not update the hidden state $\mathbf{H}_t$.

![](http://d2l.ai/_images/lstm-3.svg)

## Implementing LSTM

In [None]:
num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens)

model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)

d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)