### Simple LSTM model using PyTorch

Before we jump into a project with a full dataset, let's just take a look at how the PyTorch LSTM layer really works in practice by visualizing the outputs. We don't need to instantiate a model to see how the layer works. (You don’t need to run on a GPU for this portion).

In [1]:
import torch
import torch.nn as nn

Just like the other kinds of layers, we can instantiate an LSTM layer and provide it with the necessary arguments. The full documentation of the accepted arguments can be found here. In this example, we will only be defining the input dimension, hidden dimension and the number of layers.

 - Input dimension - represents the size of the input at each time step
    * e.g. input of dimension 5 will look like this [1, 3, 8, 2, 3]
 - Hidden dimension - represents the size of the hidden state and cell state at each time step
    * e.g. the hidden state and cell state will both have the shape of [3, 5, 4] if the hidden dimension is 3
 - Number of layers - the number of LSTM layers stacked on top of each other

In [2]:
input_dim = 5
hidden_dim = 10
n_layers = 1

lstm_layer = nn.LSTM(input_dim, hidden_dim, n_layers, batch_first=True)

Let's create some dummy data to see how the layer takes in the input. As our input dimension is 5, we have to create a tensor of the shape (1, 1, 5) which represents (batch size, sequence length, input dimension).

Additionally, we'll have to initialize a hidden state and cell state for the LSTM as this is the first cell. The hidden state and cell state is stored in a tuple with the format (hidden_state, cell_state)

In [3]:
batch_size = 1
seq_len = 1

inp = torch.randn(batch_size, seq_len, input_dim)
hidden_state = torch.randn(n_layers, batch_size, hidden_dim)
cell_state = torch.randn(n_layers, batch_size, hidden_dim)
hidden = (hidden_state, cell_state)
print("Input shape: {}".format(inp.shape))
print("Hidden shape: ({}, {})".format(hidden[0].shape, hidden[1].shape))

Input shape: torch.Size([1, 1, 5])
Hidden shape: (torch.Size([1, 1, 10]), torch.Size([1, 1, 10]))


Next, we’ll feed the input and hidden states and see what we’ll get back from it.

In [4]:
out, hidden = lstm_layer(inp, hidden)
print("Output shape: ", out.shape)
print("Hidden: ", hidden)

Output shape:  torch.Size([1, 1, 10])
Hidden:  (tensor([[[ 0.1268,  0.3301,  0.0041,  0.2489, -0.2372, -0.0246, -0.5380,
          -0.0507, -0.1683,  0.1378]]], grad_fn=<StackBackward0>), tensor([[[ 0.2830,  0.8670,  0.0084,  0.4287, -0.3780, -0.0677, -0.8299,
          -0.0996, -0.4272,  0.4779]]], grad_fn=<StackBackward0>))


In the process above, we saw how the LSTM cell will process the input and hidden states at each time step. However in most cases, we'll be processing the input data in large sequences. The LSTM can also take in sequences of variable length and produce an output at each time step. Let's try changing the sequence length this time.

In [5]:
seq_len = 3
inp = torch.randn(batch_size, seq_len, input_dim)
out, hidden = lstm_layer(inp, hidden)
print(out.shape)

torch.Size([1, 3, 10])


This time, the output's 2nd dimension is 3, indicating that there were 3 outputs given by the LSTM. This corresponds to the length of our input sequence. For the use cases where we'll need an output at every time step, such as Text Generation, the output of each time step can be extracted directly from the 2nd dimension and fed into a fully connected layer. For text classification tasks, such as Sentiment Analysis, the last output can be taken to be fed into a classifier.

In [6]:
# Obtaining the last output in the sequence
out = out.squeeze()[-1, :]
print(out.shape)

torch.Size([10])


## LSTM Cell Equations
### Input Gate
***i<sub>1</sub> = σ(W<sub>i1</sub> ⋅ (H<sub>t−1</sub>, x<sub>t</sub>) + bias<sub>i<sub><sub>1</sub>)***

***i<sub>2</sub> = tanh(W<sub>i2</sub> ⋅ (H<sub>t−1</sub>, x<sub>t</sub>) + bias<sub>i<sub><sub>2</sub>)***

***i<sub>input</sub> = i<sub>1</sub> ∗ i<sub>2</sub>***

### Forget Gate
***f = σ(W<sub>forget</sub> ⋅ (H<sub>t−1</sub>, x<sub>t</sub>) + bias<sub>forget</sub>)***

***C<sub>t</sub> = C<sub>t−1</sub> ∗ f + i<sub>input</sub>***

### Output Gate
***O<sub>1</sub> = σ(W<sub>output1</sub> ⋅ (H<sub>t−1</sub>, x<sub>t</sub>) + bias<sub>output1</sub>)***

***O<sub>2</sub> = tanh(W<sub>output2</sub> ⋅ C<sub>t</sub> + bias<sub>output2</sub>)***
