In [1]:
%matplotlib inline

In [2]:
# Author: sanketvmehta
# Base code from: http://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

In [3]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Introduction to Multi-layer LSTM in Pytorch
======================================

![LSTM Architecture](lstm_architecture.jpg)

### Define the network

In [4]:
lstm = nn.LSTM(input_size=6, hidden_size=4, num_layers=2, bias=True, 
               batch_first=False, dropout=0, bidirectional=False)

#### Args:
- **input_size:** The number of expected features in the input $x_t$ (_6 in our case_)    
- **hidden_size:** The number of features in the hidden state $h_t$ (_4 in our case_)    
- **num_layers:** Number of recurrent layers (_2 in our case_)    
- **bias:** If _False_, then the layer does not use bias weights $b_{ih}$ and $b_{hh}$ Default: _True_    
- **batch_first:** If _True_, then the input and output tensors are provided as (_batch, seq, feature_)    
- **dropout:** If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer    
- **bidirectional:** If _True_, becomes a bidirectional RNN. Default: _False_

### Initialize the hidden state 

In [5]:
h0 = autograd.Variable(torch.randn(2, 1, 4))

**h0 (num_layers * num_directions, batch, hidden_size):** tensor containing the initial hidden state for each element in the batch. 

### Initialize the cell state

In [6]:
c0 = autograd.Variable(torch.randn(2, 1, 4))

**c0 (num_layers * num_directions, batch, hidden_size):** tensor containing the initial cell state for each element in the batch.

### Define the input for the network

In [7]:
input = autograd.Variable(torch.randn(5, 1, 6)) 

**input (seq_len, batch, input_size):** tensor containing the features of the input sequence where first
argument corresponds to length of the sequence (_5 in our case_), second argument corresponds to size of the batch (_1 in our case_) and third argument corresponds to the number of features in the input (_6 in our case_)

**NOTE:** We have above ordering because "batch_first" is set to 'False' in our network definition.

### Understanding the output from Multi-Layer LSTM in Pytorch

In [8]:
output, (hn, cn) = lstm(input, (h0, c0))

print("Output size: ", output.size())
print("hn size: ", hn.size())
print("cn size: ", cn.size())

Output size:  torch.Size([5, 1, 4])
hn size:  torch.Size([2, 1, 4])
cn size:  torch.Size([2, 1, 4])


- **output (seq_len, batch, hidden_size * num_directions):** tensor containing the output features (h_t) 
from the LAST LAYER of the RNN, for each t (so total of seq_len). Basically, it is all of the hidden states 
throughout the sequence.
- **hn (num_layers * num_directions, batch, hidden_size):** tensor containing the hidden state for t=seq_len.
- **cn (num_layers * num_directions, batch, hidden_size):** tensor containing the cell state for t=seq_len.

### Comparing the last slice of "output" with "hidden state" from the last layer

In [9]:
print("The last slice of the output: ", output.data[4])
print("Hidden state from last layer: ", hn.data[1])

The last slice of the output:  
-0.1165 -0.1405  0.0563  0.1008
[torch.FloatTensor of size 1x4]

Hidden state from last layer:  
-0.1165 -0.1405  0.0563  0.1008
[torch.FloatTensor of size 1x4]



Comparing the last slice of "output" with "hidden state from last layer i.e.,hn[num_layers-1] below, they are the same. The reason for this is that: "output" will give one access to all hidden states (last hidden layers) in the sequence. While "hn" will allow one to continue the sequence and backpropagate, by passing it as an argument  to the lstm at a later time.

### Comparing the last slice of "output" with "hidden state from second last layer

In [10]:
print("The last slice of the output: ", output.data[4])
print("Hidden state from second last layer: ", hn.data[0])

The last slice of the output:  
-0.1165 -0.1405  0.0563  0.1008
[torch.FloatTensor of size 1x4]

Hidden state from second last layer:  
 0.1357  0.4084  0.3309  0.0066
[torch.FloatTensor of size 1x4]



Comparing the last slice of "output" with "hidden state from second last layer i.e.,hn[num_layers-2] below, they are the different.

### Understanding model parameters

In [11]:
for param in lstm.parameters():
    print(param.size())

torch.Size([16, 6])
torch.Size([16, 4])
torch.Size([16])
torch.Size([16])
torch.Size([16, 4])
torch.Size([16, 4])
torch.Size([16])
torch.Size([16])


### Hidden Layer 1

torch.Size([16, 6]) 
weight_ih_l[k] => the learnable input-hidden weights of the 1st layer 
                        (W_ii|W_if|W_ig|W_io), of shape (4*hidden_size (4) x input_size (6))

torch.Size([16, 4])
weight_hh_l[k] => the learnable hidden-hidden weights of the 1st layer 
                        (W_hi|W_hf|W_hg|W_ho), of shape (4*hidden_size (4) x hidden_size (4))

torch.Size([16]) 
bias_ih_l[k] => the learnable input-hidden bias of the 1st layer 
                        (b_ii|b_if|b_ig|b_io), of shape (4*hidden_size)
                        
torch.Size([16])
bias_hh_l[k] => the learnable hidden-hidden bias of the 1st layer 
                        (b_hi|b_hf|b_hg|b_ho), of shape (4*hidden_size)

### Hidden Layer 2
torch.Size([16, 4])
torch.Size([16, 4])
torch.Size([16])
torch.Size([16])