# Long Short-Term Memory Networks (LSTM) with PyTorch (on MNIST)
By [Zahra Taheri](https://github.com/zata213), September 15, 2020

## LSTM: Special RNN
- Capable of learning long-term dependencies
- LSTM = RNN on super juice

## RNN transition to LSTM

![alt text](rnn-2.png)

![alt text](lstm.png)

### About the above LSTM
- Input passes through a linear function four times (4 times AX+B)
- Hidden state output passes through a linear function four times (4 times A'X'+B')
- We do an addition between the output of the above operations (4 times (AX+B)+(A'X'+B'))
- first output of the above operations passes through a non-linearity (sigmoid). It determines how much we want to forget from our previous time step (Foreget gate output)
- Second output of the above operations passes through a non-linearity (sigmoid). It determines how much information we want to have comming in (Input gate output)
- Third output of the above operations passes through a non-linearity (tanh). It determines what we can add to the new input (New candidate)
- Forth output of the above operations passes through a non-linearity (sigmoid). It determines what we can add to the new input (Output gate)

![alt text](lstm-par.png)

## Building LSTMs with PyTorch

### Model A: 1 hidden layer
- Unroll 28 time steps
    - In each step input size is 28x1 and output size is 10
    - Total per unroll: 28x28
        - FNN input size is 28x28
- 1 hidden layer

#### Steps
- Step 1: Load dataset
- Step 2: Make dataset iterable
- Step 3: Create model class
- Step 4: Instantiate model class
- Step 5: Instantiate loss class
- Step 6: Instantiate optimizer class
- Step 7: Train the model


In [1]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [2]:
'''
Step 1: Load dataset
'''
train_dataset = dsets.MNIST(root='.\data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='.\data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
Step 2: Make dataset iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
Step 3: Create model class
'''

class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTMModel, self).__init__()
        # Hidden dimensions
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # Building your LSTM
        # batch_first=True causes input/output tensors to be of shape
        # (batch_dim, seq_dim, feature_dim)
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        
        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        
        # Initialize hidden state with zeros
        h0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # Initialize cell state
        c0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # 28 time steps
        out, (hn, cn) = self.lstm(x, (h0,c0))
        
        # Index hidden state of last time step
        # out.size() --> 100, 28, 100
        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states! 
        out = self.fc(out[:, -1, :]) 
        # out.size() --> 100, 10
        return out

- 28 time steps
    - In each time step: input dimension=28. It means that in each time step, we only fit 28 pixels and after 28 time steps, all 28x28 pixels are fitted. Therefore, we only want the prediction of the last time step.
- 1 hidden layer
- MNIST 1-9 digits $\rightarrow$ output dimension=10
- Cross Entropy Loss is used for LSTM.

In [3]:
'''
Step 4: Instantiate model class
'''
input_dim = 28
hidden_dim = 100
layer_dim = 1
output_dim = 10

model = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim)
    
'''
Step 5: Instantiate loss class
'''
criterion = nn.CrossEntropyLoss()

'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  

#### Parameters In-Depth

In [4]:
print(model) 
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

LSTMModel(
  (lstm): LSTM(28, 100, batch_first=True)
  (fc): Linear(in_features=100, out_features=10, bias=True)
)
6
torch.Size([400, 28])
torch.Size([400, 100])
torch.Size([400])
torch.Size([400])
torch.Size([10, 100])
torch.Size([10])


![alt text](params.png)

- We have four groups of parameters, $w_1,w_3,w_5,w_7$, each has size $[100, 28]$. So, we have $[400, 28]$.

#### Step 7: Train the model

1. Convert inputs\labels to variables
    - LSTM input size: (1, 28)
    - RNN input size: (1, 28)
    - CNN input size: (1, 28, 28)
    - FNN input size: (1, 28*28)
2. Clear gradient buffets
3. Get output given input
4. Get loss
5. Get gradients w.r.t. parameters
6. Update parameters using gradients
    - parameters = parameters - learning_rate * parameters_gradients
7. Repeat 

In [None]:
'''
Step 7: Train the model
'''

# Number of steps to unroll
seq_dim = 28  

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        
        # Load images as variables
        # resize images to (batch_size, seq_dim, input_dim) because when we create our model class, we had the argument 
        # batch_first=True
        images = Variable(images.view(-1, seq_dim, input_dim)) # -1 is to allow program to find the first dimension based on the input dimension
        labels = Variable(labels)
            
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        # Forward pass to get output/logits
        # outputs.size() --> 100, 10
        outputs = model(images)
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get logits/output
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
            
            accuracy = 100 * correct // total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

Iteration: 500. Loss: 2.280489206314087. Accuracy: 22
Iteration: 1000. Loss: 1.1225415468215942. Accuracy: 67
Iteration: 1500. Loss: 0.5317773222923279. Accuracy: 87
Iteration: 2000. Loss: 0.5812236070632935. Accuracy: 92
Iteration: 2500. Loss: 0.14171314239501953. Accuracy: 95


### Model B: 2 hidden layer
- Unroll 28 time steps
    - In each step input size is 28x1 and output size is 10
    - Total per unroll: 28x28
        - FNN input size is 28x28
- 2 hidden layer

#### Steps
- Step 1: Load dataset
- Step 2: Make dataset iterable
- **Step 3: Create model class**
- Step 4: Instantiate model class
- Step 5: Instantiate loss class
- Step 6: Instantiate optimizer class
- Step 7: Train the model


In [1]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [2]:
'''
Step 1: Load dataset
'''
train_dataset = dsets.MNIST(root='.\data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='.\data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
Step 2: Make dataset iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
Step 3: Create model class
'''

class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(LSTMModel, self).__init__()
        # Hidden dimensions
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # Building your LSTM
        # batch_first=True causes input/output tensors to be of shape
        # (batch_dim, seq_dim, feature_dim)
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        
        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        
        # Initialize hidden state with zeros
        h0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # Initialize cell state
        c0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
        
        # 28 time steps
        out, (hn, cn) = self.lstm(x, (h0,c0))
        
        # Index hidden state of last time step
        # out.size() --> 100, 28, 100
        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states! 
        out = self.fc(out[:, -1, :]) 
        # out.size() --> 100, 10
        return out

- 28 time steps
    - In each time step: input dimension=28. It means that in each time step, we only fit 28 pixels and after 28 time steps, all 28x28 pixels are fitted. Therefore, we only want the prediction of the last time step.
- 1 hidden layer
- MNIST 1-9 digits $\rightarrow$ output dimension=10
- Cross Entropy Loss is used for LSTM.

In [3]:
'''
Step 4: Instantiate model class
'''
input_dim = 28
hidden_dim = 100
layer_dim = 1
output_dim = 10

model = LSTMModel(input_dim, hidden_dim, layer_dim, output_dim)
    
'''
Step 5: Instantiate loss class
'''
criterion = nn.CrossEntropyLoss()

'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  

#### Parameters In-Depth

In [4]:
print(model) 
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

LSTMModel(
  (lstm): LSTM(28, 100, batch_first=True)
  (fc): Linear(in_features=100, out_features=10, bias=True)
)
6
torch.Size([400, 28])
torch.Size([400, 100])
torch.Size([400])
torch.Size([400])
torch.Size([10, 100])
torch.Size([10])


![alt text](params.png)

- We have four groups of parameters, $w_1,w_3,w_5,w_7$, each has size $[100, 28]$. So, we have $[400, 28]$.

#### Step 7: Train the model

1. Convert inputs\labels to variables
    - LSTM input size: (1, 28)
    - RNN input size: (1, 28)
    - CNN input size: (1, 28, 28)
    - FNN input size: (1, 28*28)
2. Clear gradient buffets
3. Get output given input
4. Get loss
5. Get gradients w.r.t. parameters
6. Update parameters using gradients
    - parameters = parameters - learning_rate * parameters_gradients
7. Repeat 

In [None]:
'''
Step 7: Train the model
'''

# Number of steps to unroll
seq_dim = 28  

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        
        # Load images as variables
        # resize images to (batch_size, seq_dim, input_dim) because when we create our model class, we had the argument 
        # batch_first=True
        images = Variable(images.view(-1, seq_dim, input_dim)) # -1 is to allow program to find the first dimension based on the input dimension
        labels = Variable(labels)
            
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        # Forward pass to get output/logits
        # outputs.size() --> 100, 10
        outputs = model(images)
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get logits/output
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                correct += (predicted == labels).sum()
            
            accuracy = 100 * correct // total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

Iteration: 500. Loss: 2.280489206314087. Accuracy: 22
Iteration: 1000. Loss: 1.1225415468215942. Accuracy: 67
