# Recurrent Neural Network (RNN) with PyTorch (on MNIST)
By [Zahra Taheri](https://github.com/zata213), September 11, 2020

## Feedforward Neural Networks Transition to Recurrent Neural Networks

### RNN is essentially an FNN

![alt text](fnn.png)

![alt text](rnn.png)

![alt text](fnn-2.png)

![alt text](rnn-2.png)

## Increasing the number of hidden layers in RNN

- In RNN, whenever you see a non-linear output, it is needed to put a linear function to ensure that information passes through time steps.
- RNN is essencially repeating FNN, but information get passed through with your previous non-linear output passing through a linear function to your current hidden state.

![alt text](rnn-3.png)

![alt text](rnn-4.png)

## Building Recurrent Neural Networks with PyTorch

### Model A: 1 hidden layer (ReLU)
- Unroll 28 time steps
    - In each step input size is 28x1 and output size is 10
    - Total per unroll: 28x28
        - FNN input size is 28x28
- 1 hidden layer
- ReLU activation function

#### Steps
- Step 1: Load dataset
- Step 2: Make dataset iterable
- Step 3: Create model class
- Step 4: Instantiate model class
- Step 5: Instantiate loss class
- Step 6: Instantiate optimizer class
- Step 7: Train the model


In [1]:
# import libraries
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [2]:
'''
Step 1: Load dataset
'''

train_dataset = dsets.MNIST(root='.\data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='.\data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
Step 2: Make dataset iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs) # the number of times we go through the whole dataset

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
Step 3: Create model class
'''
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        
        # Hidden dimension
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # RNN
        # ** batch_first=True causes input/output tensors to be of shape (batch_dim, seq_dim, input_dim) 
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='relu')
        
        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim) 
    
    def forward(self, x):
        
        # Initialize hidden state with zeros (layer_dim, batch_size, hidden_dim)
        h_0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim)) # x.size(o)=batch_size because of part **
        
        out, h_n = self.rnn(x, h_0)
        
        # Index hidden state of last time step
        # out.size()->100,28,100
        # out[:,-1,:]-> 100,100 -> just want last time step hidden states
        out = self.fc(out[:,-1,:])
        #out.size()-> 100,10 (batch_size=100. 10 prediction for every image in 100 images)
        return out


One forward can return all 28 time steps but we only want the last time step.

- 28 time steps
    - In each time step: input dimension=28. It means that in each time step, we only fit 28 pixels and after 28 time steps, all 28x28 pixels are fitted. Therefore, we only want the prediction of the last time step.
- 1 hidden layer
- MNIST 1-9 digits $\rightarrow$ output dimension=10
- Cross Entropy Loss is used for RNN.

In [3]:
'''
Step 4: Instantiate model class
'''

input_dim = 28 # not 28x28 as FNN
hidden_dim = 100
layer_dim = 1
output_dim = 10


model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

'''
Step 5: Instantiate loss class
'''
criterion = nn.CrossEntropyLoss()


'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

#### Parameters In-Depth

In [4]:
len(list(model.parameters()))

6

![alt text](params.png)

In [5]:
# Input to hidden (A1)
list(model.parameters())[0].size()

torch.Size([100, 28])

In [6]:
# Input to hidden bias (B1)
list(model.parameters())[2].size()

torch.Size([100])

In [7]:
# Hidden to hidden (A3). Linear function between time steps
list(model.parameters())[1].size()

torch.Size([100, 100])

In [8]:
# Hidden to hidden bias (B3)
list(model.parameters())[3].size()

torch.Size([100])

In [9]:
# Hidden to readout (A2)
list(model.parameters())[4].size()

torch.Size([10, 100])

In [10]:
# Hidden to readout bias (B2)
list(model.parameters())[5].size()

torch.Size([10])

#### Input sizes:

- RNN input size: (1, 28)
- CNN input size: (1, 28, 28)
- FNN input size: (1, 28*28)

In [11]:
'''
Step 7: Train the model
'''

# number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        
        # Load images as variables
        # resize images to (batch_size, seq_dim, input_dim) because when we create our model class, we had the argument 
        # batch_first=True
        images = Variable(images.view(-1, seq_dim, input_dim))  
        labels = Variable(labels)
        
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        # Forward pass to get output/logits
        # outputs.size()->100, 10
        outputs = model(images)
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                images = Variable(images.view(-1, seq_dim, input_dim)) 
                
                # Forward pass only to get logits/output
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                correct += (predicted == labels).sum()
            
            accuracy = 100 * correct // total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

Iteration: 500. Loss: 1.6357725858688354. Accuracy: 53
Iteration: 1000. Loss: 0.5974587798118591. Accuracy: 66
Iteration: 1500. Loss: 0.9172778129577637. Accuracy: 72
Iteration: 2000. Loss: 0.5771211981773376. Accuracy: 84
Iteration: 2500. Loss: 0.3714955151081085. Accuracy: 90
Iteration: 3000. Loss: 0.2064978927373886. Accuracy: 89


### Model B: 2 hidden layer (ReLU)
- Unroll 28 time steps
    - In each step input size is 28x1 and output size is 10
    - Total per unroll: 28x28
        - FNN input size is 28x28
- 2 hidden layer
- ReLU activation function

#### Steps
- Step 1: Load dataset
- Step 2: Make dataset iterable
- Step 3: Create model class
- **Step 4: Instantiate model class**
- Step 5: Instantiate loss class
- Step 6: Instantiate optimizer class
- Step 7: Train the model


In [12]:
# import libraries
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [13]:
'''
Step 1: Load dataset
'''

train_dataset = dsets.MNIST(root='.\data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='.\data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
Step 2: Make dataset iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs) # the number of times we go through the whole dataset

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
Step 3: Create model class
'''
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        
        # Hidden dimension
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # RNN
        # ** batch_first=True causes input/output tensors to be of shape (batch_dim, seq_dim, input_dim) 
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='relu')
        
        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim) 
    
    def forward(self, x):
        
        # Initialize hidden state with zeros (layer_dim, batch_size, hidden_dim)
        h_0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim)) # x.size(o)=batch_size because of part **
        
        out, h_n = self.rnn(x, h_0)
        
        # Index hidden state of last time step
        # out.size()->100,28,100
        # out[:,-1,:]-> 100,100 -> just want last time step hidden states
        out = self.fc(out[:,-1,:])
        #out.size()-> 100,10 (batch_size=100. 10 prediction for every image in 100 images)
        return out

'''
Step 4: Instantiate model class
'''

input_dim = 28 # not 28x28 as FNN
hidden_dim = 100
layer_dim = 2 # the only change is here
output_dim = 10


model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

'''
Step 5: Instantiate loss class
'''
criterion = nn.CrossEntropyLoss()


'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

#### Parameters In-Depth

In [14]:
print(model) 
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

RNNModel(
  (rnn): RNN(28, 100, num_layers=2, batch_first=True)
  (fc): Linear(in_features=100, out_features=10, bias=True)
)
10
torch.Size([100, 28])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([100, 100])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([10, 100])
torch.Size([10])


![alt text](params-b.png)

In [15]:
'''
Step 7: Train the model
'''

# number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        
        # Load images as variables
        # resize images to (batch_size, seq_dim, input_dim) because when we create our model class, we had the argument 
        # batch_first=True
        images = Variable(images.view(-1, seq_dim, input_dim))  
        labels = Variable(labels)
        
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        # Forward pass to get output/logits
        # outputs.size()->100, 10
        outputs = model(images)
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                images = Variable(images.view(-1, seq_dim, input_dim)) 
                
                # Forward pass only to get logits/output
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                correct += (predicted == labels).sum()
            
            accuracy = 100 * correct // total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

Iteration: 500. Loss: 1.1144779920578003. Accuracy: 58
Iteration: 1000. Loss: 0.6000416278839111. Accuracy: 81
Iteration: 1500. Loss: 0.35051703453063965. Accuracy: 91
Iteration: 2000. Loss: 0.43590569496154785. Accuracy: 90
Iteration: 2500. Loss: 0.19808748364448547. Accuracy: 95
Iteration: 3000. Loss: 0.031059807166457176. Accuracy: 95


### Model C: 2 hidden layer (Tanh)
- Unroll 28 time steps
    - In each step input size is 28x1 and output size is 10
    - Total per unroll: 28x28
        - FNN input size is 28x28
- 2 hidden layer
- Tanh activation function

#### Steps
- Step 1: Load dataset
- Step 2: Make dataset iterable
- **Step 3: Create model class**
- Step 4: Instantiate model class
- Step 5: Instantiate loss class
- Step 6: Instantiate optimizer class
- Step 7: Train the model


In [16]:
# import libraries
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

In [17]:
'''
Step 1: Load dataset
'''

train_dataset = dsets.MNIST(root='.\data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='.\data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
Step 2: Make dataset iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs) # the number of times we go through the whole dataset

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
Step 3: Create model class
'''
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        
        # Hidden dimension
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # RNN
        # ** batch_first=True causes input/output tensors to be of shape (batch_dim, seq_dim, input_dim) 
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='tanh')
        
        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim) 
    
    def forward(self, x):
        
        # Initialize hidden state with zeros (layer_dim, batch_size, hidden_dim)
        h_0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim)) # x.size(o)=batch_size because of part **
        
        out, h_n = self.rnn(x, h_0)
        
        # Index hidden state of last time step
        # out.size()->100,28,100
        # out[:,-1,:]-> 100,100 -> just want last time step hidden states
        out = self.fc(out[:,-1,:])
        #out.size()-> 100,10 (batch_size=100. 10 prediction for every image in 100 images)
        return out

'''
Step 4: Instantiate model class
'''

input_dim = 28 # not 28x28 as FNN
hidden_dim = 100
layer_dim = 2 # the only change is here
output_dim = 10


model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

'''
Step 5: Instantiate loss class
'''
criterion = nn.CrossEntropyLoss()


'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

#### Parameters In-Depth

In [18]:
print(model) 
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
    print(list(model.parameters())[i].size())

RNNModel(
  (rnn): RNN(28, 100, num_layers=2, batch_first=True)
  (fc): Linear(in_features=100, out_features=10, bias=True)
)
10
torch.Size([100, 28])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([100, 100])
torch.Size([100, 100])
torch.Size([100])
torch.Size([100])
torch.Size([10, 100])
torch.Size([10])


![alt text](params-b.png)

In [19]:
'''
Step 7: Train the model
'''

# number of steps to unroll
seq_dim = 28

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        
        # Load images as variables
        # resize images to (batch_size, seq_dim, input_dim) because when we create our model class, we had the argument 
        # batch_first=True
        images = Variable(images.view(-1, seq_dim, input_dim))  
        labels = Variable(labels)
        
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        # Forward pass to get output/logits
        # outputs.size()->100, 10
        outputs = model(images)
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                images = Variable(images.view(-1, seq_dim, input_dim)) 
                
                # Forward pass only to get logits/output
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                correct += (predicted == labels).sum()
            
            accuracy = 100 * correct // total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

Iteration: 500. Loss: 0.6841119527816772. Accuracy: 77
Iteration: 1000. Loss: 0.2851320505142212. Accuracy: 92
Iteration: 1500. Loss: 2.4408063888549805. Accuracy: 54
Iteration: 2000. Loss: 0.2851850986480713. Accuracy: 92
Iteration: 2500. Loss: 0.2369268536567688. Accuracy: 95
Iteration: 3000. Loss: 0.25919097661972046. Accuracy: 94


# RNN From CPU to GPU in PyTorch

## Model C: 2 hidden layer (Tanh)

#### Steps
- Step 1: Load dataset
- Step 2: Make dataset iterable
- **Step 3: Create model class**
- **Step 4: Instantiate model class**
- Step 5: Instantiate loss class
- Step 6: Instantiate optimizer class
- **Step 7: Train the model**

In [20]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from torch.autograd import Variable

'''
Step 1: Load dataset
'''

train_dataset = dsets.MNIST(root='.\data', 
                            train=True, 
                            transform=transforms.ToTensor(),
                            download=True)

test_dataset = dsets.MNIST(root='.\data', 
                           train=False, 
                           transform=transforms.ToTensor())

'''
Step 2: Make dataset iterable
'''

batch_size = 100
n_iters = 3000
num_epochs = n_iters / (len(train_dataset) / batch_size)
num_epochs = int(num_epochs) # the number of times we go through the whole dataset

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

'''
Step 3: Create model class
'''
class RNNModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        super(RNNModel, self).__init__()
        
        # Hidden dimension
        self.hidden_dim = hidden_dim
        
        # Number of hidden layers
        self.layer_dim = layer_dim
        
        # RNN
        # ** batch_first=True causes input/output tensors to be of shape (batch_dim, seq_dim, input_dim) 
        self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity='tanh')
        
        # Readout layer
        self.fc = nn.Linear(hidden_dim, output_dim) 
    
    def forward(self, x):
        
        # Initialize hidden state with zeros
        #######################
        #  USE GPU FOR MODEL  #
        #######################
        if torch.cuda.is_available():
            h_0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).cuda())
        else:
            h_0 = Variable(torch.zeros(self.layer_dim, x.size(0), self.hidden_dim))
            
        
        out, h_n = self.rnn(x, h_0)
        
        # Index hidden state of last time step
        # out.size()->100,28,100
        # out[:,-1,:]-> 100,100 -> just want last time step hidden states
        out = self.fc(out[:,-1,:])
        #out.size()-> 100,10 (batch_size=100. 10 prediction for every image in 100 images)
        return out

'''
Step 4: Instantiate model class
'''

input_dim = 28 # not 28x28 as FNN
hidden_dim = 100
layer_dim = 2 # the only change is here
output_dim = 10


model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

#######################
#  USE GPU FOR MODEL  #
#######################

if torch.cuda.is_available():
    model.cuda()
    
'''
Step 5: Instantiate loss class
'''
criterion = nn.CrossEntropyLoss()


'''
Step 6: Instantiate optimizer class
'''
learning_rate = 0.1

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

'''
Step 7: Train the model
'''

# Number of steps to unroll
seq_dim = 28  

iter = 0
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Load images as Variable
        #######################
        #  USE GPU FOR MODEL  #
        #######################
        if torch.cuda.is_available():
            images = Variable(images.view(-1, seq_dim, input_dim).cuda())
            labels = Variable(labels.cuda())
        else:
            images = Variable(images.view(-1, seq_dim, input_dim))
            labels = Variable(labels)
            
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()
        
        # Forward pass to get output/logits
        # outputs.size() --> 100, 10
        outputs = model(images)
        
        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)
        
        # Getting gradients w.r.t. parameters
        loss.backward()
        
        # Updating parameters
        optimizer.step()
        
        iter += 1
        
        if iter % 500 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for images, labels in test_loader:
                #######################
                #  USE GPU FOR MODEL  #
                #######################
                if torch.cuda.is_available():
                    images = Variable(images.view(-1, seq_dim, input_dim).cuda())
                else:
                    images = Variable(images.view(-1, seq_dim, input_dim))
                
                # Forward pass only to get logits/output
                outputs = model(images)
                
                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)
                
                # Total number of labels
                total += labels.size(0)
                
                # Total correct predictions
                #######################
                #  USE GPU FOR MODEL  #
                #######################
                if torch.cuda.is_available():
                    correct += (predicted.cpu() == labels.cpu()).sum()
                else:
                    correct += (predicted == labels).sum()
            
            accuracy = 100 * correct // total
            
            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iter, loss.data, accuracy))

Iteration: 500. Loss: 0.49409088492393494. Accuracy: 83
Iteration: 1000. Loss: 0.25837576389312744. Accuracy: 91
Iteration: 1500. Loss: 0.16369344294071198. Accuracy: 94
Iteration: 2000. Loss: 0.8283212184906006. Accuracy: 89
Iteration: 2500. Loss: 0.07316825538873672. Accuracy: 94
Iteration: 3000. Loss: 0.15746740996837616. Accuracy: 96
