# Fashion MNIST Classification with RNNs (25 points)

## <a id="intro"/> Introduction
Recurrent Neural Networks are Deep Learning models with simple structures and a feedback mechanism builted-in, or in different words, the output of a layer is added to the next input and fed back to the same layer.

The Recurrent Neural Network is a specialized type of Neural Network that solves the issue of **maintaining context for Sequential data** -- such as Weather data, Stocks, Genes, etc. At each iterative step, the processing unit takes in an input and the current state of the network, and produces an output and a new state that is **re-fed into the network**.

However, **this model has some problems**. It's very computationally expensive to maintain the state for a large amount of units, even more so over a long amount of time. Additionally, Recurrent Networks are very sensitive to changes in their parameters. As such, they are prone to different problems with their Gradient Descent optimizer -- they either grow exponentially  or drop down to near zero and stabilize , both problems that greatly harm a model's learning capability.

To solve these problems, to keep information over long periods of time and additionally solve the oversensitivity to parameter changes, i.e., make backpropagating through the Recurrent Networks more viable.



## <a id="arch"/>Architectures

- **The Long Short-Term Memory Model (LSTM)**

<img src="https://ibm.box.com/shared/static/v7p90neiaqghmpwawpiecmz9n7080m59.png" alt="Representation of a Recurrent Neural Network" width=80%>

##  <a id="lstm"/>LSTM
LSTM is one of the proposed solutions or upgrades to the **Recurrent Neural Network model**. 

It is an abstraction of how computer memory works. It is "bundled" with whatever processing unit is implemented in the Recurrent Network, although outside of its flow, and is responsible for keeping, reading, and outputting information for the model. The way it works is simple: you have a linear unit, which is the information cell itself, surrounded by three logistic gates responsible for maintaining the data. One gate is for inputting data into the information cell, one is for outputting data from the input cell, and the last one is to keep or forget data depending on the needs of the network.

Thanks to that, it not only solves the problem of keeping states, because the network can choose to forget data whenever information is not needed, it also solves the gradient problems, since the Logistic Gates have a very nice derivative.

### Long Short-Term Memory Architecture

As seen before, the Long Short-Term Memory is composed of a linear unit surrounded by three logistic gates. The name for these gates vary from place to place, but the most usual names for them are:
- the "Input" or "Write" Gate, which handles the writing of data into the information cell, 
- the "Output" or "Read" Gate, which handles the sending of data back onto the Recurrent Network, and 
- the "Keep" or "Forget" Gate, which handles the maintaining and modification of the data stored in the information cell.

<img src="https://ibm.box.com/shared/static/zx10duv5egw0baw6gh2hzsgr8ex45gsg.png" width="720"/>
<center>*Diagram of the Long Short-Term Memory Unit*</center>

The three gates are the centerpiece of the LSTM unit. The gates, when activated by the network, perform their respective functions. For example, the Input Gate will write whatever data it is passed onto the information cell, the Output Gate will return whatever data is in the information cell, and the Keep Gate will maintain the data in the information cell. These gates are analog and multiplicative, and as such, can modify the data based on the signal they are sent.


<img src="https://i.stack.imgur.com/RHNrZ.jpg" width="720"/>

---

# **LSTM RNN Cell**:
The mechanics of LSTM, it uses a combinations of gating cells to control its contents and by having gates, it is able to block the flow of the gradient, avoiding too many multiplications during backprop

# **import dataset**: 
The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning.
The MNIST database contains 60,000 training images and 10,000 testing images.



---
To classify images using a recurrent neural network, we consider every image row as a sequence of pixels. Because MNIST image shape is 28*28px, we will then handle 28 sequences of 28 steps for every sample.



In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import time

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# MNIST dataset 
train_dataset = torchvision.datasets.MNIST(root='./data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),  
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='./data', 
                                          train=False, 
                                          transform=transforms.ToTensor())


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw



In [None]:
# Hyper-parameters 
num_classes = 10 # MNIST total classes (0-9 digits)
num_epochs = 10
batch_size = 100
learning_rate = 0.001


# Preparing data for training with DataLoaders:
The Dataset retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s multiprocessing to speed up data retrieval.
DataLoader is an iterable that abstracts this complexity for us in an easy API.

In [None]:
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

# Set parameters:


1.   input_size — The number of expected features in the input x
2.   hidden_size — The number of features in the hidden state h
3.   num_layers — Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1
4.   nonlinearity — The non-linearity to use. Can be either ‘tanh’ or ‘relu’. Default: ‘tanh’

5.  Num of classes — require an output layer with 10 nodes in order to predict the probability distribution of an image belonging to each of the 10 classes.







In [None]:

input_size = 28 # MNIST data input (img shape: 28*28)
sequence_length = 28 # handle 28 sequences of 28 steps for every sample
hidden_size = 128 #hidden layer num of features
num_layers = 2 

# RNN Implementation using PyTortch (25 points) #

# Define the network: 
We are passing the input dimension, hidden dimension, number of layers and num of classes as input parameters.



---




*   We have to define the forward() because the forward function is executed sequentially, therefore we’ll have to pass the inputs and the zero-initialized hidden state through the RNN layer first, before passing the RNN outputs to the fully-connected layer.
*   **Set initial hidden state and cell state**: We have to initialize a hidden state and cell state for the LSTM as this is the first cell. The hidden state and cell state is stored in a tuple with the format (hidden_state, cell_state).
* Pass the input and hidden state into the model




In [None]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(RNN, self).__init__()
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        #self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        # Set initial hidden states (and cell states for LSTM)
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device) 
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device) 
        out, _ = self.lstm(x, (h0,c0))  
        out = out[:, -1, :] 
        out = self.fc(out)
        return out

model = RNN(input_size, hidden_size, num_layers, num_classes).to(device)

# Define loss function and Optimization Function:

lr(Learning Rate): Rate at which our model updates the weights in the cells each time back-propagation is done.

In [None]:

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  

# Train the model


*   num_epochs: Number of times our model will go through the entire training dataset
*   iterate training data loader inside num of epochs
* Reshape images
* Pass images to the model
* Pass outputs to the loss function
* We need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. So, the default action is to accumulate (i.e. sum) the gradients on every 
loss.backward() call.
* Print epoches, Accuracy ,losses and time taken by each epoch 


In [None]:
# Train the model
n_total_steps = len(train_loader)
for epoch in range(num_epochs):
    start_time = time.time()
    n_correct = 0
    n_samples = 0
    for i, (images, labels) in enumerate(train_loader):  
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        # max returns (value ,index)
        _, predicted = torch.max(outputs.data, 1)
        n_samples += labels.size(0)
        n_correct += (predicted == labels).sum().item()
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    acc = 100.0 * n_correct / n_samples        
    end_time = time.time() - start_time        
    print("Epoch no.",epoch+1 ,"|accuracy: ", round(acc, 3),"%", "|Loss: ", round(loss.item(), 3), "| epoch_duration: ", round(end_time,2),"sec")          



Epoch no. 1 |accuracy:  83.697 % |Loss:  0.096 | epoch_duration:  61.72 sec
Epoch no. 2 |accuracy:  96.212 % |Loss:  0.073 | epoch_duration:  58.63 sec
Epoch no. 3 |accuracy:  97.543 % |Loss:  0.087 | epoch_duration:  59.51 sec
Epoch no. 4 |accuracy:  98.18 % |Loss:  0.051 | epoch_duration:  58.69 sec
Epoch no. 5 |accuracy:  98.575 % |Loss:  0.03 | epoch_duration:  58.99 sec
Epoch no. 6 |accuracy:  98.692 % |Loss:  0.031 | epoch_duration:  58.85 sec
Epoch no. 7 |accuracy:  98.955 % |Loss:  0.058 | epoch_duration:  58.3 sec
Epoch no. 8 |accuracy:  98.982 % |Loss:  0.017 | epoch_duration:  58.39 sec
Epoch no. 9 |accuracy:  99.1 % |Loss:  0.005 | epoch_duration:  58.62 sec
Epoch no. 10 |accuracy:  99.25 % |Loss:  0.01 | epoch_duration:  58.58 sec


# Evaluate the model on test data



In [None]:
with torch.no_grad():
    n_correct = 0
    n_samples = 0
    for images, labels in test_loader:
        images = images.reshape(-1, sequence_length, input_size).to(device)
        labels = labels.to(device)
        outputs = model(images)
        # max returns (value ,index)
        _, predicted = torch.max(outputs.data, 1)
        n_samples += labels.size(0)
        n_correct += (predicted == labels).sum().item()

    acc = 100.0 *( n_correct / n_samples)
    print(f'Accuracy of the network on the 10000 test images: {acc} %')

Accuracy of the network on the 10000 test images: 98.86 %


# Performance Comparison (25 points)


*   **Computing Time**
 CNN would be more powerful than RNN. That’s mainly because RNN has less feature compatibility and it has the ability to take arbitrary output/input lengths which can affect the total computational time and efficiency. On the other hand, CNN takes fixed input and gives a fixed output which allows it to compute the results at a faster pace. 




# CNN + RNN (25 points)
* Every real-world image can be annotated with multiple labels, because an image normally abounds with rich semantic information, such as objects, parts, scenes, actions,and their interactions or attributes.
* CNNs for multi-label classification is to transform it into multiple single-label classification problems, which can be trained with the ranking loss  or the cross-entropy loss. However, when treating labels independently, these methods fail to model the dependency between multiple labels. For instance,sky and cloud usually appear together, while water and cars almost never co-occur.
* So, to capture  higher-order correlations with labels,one can use CNN with with recurrent neural networks (RNNs) to capture higher-order label relationships.
* we use RNNs framework to adapt the image features based on the previous prediction results in the CNN-RNN structure.
* A unified CNN-RNN framework for multi-label image classification, which effectively learns both the semantic redundancy and the co-occurrence
dependency in an end-to-end way.
* **CNNs** are good with hierarchical or spatial data and extracting unlabeled features. Those could be images or written characters.  CNNs take fixed size inputs and generate fixed size outputs.

* **RNNs** are good at temporal or otherwise sequential data. Could be letters or words in a body of text, stock market data, or speech recognition.  RNNs can input and output arbitrary lengths of data.  LSTMs are a variant of RNNs that allow for controlling how much of prior training data should be remembered, or more appropriately forgotten.

* **CNN-RNN** framework contains two parts: The CNN part extracts semantic representations from images; the RNN part models image/label relationship and label dependency.
* For example, labels “zebra” and “elephant”
can be decomposed as either (“zebra”, “elephant”) or (“elephant”, “zebra”). The probability of a prediction can be computed by the RNN network. The image, label, and recurrent representations are projected to the same low dimensional space to model the image-text relationship as well as the label redundancy. The RNN model is employed as a compact yet powerful representation of the label cooccurrence dependency in this space. It takes the embedding of the predicted label at each time step and maintains a
hidden state to model the label co-occurrence information. The a priori probability of a label given the previously predicted labels can be computed according to their dot products with the sum of the image and recurrent embeddings. The probability of a prediction path can be obtained as the product of the a-prior probability of each label given the
previous labels in the prediction.
* RNN can learn temporal and context features, especially long-term dependency between two entities, while CNN is capable of catching more potential feature
*  The proposed framework combines the advantages of the joint image/label embedding and label co-occurrence models by employing CNN and
RNN to model the label co-occurrence dependency in a
joint image/label embedding space. 
*   Accorsing to paper, they experienced on  several  datasets which demonstrate that the CNN-RNN  approach achieves superior performance than others method.