In [None]:
import time
initial_time = time.time()

Concretely, this project is split into two parts:

- **Part I: play with feed-forward neural network**
    - Build your feed-forward network with different layers and activation functions
    - Define the gradient descent function to update the parameters
    - Adjust the learning rate to achieve better performance 
    - Run the evaluation function


- **Part II: implement your own Convolutional Neural Network**
    - Train the CNN and compare it with the feed-forward neural network


Let's get started!

## 1. Package

Let's first import all the packages that you will need.

- **torch, torch.nn, torch.nn.functional** are the fundamental modules in pytorch library, supporting Python programs that facilitates building deep learning projects.
- **torchvision** is a library for Computer Vision that goes hand in hand with PyTorch
- **numpy** is the fundamental package for scientific computing with Python programs.
- **matplotlib** is a library to plot graphs and images in Python.
- **math, random** are the standard modules in Python.

In [None]:
# pip install torch
# pip install torchvision

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import random
import math
import numpy as np
import matplotlib.pyplot as plt
from project1_utils import *

print("Import packages successfully!")

In [None]:
seed = 1
set_seed(seed)

## 2. Dataset

In [None]:
# the number of images in a batch
batch_size = 32

# load dataset

trainset = dataset(path='dataset/trainset.h5')
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = dataset(path='dataset/testset.h5')
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

# name of classes
classes = ('cat', 'dog')

print ("Number of training examples: " + str(trainset.length))
print ("Number of testing examples: " + str(testset.length))

In [None]:
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

num_toshow = 10

# show images
imshow(torchvision.utils.make_grid(images[:num_toshow]))

# print labels
_, indexs = torch.max(labels, 1) 
print(' '.join('%5s' % classes[indexs[j]] for j in range(num_toshow)))

## 3. Build your feedforward neural network.

In this cell, you will be required to build a **three-layer multilayer perceptron (MLP)** to classify images into different categories. 

<!-- As we know from the class, **each layer** of a MLP can be denoted as the following mathematical operation:

$$z = W^T x + b$$ $$a = \sigma(z)$$

Here, $W, b$ denote the weights and biases, and $a, \sigma$ denote activation output and activation function, respectively.
**The function is parameterized by $W, b$ as well as the choice of $\sigma(\cdot)$**.

Note that it is valid for $\sigma(\cdot)$ to be the identity function, or $z = \sigma(z)$.

----

**Question 1 (6 points):** Now, let's implement functions at the layer level to do the following:

Hint: To implement $W^Tx+b$ in PyTorch, one way is to write it as `x.mm(W) + b`. -->

The size of input images is a batch-like tensor $ X \in \mathbb{R}^{B \times C \times H \times W}$, where $B$ denotes the batch size. Vectorize the image pixels equals to transforming into a vector $X_{vector} \in \mathbb{R}^{B \times CHW}$.

In [None]:
def image_vectorization(image_batch):
    """
    Input: 
        image_batch: a batch of images with shape [b, c, h, w]
    Output: 
        vectorized_image_batch: a batch of neurons
    """
    
    # vectorize the image pixels
    flat = nn.Flatten()
    vectorized_image_batch = flat(image_batch)

    return vectorized_image_batch
    

As we know from the class, **each layer** of a MLP can be denoted as the following mathematical operation:

$$z = W^T x + b$$ 

Here, $W, b$ denote the weights and biases. The function is **parameterized by $W, b$**.

In [None]:
def get_layer_params(input_dim: int, output_dim: int):
    """
    Input: 
        input_dim: number of neurons in the input
        output_dim: number of neurons produced by the layer
    Output: 
        a dictionary of generated parameters
            - w: weights
            - b: biases
    """
    
    # generate the parameters
    w = nn.Parameter(torch.Tensor(output_dim,input_dim), requires_grad =True)
    b = nn.Parameter(torch.Tensor(output_dim), requires_grad =True)
    
    return {'w': w,
            'b': b}
    

Following with the previous linear layer, an activation layer is required to add non-linearity to the network:

 $$a = \sigma(z)$$

 $a, \sigma$ denote activation output and activation function, respectively.
The entire layer function is also **parameterized by choice of $\sigma(\cdot)$**.

In [None]:
def activation_wrapper(z, activation='relu'):
    """
    Input: 
        x: the input neuron values
        activation: name of activation, could be one in ['relu', 'sigmoid', 'tanh']
    Output: 
        a: the corresponding activated output
    """
    with torch.no_grad():
      if activation == 'relu':
          for i in range(z.shape[0]):
            for j in range(z.shape[1]):
              z[i][j] = max(0,z[i][j])

      elif activation == 'sigmoid':
          for i in range(z.shape[0]):
            for j in range(z.shape[1]):
              z[i][j] = 1/(1 + np.exp(-z[i][j]))

      elif activation == 'tanh':
          for i in range(z.shape[0]):
            for j in range(z.shape[1]):
              z[i][j] = (np.exp(z[i][j]) - np.exp(-z[i][j]))/(np.exp(z[i][j]) + np.exp(-z[i][j]))

    a = z

    return a
    

In [None]:
def layer_forward_computation(x, params, activation):
    """
    Input: 
        x: the input to the layer
        params: parameters of each layer
        activation: activation type
    Output: 
        a: the output after the activation
    """
    
    # compute the output for layer
    z = torch.mm(x, torch.t(params['w'])) + params['b']

    a = activation_wrapper(z, activation)

    return a

---

Back to building our three-layer MLP for classification. If you have implemented the functions above correctly,
now the processing of putting everything together will be very easy.

Just like other parts of your programming experience,
knowing how to efficiently abstract and modularize components of your program will be critical in deep learning.

**Architecture Requirement**:

We now describe in details how our three-layer MLP should be built in PyTorch.

1. In the dataset, the size of input image is a tensor $ X \in \mathbb{R}^{B \times 3 \times 32 \times 32}$, where $B$ denotes the batch size.
2. Vectorize the image pixels to a vector $X_{vector} \in \mathbb{R}^{B \times 3072}$.
3. We now begin describing the specific architecture of the model, although this is not the only design choice, and feel free to change the hidden dimensions of the parameters
4. Layer1: set your parameters so the input is projected from $\mathbb{R}^{B \times 3072}$ to $\mathbb{R}^{B \times 256}$, use ReLU as your activation function
5. Layer2: set your parameters so the input is projected from $\mathbb{R}^{B \times 256}$ to $\mathbb{R}^{B \times 128}$, use ReLU as your activation function
6. Layer3: set your parameters so the input is projected from $\mathbb{R}^{B \times 128}$ to $\mathbb{R}^{B \times 2}$, use sigmoid function as your activation function

In [None]:
layer1_params: dict = dict()
layer2_params: dict = dict()
layer3_params: dict = dict()



def net(X, params, activations):
    """
    Input: 
        X: the input images to the network
        params: a dictionary of parameters(W and b) for the three different layers
        activations: a dictionary of activation function names for the three different layers
    Output: 
        output: the final output from the third layer
    """
    # build your network forward

    vectorized_image_batch = image_vectorization(X)

    layer1_output = layer_forward_computation(vectorized_image_batch, params['layer1'], activations['layer1'])

    layer2_output = layer_forward_computation(layer1_output, params['layer2'], activations['layer2'])

    output = layer_forward_computation(layer2_output, params['layer3'], activations['layer3'])
    
    return output

In [None]:
""" We prepare serval dictories to store the parameters and activations for different   """
layer1_params: dict = dict()
layer2_params: dict = dict()
layer3_params: dict = dict()
params: dict = dict()
activations: dict = dict()

layer1_params = get_layer_params(3072, 256)
layer2_params = get_layer_params(256, 128)
layer3_params = get_layer_params(128, 2)

params['layer1'] = layer1_params
params['layer2'] = layer2_params
params['layer3'] = layer3_params

# Three activation function options: ['relu', 'sigmoid', 'tanh']

activations['layer1'] = 'relu'
activations['layer2'] = 'relu'
activations['layer3'] = 'sigmoid'

In [None]:
output = net(images, params, activations)

## 4. Backpropagation and optimization

After finishing the forward pass, you now need to compute gradients for all Tensors with `requires_grad=True`, e.g., parameters of layer1. These gradients will be used to update parameters via gradient descent. 



Gradient descent is a way to minimize the final objective function (loss) parameterized by a model's parameter $\theta$ by updating the parameters in the opposite direction of the gradient $\nabla_\theta J(\theta)$ w.r.t to the parameters. The learning rate $\lambda$ determines the size of the steps you take to reach a (local) minimum.

However, for the vanilla gradient descent, you need to run through all the samples in your training set and update once. This will be time-consuming with large-scale datasets. We are doing Stochastic Gradient Descent, which only requires a subset of training samples to update the parameters. With the popular deep learning framework, the subset usually equals to the minibatch selected during training.

Now, let's look at the equation to update parameters for each layer in your network.

$$\large \theta = \theta - \lambda\cdot\nabla_\theta J(\theta)$$

---

In [None]:
def update_params(params, learning_rate):
    """
    Input: 
        params: the dictornary to store all the layer parameters
        learning_rate: the step length to update the parameters
    Output: 
        params: the updated parameters
    """
        
    #TODO: update the parameters of each layer
    with torch.no_grad():
       for k in params:
         params[k]['w'] -= learning_rate * params[k]['w'].grad
         params[k]['b'] -= learning_rate * params[k]['b'].grad
    
    return params

Since you are updating the parameters for each batch of data iteratively, you will need to clear the gradients after each update. 


In [None]:
def zero_grad(params):
    """
    Input: 
        params: the dictornary to store all the layer parameters
    Output: 
        params: the updated parameters with gradients clear
    """
    for k in params:
      params[k]['w'].grad.zero_()
      params[k]['b'].grad.zero_()

    return params
    


With the function **update_params( )** and **zero_grad( )** you have defined, you can move to the backpropagation process. The process includes computing gradients, updating parameters, reset gradients, think of how to combine them.


In [None]:


def backprop(loss, params, learning_rate):
    """
    Input: 
        loss: the loss tensor from the objective funtion that can be used to compute gradients
        params: parameters of the three layers
        learning_rate: the size of steps when updating parameters
    Output:
        params: parameters after one backpropogation
    """    
    loss.backward()
    update_params(params, learning_rate)
    zero_grad(params)

    return params
    


## 5. Training loop

For this binary classification task, a standard objective function **Binary Cross-Entropy Loss** is used. Related detail is given as follows:

$$\large L = -\frac{1}{N}\sum_{i=1}^{N}( y_i \cdot \log(p(y_i))+(1-y_i)\log(1-p(y_i)))$$

where $y$ is the label (1 for dog and 0 for cat in our case) and $p(y)$ is the predicted probability, here $N$ equals to the batch_size.


Before moving into the training loop, it's usually a good practice to have a learning rate decay function. The reason is that when your model is training for a longer time, it's closer to the optimal convergence. Therefore, a lower learning rate will improve the learning of complex patterns.

In [None]:
def adjust_lr(learning_rate, epoch):
    """
    Input: 
        learning_rate: the input learning rate
        epoch: which epoch you are in
    Output:
        learning_rate: the updated learning rate
    """    

    # for every 15 epochs it will decay
    if (epoch + 1)%15 == 0 and epoch != 0:
      learning_rate = learning_rate*0.1

    return learning_rate
    

In [None]:
import time
start_time = time.time()

# define the initial learning rate here
learning_rate = 1e-2
n_epochs = 30 # how many epochs to run

# define loss function
criterion = nn.BCELoss()

# initialize network parameters
init_params(params)

for epoch in range(n_epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        labels = labels.float()

        # Forward
        output = net(inputs, params, activations)
        
        # Compute the loss using the final output
        loss = criterion(output, labels)
        
        # Backpropagation
        params = backprop(loss, params, learning_rate)
        
        # print statistics
        running_loss += loss.item()

        if i % 200 == 199:  # print every 200 mini-batches
            print('[Epoch %d, Step %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0
            
    # adjust learning rate
    learning_rate = adjust_lr(learning_rate, epoch)
print('Finished Training')

print('Time taken : ' + str((time.time() - start_time)/60) + ' mins')

## 6. Testing

In [None]:
dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
_, labels = torch.max(labels, 1)
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(10)))

output = net(images, params, activations)
_, predicted = torch.max(output, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(10)))

**Evaluation**: Now testing with your trained model!

In [None]:
correct = 0
total = 0

# since you're not training, you don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        
        # calculate outputs by running images through the network
        output = net(images, params, activations)

        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 2000 test images: %d %%' % (
        100 * correct / total))

In [None]:
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        output = net(images, params, activations)
        _, predictions = torch.max(output, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                                                         accuracy))

# Part II

---

## 7. Build your convolutional neural network.

So far, you have tried feed-forward network on image classification, but the performance is not satisfying. Why? 
The reason is that vectorizing the images will lose some critical patterns of images, such as edges, corners, and local structures. Instead, convolutional neural network is very good at capturing these patterns. 

To explore this, let's build a CNN to see how good it is in image classification!

---
Lets build a two-layer CNN with a maxpooling layer in between.
Note that one Fully-Connected (FC) layer will follow the CNN network to map image features into class features. Overall, this network is similar to the one you built above, with the first two feed-forward layers being replaced by the convolutional layers:

            image -> [CNN layer 1] -> [CNN layer 2] -> vectorization -> [FC layer] -> prediction

**Architecture Requirement**:

1. CNN Layer1: suggests that **3 or 5** as your convolution kernel size; the number of output channels can be selected from **[16, 32, 64]**; use ReLU as your activation function
2. CNN Layer2: suggests that **3 or 5** as your convolution kernel size; the number of output channels can be selected from **[128, 256]**; use ReLU as your activation function
3. FC layer: set your parameters, so the input is projected from $\mathbb{R}^{B \times N}$ to $\mathbb{R}^{B \times 2}$, $N$ is defined by your CNN layers' parameters; use sigmoid function as your activation function

In [None]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Define your layers here!! 
        self.cnn_layer1 = torch.nn.Sequential(
            nn.Conv2d(3, 32, kernel_size = 5), 
            nn.ReLU(), 
            nn.MaxPool2d(kernel_size=2, stride=2)
            )
        
        self.cnn_layer2 = torch.nn.Sequential(
            nn.Conv2d(32, 256, kernel_size = 3), 
            nn.ReLU(), 
            nn.MaxPool2d(kernel_size=2, stride=2)
            )

        self.FC_layer = torch.nn.Sequential(
            nn.Linear(6*6*256, 2),
            nn.Sigmoid()
            )
        # raise NotImplementedError()
        
    def forward(self, x):
        # Your forward pass with the defined layers
        cnn_l1_out = self.cnn_layer1(x)
        cnn_l2_out = self.cnn_layer2(cnn_l1_out)
        vectorized_image_batch = image_vectorization(cnn_l2_out)
        output = self.FC_layer(vectorized_image_batch)
        return output

In [None]:
# define the initial learning rate here
import time
starting = time.time()
learning_rate = 1e-2
n_epochs = 30 # how many epochs to run

# define loss function
criterion = nn.BCELoss()
cnn_net = Net()
optimizer = torch.optim.SGD(cnn_net.parameters(), lr=learning_rate)

for epoch in range(n_epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        labels = labels.float()

        # Forward 
        output = cnn_net(inputs)
        
        # Compute the loss using the final output
        loss = criterion(output, labels)

        # Backpropagation
        loss.backward()

        optimizer.step()

        optimizer.zero_grad()

        # print statistics
        running_loss += loss.item()
        if i % 200 == 199:  # print every 200 mini-batches
            print('[Epoch %d, Step %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 200))
            running_loss = 0.0

print('Finished Training')
print('Time taken : ' + str((time.time() - starting)/60) + ' mins')

In [None]:
correct = 0
total = 0

# since you're not training, you don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        
        # calculate outputs by running images through the network
        output = cnn_net(images)

        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(output.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 2000 test images: %d %%' % (
        100 * correct / total))

In [None]:
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        _, labels = torch.max(labels, 1)
        output = cnn_net(images)
        _, predictions = torch.max(output, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1

# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                                                         accuracy))

In [None]:
print("Total time taken : " + str((time.time() - initial_time)/60) + " mins")