# Quiz 2 Reference Solutions

## Q1
**In principal component analysis, a smaller eigenvalue indicates that**

**Answer**: A given **principal component**, say V_j, is **less** important.

1. Each eigenvalue is associated with a principal component (eigenvector), not the variable in the original space.
2. Smaller means less variation/information and hence less important.


## Q2
**We have a set of 20x20 binary images. We want to use PCA to reduce the dimensionality to 10. How many parameters do we need to estimate in total?**

**Answer**: 4000

1. 20x20 images mean the original dimensionality is 20x20=400
2. To reduce the dimensionality to 10, we need to estimate 10 eigenvector.
3. Each eigenvector is of dimension 400 so that total number of parameters to estimate is 400 x 10 = 4000



## Q3
**In slide 10 of Lecture 8, if all the variables are assumed to be Gaussian and there are 8 variables in total, how many parameters do we need to learn/compute?**

**Answer**: 16

1. Slide 10 of L8 assumes variables are independent.
2. Each Gaussian variable needs two parameters.
3. 8x2=16


## Q4
**Consider the fully connected neural network (Multilayer perceptron) in Slide 47 of Lecture 6. If we insert another new hidden layer with 8 neurons between the old Hidden layer with four neurons and the Output layer with 2 neurons, with full connections between the old hidden layer and new hidden layer, and full connections between new hidden layer and output layer (no direct connections between the old hidden layer and the output layer). The same activation function sigma is used in the new hidden layer. How many learnable parameters in total are there for this two-hidden-layer neural network?**

**Answer**: 74

Following the example in the slide
1. Weights: 3*4+4*8+8*2=60
2. Bias: 4+8+2=14
3. 60 + 14 =74



## Q5
**You are given a training dataset with a 2x2 covariance matrix C = [1, 2; 3, 4], i.e., the first row is [1 2] and the second row is [3, 4]. This covariance matrix C has two eigenvectors u1=[-0.55 0.83]^T and u2=[0.83 0.55]^T and two corresponding eigenvalues 5 and 51. How much of the variance in the data is explained by the FIRST principal component? Please input your answer rounding to the nearest integer, e.g., input 12 if your answer is 12.34%.**

**Answer**: 91

51/(51+5)=0.91071428571


## Q6
**Please modify Lab 7B2 strictly following the three requirements from a client:**

1) Specify a random seed of 2020 using the manual_seed() method of pytorch right before the line myAE=Autoencoder(). No any other seed at anywhere else, i.e., remove the previous seed torch.manual_seed(509) above criterion = nn.MSELoss();

2) Change the optimizer to torch.optim.SGD with learning_rate=0.05 and other parameters as default;

3) Train the autoencoder for at least three epochs, record the loss at the end of epoch 1 as loss1, and the loss at the end of epoch 3 as loss3. 

**Please report the improvement (loss1-loss3) using only three decimals with rounding, e.g., x.xxx.**

**Answer**: See below. The improvement is 0.030788384, rounded to 0.031. I accept a tolerance of $\pm$0.001 (likely rounding to the wrong number).


#### Libaries

Get ready by importing the standard APIs

In [183]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

#### Data
Let us work with the popular [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) with handwritten digits. We will work on a subset for efficiency.

In [184]:
mnist_data = datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor())
print(len(mnist_data))
mnist_data = list(mnist_data)[:4096]
print(len(mnist_data))

60000
4096


#### Define the NN architecture
In Lab 1, we did not define a class for our linear regression NN. Here we do so and define an autoencoder class consisting of an **encoder** followed by a **decoder** as defined below.
<img src="https://miro.medium.com/max/3524/1*oUbsOnYKX5DEpMOK3pH_lg.png" style="width:360px;"/>

In [185]:
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            # 1 input image channel, 16 output channel, 3x3 square convolution
            nn.Conv2d(1, 16, 3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(16, 32, 3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 7)
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(64, 32, 7),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(16, 1, 3, stride=2, padding=1, output_padding=1),
            nn.Sigmoid()  #to range [0, 1]
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

`__init__()` defines the layers.  `forward()` defines the *forward pass* that transform the input to the output. `backward()` is automatically defined using `autograd`.

Here, we have both convolution layers `Conv2d()` and transpose convolution layers `ConvTranspose2d()`, with nice illustrations at [Convolution arithmetic](https://github.com/vdumoulin/conv_arithmetic). The basic ones are reproduced below where blue maps indicate inputs, and cyan maps indicate outputs.

<table>
    <tr>
    <td  style="text-align: left"> Convolution with no padding, no strides.      <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif" alt="Drawing" style="width: 250px;"/> </td>
    <td  style="text-align: left"> Transpose convolution with No padding, no strides.<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/no_padding_no_strides_transposed.gif" alt="Drawing" style="width: 250px;"/> </td>
</tr>
</table>

`ReLu()` and `Sigmoid()` are [rectified linear unit (ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) and [Sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function), two popular **activation function** that performs a *nonlinear* transformation/mapping of an input variable (element-wise operation).


#### Inspect the NN architecture

Now let's take a look at the autoencoder built.

In [186]:
#Set the random seed for reproducibility 
torch.manual_seed(2020) 

myAE=Autoencoder()
print(myAE)

Autoencoder(
  (encoder): Sequential(
    (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (1): ReLU()
    (2): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (3): ReLU()
    (4): Conv2d(32, 64, kernel_size=(7, 7), stride=(1, 1))
  )
  (decoder): Sequential(
    (0): ConvTranspose2d(64, 32, kernel_size=(7, 7), stride=(1, 1))
    (1): ReLU()
    (2): ConvTranspose2d(32, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (3): ReLU()
    (4): ConvTranspose2d(16, 1, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
    (5): Sigmoid()
  )
)


Let us check the (randomly initialised) parameters of this NN. Below, we check the first 2D convolution and the ReLu activiation function. 

In [187]:
params = list(myAE.parameters())
print(len(params))
print(params[0].size())  # First Conv2d's .weight
print(params[1].size())  # First Conv2d's .bias
print(params[1])

12
torch.Size([16, 1, 3, 3])
torch.Size([16])
Parameter containing:
tensor([-0.1837, -0.2009, -0.1551, -0.1346,  0.0843,  0.0840, -0.2357, -0.0940,
         0.2356, -0.1050,  0.2993, -0.1217,  0.2885,  0.1571, -0.0261,  0.2319],
       requires_grad=True)


To learn more about these functions, refer to the [`torch.nn` documentation](https://pytorch.org/docs/stable/nn.html) (search for the function, e.g., search for `torch.nn.ReLu(` and you will find its documentation [here](https://pytorch.org/docs/stable/nn.html?highlight=relu#torch.nn.ReLU)

#### Train the NN
Next, we will feed data in this autoencoder to train it, i.e., learn its parameters so that the reconstruction error (the `loss`) is minimised, using the mean square error (MSE) and `Adam` optimiser. The dataset is loaded in batches to train the model. One `epoch` means one cycle through the full training dataset. The `outputs` at the end of each epoch save the orignal image and the reconstructed (decoded) image pairs for later inspection. The steps are 
* Define the optimisation criteria and optimisation method.
* Iterate through the whole dataset in batches, for a number of `epochs` till a maximum specified or a convergence criteria (e.g., successive change of loss < 0.000001)
* In each batch processing, we 
    * do a forward pass
    * compute the loss
    * backpropagate the loss via `autograd`
    * update the parameters

In [188]:
#Hyperparameters for training
batch_size=64
learning_rate=0.05
max_epochs = 3


#Set the random seed for reproducibility 
#torch.manual_seed(509) 
#Choose mean square error loss
criterion = nn.MSELoss() 

#Choose the SGD optimiser
#optimizer = torch.optim.Adam(myAE.parameters(), lr=learning_rate, weight_decay=1e-5)
optimizer = torch.optim.SGD(myAE.parameters(), lr=learning_rate)

#Specify how the data will be loaded in batches (with random shffling)
train_loader = torch.utils.data.DataLoader(mnist_data, batch_size=batch_size, shuffle=True)

#Record loss
allloss = []
#Start training
for epoch in range(max_epochs):
    for data in train_loader:
        img, label = data
        optimizer.zero_grad()
        recon = myAE(img)
        loss = criterion(recon, img)
        loss.backward()
        optimizer.step()                
    print('Epoch:{}, Loss:{:.4f}'.format(epoch+1, float(loss)))
    allloss.append((epoch, loss.data.numpy()),)

Epoch:1, Loss:0.1385
Epoch:2, Loss:0.1203
Epoch:3, Loss:0.1077


In [189]:
allloss[0][1]-allloss[max_epochs-1][1]

0.030788384