<a href="https://colab.research.google.com/github/yexf308/MachineLearning/blob/main/homework/HW5/592Fa22HW5Q3_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pylab inline 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from tqdm import tqdm


$\def\m#1{\mathbf{#1}}$
$\def\mm#1{\boldsymbol{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$
$\def\mr#1{\mathrm{#1}}$
$\newenvironment{rmat}{\left[\begin{array}{rrrrrrrrrrrrr}}{\end{array}\right]}$
$\newcommand\brm{\begin{rmat}}$
$\newcommand\erm{\end{rmat}}$
$\newenvironment{cmat}{\left[\begin{array}{ccccccccc}}{\end{array}\right]}$
$\newcommand\bcm{\begin{cmat}}$
$\newcommand\ecm{\end{cmat}}$



---


---




# Q3: Image classification on CIFAR-10 (60pt)

### Preliminaries information:
In this problem we will explore different deep learning architectures for image classification on the CIFAR-10
dataset. If you are not comfortable with PyTorch from the previous lecture and discussion materials, use the tutorials at http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html and make sure you
are familiar with tensors, two-dimensional convolutions (`nn.Conv2d`) and fully-connected layers (`nn.Linear`),
ReLU non-linearities (`F.relu`), pooling (`nn.MaxPool2d`), and tensor reshaping (`view`).

For this problem, it is highly recommended that you copy and modify the existing network code produced in
the tutorial *Training a classifier*. You should not be coding this network from scratch!



- Each network $f$ maps an image $x^{\rm in} \in \mb{R}^{32 \times 32 \times 3}$ (3 channels for RGB) to an output $f(x^{\rm in}) = x^{\rm out} \in \mb{R}^{10}$. The class label is predicted as $\arg\max_{i=0,1,\dots,9} x_{i}^{\rm out}$.

- The network is trained via multiclass cross-entropy loss (log of softmax function).  Specifically, for an input image and label pair $(x^{\rm in} , c)$ where $c\in \{0,\dots, 9\}$. If the network’s
output layer is $x^{\rm out} \in \mb{R}^{10}$, the loss $-\log\left(\frac{\exp(x_c^{\rm out})}{\sum_{c'} \exp(x_{c'}^{\rm out})}\right)$. 

- For computational efficiency reasons, this particular network considers mini-batches of images per training
step meaning the network actually maps $B=4$ images per feed-forward so that $\tilde{x}^{\rm in}\in\mb{R}^{B\times 32 \times 32 \times 3}$ and $\tilde{x}^{\rm out}\in\mb{R}^{B\times 10}$.  This is ignored in the network descriptions below but it is something to be aware of.
 
- Create a validation dataset by appropriately partitioning the train dataset. **Hint**: look at the documentation for `torch.utils.data.random\_split`. Make sure to tune hyperparameters like network architecture and step size on the validation dataset. Do **NOT** validate your hyperparameters on the test dataset.

- Modify the training code such that at the end of each epoch (one pass over the training data) it computes and prints the training and test classification accuracy.

- The cross-entropy loss for a neural network is, in general, non-convex. This means that the optimization
method may converge to different local minima based on different hyperparameters of the optimization
procedure (e.g., stepsize). Usually one can find a good setting for these hyperparameters by just observing
the relative progress of training over the first epoch or two (how fast is it decreasing) but you are warned
that early progress is not necessarily indicative of the final convergence value (you may converge quickly to a poor local minimum whereas a different step size could have poor early performance but converge to
a better final value).

- While one would usually train a network for hundreds of epochs to reach convergence and maximize accuracy, this can be prohibitively time-consuming, so feel free to train for just a a dozen or so epochs. 


**Your Task:** 
For all of the following, 
- Apply a **hyperparameter tuning method** (manually by
hand, grid search, random search, etc.) using the
validation set

- Report the hyperparameter configurations you evaluated and the best set of hyperparameters
from this set.  

- Plot the training and validation classification accuracy as a function of iteration. Produce
a separate line or plot for each hyperparameter configuration evaluated (please try to use multiple lines in a single plot to keep the number of figures minimal). 

- Finally, evaluate your best set of
hyperparameters on the test data and report the accuracy.



The number of hyperparameters to tune, combined with the slow training times, will hopefully give
you a taste of how difficult it is to construct networks with good generalization performance. It should be emphasized that the
networks we constructed are **tiny**. 
State-of-the-art networks can have dozens of layers, each with their own hyperparameters to tune. Additional
hyperparameters you are welcome to play with if you are so inclined, include: changing the activation
function, replace max-pool with average-pool, adding more convolutional or fully connected layers, and
experimenting with batch normalization or dropout.


Here are the network architectures you will construct and compare.
Before you jump into tuning, it is better to write  separate train and evaluation functions. 

---







In [None]:

sns.set()
torch.manual_seed(592)
np.random.seed(592)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


In [None]:
# it may takes while to download the data, please try this code several times. 
def prepare_dataset(batch_size=64, train_val_split_ratio=0.9):

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    cifar10_set = datasets.CIFAR10(root='./data', train=True, download=False, transform=transform)

    train_size = int(len(cifar10_set) * train_val_split_ratio)
    val_size   = len(cifar10_set) - train_size

    cifar10_trainset, cifar10_valset = torch.utils.data.random_split(cifar10_set, [train_size, val_size])
    cifar10_testset = datasets.CIFAR10(root='./data', train=False, download=False, transform=transform)

    train_loader = torch.utils.data.DataLoader(cifar10_trainset, batch_size=batch_size, shuffle=True)
    val_loader = torch.utils.data.DataLoader(cifar10_valset, batch_size=batch_size, shuffle=True)
    test_loader = torch.utils.data.DataLoader(cifar10_testset, batch_size=batch_size, shuffle=True)

    return train_loader, val_loader, test_loader


In [None]:
def train(model, train_loader, val_loader, criterion, optimizer, epochs, batch_size):
   """ Trains a model for n epochs using given optimizer, and then records 
    validation and training accuracies, validation and training losses.  
   """
   # Your code starts here.

In [None]:
def evaluation(model, test_loader, criterion):
  """Calculate and print test accuracy and test losses.
  """
  # Your code starts here.

## Q3.1:  Fully-connected output, no hidden layers (logistic regression) (10pt)
We begin with the simplest network
possible that has no hidden layers and simply linearly maps the input layer to the output layer. That is,
conceptually it could be written as
\begin{align*}
    x^{\rm out} &= W \text{vec}(x^{\rm in}) +b
\end{align*} 
where $x^{\rm out} \in \mb{R}^{10}$, $x^{\rm in} \in \mb{R}^{32 \times 32 \times 3}$, $W \in \mb{R}^{10 \times 3072}$, $b \in \mb{R}^{10}$ since $3072 = 32 \cdot 32 \cdot 3$. For a tensor $x \in \mb{R}^{a \times b \times c}$, we let $\text{vec}(x) \in \mb{R}^{a b c}$ be the reshaped form of the tensor into a vector (in an arbitrary but consistent pattern).   There is no required benchmark testing accuracy for this part.

In [None]:
#  Q3.1 your code starts here

# Your Solution:



---


## Q3.2: Fully-connected output, 1 fully-connected hidden layer (10pt)

We will have one hidden layer denoted as $x^{\rm hidden} \in \mb{R}^{M}$ where $M$ will be a hyperparameter you choose ($M$ could be in the hundreds). The non-linearity applied to the hidden layer will be the **relu** ($\mathrm{relu}(x) = \max\{0,x\}$, elementwise). This network can be written as

\begin{align*}
    x^{\rm out} &= W_2 \mathrm{relu}(W_1 \text{vec}(x^{\rm in}) +b_1) + b_2
\end{align*}

where $W_1 \in \mb{R}^{M \times 3072}$, $b_1 \in \mb{R}^M$, $W_2 \in \mb{R}^{10 \times M}$, $b_2 \in \mb{R}^{10}$.  Tune the different hyperparameters and train for
a sufficient number of epochs to achieve a testing accuracy of at least 50%. Provide the hyperparameter
configuration used to achieve this performance.



In [None]:
#  Q3.2 your code starts here

# Your Solution:



---


## Q3.3: Convolutional layer with max-pool and fully-connected output (15pt)

For a convolutional layer $W_1$ with filters of size $k \times k \times 3$, and $M$ filters (reasonable choices are $M=100$, $k=5$), we have that $\mathrm{Conv2d}(x^{\rm in}, W_1) \in \mb{R}^{(33-k) \times (33-k) \times M}$. 

- Each convolution will have its own offset applied to each of the output pixels of the convolution; we denote this as $\mathrm{Conv2d}(x^{\rm in}, W) + b_1$ where $b_1$ is parameterized in $\mb{R}^M$. Apply a **relu** activation to the result of the convolutional layer. 

-  Next, use a max-pool of size $N \times N$ (a reasonable choice is $N=14$ to pool to $2 \times 2$ with $k=5$) we have that $\textrm{MaxPool}( \mathrm{relu}( \mathrm{Conv2d}(x^{\rm in}, W_1)+b_1)) \in \mb{R}^{\lfloor\frac{33-k}{N}\rfloor \times \lfloor\frac{33-k}{N}\rfloor \times M}$.

- We will then apply a fully-connected layer to the output to get a final network given as
\begin{align*}
          x^{\rm output} = W_2 \text{vec}(\textrm{MaxPool}( \mathrm{relu}( \mathrm{Conv2d}(x^{\rm input}, W_1)+b_1))) + b_2
\end{align*}
where $W_2 \in \mb{R}^{10 \times M (\lfloor\frac{33-k}{N}\rfloor)^2}$, $b_2 \in \mb{R}^{10}$.


The parameters $M, k, N$ (in addition to the step size and momentum) are all hyperparameters, but you
can choose a reasonable value. Tune the different hyperparameters (number of convolutional filters, filter
sizes, dimensionality of the fully-connected layers, stepsize, etc.) and train for a sufficient number of
epochs to achieve a validation accuracy of **at least 70%**. Provide the hyperparameter configuration used
to achieve this performance. Make sure to save this model so that you can do the next part.


In [None]:
#  Q3.3 your code starts here

# Your Solution:



---


## Q3.4: More tuning (10pt)

Return to the original network you were left with at the end of the tutorial Training
a classifier. (Note that this is not the network from Q3.3 above.) Tune the different hyperparameters
(number of convolutional filters, filter sizes, dimensionality of the fully-connected layers, stepsize, etc.) and
train for a sufficient number of iterations to achieve a *train accuracy* of **at least 87%**. You may not modify
the core structure of the model (i.e., adding additional layers). Provide the hyperparameter configuration
used to achieve this performance. Make sure to save this model so that you can do the next part (see
the Training a classifier tutorial for details on how to do this).

In [None]:
#  Q3.4 your code starts here

# Your Solution:



---

## Q3.5: Transfer Learning:  Use AlexNet as a fixed feature extractor (5pt)
So far we have trained very small neural networks from scratch. As mentioned in the previous problem,
modern neural networks are much larger and more difficult to train and validate. In practice, it is rare to train
such large networks from scratch. This is because it is difficult to obtain both the massive datasets and the
computational resources required to train such networks. 

Instead of training a network from scratch, in this problem, we will use a network that has already been trained
on a very large dataset (ImageNet) and adjust it for the task at hand. This process of adapting weights in a
model trained for another task is known as **transfer learning**.

Begin with the pretrained **AlexNet** model from `torchvision.models` for the following tasks below. AlexNet
achieved an early breakthrough performance on ImageNet and was instrumental in sparking the deep
learning revolution in 2012.

Do not modify any module within AlexNet that is not the final classifier layer.

- The output of AlexNet comes from the 6-th layer of the classifier. Specifically, `model.classifer[6] =
nn.Linear(4096, 1000)`. To use AlexNet with CIFAR-10, we will reinitialize (replace) this layer with
`nn.Linear(4096, 10)`. This re-initializes the weights, and changes the output shape to reflect the desired
number of target classes in CIFAR-10. 

- We only adjust the weights of this new layer (keeping the weights of all other layers
fixed). When using AlexNet as a fixed feature extractor, make sure to freeze all of the parameters in the network
before adding your new linear layer:
 ```
model = torchvision.models.alexnet(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.classifier[6] = nn.Linear(4096, 10)
```








In [None]:
#  Q3.5 your code starts here

# Your Solution:



---


## Q3.6: Transfer Learning: Use AlexNet as initialization (10pt)
The second approach to transfer learning is to fine-tune the weights of the pretrained network, in addition to training the new classification layer. In this approach, all network weights
are updated at every training iteration; we simply use the existing AlexNet weights as the “initialization”
for our network (except for the weights in the new classification layer, which will be initialized using
whichever method is specified in the constructor) prior to training on CIFAR-10. 

**Note**: Fine-tune AlexNet on
CPU takes an insame amount of time, so we recommend you to use Google Colab, which has free GPU
access. To enable GPU for the notebook: Navigate to Edit→Notebook Settings. select GPU from the
Hardware Accelerator drop-down. For information about training on GPU, check the tutorial.

In [None]:
#  Q3.6 your code starts here

# Your Solution:



---


## Q3.7: (Bonus Question) Adversarial Attacks
**If you enjoy deeping learning, you should try this question.**

Modern deep neural networks are brittle and susceptible to small
perturbations to their inputs. This gives rise to adversarial examples, which are nearly indistinguishable
to the human eye but somehow “fool” neural networks into making drastically wrong predictions.

One algorithm to generate such examples is the untargeted fast gradient sign method (FGSM) attack,
which can be described as follows: 
Let $x$ be an input image with label $y$, $\c{F}$ be a neural network, and $\epsilon$ be a small value (intuitively, an attack rate).

\begin{align}
&\hat{y} = \c{F}(x) \\
&\c{L}= \text{CrossEntropy}(\hat{y},y) \\
&x' = x+\epsilon \cdot \text{sign}(\nabla_x \c{L})
\end{align}

where $\text{sign}(t) = \frac{t}{|t|}$.  We then use $x'$ as an input to the network. Note that the calculation for $x'$
loosely resembles gradient descent. Intuitively, we are slightly adjusting the input image so that the model is less
likely to predict its true class.

For this part, use your classifier from Q3.4 to do the following steps. As always, please provide all code
and plots.

- Select four images from the train set that have been correctly classified. Visualize them and provide
their labels.

- Implement the untargeted FGSM algorithm. Run one iteration on these images and visualize them:
they should look like the originals.

- Provide the predicted labels for your attacked images. You should have at least one image that is
incorrectly classified. Remark: **FGSM** is a simple attack, but it’s not always effective. In order to
generate successful adversarial examples, you may need to try different values of $\epsilon$ or even different
images, depending on where your classifier excels.

- Explain the significance of the existence of such adversarial examples.



In [None]:
#  Q3.5 your code starts here

# Your Solution:

---



---

# Q4: (optional) Correction to your previous homework question
You may pick any one question in homework 1-4 that didn't perform well, and now you have the chance to correct your mistakes. If you successfully correct your mistakes, your previous grade will be replaced by the current score, e.g., say you want to correct HW4Q2:Logistic regression with Softmax , your previous score is 10/30 and after successful attempt, your score becomes 25/30. You will be awarded 15 bonus point here. 

**State Your question that you want to correct:**

In [None]:
# Your new code starts here

# Your New Solution: