<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Homework/chw_4_YOUR_NAME.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 4: A PyTorch pipeline
Name:

Wustlkey:

Partner Name (if applicable):

Partner Wustlkey (if applicable):

### Submission instructions

- Submit the modified python notebook as homework submission.
- Group submission is enabled, you can submit this coding assignment with up to 1 teammate in our class. For instruction of how to do a group submission. Please refer to Canvas useful links.
- You can google answers on StackOverflow, please attach the corresponding StackOverflow answer as comments. However, if the answer is converted to `torch` format, no credit will be awarded.
- Do not change the number of cells! Please work in the cell provided. If we need extra cells for debugging and testing purposes, we can work at the end of this notebook, save everything as a backup for review, and delete the extra cells in the submitted version.

 

### Instructions
Do **not** use `for` loops to iterate along the number of feature dimension for computational purpose in any of our solutions! We are allowed to use `for` loops to display figures, iterating across train loader etc.
Efficieny will be graded as well. For example if a problem asks us generate an array from 0 to 9: then
```python
x = []
for i in range(10):
    x.append(i)
```
this will only result a partial credit while
```python
x = np.arange(10)
```
or
```python
x = torch.arange(10)
```
will yield a full score.

### Problems
Below are 4 problems that helps us understand what a full pipeline in PyTorch looks like. Complete the coding tasks for credit. 

### Grading
This homework has 4 problems, 5 points for each problem. The homework will be graded and the grade counts towards your course grade. 

## Coding environments and submission
If we do not have `torch` installed on your computer, we have three ways to upload this notebook to [Google colab](https://colab.research.google.com/)：

1. Open up Google Colab, choose `Upload` to upload this template and work there. After we have done working we can select `File->Download .ipynb`.
2. Open up Google Colab, choose either `GitHub` or `Google Drive` to select the uploaded notebook in the corresponding website. After done working, we can sync the file to the corresponding GitHub or Google Drive copy.
3. Use the "Open in Colab" button at the top.

In [None]:
# import torch and numpy
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Optimizer
# import torchvision 
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torchvision.utils import make_grid

# progress bar
from tqdm.auto import tqdm

# import packages that help us plot
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")

## Dataset


"MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike."

In the following cells, we will learn how to load and view this dataset for our toy models. 

Read more:[https://www.kaggle.com/c/digit-recognizer](https://www.kaggle.com/c/digit-recognizer)


<a title="By Josef Steppan [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:MnistExamples.png"><img width="512" alt="MnistExamples" src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png"/></a>


---- 
This code is adopted from the pytorch examples repository. 
It is licensed under BSD 3-Clause "New" or "Revised" License.
Source: https://github.com/pytorch/examples/
LICENSE: https://github.com/pytorch/examples/blob/master/LICENSE

In [None]:
# from six.moves import urllib
# opener = urllib.request.build_opener()
# opener.addheaders = [('User-agent', 'Mozilla/5.0')]
# urllib.request.install_opener(opener)
# ### somehow torch.dataset malfunctioned after an update in March 2021
# ### Facebook team issued a hotfix but apparently not loaded in the docker image
# ### of Colab yet as of Mar 5, 2021

## update as of Mar 11 (as of Mar 28, 2021 this still have to be used)
!wget https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz
!mv MNIST.tar_.gz MNIST.tar.gz
!tar -zxvf MNIST.tar.gz


## Below is a full pipeline of training that can be found in coding Lecture 8

Below is a working baseline for train a neural net.
- The first cell is to define everything needed (except the validation part which we will learn to code in next coding HW).
- The second cell below is to initialize.
- The third cell below is to train the initialized model.

In [None]:
# set up the data
train = datasets.MNIST(root='./', 
                       train=True, 
                       download=True, 
                       transform = transforms.ToTensor())

# set up the loader
train_loader = DataLoader(train, batch_size=64)

# model
class MLP(nn.Module): # subclass of nn.Module 
    def __init__(self, 
                 input_size: int,
                 output_size: int,
                 hidden_size: int = 256):
        super(MLP, self).__init__() 
        self.linear0 = nn.Linear(input_size, hidden_size)
        self.activation = nn.ReLU()
        self.linear1 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x): 
        # getting rid of color channel
        x = x.view(x.size(0), -1) 
        x1 = self.linear0(x)
        a1 = self.activation(x1)
        output = self.linear1(a1)

        return output

# optimizer
class SGD(Optimizer): # subclass of Optimizer
    """
    Implements the vanilla SGD simplified 
    from the torch official one for Math 450 WashU
    
    Args:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float): learning rate
        
    Example:
        >>> optimizer = SGD(model.parameters(), lr=1e-2)
        >>> optimizer.zero_grad()
        >>> loss_fn(model(input), target).backward()
        >>> optimizer.step()
    """

    def __init__(self, params, # params: model.parameters()
                       lr: float = 1e-3, # input: type = value
                       name_input: str = 'SGD'
                 ): 
        # constructor
        defaults = dict(lr=lr, name=name_input) 
        super(SGD, self).__init__(params, defaults)

    def step(self, closure=None): 

      for group in self.param_groups:

          for param in group['params']:
              if param.grad is None:
                  continue
              grad_param = param.grad.data

              param.data -= group['lr']*grad_param

      return loss


def accuracy_score(y_pred, y_true):
    '''
    A modified acc score from HW 3

    Input:
        - predicted labels: output from our NN
        - true labels: integers from 0 to num_class

    Output:
        - accuracy score above
    '''
    y_pred = y_pred.argmax(dim=-1) # convert this to a label
    acc = (y_pred == y_true).float().mean()

    return acc

In [None]:
# get a sample from the train loader
sample = next(iter(train_loader))
 
x = sample[0] # data fed into the model
y = sample[1] # true label

x = x.view(x.size(0), -1) # getting rid of color channel

# get input_size
input_size = x.size(-1)

# get output_size (10 classes)
output_size = 10 

# hidden layer size
hidden_size = 256

# initialize our model
model = MLP(input_size=input_size,
            hidden_size=hidden_size,
            output_size=output_size)

# learning rate (step size)
learning_rate = 1e-3

# equip this model an optimizer
optimizer = SGD(model.parameters(), lr=learning_rate)

# choose the loss function
# CrossEntropyLoss() in nn avoids taking softmax explicitly
loss_func = nn.CrossEntropyLoss()

In [None]:
# train our model for 5 epochs
epochs = 5

for epoch in range(epochs):
    
    model.train()
    
    loss_vals_epoch = []
    acc_epoch = []

    with tqdm(total=len(train_loader)) as pbar:
      # use the tqdm with block as a progress bar
      for data, targets in train_loader:
          
        # forward pass
        outputs = model(data)

        # outputs = torch.log(outputs) if using nn.NLLLoss()
        
        # loss function
        loss = loss_func(outputs, targets)
        
        # record loss function values
        loss_vals_epoch.append(loss.item())
        
        # clean the gradient from last iteration
        optimizer.zero_grad()
        
        # backprop
        loss.backward()
        
        # gradient descent
        optimizer.step()
        
        # check accuracy 
        # need to detach the variable first (no grad is tracked)
        # and move the variable from GPU to CPU memory pool 
        # if GPU is used
        outputs = outputs.detach().cpu()
        acc = accuracy_score(outputs, targets)
        acc_epoch.append(acc)

        # tqdm template
        description = f"epoch:{epoch+1} "
        description += f"| loss: {np.mean(loss_vals_epoch):.3e}"
        description += f"| acc: {np.mean(acc_epoch)*100:.2f} %"
        pbar.set_description(description)
        pbar.update()


## Problem 1
In the pipeline code above, modified the MLP model to have $k$ ($k\geq 1$) hidden layers. Notice that you have to modify your code in an automatic way: first change the default input `hidden_size` to a tuple 
- If the user gives input `hidden_size = (256, )`, then the code is almost unchanged.
- If the user gives input `hidden_size = (256, 128)`, then the MLP model should have 2 hidden layers, which have 256 neurons and 128 neurons respectively.
- Remember to add a nonlinera activation function in between each hidden layer.

There are various way to implement this. We can implement this well within what is taught in class and Math 449. 

If you have interest to learn things outside of our class's scope, please read the manual below:
https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html

In [None]:
class MLP(nn.Module): # subclass of nn.Module 
    def __init__(self, 
                 input_size: int,
                 output_size: int,
                 hidden_size: tuple = (256, )):
        super(MLP, self).__init__() 
        # modify the code here
        self.linear0 = nn.Linear(input_size, hidden_size[0])
        self.activation = nn.ReLU()
        self.linear1 = nn.Linear(hidden_size[0], output_size)
        
    def forward(self, x): 
        x = x.view(x.size(0), -1) 
        x1 = self.linear0(x)
        a1 = self.activation(x1)
        output = self.linear1(a1)

        # inserting softmax computation here
        return output


hidden_size = (256, 128, 64)
model = MLP(input_size=28*28, output_size=10, hidden_size=hidden_size)
for param in model.parameters():
  print(param.size())
  
# expected output:
# torch.Size([256, 784])
# torch.Size([256])
# torch.Size([128, 256])
# torch.Size([128])
# torch.Size([64, 128])
# torch.Size([64])
# torch.Size([10, 64])
# torch.Size([10])

## Problem 2

The loss $-\sum y\ln \hat{y}$ is implemented in torch as `nn.NLLLoss` (negative log likelihood loss), which accepts the log of a probability input (i.e., log output of a softmax function). If we replace `nn.CrossEntropyLoss()` with `nn.NLLLoss()`, we have to 
- Add an explicit softmax computation. 
- Take the log of the output.

Assume we can use the built-in softmax from `nn.functional` or `nn`. Modify the code below to have an explicit softmax function.

#### Remark on why we want to do this
if we have an imbalanced dataset (like the ones in real life unlike MNIST), which the numbers of samples in each class are drastically different from class to class, using this pipeline is much preferred over applying `nn.CrossEntropyLoss()` directly. Because in this pipeline, it leaves room for adjusting the weight of $-y\ln \hat{y}$ computation for each class.

In [None]:
class MLP(nn.Module): # subclass of nn.Module 
    def __init__(self, 
                 input_size: int,
                 output_size: int,
                 hidden_size: tuple = (256, )):
        super(MLP, self).__init__() 
        self.linear0 = nn.Linear(input_size, hidden_size[0])
        self.activation = nn.ReLU()
        self.linear1 = nn.Linear(hidden_size[0], output_size)
        
    def forward(self, x): 
        x = x.view(x.size(0), -1) 
        x1 = self.linear0(x)
        a1 = self.activation(x1)
        output = self.linear1(a1)

        # inserting softmax computation here
        return output

model = MLP(input_size=28*28, output_size=10)
sample = next(iter(train_loader))
x = sample[0]
with torch.no_grad():
  yhat = model(x)
print(torch.allclose(yhat.sum(axis=1), torch.ones(yhat.size(0)))) 
# expected output is True

## Problem 3

If the loss function is modified as follows: for $\epsilon>0$ small $\epsilon\ll 1$
$$
L = -\sum_{i=1}^{N_{_\text{batch}}}
y_i \ln \hat{y}_i + \epsilon \|W\|_2^2,
$$
where $\|W\|_2^2 = \sum_{l} \|w_l\|^2$, where $w_l$ stands for the parameters of $l$-th layer, the $\ell^2$-norm square is summing up all square of the parameters. When the gradient descent is performed against this loss function, without actually re-implement the loss function, it is equivalent to let each weight decay an extra factor of $(1-\alpha\epsilon)$ in each iteration of gradient descent, or simply put, adding an extra $\alpha \epsilon W$ to the gradient in each SGD iteration. This is called a "weight decay" regularizer.

Implement this weight regularizer in the template SGD class below. 

In [None]:
# optimizer
class SGD(Optimizer): # subclass of Optimizer
    """
    Implements the vanilla SGD simplified 
    from the torch official one for Math 450 WashU
    
    Args:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float): learning rate
         weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
        
    Example:
        >>> optimizer = SGD(model.parameters(), lr=1e-2)
        >>> optimizer.zero_grad()
        >>> loss_fn(model(input), target).backward()
        >>> optimizer.step()
    """

    def __init__(self, params, # params: model.parameters()
                       lr: float = 1e-3, # input: type = value
                       weight_decay: float = 0.0,
                       name_input: str = 'SGD'
                 ): 
        # constructor
        defaults = dict(lr=lr, name=name_input) 
        super(SGD, self).__init__(params, defaults)

    def step(self, closure=None): 

      for group in self.param_groups:
          # add weight_decay somewhere here
          # notice we can access the weight_decay initialized earlier
          # as weight_decay = group['weight_decay']
          
          for param in group['params']:
              if param.grad is None:
                  continue
              grad_param = param.grad.data

              param.data -= group['lr']*grad_param

      return loss

## Problem 4
Add the result from Problem 2 and 3 into the pipeline provided in the begining. Compare the result with the baseline. 


Observing expected and similar result with baseline is a good indication that we can further down improve and change the model, otherwise there must be something wrong with our modification. Changing things bit by bit and then compare with a working baseline is a good habit in machine learning.


Expected accuracy: after 5 epochs we should observe similar accuracy.

In [None]:
# copy and paste the pipeline here and modify it.