<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Lectures/Math_450_Notebook_7_(SGD).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding lecture 7 of Math 450

## Last two weeks

- Explore MNIST dataset.
- Generator, iterator, `iter()`, `next()`, `enumerate()`, `try: except:` flow control.
- Matrix-vector multiplications and "broadcastability".
- `loss.backward()` vs hand computation.
- Why `with torch.no_grad():` is necessary.
- Build simple neural network using `torch.nn.Sequential()`
- Gradient descent for a binary classification problem.


## Today
- Class and object-oriented programming primer. `constructor`, inheritance, `super`.
- Torch `DataLoader` interface for (mini-batch) SGD.

In [None]:
next(iter(['df', 'fg', 'gh']))

# Stochastic Gradient Descent

Suppose our loss function is still:

$$f := f(\mathbf{w}; X,\mathbf{y}) =  \frac{1}{N}\sum_{i=1}^N f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)}),$$

where $X = (\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)})^{\top}$ are the training samples, $\mathbf{y} = (y^{(1)}, \dots, y^{(N)})^{\top}$ are the labels/taget values for the training samples.

> Choose initial guess $\mathbf{w}_0$, step size (learning rate) $\alpha$, number of inner iterations $N$, number of epochs $n_E$ <br><br>
>    Set $\mathbf{w}_{N+1} = \mathbf{w}_0$ for epoch $e=0$<br>
>    For epoch $e=1,2, \cdots, n_E$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; $\mathbf{w}_{0}$ for the current epoch is $\mathbf{w}_{N+1}$ for the previous epoch.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; Randomly shuffle the samples so that $\{\mathbf{x}^{(m)},y^{(m)}\}_{m=1}^N$ is a permutation of the original dataset.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; For $m=0,1,2, \cdots, M$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle\mathbf{w}_{m+1} = \mathbf{w}_m - \alpha \nabla f_m(\mathbf{w}; \mathbf{x}^{(m)},y^{(m)})$

One outer iteration is called a completed *epoch* (sweeping all samples once).

### Remark
This is the vanilla SGD: Single gradient evaluation at each iteration.

# (Practice) Mini-batch SGD

In the vanilla SGD, each parameter $\mathbf{w}$ update is computed w.r.t one training sample randomly selected. In mini-batch SGD, the update is computed for a mini-batch (a small number of training samples), as opposed to a single example. The reason for this is twofold: 
* This reduces the variance in the parameter update and can lead to more stable convergence.
* This allows the computation to be more efficient (less overhead), since the training code is already written in a vectorized way. 

A typical mini-batch size is $2^k$ (32, 256, etc), although the optimal size of the mini-batch can vary for different applications, and size of dataset (e.g., AlphaGo training uses mini-batch size of 2048 board images).

> Choose initial guess $\mathbf{w}_0$, learning rate $\alpha$, <br>
batch size $b$, number of inner iterations $M= \lfloor N/b \rfloor$, number of epochs $n_E$ <br><br>
>    Set $\mathbf{w}_{M+1} = \mathbf{w}_0$ for epoch $e=0$<br>
>    For epoch $e=1,2, \cdots, n_E$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; $\mathbf{w}_{0}$ for the current epoch is $\mathbf{w}_{M+1}$ for the previous epoch.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; Randomly shuffle the training samples.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; For $m=0,1,2, \cdots, M$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle\mathbf{w}_{m+1} = \mathbf{w}_m -  \frac{\alpha}{b}\sum_{i=1}^{b} \nabla f(\mathbf{w}; \mathbf{x}^{(bm+i)},y^{(bm+i)})$

In [25]:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")

import torch
import torch.nn as nn
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [26]:
!wget https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz
!mv MNIST.tar_.gz MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

--2021-03-12 21:16:06--  https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz
Resolving sites.wustl.edu (sites.wustl.edu)... 34.216.237.15, 34.215.37.29
Connecting to sites.wustl.edu (sites.wustl.edu)|34.216.237.15|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/1/2774/files/2021/03/MNIST.tar_.gz [following]
--2021-03-12 21:16:06--  https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/1/2774/files/2021/03/MNIST.tar_.gz
Resolving cpb-us-w2.wpmucdn.com (cpb-us-w2.wpmucdn.com)... 151.139.244.23
Connecting to cpb-us-w2.wpmucdn.com (cpb-us-w2.wpmucdn.com)|151.139.244.23|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23197619 (22M) [application/gzip]
Saving to: ‘MNIST.tar_.gz’


2021-03-12 21:16:07 (201 MB/s) - ‘MNIST.tar_.gz’ saved [23197619/23197619]

./MNIST/
./MNIST/processed/
./MNIST/raw/
./MNIST/raw/train-images-idx3-ubyte
./MNIST/raw/train-labels-idx1-ubyte
./MNIST/raw/train-

In [27]:
train = datasets.MNIST(root='./', 
                       train=True, 
                       download=True, 
                       transform = transforms.ToTensor())

In [30]:
len(train)

60000

In [32]:
print(train.data.size(), train.targets.size())

torch.Size([60000, 28, 28]) torch.Size([60000])


In [33]:
 train.targets[:20]

tensor([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9])

In [81]:
 train_loader = DataLoader(train, 
                           batch_size=32)

In [40]:
# how to convert this into a generator?
sample = next(iter(train_loader))

In [41]:
type(sample)

list

In [42]:
for i in range(len(sample)):
  print(sample[i].size())

torch.Size([8, 1, 28, 28])
torch.Size([8])


In [43]:
sample[1]

tensor([5, 0, 4, 1, 9, 2, 1, 3])

## Network
How to implement
```python
model = nn.Sequential(
            nn.Linear(784, 128), # 784 = 28*28
            nn.ReLU(), # activation
            nn.Linear(128, 32), # 2nd hidden layer
            nn.ReLU(),
            nn.Linear(32, 10) # output layer, 10 classes
        )
```
using the `torch.nn` neural network class template.

In [44]:
__name__

'__main__'

In [45]:
[2,0, 10].__len__()

3

In [46]:
dir(sample)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [47]:
np.array([1]) + np.array([10]) # + is for two arrays, __add__

array([11])

In [48]:
torch.tensor([12, 10]) + torch.tensor([10, 1]) # + is for two tensors __add__

tensor([22, 11])

In [49]:
[1, 10] + [50, 1] # __add__ now is different for list, concat two lists

[1, 10, 50, 1]

In [51]:
'math 450' + ' sucks' 

'math 450 sucks'

In [62]:
# class implementation
class MLP(nn.Module): 
    '''
    MLP: name of the class
    nn.Module: MLP is a subclass of nn.Module
    '''
    # self is referring the class itself
    def __init__(self, 
                 input_size: int,
                 slope: float = 0.02):  # :int requires input_size to be an integer
        # super is a keyword for 
        # constructor inheritance
        # now we have a template of nn.Module interface
        # super is called a constructor
        super(MLP, self).__init__()
        ## after __init__(): initialization
        # self.layers = nn.Sequential(
        #     nn.Linear(784, 128),
        #     nn.ReLU(),
        #     nn.Linear(128, 32),
        #     nn.ReLU(),
        #     nn.Linear(32, 10)
        # )
        self.linear0 = nn.Linear(input_size, 256)
        self.acti0 = nn.LeakyReLU(negative_slope= slope)
        self.linear1 = nn.Linear(256, 10)
        
    def forward(self, x):
        # forward() is a built-in method for nn.Module
        # train data (-1, 1, 28, 28) --> (-1, 28*28)
        # in the implementation above
        # x = x.view(x.size(0), -1)
        # x = self.layers(x)

        # resize each batch into (n_batch, 28*28)
        x = x.view(x.size(0), -1)
        # x = x.view(-1, 28*28) # less general, more bug-prone
        x1 = self.linear0(x)
        a1 = self.acti0(x1)
        output = self.linear1(a1)

        return output

In [None]:
# go through an example of a simple class

In [69]:
model = MLP(input_size=28*28, slope=0.1)

In [70]:
model

MLP(
  (linear0): Linear(in_features=784, out_features=256, bias=True)
  (acti0): LeakyReLU(negative_slope=0.1)
  (linear1): Linear(in_features=256, out_features=10, bias=True)
)

# How to train using this interface?

- Train loader iterator.
- Model.
- Loss function!
- Optimizer.

In [83]:
loss_func = nn.CrossEntropyLoss()
numEpochs = 5
learning_rate = 1e-4

In [73]:
from tqdm.auto import tqdm

In [84]:
for i, epoch in enumerate(range(numEpochs)):
     
  # arrays for checking accuracy
  target_all = []
  output_all = []

  for data, target in tqdm(train_loader):
    # data, target = data.to(device), target.to(device)
    
    # model prediction
    output = model(data)

    # loss function
    loss = loss_func(output, target)

    # for later accuracy checking use
    target_all.append(target.detach().numpy())
    output_all.append(output.detach().numpy())

    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # autograd to do backprop
    loss.backward()

    # GD
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
    
  # accuracy after 1 epoch
  # target_all is a list, make it an array
  target_all = np.array(target_all).reshape(-1)

  # output_all has shape (n_batch,..., 10)
  # we have to find the maximum entry's index of the last axis
  output_all = np.array(output_all).argmax(axis=-1)
  output_all = output_all.reshape(-1)
  acc = (target_all == output_all).mean()
  print(f"\n Epoch {i+1} accuracy: {100*acc:.2f} \n")

HBox(children=(FloatProgress(value=0.0, max=1875.0), HTML(value='')))



 Epoch 1 accuracy: 65.44 



HBox(children=(FloatProgress(value=0.0, max=1875.0), HTML(value='')))



 Epoch 2 accuracy: 68.02 



HBox(children=(FloatProgress(value=0.0, max=1875.0), HTML(value='')))



 Epoch 3 accuracy: 69.94 



HBox(children=(FloatProgress(value=0.0, max=1875.0), HTML(value='')))



 Epoch 4 accuracy: 71.23 



HBox(children=(FloatProgress(value=0.0, max=1875.0), HTML(value='')))



 Epoch 5 accuracy: 72.23 

