<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Lectures/Math_450_Notebook_7_(SGD).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding lecture 7 of Math 450

## Last two weeks

- Explore MNIST dataset.
- Generator, iterator, `iter()`, `next()`, `enumerate()`, `try: except:` flow control.
- Matrix-vector multiplications and "broadcastability".
- `loss.backward()` vs hand computation.
- Why `with torch.no_grad():` is necessary.
- Build simple neural network using `torch.nn.Sequential()`
- Gradient descent for a binary classification problem.


## Today
- Class and object-oriented programming primer. `constructor`, inheritance, `super`.
- Torch `DataLoader` interface for (mini-batch) SGD.

# Stochastic Gradient Descent

Suppose our loss function is still:

$$f := f(\mathbf{w}; X,\mathbf{y}) =  \frac{1}{N}\sum_{i=1}^N f(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)}),$$

where $X = (\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)})^{\top}$ are the training samples, $\mathbf{y} = (y^{(1)}, \dots, y^{(N)})^{\top}$ are the labels/taget values for the training samples.

> Choose initial guess $\mathbf{w}_0$, step size (learning rate) $\alpha$, number of inner iterations $N$, number of epochs $n_E$ <br><br>
>    Set $\mathbf{w}_{N+1} = \mathbf{w}_0$ for epoch $e=0$<br>
>    For epoch $e=1,2, \cdots, n_E$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; $\mathbf{w}_{0}$ for the current epoch is $\mathbf{w}_{N+1}$ for the previous epoch.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; Randomly shuffle the samples so that $\{\mathbf{x}^{(m)},y^{(m)}\}_{m=1}^N$ is a permutation of the original dataset.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; For $m=0,1,2, \cdots, M$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle\mathbf{w}_{m+1} = \mathbf{w}_m - \alpha \nabla f_i(\mathbf{w}; \mathbf{x}^{(m)},y^{(m)})$

One outer iteration is called a completed *epoch* (sweeping all samples once).

### Remark
This is the vanilla SGD: Single gradient evaluation at each iteration.

# (Practice) Mini-batch SGD

In the vanilla SGD, each parameter $\mathbf{w}$ update is computed w.r.t one training sample randomly selected. In mini-batch SGD, the update is computed for a mini-batch (a small number of training samples), as opposed to a single example. The reason for this is twofold: 
* This reduces the variance in the parameter update and can lead to more stable convergence.
* This allows the computation to be more efficient (less overhead), since the training code is already written in a vectorized way. 

A typical mini-batch size is $2^k$ (32, 256, etc), although the optimal size of the mini-batch can vary for different applications, and size of dataset (e.g., AlphaGo training uses mini-batch size of 2048 board images).

> Choose initial guess $\mathbf{w}_0$, learning rate $\alpha$, <br>
batch size $b$, number of inner iterations $M= \lfloor N/n_B \rfloor$, number of epochs $n_E$ <br><br>
>    Set $\mathbf{w}_{M+1} = \mathbf{w}_0$ for epoch $e=0$<br>
>    For epoch $e=1,2, \cdots, n_E$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; $\mathbf{w}_{0}$ for the current epoch is $\mathbf{w}_{M+1}$ for the previous epoch.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; Randomly shuffle the training samples.<br>
>    &nbsp;&nbsp;&nbsp;&nbsp; For $m=0,1,2, \cdots, M$<br>
>    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle\mathbf{w}_{m+1} = \mathbf{w}_m -  \frac{\alpha}{n_B}\sum_{i=1}^{n_B} \nabla f(\mathbf{w}; \mathbf{x}^{(bm+i)},y^{(bm+i)})$

In [None]:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")

import torch
import torch.nn as nn
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [None]:
!wget https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz
!mv MNIST.tar_.gz MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

In [None]:
train = datasets.MNIST(root='./', 
                       train=True, 
                       download=True, 
                       transform = transforms.ToTensor())

In [None]:
 train_loader = DataLoader(train, batch_size=1)

In [None]:
# how to convert this into a generator?

## Network
How to implement
```python
model = nn.Sequential(
            nn.Linear(784, 128), # 784 = 28*28
            nn.ReLU(), # activation
            nn.Linear(128, 32), # 2nd hidden layer
            nn.ReLU(),
            nn.Linear(32, 10) # output layer
        )
```
using the `torch.nn` neural network class template.

In [None]:
# class implementation which we will cover in next class
class MLP(nn.Module):
    def __init__(self):
        # super is a keyword for 
        # constructor inheritance
        # now we have a template of nn.Module interface
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Linear(128, 32),
            nn.ReLU(),
            nn.Linear(32, 10)
        )
        
    def forward(self, x):
        # train data (-1, 28, 28) --> (-1, 28*28)
        # in the implementation above
        x = x.view(x.size(0), -1)
        x = self.layers(x)
        return x

In [None]:
# go through an example of a simple class

In [None]:
model = MLP()

# How to train using this interface?

- Train loader iterator.
- Model.
- Loss function!
- Optimizer.

In [None]:
loss_func = nn.CrossEntropyLoss()
numEpochs = 10
learning_rate = 1e-4

In [None]:
for i, epoch in enumerate(range(numEpochs)):
     
  # arrays for checking accuracy
  target_all = []
  output_all = []

  for data, target in train_loader:
    # data, target = data.to(device), target.to(device)
    
    # model prediction
    output = model(data)

    # loss function
    loss = loss_func(output, target)

    # for later accuracy checking use
    target_all.append(target.detach().numpy())
    output_all.append(output.detach().numpy())

    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # autograd to do backprop
    loss.backward()

    # GD
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
    
      # accuracy after 1 epoch
  target_all = np.array(target_all)
  output_all = np.array(output_all)
  acc = (target_all == output_all).float().mean()
  print(f"\n Epoch {i+1} accuracy: {100*acc:.2f} \n")