<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Lectures/Math_450_Notebook_8_(SGD).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding lecture 8 of Math 450

## Last three weeks

- MNIST
- Generator, iterator, `iter()`, `next()`, `enumerate()`, `try: except:` flow control.
- Matrix-vector multiplications and "broadcastability".
- `loss.backward()` vs hand computation.
- Why `with torch.no_grad():` is necessary in manual gradient descent computation.
- Build simple neural network using `torch.nn.Sequential()`
- Gradient descent for a binary classification problem.
- Torch `DataLoader` interface for (mini-batch) SGD.

# Today
- More on class and object-oriented programming. `constructor`, inheritance, the usage of `super`.
- PyTorch SGD training complete pipeline template.
- A new type: dictionary `dict`.

# A complete pipeline

- Data preparation
- Train-Validation split (will be covered later)
- Model
- Choose an optimizer or write one on our own.
- Choose an scheduler or write one on our own (optional, will be covered later).
- Choose the proper loss function.
- Train!
- Inference (for our final project, will be covered later).

In [None]:
import torch
import torch.nn as nn
import numpy as np

from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")

import warnings
warnings.filterwarnings("ignore")

In [None]:
!wget https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz 
!mv MNIST.tar_.gz MNIST.tar.gz 
!tar -zxvf MNIST.tar.gz 

In [None]:
train = datasets.MNIST(root='./', 
                       train=True, 
                       download=True, 
                       transform = transforms.ToTensor());

In [None]:
train_loader = DataLoader(train, batch_size=64)

In [None]:
class MLP(nn.Module): 
    def __init__(self, 
                 input_size: int):
        super(MLP, self).__init__()
        self.linear0 = nn.Linear(input_size, 256)
        self.activation = nn.ReLU()
        self.linear1 = nn.Linear(256, 10)
        
    def forward(self, x):
        x = x.view(x.size(0), -1) 
        x1 = self.linear0(x)
        a1 = self.activation(x1)
        output = self.linear1(a1)

        return output

In [None]:
model = MLP(input_size=28*28)

# Optimizer

In this class we will learn how to write an optimizer.

#### Reference: 
Final project start code: https://www.kaggle.com/scaomath/washu-math-450-sp21-final-project-starter#Final-project:-write-our-own-optimizer

In [None]:
from torch.optim import Optimizer

In [None]:
class SGD(Optimizer):
    """
    Implements the vanilla SGD simplified from the torch official one
    for Math 450 WashU
    
    Args:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float): learning rate
        
    Example:
        >>> optimizer = SGD(model.parameters(), lr=1e-2)
        >>> optimizer.zero_grad()
        >>> loss_fn(model(input), target).backward()
        >>> optimizer.step()
    """

    def __init__(self, params, 
                       lr: float = 1e-3):
        defaults = dict(lr=lr) # add a default attribute that can be accessed
        super(SGD, self).__init__(params, defaults)

    def step(self, closure=None): 
      # we can ignore closure for now, useful in quasi-Newton
        
      for group in self.param_groups: # fixed in template

          for param in group['params']:
              if param.grad is None:
                  continue
              grad_param = param.grad.data

              param.data -= group['lr']*grad_param

      return loss

# What is a dictionary?

- key, value, item.
- Two ways of initialization

In [None]:
loss_func = nn.CrossEntropyLoss()
epochs = 5
learning_rate = 1e-3

In [None]:
optimizer = SGD(model.parameters(), lr=learning_rate)

In [None]:

for epoch in range(epochs):
    
    model.train()
    
    loss_vals = []
    
    with tqdm(total=len(train_loader)) as pbar:
      for x, targets in train_loader:
          
        # forward pass
        outputs = model(x)
        
        # loss function
        loss = loss_func(outputs, targets)
        
        # record loss function values
        loss_vals.append(loss.item())
        
        # clean the gradient from last iteration
        optimizer.zero_grad()
        
        # backprop
        loss.backward()
        
        # gradient descent
        optimizer.step()
        
        # check accuracy

        # tqdm template
        desc = f"epoch: [{epoch+1}/{epochs}] loss: {np.mean(loss_vals):.4f}"
        pbar.set_description(desc)
        pbar.update()