<a href="https://colab.research.google.com/github/scaomath/wustl-math450/blob/main/Lectures/Math_450_Notebook_8_(SGD).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding lecture 8 of Math 450

## Last three weeks

- MNIST
- Generator, iterator, `iter()`, `next()`, `enumerate()`, `try: except:` flow control.
- Matrix-vector multiplications and "broadcastability".
- `loss.backward()` vs hand computation.
- Why `with torch.no_grad():` is necessary in manual gradient descent computation.
- Build simple neural network using `torch.nn.Sequential()`
- Gradient descent for a binary classification problem.
- Torch `DataLoader` interface for (mini-batch) SGD.

# Today
- More on class and object-oriented programming. `constructor`, inheritance, the usage of `super`.
- PyTorch SGD training complete pipeline template.
- A new type: dictionary `dict`.

# A complete pipeline

- Data preparation
- Train-Validation (Train-Test) split (will be covered later)
- Model
- Choose an optimizer or write one on our own.
- Choose an scheduler or write one on our own (optional, will be covered later).
- Choose the proper loss function.
- Train! (and validate at the same time)
- Inference (for our final project, will be covered later).

In [27]:
import torch
import torch.nn as nn
import numpy as np

from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")

import warnings
warnings.filterwarnings("ignore")

In [28]:
!wget https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz 
!mv MNIST.tar_.gz MNIST.tar.gz 
!tar -zxvf MNIST.tar.gz 

--2021-03-26 20:08:53--  https://sites.wustl.edu/scao/files/2021/03/MNIST.tar_.gz
Resolving sites.wustl.edu (sites.wustl.edu)... 34.216.237.15, 34.215.37.29
Connecting to sites.wustl.edu (sites.wustl.edu)|34.216.237.15|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/1/2774/files/2021/03/MNIST.tar_.gz [following]
--2021-03-26 20:08:54--  https://cpb-us-w2.wpmucdn.com/sites.wustl.edu/dist/1/2774/files/2021/03/MNIST.tar_.gz
Resolving cpb-us-w2.wpmucdn.com (cpb-us-w2.wpmucdn.com)... 151.139.244.23
Connecting to cpb-us-w2.wpmucdn.com (cpb-us-w2.wpmucdn.com)|151.139.244.23|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23197619 (22M) [application/gzip]
Saving to: ‘MNIST.tar_.gz’


2021-03-26 20:08:54 (215 MB/s) - ‘MNIST.tar_.gz’ saved [23197619/23197619]

./MNIST/
./MNIST/processed/
./MNIST/raw/
./MNIST/raw/train-images-idx3-ubyte
./MNIST/raw/train-labels-idx1-ubyte
./MNIST/raw/train-

In [29]:
train = datasets.MNIST(root='./', 
                       train=True, 
                       download=True, 
                       transform = transforms.ToTensor());

In [None]:
train_loader = DataLoader(train, batch_size=64)

In [30]:
sample = next(iter(train_loader))

In [34]:
print(sample[0].size()) # (n_batch, n_channel, height, width)
# N_C in PyTorch official docs
# how many color channels we have, 3 means RGB

torch.Size([64, 1, 28, 28])


In [37]:
print(sample[1].size())
print(sample[1][:10])

torch.Size([64])
tensor([5, 0, 4, 1, 9, 2, 1, 3, 1, 4])


In [54]:
class MLP(nn.Module): # subclass of nn.Module 
    def __init__(self, 
                 input_size: int):
        '''
        __init__: initialize
        afterward, constructors
        '''
        super(MLP, self).__init__() 
        # let MLP class inherit everything from nn.Module
        self.linear0 = nn.Linear(input_size, 256)
        self.activation = nn.ReLU()
        self.linear1 = nn.Linear(256, 10)
        
    def forward(self, x): 
      # forward is a "fixed" method from nn.Module
      # this is different from @staticmethod
      # the behavior of model(x) 

        x = x.view(x.size(0), -1) # getting rid of color channel
        x1 = self.linear0(x)
        a1 = self.activation(x1)
        output = self.linear1(a1)

        return output

In [48]:
# explicit forward
x = sample[0] # data fed into the model
print("input data: ", x.size())

x = x.view(x.size(0), -1) # getting rid of color channel
print("reshape to remove channels: ", x.size())

input_size = x.size(-1)
print("x's last dim size: ", input_size)
linear0 = nn.Linear(input_size, 256)
x1 = linear0(x)
print("after layer 1: ", x1.size())
# batch_size: dim 0 should not change in forward pass

activation = nn.ReLU()
a1 = activation(x1) # activation does not change the shape

linear1 = nn.Linear(256, 10)
output = linear1(a1)
print("output size: ", output.size())

# softmax does not need to be implemented if using
# nn.CrossEntropyLoss
softmax = nn.Softmax(dim=-1)
output_prob = softmax(output)

input data:  torch.Size([64, 1, 28, 28])
reshape to remove channels:  torch.Size([64, 784])
x's last dim size:  784
after layer 1:  torch.Size([64, 256])
output size:  torch.Size([64, 10])


In [50]:
output[:2].detach()

tensor([[ 0.0183,  0.0725, -0.0405, -0.0193,  0.0671, -0.1242,  0.0634, -0.0373,
          0.0385, -0.0133],
        [-0.0908,  0.0086, -0.0722, -0.0163, -0.0316,  0.0034,  0.0679, -0.1210,
          0.0180,  0.1239]])

In [51]:
output_prob[:2].detach()

tensor([[0.1014, 0.1071, 0.0956, 0.0977, 0.1065, 0.0879, 0.1061, 0.0959, 0.1035,
         0.0983],
        [0.0921, 0.1017, 0.0938, 0.0992, 0.0977, 0.1012, 0.1079, 0.0894, 0.1027,
         0.1142]])

In [55]:
model = MLP(input_size=28*28)

In [56]:
y = model(sample[0])
print(y.size())

torch.Size([64, 10])


# Optimizer

In this class we will learn how to write an optimizer.

#### Reference: 
Final project start code: https://www.kaggle.com/scaomath/washu-math-450-sp21-final-project-starter#Final-project:-write-our-own-optimizer

In [57]:
from torch.optim import Optimizer

In [59]:
dir(torch.optim)

['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'AdamW',
 'Adamax',
 'LBFGS',
 'Optimizer',
 'RMSprop',
 'Rprop',
 'SGD',
 'SparseAdam',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_functional',
 '_multi_tensor',
 'lr_scheduler',
 'swa_utils']

In [60]:
dir(torch.optim.lr_scheduler)

['CosineAnnealingLR',
 'CosineAnnealingWarmRestarts',
 'Counter',
 'CyclicLR',
 'ExponentialLR',
 'LambdaLR',
 'MultiStepLR',
 'MultiplicativeLR',
 'OneCycleLR',
 'Optimizer',
 'ReduceLROnPlateau',
 'StepLR',
 '_LRScheduler',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'bisect_right',
 'inf',
 'math',
 'types',
 'weakref',
 'wraps']

In [79]:
class SGD(Optimizer): # subclass of Optimizer
    """
    Implements the vanilla SGD simplified 
    from the torch official one
    for Math 450 WashU
    
    Args:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float): learning rate
        
    Example:
        >>> optimizer = SGD(model.parameters(), lr=1e-2)
        >>> optimizer.zero_grad()
        >>> loss_fn(model(input), target).backward()
        >>> optimizer.step()
    """

    def __init__(self, params, # params: model.parameters()
                       lr: float = 1e-3, # input: type = value
                       name_input: str = 'SGD'
                 ): 
        # constructor
        defaults = dict(lr=lr, name=name_input) 
        # add a default attribute that can be accessed
        super(SGD, self).__init__(params, defaults)

    def step(self, closure=None): 
      # we can ignore closure for now, useful in quasi-Newton
        
      for group in self.param_groups: # fixed in template

          for param in group['params']:
              if param.grad is None:
                  continue
              grad_param = param.grad.data

              param.data -= group['lr']*grad_param

      return loss

In [80]:
optimizer = SGD(model.parameters(), lr=learning_rate)

In [81]:
optimizer.defaults

{'lr': 0.001, 'name': 'SGD'}

# What is a dictionary?

- key, value, item.
- Two ways of initialization
- Every package uses `dict` to store hyperparameter.

```python
{key1: value1, key2: value2}
```

In [72]:
dict1 = {'lr': 0.001,  # key: value
         'name': 'SGD', 
         10: 120}

In [66]:
type(dict1)

dict

In [73]:
for key in dict1.keys():
  print(key)

lr
name
10


In [70]:
dict1['lr']

0.001

In [74]:
dict1['name']

'SGD'

In [75]:
dict1[10]

120

In [76]:
for item in dict1.items():
  print(item)

('lr', 0.001)
('name', 'SGD')
(10, 120)


In [77]:
# using dict function
dict2 = dict(shuhao='instructor', 
             # left hand side becomes a string key
             name='math 450',
             time=300)

In [78]:
dict2

{'name': 'math 450', 'shuhao': 'instructor', 'time': 300}

In [None]:
loss_func = nn.CrossEntropyLoss()
epochs = 5
learning_rate = 1e-3

In [None]:
optimizer = SGD(model.parameters(), lr=learning_rate)

In [None]:

for epoch in range(epochs):
    
    model.train()
    
    loss_vals = []
    
    with tqdm(total=len(train_loader)) as pbar:
      for x, targets in train_loader:
          
        # forward pass
        outputs = model(x)
        
        # loss function
        loss = loss_func(outputs, targets)
        
        # record loss function values
        loss_vals.append(loss.item())
        
        # clean the gradient from last iteration
        optimizer.zero_grad()
        
        # backprop
        loss.backward()
        
        # gradient descent
        optimizer.step()
        
        # check accuracy

        # tqdm template
        desc = f"epoch: [{epoch+1}/{epochs}] loss: {np.mean(loss_vals):.4f}"
        pbar.set_description(desc)
        pbar.update()