# Instructions

For this assignment you will use PyTorch instead of EDF to implement and train neural networks. The experiments in this assignment will take a long time to run without a GPU, but you can run the notebook remotely on Google Colab and have access to GPUs for free -- in this case you don't have to worry about installing PyTorch as it is available by default in Google Colab's environment.

To use Google Colab, you should access https://colab.research.google.com/ and upload this notebook to your workspace. To use a GPU, go to Edit -> Notebook settings and select GPU as the accelerator.

In case you will be running the experiments in your own machine, you should install PyTorch -- there are multiple tutorials online and it is especially easy if you're using Anaconda. However, running on colab ensures that you are running in the same environment (e.g., same python version) that the homework was developed in.

You can check out pytorch tutorials at https://pytorch.org/tutorials/.

In [None]:
import torch, math, copy
import numpy as np
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F

loading the mnist data

In [None]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
test_dataset = datasets.MNIST("data", train=False, download=True, transform=transform)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:07<00:00, 1280120.14it/s]


Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 495689.85it/s]


Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 4568512.18it/s]


Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)>

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 5604745.15it/s]

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw






The pytorch works with modules which are instances of the class [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html).

A module class holds a parameter shape and a forward method. An instance of a module holds trainable parameters.  Creating an instance of a module allocates fresh parameters.  A module also includes a call method (automatically created) that allows an instance of the module to be applied like a function to an input.

This is different from EDF.  In EDF the forward method accesses the input as an instance variable of the object rather than as an argument to an application of a module instance. A parameter package of EDF is somewhat analogous to a module instance in PyTorch.

For the module class "compose" given below each instance of the module has instance variables f and g which are also modules.


In [None]:
class compose(nn.Module):
  def __init__(self,f,g):
    super().__init__()
    self.f = f
    self.g = g
  def forward(self,input):
    return self.f(self.g(input))

We now use compose to build networks.  This is somewhat non-standard but seems elegant.





In [None]:
def simple_stack(depth,nchannels_in,nchannels_out,kernel_dim,activation):
  if(depth  == 1):
     return compose(activation,
                    nn.Conv2d(nchannels_in,
                              nchannels_out,
                              kernel_dim,
                              stride = 2,
                              padding = int(kernel_dim/2)))
  return compose(simple_stack(depth-1,nchannels_in,nchannels_out,kernel_dim,activation),
                 compose(activation,
                         nn.Conv2d(nchannels_in,
                         nchannels_in,
                         kernel_dim,
                         stride = 1,
                         padding = int(kernel_dim/2))))

I also found an on-line discussion of initialization in PyTorch stating that "The docs usually don’t mention the initialization method, but if you look at PyTorch’s source code, you can see the weights are initialized with Kaiming uniform initialization."

This is called He initialization in the course slides (his name is Kaiming He). Xavier initialization is similar.



In [None]:
def PS2_CNN(stackdepth,kernel_dim,activation):
  #input.shape is [nbatch,28,28,1]
  u = compose(activation,nn.Conv2d(1, 4, kernel_dim, 2, 1))
  #u(input).shape = [nbatch,14,14,4]
  u = compose(simple_stack(stackdepth,4,8,kernel_dim,activation),u)
  #u(input).shape = [nbatch,7,7,8]
  u = compose(simple_stack(stackdepth,8,16,kernel_dim,activation),u)
  #u(input).shape = [nbatch,4,4,16]
  u = compose(simple_stack(stackdepth,16,32,kernel_dim,activation),u)
  #u(input).shape = [nbatch,2,2,32]
  u = compose(nn.Flatten(1),u);
  #u(input).shape = [nbatch,128]
  u = compose(nn.Linear(128,10),u)
  #u(input).shape = [nbatch,10]
  return u

In [None]:
model = PS2_CNN(1,3,nn.ReLU())

for p in model.parameters():
  print(p.shape)

torch.Size([10, 128])
torch.Size([10])
torch.Size([32, 16, 3, 3])
torch.Size([32])
torch.Size([16, 8, 3, 3])
torch.Size([16])
torch.Size([8, 4, 3, 3])
torch.Size([8])
torch.Size([4, 1, 3, 3])
torch.Size([4])


In [None]:
model = PS2_CNN(2,3,nn.ReLU())

for p in model.parameters():
  print(p.shape)

torch.Size([10, 128])
torch.Size([10])
torch.Size([32, 16, 3, 3])
torch.Size([32])
torch.Size([16, 16, 3, 3])
torch.Size([16])
torch.Size([16, 8, 3, 3])
torch.Size([16])
torch.Size([8, 8, 3, 3])
torch.Size([8])
torch.Size([8, 4, 3, 3])
torch.Size([8])
torch.Size([4, 4, 3, 3])
torch.Size([4])
torch.Size([4, 1, 3, 3])
torch.Size([4])


Note that the module PS2_CNN does not have any activation after the fully-connected layer. The PyTorch loss module that is used for cross entropy loss takes logits (scores) as input rather than class probabilities.

We now provide a generic training algorithm for training a multiclass classification. Training minimizes cross entropy loss but we report classification error rate.

Hyperparameters should be tuned on validation data and tested on test data not used in training or tuning.  Here we will cheat and use the test data as thought it were the validation data.

In [None]:
def vanilla_train(model, nepochs,learning_rate, momentum, nbatch,train_data,val_data):

  #this function only uses the GPU while computing
  #the GPU is released when the computation is done

 if torch.cuda.is_available():
  model = model.cuda() #we move the model parameters onto a GPU

  print('training nbatch = {:03d},lr = {:.2f}, momentum = {:.2f}'.format(nbatch,learning_rate,momentum))

  train_loader = torch.utils.data.DataLoader(train_data, batch_size=nbatch, shuffle=True)
  test_loader = torch.utils.data.DataLoader(val_data, batch_size=nbatch, shuffle=False)
  optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

  for epoch in range(nepochs):
    train_err = train_epoch(model,optimizer, train_loader)
    test_err = test(model, test_loader)
    print('Epoch {:03d}/{:03d}, Train Error {:.2f}% || Test Error {:.2f}%'.format(epoch+1, nepochs, train_err*100, test_err*100))



def train_epoch(model, optimizer,loader):
    total_correct = 0.
    total_samples = 0.

    for batch_idx, (data, target) in enumerate(loader):
      #the loader organizes the data and target into batches
      #the GPU holds one batch at a time.
      if torch.cuda.is_available():
        data, target = data.cuda(), target.cuda()
      output = model(data)
      # The error rate is determined by the logits -- we do not yet need the loss.
      total_correct += (output.argmax(1) == target).type(torch.float).sum().item()
      total_samples += len(data)

      loss = nn.CrossEntropyLoss()(output, target)
      loss.backward()

      #print('Batch {:04}, Train Error {:03}%'.format(batch_idx, 100*(1-total_correct/total_samples)))
      optimizer.step()
      optimizer.zero_grad()

    return 1 - total_correct/total_samples

def test(model, loader):
    total_correct = 0.
    total_samples = 0.
    model.eval()
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
          if torch.cuda.is_available():
              data, target = data.cuda(), target.cuda()
          output = model(data)
          total_correct += (output.argmax(1) == target).type(torch.float).sum().item()
          total_samples += len(data)
    return 1 - total_correct/total_samples

After some exploration we can find a simple learning rate schedule that seems to work well for batch size 16 and momentum zero.  The following two cells show the variability in the stochastic runs.

In [None]:
model = PS2_CNN(1,3,nn.ReLU())#intializes model parameters
vanilla_train(model, 2, .07, 0, 16, train_dataset, test_dataset)
vanilla_train(model, 3, .01, 0, 16, train_dataset, test_dataset)

training nbatch = 016,lr = 0.07, momentum = 0.00
Epoch 001/002, Train Error 12.82% || Test Error 4.71%
Epoch 002/002, Train Error 3.18% || Test Error 2.49%
training nbatch = 016,lr = 0.01, momentum = 0.00
Epoch 001/003, Train Error 1.69% || Test Error 1.94%
Epoch 002/003, Train Error 1.45% || Test Error 1.67%
Epoch 003/003, Train Error 1.31% || Test Error 1.74%


In [None]:
model = PS2_CNN(1,3,nn.ReLU())#intializes model parameters
vanilla_train(model, 2, .07, 0, 16, train_dataset, test_dataset)
vanilla_train(model, 3, .01, 0, 16, train_dataset, test_dataset)

training nbatch = 016,lr = 0.07, momentum = 0.00
Epoch 001/002, Train Error 10.60% || Test Error 2.98%
Epoch 002/002, Train Error 3.20% || Test Error 2.79%
training nbatch = 016,lr = 0.01, momentum = 0.00
Epoch 001/003, Train Error 1.75% || Test Error 1.83%
Epoch 002/003, Train Error 1.53% || Test Error 1.72%
Epoch 003/003, Train Error 1.40% || Test Error 1.63%


See if you can find comparable performance over five epochs for barch size 64 and momentum zero. Also find a schedule that works well for batch size 16 and momentum .75.  You should be able to make a good first guess in each case.

**** your solution goes below ****

**** Your solution goes above ****

We now consider increasing the depth of the network.  The following network has nine convolution layers.  It does not seem possible to find a learning rate that works well. (The test error 88.65% comes up a lot.  I'm guessing that is the random guessing error rate on the test data where the test data is not exactly unifromly distributed over the labels.)

In [None]:
model = PS2_CNN(3,3,nn.ReLU())
vanilla_train(model, 2, .3, 0, 64, train_dataset, test_dataset)
vanilla_train(model, 3, .07, 0, 64, train_dataset, test_dataset)

training nbatch = 064,lr = 0.30, momentum = 0.00
Epoch 001/002, Train Error 89.03% || Test Error 88.65%
Epoch 002/002, Train Error 89.02% || Test Error 88.65%
training nbatch = 064,lr = 0.07, momentum = 0.00
Epoch 001/003, Train Error 88.78% || Test Error 88.65%
Epoch 002/003, Train Error 88.77% || Test Error 88.65%
Epoch 003/003, Train Error 88.77% || Test Error 88.65%


You should now modify the nine layer network in the last cell to have a residual connection around each convolution layer. For the residual connections that require shape adjustment you can use the pytorch modules nn.avg_pool2d and F.pad. The package F contains functions with no trainable parameters.  See if you can find hyper-parameters for which the network trains.

*** your solution goes below ***

*** your solution goes above ***

Finally replace each addition in the residual connection with a convex combination with a trainable combination weight.  This is an experiemnt -- I'm not sure what to expect.

*** your solution goes below ***