# PyTorch Tutorial & Homework - Neural Networks
Prof. Lim Kwan Hui, with many thanks to Prof. Dorien Herremans for the initial version and Nelson Lui for the base text.

Homework questions are at the end of the tutorial.

# Homework Exercises
**Due: 23th Feb, 11:59pm**
<br>
<br>
Based on the same FashionMNIST dataset, work on the following tasks below. Submit your homework as either: (i) an ipynb file with your results inside; or (ii) a python file and separate pdf discussing your results.

(a) Develop a new feed-forward neural network that contains 3 hidden layers, with hidden layers 1, 2, 3 being of dimensions 512, 256, 128, respectively. Hidden layer 1 is the layer immediately after the input layer, while hidden layer 3 is the one just before the output layer.

(b) Experiment with three different activation functions and two different optimizers. Report your results and discuss your findings.

(c) Building upon Task b above, describe and implement two approaches to improve upon the best variation from Task b. Report your results and discuss your findings.


# Homework Answers
(a)

In [1]:
import torch
import numpy as np
from tqdm import tqdm

import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

import os
using_GPU = os.path.exists('/opt/bin/nvidia-smi')

import torch.optim as optim

In [2]:
import torchvision
from torchvision.datasets import FashionMNIST

train_dataset = FashionMNIST(root='./torchvision-data',
                             train=True,
                             transform=torchvision.transforms.ToTensor(),
                             download=True)

test_dataset = FashionMNIST(root='./torchvision-data', train=False,
                            transform=torchvision.transforms.ToTensor())

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./torchvision-data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:01<00:00, 14792536.95it/s]


Extracting ./torchvision-data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./torchvision-data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./torchvision-data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 266296.57it/s]


Extracting ./torchvision-data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./torchvision-data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./torchvision-data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:00<00:00, 4826832.62it/s]


Extracting ./torchvision-data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./torchvision-data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./torchvision-data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 9160915.14it/s]

Extracting ./torchvision-data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./torchvision-data/FashionMNIST/raw






In [3]:
from torch.utils.data import DataLoader

# Data-related hyperparameters
batch_size = 64

# Set up a DataLoader for the training dataset.
train_dataloader = DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True)

# Set up a DataLoader for the test dataset.
test_dataloader = DataLoader(
    dataset=test_dataset, batch_size=batch_size)

In [4]:
class FeedForwardNN_new(nn.Module): # ReLU Activation and Dropout
    def __init__(self, input_size, num_classes, hidden_dims, dropout, activation_fn):
        super(FeedForwardNN_new, self).__init__()

        assert len(hidden_dims) == 3

        self.hidden_layers = nn.ModuleList([])

        self.hidden_layers.append(nn.Linear(input_size, hidden_dims[0]))

        for i in range(len(hidden_dims) - 1):
          self.hidden_layers.append(nn.Linear(hidden_dims[i], hidden_dims[i + 1]))

        # Set up the dropout layer.
        self.dropout = nn.Dropout(dropout)

        # Set up the output layer.
        self.output_projection = nn.Linear(128, num_classes)

        # Set up the nonlinearity.
        self.nonlinearity = activation_fn

    def forward(self, x):
        # Apply the hidden layers, nonlinearity, and dropout.
        for hidden_layer in self.hidden_layers:
            x = hidden_layer(x)
            x = self.dropout(x)
            x = self.nonlinearity(x)

        # Output layer: project x to a distribution over classes.
        out = self.output_projection(x)

        # Softmax the out tensor to get a log-probability distribution
        # over classes for each example.
        out_distribution = F.log_softmax(out, dim=-1)
        return out_distribution

Training the data

In [9]:
def training_phase(ffnn_optimizer, fashionmnist_ffnn_clf):
  # Number of epochs (passes through the dataset) to train the model for.
  num_epochs = 10

  # A counter for the number of gradient updates we've performed.
  num_iter = 0

  # Iterate `num_epochs` times.
  for epoch in range(num_epochs):
    print("Starting epoch {}".format(epoch + 1))
    # Iterate over the train_dataloader, unpacking the images and labels
    for (images, labels) in train_dataloader:
      # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784), since
      # that's what our model expects. Remember that -1 does shape inference!
      reshaped_images = images.view(-1, 784)

      # Wrap reshaped_images and labels in Variables,
      # since we want to calculate gradients and backprop.
      reshaped_images = Variable(reshaped_images)
      labels = Variable(labels)

      # If we're using the GPU, move reshaped_images and labels to the GPU.
      if using_GPU:
        reshaped_images = reshaped_images.cuda()
        labels = labels.cuda()

      # Run the forward pass through the model to get predicted log distribution.
      # predicted shape: (batch_size, 10) (since there are 10 classes)
      predicted = fashionmnist_ffnn_clf(reshaped_images)

      # Calculate the loss
      batch_loss = nll_criterion(predicted, labels)

      # Clear the gradients as we prepare to backprop.
      ffnn_optimizer.zero_grad()

      # Backprop (backward pass), which calculates gradients.
      batch_loss.backward()

      # Take a gradient step to update parameters.
      ffnn_optimizer.step()

      # Increment gradient update counter.
      num_iter += 1

      # Calculate test set loss and accuracy every 500 gradient updates
      # It's standard to have this as a separate evaluate function, but
      # we'll place it inline for didactic purposes.
      if num_iter % 500 == 0:
        # Set model to eval mode, which turns off dropout.
        fashionmnist_ffnn_clf.eval()
        # Counters for the num of examples we get right / total num of examples.
        num_correct = 0
        total_examples = 0
        total_test_loss = 0

        # Iterate over the test dataloader
        for (test_images, test_labels) in test_dataloader:
          # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784) again
          reshaped_test_images = test_images.view(-1, 784)

          # Wrap test data in Variable, like we did earlier.
          # We set volatile=True bc we don't need history; speeds up inference.
          reshaped_test_images = Variable(reshaped_test_images, volatile=True)
          test_labels = Variable(test_labels, volatile=True)

          # If we're using the GPU, move tensors to the GPU.
          if using_GPU:
            reshaped_test_images = reshaped_test_images.cuda()
            test_labels = test_labels.cuda()

          # Run the forward pass to get predicted distribution.
          predicted = fashionmnist_ffnn_clf(reshaped_test_images)

          # Calculate loss for this test batch. This is averaged, so multiply
          # by the number of examples in batch to get a total.
          total_test_loss += nll_criterion(
              predicted, test_labels).data * test_labels.size(0)

          # Get predicted labels (argmax)
          # We need predicted.data since predicted is a Variable, and torch.max
          # expects a Tensor as input. .data extracts Tensor underlying Variable.
          _, predicted_labels = torch.max(predicted.data, 1)

          # Count the number of examples in this batch
          total_examples += test_labels.size(0)

          # Count the total number of correctly predicted labels.
          # predicted == labels generates a ByteTensor in indices where
          # predicted and labels match, so we can sum to get the num correct.
          num_correct += torch.sum(predicted_labels == test_labels.data)
        accuracy = 100 * num_correct / total_examples
        average_test_loss = total_test_loss / total_examples
        print("Iteration {}. Test Loss {}. Test Accuracy {}.".format(
            num_iter, average_test_loss, accuracy))
        # Set the model back to train mode, which activates dropout again.
        fashionmnist_ffnn_clf.train()

In [None]:
# Instantiate the new feed-forward neural network.
relu_ffnn_clf = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.ReLU())
if using_GPU:
    relu_ffnn_clf = relu_ffnn_clf.cuda()

print(relu_ffnn_clf)

parameters = relu_ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
SGD_optimizer = optim.SGD(relu_ffnn_clf.parameters(),
                           lr=lr, momentum=momentum)

ffnn_optimizer = SGD_optimizer
training_phase(ffnn_optimizer, relu_ffnn_clf)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): ReLU()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.5840354561805725. Test Accuracy 76.43999481201172.
Starting epoch 2
Iteration 1000. Test Loss 0.5898212790489197. Test Accuracy 76.73999786376953.
Iteration 1500. Test Loss 0.561018168926239. Test Accuracy 79.5.
Starting epoch 3
Iteration 2000. Test Loss 0.5525748133659363. Test Accuracy 81.90999603271484.
Iteration 2500. Test Loss 0.49628332257270813. Test Accuracy 82.56999969482422.
Starting epoch 4
Iteration 3000. Test Loss 0.5364255309104919. Test Accuracy 81.0999984741211.
Iteration 3500. Test Loss 0.5090486407279968. Test Accuracy 82.95999908447266.
Starting epoch 5
Iteration 4000. Test Loss 0.5092098712921143. Test Accuracy 82.3699951171875.
Iteration 4500. Test Loss 0.46504807472229004. Test Accuracy 84.22000122070312.
Starting epoch 6
Iteration 5000. Test Loss 0.4656705856323242. Test Accuracy 83.77999877929688.
Iteration 5500. Test Loss 0.477907657623291. Test Accuracy 83.20999908447266.
Starting epoch 7
Iteration 6000. Test Loss 0.5084761381149292.

(b)

In [None]:
# Tanh Activation Function + SGD Optimizer
# Instantiate the new feed-forward neural network.
tanh_ffnn_clf = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.Tanh())
if using_GPU:
    tanh_ffnn_clf = tanh_ffnn_clf.cuda()

print(tanh_ffnn_clf)

parameters = tanh_ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
SGD_optimizer = optim.SGD(tanh_ffnn_clf.parameters(),
                           lr=lr, momentum=momentum)

ffnn_optimizer = SGD_optimizer
training_phase(ffnn_optimizer, tanh_ffnn_clf)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): Tanh()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.6184403300285339. Test Accuracy 80.4000015258789.
Starting epoch 2
Iteration 1000. Test Loss 0.7050087451934814. Test Accuracy 74.7699966430664.
Iteration 1500. Test Loss 0.5966458916664124. Test Accuracy 82.13999938964844.
Starting epoch 3
Iteration 2000. Test Loss 0.7283454537391663. Test Accuracy 77.72999572753906.
Iteration 2500. Test Loss 0.6101191639900208. Test Accuracy 80.62999725341797.
Starting epoch 4
Iteration 3000. Test Loss 0.5985407829284668. Test Accuracy 81.93000030517578.
Iteration 3500. Test Loss 0.6728845834732056. Test Accuracy 81.0999984741211.
Starting epoch 5
Iteration 4000. Test Loss 0.5841867327690125. Test Accuracy 81.79000091552734.
Iteration 4500. Test Loss 0.5688333511352539. Test Accuracy 82.41999816894531.
Starting epoch 6
Iteration 5000. Test Loss 0.6561270356178284. Test Accuracy 81.08999633789062.
Iteration 5500. Test Loss 0.5494189262390137. Test Accuracy 83.37999725341797.
Starting epoch 7
Iteration 6000. Test Loss 0.65832

In [None]:
# Sigmoid Activation Function + SGD Optimizer
# Instantiate the new feed-forward neural network.
sigmoid_ffnn_clf = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.Sigmoid())
if using_GPU:
    sigmoid_ffnn_clf = sigmoid_ffnn_clf.cuda()

print(sigmoid_ffnn_clf)

parameters = sigmoid_ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
SGD_optimizer = optim.SGD(sigmoid_ffnn_clf.parameters(),
                           lr=lr, momentum=momentum)

ffnn_optimizer = SGD_optimizer
training_phase(ffnn_optimizer, sigmoid_ffnn_clf)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): Sigmoid()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 2.304335832595825. Test Accuracy 10.0.
Starting epoch 2
Iteration 1000. Test Loss 2.157402992248535. Test Accuracy 19.53999900817871.
Iteration 1500. Test Loss 0.8585001826286316. Test Accuracy 65.12999725341797.
Starting epoch 3
Iteration 2000. Test Loss 0.7803479433059692. Test Accuracy 72.16999816894531.
Iteration 2500. Test Loss 0.6224414110183716. Test Accuracy 78.55999755859375.
Starting epoch 4
Iteration 3000. Test Loss 0.5974296927452087. Test Accuracy 80.13999938964844.
Iteration 3500. Test Loss 0.5586544871330261. Test Accuracy 81.57999420166016.
Starting epoch 5
Iteration 4000. Test Loss 0.49993616342544556. Test Accuracy 83.12999725341797.
Iteration 4500. Test Loss 0.5281434059143066. Test Accuracy 82.8699951171875.
Starting epoch 6
Iteration 5000. Test Loss 0.5730727910995483. Test Accuracy 82.30999755859375.
Iteration 5500. Test Loss 0.48799291253089905. Test Accuracy 83.62999725341797.
Starting epoch 7
Iteration 6000. Test Loss 0.4766145944595337

In [10]:
# ReLU Activation Function + Adam Optimizer
# Instantiate the new feed-forward neural network.
relu_ffnn_clf_2 = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.ReLU())
if using_GPU:
    relu_ffnn_clf_2 = relu_ffnn_clf_2.cuda()

print(relu_ffnn_clf_2)

parameters = relu_ffnn_clf_2.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
Adam_optimizer = optim.Adam(relu_ffnn_clf_2.parameters(), lr=0.0001)

ffnn_optimizer = Adam_optimizer
training_phase(ffnn_optimizer, relu_ffnn_clf_2)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): ReLU()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.6407455801963806. Test Accuracy 77.02999877929688.
Starting epoch 2
Iteration 1000. Test Loss 0.5363321900367737. Test Accuracy 80.69000244140625.
Iteration 1500. Test Loss 0.4833813011646271. Test Accuracy 83.16999816894531.
Starting epoch 3
Iteration 2000. Test Loss 0.4567480981349945. Test Accuracy 83.61000061035156.
Iteration 2500. Test Loss 0.4402889609336853. Test Accuracy 84.25.
Starting epoch 4
Iteration 3000. Test Loss 0.418686181306839. Test Accuracy 84.70999908447266.
Iteration 3500. Test Loss 0.4084586203098297. Test Accuracy 85.22000122070312.
Starting epoch 5
Iteration 4000. Test Loss 0.39851099252700806. Test Accuracy 85.56999969482422.
Iteration 4500. Test Loss 0.3913775384426117. Test Accuracy 85.77999877929688.
Starting epoch 6
Iteration 5000. Test Loss 0.38298240303993225. Test Accuracy 85.95999908447266.
Iteration 5500. Test Loss 0.3737016022205353. Test Accuracy 86.37999725341797.
Starting epoch 7
Iteration 6000. Test Loss 0.3758830428123

In [None]:
# Tanh Activation Function + Adam Optimizer
# Instantiate the new feed-forward neural network.
tanh_ffnn_clf_2 = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.Tanh())
if using_GPU:
    tanh_ffnn_clf_2 = tanh_ffnn_clf_2.cuda()

print(tanh_ffnn_clf_2)

parameters = tanh_ffnn_clf_2.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
Adam_optimizer = optim.Adam(tanh_ffnn_clf_2.parameters(), lr=0.0001)

ffnn_optimizer = Adam_optimizer
training_phase(ffnn_optimizer, fashionmnist_ffnn_clf = tanh_ffnn_clf_2)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): Tanh()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.6105337738990784. Test Accuracy 78.37999725341797.
Starting epoch 2
Iteration 1000. Test Loss 0.5047974586486816. Test Accuracy 81.87999725341797.
Iteration 1500. Test Loss 0.4703048765659332. Test Accuracy 83.16999816894531.
Starting epoch 3
Iteration 2000. Test Loss 0.45681488513946533. Test Accuracy 83.97000122070312.
Iteration 2500. Test Loss 0.447513222694397. Test Accuracy 84.06999969482422.
Starting epoch 4
Iteration 3000. Test Loss 0.43859827518463135. Test Accuracy 84.68000030517578.
Iteration 3500. Test Loss 0.4257950484752655. Test Accuracy 85.18000030517578.
Starting epoch 5
Iteration 4000. Test Loss 0.42296990752220154. Test Accuracy 85.48999786376953.
Iteration 4500. Test Loss 0.41380956768989563. Test Accuracy 85.79999542236328.
Starting epoch 6
Iteration 5000. Test Loss 0.4294901192188263. Test Accuracy 84.73999786376953.
Iteration 5500. Test Loss 0.4085720479488373. Test Accuracy 85.56999969482422.
Starting epoch 7
Iteration 6000. Test Loss 0

In [11]:
# Sigmoid Activation Function + Adam Optimizer
sigmoid_ffnn_clf_2 = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.Sigmoid())
if using_GPU:
    sigmoid_ffnn_clf_2 = sigmoid_ffnn_clf_2.cuda()

print(sigmoid_ffnn_clf_2)

parameters = sigmoid_ffnn_clf_2.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
Adam_optimizer = optim.Adam(sigmoid_ffnn_clf_2.parameters(), lr=0.0001)

ffnn_optimizer = Adam_optimizer
training_phase(ffnn_optimizer, sigmoid_ffnn_clf_2)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): Sigmoid()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 1.625672459602356. Test Accuracy 33.779998779296875.
Starting epoch 2
Iteration 1000. Test Loss 1.256834626197815. Test Accuracy 50.63999938964844.
Iteration 1500. Test Loss 1.075071096420288. Test Accuracy 56.290000915527344.
Starting epoch 3
Iteration 2000. Test Loss 0.9339541792869568. Test Accuracy 61.06999969482422.
Iteration 2500. Test Loss 0.8313959836959839. Test Accuracy 66.37999725341797.
Starting epoch 4
Iteration 3000. Test Loss 0.7557201385498047. Test Accuracy 70.83999633789062.
Iteration 3500. Test Loss 0.6834837794303894. Test Accuracy 72.2300033569336.
Starting epoch 5
Iteration 4000. Test Loss 0.6570406556129456. Test Accuracy 73.6500015258789.
Iteration 4500. Test Loss 0.6209216117858887. Test Accuracy 75.7300033569336.
Starting epoch 6
Iteration 5000. Test Loss 0.605666995048523. Test Accuracy 76.08000183105469.
Iteration 5500. Test Loss 0.5829781889915466. Test Accuracy 77.2300033569336.
Starting epoch 7
Iteration 6000. Test Loss 0.57868677

**Report on the activation functions and optimizers:**

**Results:**
- ReLU consistently outperformed TanH and Sigmoid, exhibiting the lowest test loss and highest test accuracy.
- Adam optimizer consistently surpassed SGD, demonstrating lower test loss and higher test accuracy across all activation functions.
- The most effective configuration was ReLU activation with Adam optimizer, showcasing superior performance in both test loss and accuracy.

**Discussion:**

The superior performance of ReLU activation may stem from its ability to mitigate the vanishing gradient problem. Adam's adaptive learning rate likely contributes to its effectiveness in optimizing neural networks.

**Conclusion:**

Based on the experiments, ReLU with Adam optimizer emerges as the optimal choice for neural network training.

(c) To enhance the performance of the top-performing variation, I will implement the following two strategies:

1. **L2 Regularization**: I will incorporate an L2 penalty term into the Adam optimizer function to implement L2 regularization, encouraging sparse weights in the model.

2. **Early Stopping**: I will monitor the validation loss during training and halt the training process when the validation loss ceases to improve or begins to increase consistently over multiple epochs to implement early stopping.

**1. L2 Regularization:**

In [14]:
# L2 regularization + Dropout on the best combination: ReLU Activation Function + Adam Optimizer
l2_ffnn_clf = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.ReLU())
if using_GPU:
    l2_ffnn_clf = l2_ffnn_clf.cuda()

print(l2_ffnn_clf)

parameters = l2_ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
Adam_optimizer = optim.Adam(l2_ffnn_clf.parameters(), lr=0.0001, weight_decay=1e-5) #L2 penalty: weight_decay

ffnn_optimizer = Adam_optimizer
training_phase(ffnn_optimizer, l2_ffnn_clf)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): ReLU()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.6415849328041077. Test Accuracy 76.8499984741211.
Starting epoch 2
Iteration 1000. Test Loss 0.5313308835029602. Test Accuracy 80.95999908447266.
Iteration 1500. Test Loss 0.48429322242736816. Test Accuracy 82.55000305175781.
Starting epoch 3
Iteration 2000. Test Loss 0.46114739775657654. Test Accuracy 83.72000122070312.
Iteration 2500. Test Loss 0.4384508430957794. Test Accuracy 84.29000091552734.
Starting epoch 4
Iteration 3000. Test Loss 0.42017748951911926. Test Accuracy 85.0999984741211.
Iteration 3500. Test Loss 0.4081423580646515. Test Accuracy 85.25.
Starting epoch 5
Iteration 4000. Test Loss 0.4004044532775879. Test Accuracy 85.76000213623047.
Iteration 4500. Test Loss 0.39404451847076416. Test Accuracy 85.8499984741211.
Starting epoch 6
Iteration 5000. Test Loss 0.3809807002544403. Test Accuracy 86.45999908447266.
Iteration 5500. Test Loss 0.3748464286327362. Test Accuracy 86.58999633789062.
Starting epoch 7
Iteration 6000. Test Loss 0.3738487362861

**Report on L2 Regularization:**

**Results:**
- Addition of the L2 penalty to the Adam Optimizer resulted in a slight improvement in test loss and test accuracy.
- The test loss decreased from 0.3576 to 0.3429, while the test accuracy increased from 87.07% to 87.51%.

**Discussion:**

The observed improvement in performance with the L2 penalty suggests that regularization helped in reducing overfitting and enhancing generalization. This regularization technique likely helped in controlling the complexity of the model and improving its robustness.

**Conclusion:**

The incorporation of the L2 penalty into the Adam Optimizer led to a modest yet meaningful enhancement in the model's performance, as indicated by the decrease in test loss and increase in test accuracy. This demonstrates the efficacy of regularization techniques in improving the performance of neural networks.

**2. Early Stopping:**

In [15]:
def early_stop_training_phase(ffnn_optimizer, fashionmnist_ffnn_clf, patience):
    # Number of epochs (passes through the dataset) to train the model for.
    num_epochs = 10

    # A counter for the number of gradient updates we've performed.
    num_iter = 0

    # Initialize variables for early stopping
    best_test_loss = float('inf')
    epochs_without_improvement = 0

    # Iterate `num_epochs` times.
    for epoch in range(num_epochs):
        print("Starting epoch {}".format(epoch + 1))
        # Iterate over the train_dataloader, unpacking the images and labels
        for (images, labels) in train_dataloader:
            # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784), since
            # that's what our model expects. Remember that -1 does shape inference!
            reshaped_images = images.view(-1, 784)

            # Wrap reshaped_images and labels in Variables,
            # since we want to calculate gradients and backprop.
            reshaped_images = Variable(reshaped_images)
            labels = Variable(labels)

            # If we're using the GPU, move reshaped_images and labels to the GPU.
            if using_GPU:
                reshaped_images = reshaped_images.cuda()
                labels = labels.cuda()

            # Run the forward pass through the model to get predicted log distribution.
            # predicted shape: (batch_size, 10) (since there are 10 classes)
            predicted = fashionmnist_ffnn_clf(reshaped_images)

            # Calculate the loss
            batch_loss = nll_criterion(predicted, labels)

            # Clear the gradients as we prepare to backprop.
            ffnn_optimizer.zero_grad()

            # Backprop (backward pass), which calculates gradients.
            batch_loss.backward()

            # Take a gradient step to update parameters.
            ffnn_optimizer.step()

            # Increment gradient update counter.
            num_iter += 1

            # Calculate test set loss and accuracy every 500 gradient updates
            # It's standard to have this as a separate evaluate function, but
            # we'll place it inline for didactic purposes.
            if num_iter % 500 == 0:
                # Set model to eval mode, which turns off dropout.
                fashionmnist_ffnn_clf.eval()
                # Counters for the num of examples we get right / total num of examples.
                num_correct = 0
                total_examples = 0
                total_test_loss = 0

                # Iterate over the test dataloader
                for (test_images, test_labels) in test_dataloader:
                    # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784) again
                    reshaped_test_images = test_images.view(-1, 784)

                    # Wrap test data in Variable, like we did earlier.
                    # We set volatile=True bc we don't need history; speeds up inference.
                    reshaped_test_images = Variable(reshaped_test_images, volatile=True)
                    test_labels = Variable(test_labels, volatile=True)

                    # If we're using the GPU, move tensors to the GPU.
                    if using_GPU:
                        reshaped_test_images = reshaped_test_images.cuda()
                        test_labels = test_labels.cuda()

                    # Run the forward pass to get predicted distribution.
                    predicted = fashionmnist_ffnn_clf(reshaped_test_images)

                    # Calculate loss for this test batch. This is averaged, so multiply
                    # by the number of examples in batch to get a total.
                    total_test_loss += nll_criterion(
                        predicted, test_labels).data * test_labels.size(0)

                    # Get predicted labels (argmax)
                    # We need predicted.data since predicted is a Variable, and torch.max
                    # expects a Tensor as input. .data extracts Tensor underlying Variable.
                    _, predicted_labels = torch.max(predicted.data, 1)

                    # Count the number of examples in this batch
                    total_examples += test_labels.size(0)

                    # Count the total number of correctly predicted labels.
                    # predicted == labels generates a ByteTensor in indices where
                    # predicted and labels match, so we can sum to get the num correct.
                    num_correct += torch.sum(predicted_labels == test_labels.data)
                accuracy = 100 * num_correct / total_examples
                average_test_loss = total_test_loss / total_examples
                print("Iteration {}. Test Loss {}. Test Accuracy {}.".format(
                    num_iter, average_test_loss, accuracy))
                # Set the model back to train mode, which activates dropout again.
                fashionmnist_ffnn_clf.train()

                # Check for early stopping
                if average_test_loss < best_test_loss:
                    best_test_loss = average_test_loss
                    epochs_without_improvement = 0
                else:
                    epochs_without_improvement += 1
                    if epochs_without_improvement >= patience:
                        print("Early stopping triggered. No improvement in test loss for {} epochs.".format(patience))
                        return

In [17]:
# L2 regularization + Dropout on the best combination: ReLU Activation Function + Adam Optimizer
l2_ffnn_clf_2 = FeedForwardNN_new(input_size=784, num_classes=10, hidden_dims = [512, 256, 128], dropout=0.2, activation_fn= nn.ReLU())
if using_GPU:
    l2_ffnn_clf_2 = l2_ffnn_clf_2.cuda()

print(l2_ffnn_clf_2)

parameters = l2_ffnn_clf_2.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
Adam_optimizer = optim.Adam(l2_ffnn_clf_2.parameters(), lr=0.0001, weight_decay=1e-5)

ffnn_optimizer = Adam_optimizer
early_stop_training_phase(ffnn_optimizer, l2_ffnn_clf_2, patience=3)

FeedForwardNN_new(
  (hidden_layers): ModuleList(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=256, bias=True)
    (2): Linear(in_features=256, out_features=128, bias=True)
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (output_projection): Linear(in_features=128, out_features=10, bias=True)
  (nonlinearity): ReLU()
)
Shapes of model parameters:
[torch.Size([512, 784]), torch.Size([512]), torch.Size([256, 512]), torch.Size([256]), torch.Size([128, 256]), torch.Size([128]), torch.Size([10, 128]), torch.Size([10])]
Starting epoch 1


  reshaped_test_images = Variable(reshaped_test_images, volatile=True)
  test_labels = Variable(test_labels, volatile=True)


Iteration 500. Test Loss 0.6402992010116577. Test Accuracy 77.0199966430664.
Starting epoch 2
Iteration 1000. Test Loss 0.5309149622917175. Test Accuracy 80.86000061035156.
Iteration 1500. Test Loss 0.489239364862442. Test Accuracy 82.76000213623047.
Starting epoch 3
Iteration 2000. Test Loss 0.456836462020874. Test Accuracy 83.66999816894531.
Iteration 2500. Test Loss 0.43717920780181885. Test Accuracy 84.41999816894531.
Starting epoch 4
Iteration 3000. Test Loss 0.4248334467411041. Test Accuracy 85.06999969482422.
Iteration 3500. Test Loss 0.4103059470653534. Test Accuracy 85.41000366210938.
Starting epoch 5
Iteration 4000. Test Loss 0.4021386504173279. Test Accuracy 85.5999984741211.
Iteration 4500. Test Loss 0.387502521276474. Test Accuracy 86.08000183105469.
Starting epoch 6
Iteration 5000. Test Loss 0.38503211736679077. Test Accuracy 85.94999694824219.
Iteration 5500. Test Loss 0.3722062408924103. Test Accuracy 86.58000183105469.
Starting epoch 7
Iteration 6000. Test Loss 0.38895

**Report on early stopping:**

**Results:**
- Implementing early stopping resulted in a marginal improvement in test accuracy, increasing from 87.51% to 87.71%. However, it led to a slight increase in test loss from 0.3429 to 0.3456.
- The model continued training without triggering early stopping because the test loss continued to improve or remained the same for at least patience epochs.

**Discussion:**

The introduction of early stopping in the training process did not significantly affect the results. The model continued training without triggering early stopping because the test loss either continued to improve or remained stable for a specified number of epochs. This indicates that the early stopping mechanism was appropriately designed to allow continued training when the model was still improving.

**Conclusion:**

The introduction of early stopping did not significantly alter the results, as the model continued training without interruption. This indicates that the training process was effective, with the test loss either improving or stabilizing over time. While early stopping may not have led to notable improvements in this instance, its presence as a safeguard against overfitting remains valuable. This highlights the importance of robust training procedures and effective implementation of techniques such as early stopping to ensure optimal model performance.