<a href="https://colab.research.google.com/github/yeb2Binfang/ECE-GY9143HPML/blob/main/Lab/Lab3/Lab3_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will consider five methods, AdaGrad, RMSProp, RMSProp+Nesterov, AdaDelta, Adam, and study their convergence using the CIFAR-10 dataset. We will use a multi-layer neural network model with two fully connected hidden layers with 1000 hidden units each and ReLU activation with a minibatch size of 128.

1. Write the weight update equations for the five adaptive learning rate methods. Explain each term clearly. What are the hyperparameters in each policy? Explain how AdaDelta and Adam are different from RMSProp. (5+1)
2. Train the neural network using all the five methods with L2-regularization for 200 epochs each and plot the training loss vs the number of epochs. Which method performs best (lowest training loss)? (5)
3. Add dropout (probability 0.2 for input layer and 0.5 for hidden layers) and train the neural network again using all the five methods for 200 epochs. Compare the training loss with that in part 2. Which method performs the best? For the five methods, compare their training time (to finish 200 epochs with dropout) to the training time in part 2 (to finish 200 epochs without dropout). (5)
4. Compare test accuracy of the trained model for all the five methods from part 2 and part 3. Note that to calculate test accuracy of the model trained using dropout you need to appropriately scale the weights (by the dropout probability). (4)

*reference:*

* The CIFAR-10 dataset. https://www.cs.toronto.edu/~kriz/cifar.html

## Problem 1
Write the weight update equations for the five adaptive learning rate methods AdaGrad, RMSProp, RMSProp+Nesterov, AdaDelta, Adam. Explain each term clearly. What are the hyperparameters in each policy? Explain how AdaDelta and Adam are different from RMSProp. (5+1)



*   AdaGrad
  
  So, for AdaGrad, it is useful for sparse data such as text data. This method will adjust the learning rate based on the previous gradients. The math is 

  $$
  \begin{gathered}
g_{t,i} = \triangledown_{\theta_t}J(\theta_{t,i})\\
\theta_{t+1,i} = \theta_{t,i}-\frac{\eta}{\sqrt{G_{t,ii}+\epsilon}}g_{t,i}
\end{gathered}
  $$
  The $G_{t,ii}$ is diagnoal matrix and we just need to get the diagonal positiona values. The Values is the sum of square gradients. It is basically $\sum_{t=0}^{t}\triangledown_t J(\theta)^2$.

*   RMSprop
  
  The formula for this oprimizer is shown below

  $$
  \begin{gathered}
E[g^2]_t = \gamma E[g^2]_{t-1}+(1-\gamma) g_t^2\\
\Delta \theta_t = \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t\\
\Delta \theta_{t+1} = \theta_t - \Delta \theta_t
\end{gathered}
  $$ 
  where we will set $\gamma = 0.9$. And $E[g^2]_t$ means that we only need to record a short period of sum of square gradients instead of the whole. If we record the whole gradients, it will slow down the updating of the learning rate and it probably cannot converge. We use $\gamma = 0.9$, and it is actually exponential decaying. It means that if the gradient is close to my current gradient, it will get a large weight but the far away will get a small weight because exponential decaying, for example, $0.9^{100}$.  

*   RMSProp+Nesterov

  The formula will actually look like
  
  $$
  \theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}(\beta_1 \hat{m}_t + \frac{(1-\beta_1)g_t}{1-\beta_1^t}) 
  $$

  Here, $\gamma$ is the momentum decay and $\beta$ is the decay rate.

*   Adadelta

  Adadelta is a variation of RMSprop, the formula is same too, The main difference is that for Adadelta, we do not need to set the default learning rate because it will automatically update it. The formula will be

  $$
  \begin{gathered}
\Delta \theta_t = \frac{RMS[\Delta \theta]_{t-1}}{RMS[g]_t}g_t\\
\theta_{t+1} = \theta_t - \Delta \theta_t
\end{gathered}
  $$ 

  SO, we let $\sqrt{E[g^2]_t+\epsilon}=RMS[g]_t$ where RMS is root mean square. And we can get $\Delta \theta_t = \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}}g_t=\frac{\eta}{RMS[g]_t}$. That is how we get the above updating formula
*   Adam

  What Adam optimizer records are past square gradient and past gradient (similar with momentum). The formula is 

  $$
  \begin{gathered}
m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\\
v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2
\end{gathered}
  $$

  We will initilize $m_t, v_t$ to 0 and it will bias toward 0 and we need to do bias-corrected. The method is

  $$
  \begin{gathered}
\hat{m}_t = \frac{m_t}{1-\beta_1^t}\\
\hat{v}_t = \frac{v-t}{1-\beta_2^t}
\end{gathered}
  $$
  Usually, we will set $\beta_1 = 0.9, \beta_2=0.999,\epsilon = 1e-8$.

  We will use the below formula to update the parameters

  $$
  \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t
  $$

  Adam is actually like RMSprop + momentum.


## Problem 2
Train the neural network using all the five methods with L2-regularization for 200 epochs each and plot the training loss vs the number of epochs. Which method performs best (lowest training loss)? (5)

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import time
import pickle

In [2]:
batch_size = 128
num_workers = 2
trainsform_train = transforms.Compose([
    transforms.RandomCrop(32, padding = 4),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

trainsform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
])

train_set = torchvision.datasets.CIFAR10(root = './data', train=True, download=True, transform=trainsform_train)
test_set = torchvision.datasets.CIFAR10(root = './data', train=False, download=True, transform=trainsform_test)
train_loader = torch.utils.data.DataLoader(train_set, batch_size = batch_size,shuffle = True, num_workers = num_workers)
test_loader = torch.utils.data.DataLoader(test_set, batch_size = batch_size, shuffle = True, num_workers = num_workers)

classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


  0%|          | 0/170498071 [00:00<?, ?it/s]

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [3]:
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(32 * 32 * 3, 1000)
        self.fc2 = nn.Linear(1000, 1000)
        self.fc3 = nn.Linear(1000, 10)

    def forward(self, x):
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [5]:
net = net.to(device)

In [8]:
# from torchsummary import summary
# summary(net, (3,32,32))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                 [-1, 1000]       3,073,000
            Linear-2                 [-1, 1000]       1,001,000
            Linear-3                   [-1, 10]          10,010
Total params: 4,084,010
Trainable params: 4,084,010
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.02
Params size (MB): 15.58
Estimated Total Size (MB): 15.61
----------------------------------------------------------------


In [6]:
loss_fn = nn.CrossEntropyLoss()
weight_decay = 5e-4
optimizer_adagrad = optim.Adagrad(net.parameters(), lr = 0.001, weight_decay=weight_decay)
optimizer_rmsporp = optim.RMSprop(net.parameters(), lr = 0.001, weight_decay=weight_decay)
# optimizer_rmsprop_nesterov = optim.Adagrad(net.parameters(), lr = 0.001)
optimizer_adadelta = optim.Adadelta(net.parameters(), weight_decay=weight_decay)
optimizer_adam = optim.Adam(net.parameters(), lr=0.001, betas=(0.9, 0.999), weight_decay=weight_decay)

In [7]:
def train(epoch, model, optimizer, train_loss_history):
  print('\nEpoch: %d' % epoch)
  model.train()
  loss_avg = 0
  count = 0
  correct = 0
  total = 0
  for batch_idx, (inputs, targets) in enumerate(train_loader):
    count += 1
    inputs, targets = inputs.to(device), targets.to(device) 
    optimizer.zero_grad()
    outputs = net(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()
    loss_avg += loss.item()

    _, predicted = outputs.max(1)
    total += targets.size(0)
    correct += predicted.eq(targets).sum().item()  
    print("\nThe batch index: {0:d}, len of train loader: {1:d}, Loss: {2:.3f}, acc: {3:.3f}".format(batch_idx,
                                                                                             len(train_loader),
                                                                                             loss_avg / (batch_idx + 1),
                                                                                             100. * correct / total))
  train_loss_history.append(loss_avg / count)

In [8]:
adagrad_train_loss_history = []
rmsprop_train_loss_history = []
epoch = 200
for epo in range(epoch):
  train(epo, net, optimizer_rmsporp, rmsprop_train_loss_history)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

The batch index: 243, len of train loader: 391, Loss: 1.339, acc: 52.350

The batch index: 244, len of train loader: 391, Loss: 1.339, acc: 52.353

The batch index: 245, len of train loader: 391, Loss: 1.339, acc: 52.356

The batch index: 246, len of train loader: 391, Loss: 1.340, acc: 52.344

The batch index: 247, len of train loader: 391, Loss: 1.340, acc: 52.315

The batch index: 248, len of train loader: 391, Loss: 1.341, acc: 52.303

The batch index: 249, len of train loader: 391, Loss: 1.341, acc: 52.300

The batch index: 250, len of train loader: 391, Loss: 1.341, acc: 52.285

The batch index: 251, len of train loader: 391, Loss: 1.342, acc: 52.257

The batch index: 252, len of train loader: 391, Loss: 1.342, acc: 52.267

The batch index: 253, len of train loader: 391, Loss: 1.342, acc: 52.267

The batch index: 254, len of train loader: 391, Loss: 1.343, acc: 52.215

The batch index: 255, len of train loader: 391

In [9]:
torch.save(net, 'rmsprop_net.pkl')
torch.save(rmsprop_train_loss_history, 'rmsprop_train_loss_history.pkl')

In [11]:
print(rmsprop_train_loss_history)

[2.6178930287470905, 1.6964890740411667, 1.6308940865499588, 1.5868664181140988, 1.5593042132799582, 1.5371854274779024, 1.5202058556744509, 1.5071852265111625, 1.4992921937762014, 1.4850869989761002, 1.4810270619819232, 1.4735902069169846, 1.4637062153243043, 1.4532322289083925, 1.4529647473484049, 1.440494799552976, 1.4376024830981593, 1.4361396835893012, 1.4387750433534003, 1.425574626154302, 1.4248188282827587, 1.4241514876675423, 1.4174633471252363, 1.4145319544140944, 1.4149938657155732, 1.4127250818340369, 1.4066424546644205, 1.4111666057420813, 1.4070045481557432, 1.4055171616546942, 1.4016165940657905, 1.401329641147038, 1.4000237765519514, 1.3992038065820094, 1.3954711192099334, 1.3960973289616578, 1.3982341304764418, 1.392111115443432, 1.3890141263947158, 1.3880148037620212, 1.3908229334580013, 1.3853045259900105, 1.3864526788292029, 1.3846146670143928, 1.3846772403058494, 1.378568775818476, 1.378658361447132, 1.382774670715527, 1.3780051820418413, 1.3792196030507002, 1.3771

In [12]:
print("the running time for adagrad_net, 200 epoch :4541.966s")
print("the running time for rmsprop_net, 200 epoch :4600.63s")

the running time for adagrad_net, 200 epoch :4541.966s
the running time for rmsprop_net, 200 epoch :4600.63s


In [13]:
net2 = torch.load('rmsprop_net.pkl')

In [14]:
def test(model):
  model.eval()
  count = 0
  correct = 0
  total = 0
  for batch_idx, (inputs, targets) in enumerate(test_loader):
    inputs, targets = inputs.to(device), targets.to(device) 
    outputs = net(inputs)
    _, predicted = outputs.max(1)
    total += targets.size(0)
    correct += predicted.eq(targets).sum().item()  
  print("The test accuracy is: {0:.3f}".format(100. * correct / total))

In [15]:
# this accuracy is for adagrad
test(net2)

The test accuracy is: 45.680
