# Homework 4

## Objective: Get the lowest loss on a CIFAR10 Classifier

After taking the entirety of the Deep Learning course, I have had the opportunity to learn various of techniques to solve different problems using classical deep learning techniques. However, for this homework, I decided that the two most important things that will be helpful for me to get a lower loss on the CIFAR image classification task are Batch/Layer Normalization and ResNets. As such, I first decided to focus on making the best model I got in homework 2 better by introducing normalization and also trying different model architectures to lower the loss. Finally, I will showcase the ResNet I made to see if it can beat the optimized model from homework 2.

### Route 1: Optimization of HW2 Model

#### Initial Model Architecture

To start, I decided to take a look at the model from the last homework that worked best for this task. Before doing anything, this model had 4 CNN layers with channels (30 -> 64 -> 128 -> 256), kernel sizes (5x5 -> 3x3 -> 3x3 -> 3x3), and a constant stride of 1, a Max Pooling layer with kernel 2x2 and stride 2, and finally a classification head with linear layers (1000 -> 500 -> 250 -> 10). Of course, dropout was placed between every layer when possible. Looking at the model, I saw that it was not currently using any form of normalization, so I thought to add batch normalization between the CNN layers and layer normalization between the linear layers. Furthermore, I decided to do this not just because it was a new tool but also since I saw that it was effective when used for the models in Homework 3.

#### Model Training

For the first training loop, I decided to use the hyperparameters that were kept for the original model in Homework 2 since the new architecture is very similar to the old architecture. For reference the hyperparameters used to train the new model are...

---

batch_size = 64

learning_rate = 5e-4

decay_rate = 4e-4

c_dropout = 0.30

f_dropout = 0.30

---

After training the model, I noticed large improvements in training and testing loss being 0.6250 and 0.6327 at the 20th epoch which is better than the ~0.80 training and testing loss for the architecture from homework 2. From here, I tried to tinker with learning rate by setting it to 1e-4 but the model not only took longer to get to an optimal training and testing loss, around the 35th epoch, but could only get a loss around ~0.64. Fiddling with the dropout was also not successful and would only make the model take longer to train but not get losses lower than ~0.70.

#### Second Model Architecture(s)

After exhausting all of my options for the first model I made, I decided to instead try and make the model more complex, I noticed that the CNN Layers would expand the number of channels in a reverse funnel, but I never thought to just try and keep the channels constant. Originally, I used a reverse funnel since I wanted the model to be able to get a few base features in the first layer and then use those to create more and more complex features in subseqeunt layers. However, if I tried keeping the same, I could achieve the same effect but instead of tying to just repeatetly getting more and more complex features, I would be building more and more complex features in a more controlled way. As such, I decided to keep the number of channels to 64 for all four CNN layers.

So, so the architecture for this model is now...

4x CNN layers (64 -> 64 -> 64 -> 64) with kernels 3x3 and stride 1 into a Max Pool of kernel 2x2 and stride 2 finally into a Linear Classifier (1000 -> 500 -> 250 -> 10).

Also you might notice that this section is suffixed with "Architecture(s)." This is because I experimented with various constant sizes for the CNN Layers which will be explained in the next section

#### Second Model Training

Once again, I decided to keep the hyperparameters that were previously listed since they seem to be the most optimal for the current model. By the 62th epoch, the model was able to get a new training and testing loss of ~0.60 around the 60th epoch. Knowing that it would be best to keep the hyperparameters the same, I decided that the only way forward would be to just adjust the width of the CNN layers, so I decided to go from 64 channels per layer to 128 channels per layer which allowed me to get a lower training and testing loss of ~0.57. For my next change, I decided to go even further and try using a constant size of 256 for the CNN channels and got a testing and training loss ~0.50. From this, we can see that increasing the channels in the CNN layers in this way allows the model to reach better and better losses for the most part, but of course, this comes at the cost of longer training times. For one final push, I tried increasing the channels to 512 but not only did it take a while to train the lowest loss it got was around ~0.54 at the 17th epoch.

#### Results

Refer to Figure 1 for the loss graph.

---

Final Accuracy:

Final Training Loss:

Final Testing Loss:

Best Model Architecture: 4x CNN Layers (256 -> 256 -> 256 -> 256) of kernels 3x3 and stride 1 into MaxPool Layer of kernel 2x2 and stride 2 into Linear Classifier with layers (1000 -> 500 -> 250 -> 100 -> 10)

---

batch_size = 64

learning_rate = 1e-4

decay_rate = 4e-4

c_dropout = 0.30

f_dropout = 0.30

### Route 1: ResNet Classifier

#### Initial Model Architecture

To start, I decided that I wanted to make ResNet blocks out of purely CNN layers. This is because the original ResNet paper which used a ResNet on image classification also used a similar architecture, so I wanted to see if making a NN like it would be benefitial.


In [51]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns
from torch.utils import data
from tqdm import tqdm

In [2]:
# Grab the MNIST dataset
training_set = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
testing_set = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor())

tfm = transforms.Compose([
    transforms.ToTensor(),
])

trainset_full_CIFAR10 = torchvision.datasets.CIFAR10(root="./data", train=True, download=True, transform=tfm)
testset_full_CIFAR10  = torchvision.datasets.CIFAR10(root="./data", train=False, download=True, transform=tfm)

In [3]:
# Verify that GPU is connected and available

print(torch.__version__)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(torch.cuda.get_device_name(0))

2.9.1+cu128
NVIDIA GeForce RTX 4080


In [65]:
class CIFAR10_Classifier(nn.Module):
    def __init__(self, C_dropout, F_dropout):
        super().__init__()

        conv2d_dropout = C_dropout

        conv_layer_1 = 512
        conv_layer_2 = 512

        conv_layer_3 = 512
        conv_layer_4 = 512

        self.forward_funnel_1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=conv_layer_1, kernel_size=3),   # Extract useful features from the beginning
            nn.BatchNorm2d(num_features=conv_layer_1),
            nn.ReLU(inplace=True),
            nn.Dropout2d(conv2d_dropout),

            nn.Conv2d(in_channels=conv_layer_1, out_channels=conv_layer_2, kernel_size=3),  # Extract useful features from the learned features
            nn.BatchNorm2d(num_features=conv_layer_2),
            nn.ReLU(inplace=True),
            nn.Dropout2d(conv2d_dropout),
            nn.MaxPool2d(kernel_size=2, stride=2),                       # Reduce dimensionality
        )

        self.forward_funnel_2 = nn.Sequential(
            nn.Conv2d(in_channels=conv_layer_2, out_channels=conv_layer_3, kernel_size=3),   # Extract useful features from the beginning
            nn.BatchNorm2d(num_features=conv_layer_3),
            nn.ReLU(inplace=True),
            nn.Dropout2d(conv2d_dropout),

            nn.Conv2d(in_channels=conv_layer_3, out_channels=conv_layer_4, kernel_size=3),  # Extract useful features from the learned features
            nn.BatchNorm2d(num_features=conv_layer_4),
            nn.ReLU(inplace=True),
            nn.Dropout2d(conv2d_dropout),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # Compute the number of features after the input has passed the funnel
        with torch.no_grad():
            test_input = torch.zeros(1, 3, 32, 32)

            test_input.to(device)

            features = self.forward_funnel_1(test_input)
            features = self.forward_funnel_2(features)

            total_count = features.view(1, -1).size(1)

        full_node_dropout = F_dropout


        lin_layer_1_size = 1000
        lin_layer_2_size = 500
        lin_layer_3_size = 250

        self.output_nodes = 100

        self.classifer = nn.Sequential(
            nn.Flatten(),                                           # Flatten the image from the funnel
            nn.Linear(in_features=total_count, out_features=lin_layer_1_size),
            nn.LayerNorm(lin_layer_1_size),
            nn.ReLU(inplace=True),
            nn.Dropout(full_node_dropout),

            nn.Linear(in_features=lin_layer_1_size, out_features=lin_layer_2_size),
            nn.LayerNorm(lin_layer_2_size),
            nn.ReLU(inplace=True),
            nn.Dropout(full_node_dropout),

            nn.Linear(in_features=lin_layer_2_size, out_features=lin_layer_3_size),
            nn.LayerNorm(lin_layer_3_size),
            nn.ReLU(inplace=True),
            nn.Dropout(full_node_dropout),

            nn.Linear(in_features=lin_layer_3_size, out_features=self.output_nodes),
            nn.LayerNorm(self.output_nodes),
            nn.ReLU(inplace=True),
            nn.Dropout(full_node_dropout),
        )

        self.output_layer = nn.Linear(in_features=self.output_nodes, out_features=10)

    def partial_forward(self, x):
        x = self.forward_funnel_1(x)
        x = self.forward_funnel_2(x)
        x = self.classifer(x)

        return x

    def forward(self, x):
        x = self.partial_forward(x)
        logits = self.output_layer(x)

        return logits

In [66]:
epoch_over_training_loss_CIFAR10 = []
epoch_over_testing_loss_CIFAR10 = []

In [67]:
# Hyperparameter setup
epochs = 100
batch_size = 64
learning_rate = 1e-4
decay_rate = 4e-4

c_dropout = 0.30
f_dropout = 0.30

print('######## Begining training for CIFAR10 classifier ##########')

# Setup data loaders
trainset_loader_CIFAR10 = data.DataLoader(trainset_full_CIFAR10,
                                   batch_size=batch_size,
                                   shuffle=True,
                                   num_workers=8,
                                   pin_memory=True)

testset_loader_CIFAR10 = data.DataLoader(testset_full_CIFAR10,
                                   batch_size=batch_size,
                                   num_workers=8,
                                   shuffle=False,
                                   pin_memory=True)

model = CIFAR10_Classifier(c_dropout, f_dropout)
model.to(device)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(),
                       lr=learning_rate,
                       weight_decay=decay_rate
                       )

# Have references to variables outside of the epoch loop
avg_training_loss = 0
avg_testing_loss = 0

# Epoch Loop
for epoch in range(epochs):
    print(f'----- Epoch: {epoch + 1}/{epochs} -----')

    avg_training_loss = 0
    avg_testing_loss = 0

    model.train()

    for x, Y in tqdm(trainset_loader_CIFAR10, desc='Training', unit=' batch'):
        # Transfer images to GPU
        x = x.to(device)
        Y = Y.to(device)

        # Zero out gradients
        optimizer.zero_grad()

        # Send images to model
        x_pred = model(x)

        # Calc loss
        loss = loss_function(x_pred, Y)

        # Calc gradient and update weights
        loss.backward()
        optimizer.step()

        with torch.no_grad():
            avg_training_loss += loss.item()

    # Switch to eval mode
    model.eval()

    with torch.no_grad():
        for x, Y in tqdm(testset_loader_CIFAR10, desc='Testing', unit=' batches'):
            # Move the images to the GPU
            x = x.to(device)
            Y = Y.to(device)

            # Get logits and sum up total loss
            x_pred = model(x)
            avg_testing_loss += loss_function(x_pred, Y).item()

    # Get training loss
    avg_training_loss /= len(trainset_loader_CIFAR10)

     # Get testing loss
    avg_testing_loss /= len(testset_loader_CIFAR10)

    # Switch model back to training mode
    model.train()

    epoch_over_training_loss_CIFAR10.append({
        "epoch": epoch,
        "training_loss": avg_training_loss
        })

    epoch_over_testing_loss_CIFAR10.append({
        "epoch": epoch,
        "testing_loss": avg_testing_loss
        })


    print("")

    print(f'   -> Training Loss: {avg_training_loss: .4f}\n')
    print(f'   -> Testing Loss: {avg_testing_loss: .4f}\n')


######## Begining training for CIFAR10 classifier ##########
----- Epoch: 1/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.52 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 84.68 batches/s]



   -> Training Loss:  1.7892

   -> Testing Loss:  1.3688

----- Epoch: 2/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.53 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.57 batches/s]



   -> Training Loss:  1.4115

   -> Testing Loss:  1.1577

----- Epoch: 3/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.63 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.21 batches/s]



   -> Training Loss:  1.2225

   -> Testing Loss:  0.9990

----- Epoch: 4/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.64 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 84.88 batches/s]



   -> Training Loss:  1.0942

   -> Testing Loss:  0.9022

----- Epoch: 5/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.64 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 84.84 batches/s]



   -> Training Loss:  0.9909

   -> Testing Loss:  0.7971

----- Epoch: 6/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.62 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.25 batches/s]



   -> Training Loss:  0.9201

   -> Testing Loss:  0.7562

----- Epoch: 7/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.64 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.55 batches/s]



   -> Training Loss:  0.8580

   -> Testing Loss:  0.7360

----- Epoch: 8/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.63 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.18 batches/s]



   -> Training Loss:  0.8178

   -> Testing Loss:  0.6992

----- Epoch: 9/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.63 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.18 batches/s]



   -> Training Loss:  0.7738

   -> Testing Loss:  0.6415

----- Epoch: 10/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.62 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.10 batches/s]



   -> Training Loss:  0.7321

   -> Testing Loss:  0.6350

----- Epoch: 11/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.62 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.25 batches/s]



   -> Training Loss:  0.6978

   -> Testing Loss:  0.6261

----- Epoch: 12/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.64 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.46 batches/s]



   -> Training Loss:  0.6731

   -> Testing Loss:  0.6068

----- Epoch: 13/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.64 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.09 batches/s]



   -> Training Loss:  0.6434

   -> Testing Loss:  0.5839

----- Epoch: 14/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.64 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.31 batches/s]



   -> Training Loss:  0.6163

   -> Testing Loss:  0.5604

----- Epoch: 15/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.63 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.42 batches/s]



   -> Training Loss:  0.5885

   -> Testing Loss:  0.5866

----- Epoch: 16/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.66 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.39 batches/s]



   -> Training Loss:  0.5666

   -> Testing Loss:  0.5630

----- Epoch: 17/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.59 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 85.16 batches/s]



   -> Training Loss:  0.5440

   -> Testing Loss:  0.5432

----- Epoch: 18/100 -----


Training: 100%|██████████| 782/782 [00:30<00:00, 25.42 batch/s]
Testing: 100%|██████████| 157/157 [00:01<00:00, 83.05 batches/s]



   -> Training Loss:  0.5216

   -> Testing Loss:  0.5575

----- Epoch: 19/100 -----


Training:  13%|█▎        | 99/782 [00:04<00:31, 21.97 batch/s]


KeyboardInterrupt: 

In [31]:
class CIFAR_ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):
        super(CIFAR_ResNet, self).__init__()

        self.in_planes = 16  # Reduced initial channel size (Standard ResNet uses 64)

        # Initial Conv Layer (Key modification for CIFAR 32x32 input)
        # Use 3x3 Conv with stride=1 to avoid downsampling too quickly
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(16)

        # ResNet Stages (Downsampling happens via stride=2 in the *first* block of each stage)
        self.layer1 = self._make_layer(block, 16, num_blocks[0], stride=1) # No downsampling here (32x32 -> 32x32)
        self.layer2 = self._make_layer(block, 32, num_blocks[1], stride=2) # Downsample to 16x16
        self.layer3 = self._make_layer(block, 64, num_blocks[2], stride=2) # Downsample to 8x8

        # Final Classification Layer
        self.linear = nn.Linear(64 * block.expansion, num_classes)

    def _make_layer(self, block, planes, num_blocks, stride):
        """Creates one stage (group of residual blocks)"""
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []

        for stride_val in strides:
            layers.append(block(self.in_planes, planes, stride_val))
            self.in_planes = planes * block.expansion # Update in_planes for the next block

        return nn.Sequential(*layers)

    def forward(self, x):
        # Initial Block
        out = F.relu(self.bn1(self.conv1(x)))

        # Residual Stages
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)

        # Global Average Pooling (8x8 -> 1x1)
        out = F.avg_pool2d(out, out.size(3))

        # Final Classification
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

In [84]:
class CNN_Block(nn.Module):

    def __init__(self, in_planes, planes, stride=1):
        super().__init__()

        # self.cnn_dropout = cnn_dropout

        self.skip = nn.Sequential()

        self.conv_block = nn.Sequential(
            nn.Conv2d(in_channels=in_planes, out_channels=planes, kernel_size=3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(planes),

            nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(planes),
        )

        if stride != 1 or in_planes != planes:
            # Use a 1x1 convolution to match the dimensions (channels and spatial size)
            self.skip = nn.Sequential(
                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes)
            )

    def forward(self, x):

        # Pass the input through the block
        logits = self.conv_block(x)

        # Skip the original data
        logits += self.skip(x)

        # Activation Function
        logits = F.relu(logits)

        return logits


class CIFAR10_ResNet(nn.Module):
    def __init__(self, num_blocks:list, num_classes=10, linear_dropout=0.25):
        super().__init__()

        # Initial size of the CNN layer that accepts the image
        # Also used when creating stages of blocks self.stage_layer
        self.in_planes = 16

        self.image_input_layer = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=self.in_planes, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(self.in_planes),
            nn.ReLU(inplace=True)
        )

        range_beginning = 16
        range_end = 48
        range_step = range_beginning

        self.cnn_plane_range = range(range_beginning, range_end + 1, range_step)

        self.stages = nn.Sequential(
            *[self.make_stage(planes, num_block, 1 if idx == 0 else 2)
                for idx, (planes, num_block) in enumerate(zip(self.cnn_plane_range, num_blocks))]
        )

        self.lin_layer_1_size = 2000
        self.lin_layer_2_size = 1500
        self.lin_layer_3_size = 1000
        self.lin_layer_4_size = 500


        self.classifier = nn.Sequential(
            nn.Linear(range_end, self.lin_layer_1_size),
            # nn.LayerNorm(self.lin_layer_1_size),
            nn.ReLU(inplace=True),
            nn.Dropout(linear_dropout),

            nn.Linear(self.lin_layer_1_size, self.lin_layer_2_size),
            # nn.LayerNorm(self.lin_layer_2_size),
            nn.ReLU(inplace=True),
            nn.Dropout(linear_dropout),

            nn.Linear(self.lin_layer_2_size, self.lin_layer_3_size),
            # nn.LayerNorm(self.lin_layer_3_size),
            nn.ReLU(inplace=True),
            nn.Dropout(linear_dropout),

            nn.Linear(self.lin_layer_3_size, self.lin_layer_4_size),
            nn.LayerNorm(self.lin_layer_4_size),
            nn.ReLU(inplace=True),
            nn.Dropout(linear_dropout),

            nn.Linear(self.lin_layer_4_size, num_classes)
        )


    def make_stage(self, planes, num_blocks, stride):

        strides = [stride] + [1] * (num_blocks - 1)

        layers = []

        for stride in strides:
            # Add ResBlock to list
            layers.append(CNN_Block(self.in_planes, planes, stride))
            # Reset the in planes to preserve in_channels of the next blocks
            self.in_planes = planes

        return nn.Sequential(*layers)

    def forward(self, x):

        logits = self.image_input_layer(x)

        logits = self.stages(logits)

        logits = F.avg_pool2d(logits, 8)

        logits = logits.view(logits.size(0), -1)

        logits = self.classifier(logits)

        return logits



In [86]:
# Hyperparameter setup
epochs = 100
batch_size = 64
learning_rate = 1e-3
decay_rate = 1e-4

f_dropout = 0.5

print('######## Begining training for CIFAR10 ResNet classifier ##########')

# Setup data loaders
trainset_loader_CIFAR10 = data.DataLoader(trainset_full_CIFAR10,
                                   batch_size=batch_size,
                                   shuffle=True,
                                   num_workers=8,
                                   pin_memory=True)

testset_loader_CIFAR10 = data.DataLoader(testset_full_CIFAR10,
                                   batch_size=batch_size,
                                   num_workers=8,
                                   shuffle=False,
                                   pin_memory=True)

model = CIFAR10_ResNet(num_blocks=[5, 5, 5], num_classes=10, linear_dropout=f_dropout)
model.to(device)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(),
                       lr=learning_rate,
                       weight_decay=decay_rate
                       )

# Have references to variables outside of the epoch loop
avg_training_loss = 0
avg_testing_loss = 0

# Epoch Loop
for epoch in range(epochs):
    print(f'----- Epoch: {epoch + 1}/{epochs} -----')

    avg_training_loss = 0
    avg_testing_loss = 0

    model.train()

    for x, Y in tqdm(trainset_loader_CIFAR10, desc='Training', unit=' batch'):
        # Transfer images to GPU
        x = x.to(device)
        Y = Y.to(device)

        # Zero out gradients
        optimizer.zero_grad()

        # Send images to model
        x_pred = model(x)

        # Calc loss
        loss = loss_function(x_pred, Y)

        # Calc gradient and update weights
        loss.backward()
        optimizer.step()

        with torch.no_grad():
            avg_training_loss += loss.item()

    # Switch to eval mode
    model.eval()

    with torch.no_grad():
        for x, Y in tqdm(testset_loader_CIFAR10, desc='Testing', unit=' batches'):
            # Move the images to the GPU
            x = x.to(device)
            Y = Y.to(device)

            # Get logits and sum up total loss
            x_pred = model(x)
            avg_testing_loss += loss_function(x_pred, Y).item()

    # Get training loss
    avg_training_loss /= len(trainset_loader_CIFAR10)

     # Get testing loss
    avg_testing_loss /= len(testset_loader_CIFAR10)

    # Switch model back to training mode
    model.train()

    epoch_over_training_loss_CIFAR10.append({
        "epoch": epoch,
        "training_loss": avg_training_loss
        })

    epoch_over_testing_loss_CIFAR10.append({
        "epoch": epoch,
        "testing_loss": avg_testing_loss
        })


    print("")

    print(f'   -> Training Loss: {avg_training_loss: .4f}\n')
    print(f'   -> Testing Loss: {avg_testing_loss: .4f}\n')


######## Begining training for CIFAR10 ResNet classifier ##########
----- Epoch: 1/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 106.21 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 231.79 batches/s]



   -> Training Loss:  1.9183

   -> Testing Loss:  1.8173

----- Epoch: 2/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 106.62 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 240.04 batches/s]



   -> Training Loss:  1.6320

   -> Testing Loss:  1.5756

----- Epoch: 3/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 104.68 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 242.36 batches/s]



   -> Training Loss:  1.4272

   -> Testing Loss:  1.3211

----- Epoch: 4/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 102.10 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 237.67 batches/s]



   -> Training Loss:  1.2804

   -> Testing Loss:  1.2256

----- Epoch: 5/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 107.34 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 231.58 batches/s]



   -> Training Loss:  1.1400

   -> Testing Loss:  1.0170

----- Epoch: 6/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 106.64 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 232.62 batches/s]



   -> Training Loss:  1.0144

   -> Testing Loss:  0.9493

----- Epoch: 7/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 104.84 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 236.51 batches/s]



   -> Training Loss:  0.9408

   -> Testing Loss:  0.9307

----- Epoch: 8/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 107.20 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 237.59 batches/s]



   -> Training Loss:  0.8728

   -> Testing Loss:  0.8532

----- Epoch: 9/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 106.70 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 224.45 batches/s]



   -> Training Loss:  0.8196

   -> Testing Loss:  0.8461

----- Epoch: 10/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 106.73 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 229.68 batches/s]



   -> Training Loss:  0.7807

   -> Testing Loss:  0.9656

----- Epoch: 11/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 106.05 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 233.38 batches/s]



   -> Training Loss:  0.7423

   -> Testing Loss:  0.7645

----- Epoch: 12/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 103.58 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 241.85 batches/s]



   -> Training Loss:  0.7098

   -> Testing Loss:  0.7758

----- Epoch: 13/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 107.33 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 236.47 batches/s]



   -> Training Loss:  0.6807

   -> Testing Loss:  0.7530

----- Epoch: 14/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 107.03 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 229.81 batches/s]



   -> Training Loss:  0.6518

   -> Testing Loss:  0.6991

----- Epoch: 15/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 110.91 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 236.89 batches/s]



   -> Training Loss:  0.6288

   -> Testing Loss:  0.7431

----- Epoch: 16/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 103.29 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 236.95 batches/s]



   -> Training Loss:  0.6078

   -> Testing Loss:  0.6668

----- Epoch: 17/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 104.60 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 233.51 batches/s]



   -> Training Loss:  0.5779

   -> Testing Loss:  0.7086

----- Epoch: 18/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 102.92 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 239.18 batches/s]



   -> Training Loss:  0.5612

   -> Testing Loss:  0.6906

----- Epoch: 19/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 108.24 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 237.31 batches/s]



   -> Training Loss:  0.5474

   -> Testing Loss:  0.6796

----- Epoch: 20/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 105.31 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 234.10 batches/s]



   -> Training Loss:  0.5278

   -> Testing Loss:  0.6564

----- Epoch: 21/100 -----


Training: 100%|██████████| 782/782 [00:07<00:00, 105.22 batch/s]
Testing: 100%|██████████| 157/157 [00:00<00:00, 236.73 batches/s]



   -> Training Loss:  0.5046

   -> Testing Loss:  0.6604

----- Epoch: 22/100 -----


Training:   7%|▋         | 58/782 [00:00<00:08, 84.51 batch/s]


KeyboardInterrupt: 