<a href="https://colab.research.google.com/github/succSeeded/dl-2025/blob/main/hws/week03_convnets/homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve CIFAR10 image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.

## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 50% (50% points)
    * 60% (60% points)
    * 65% (70% points)
    * 70% (80% points)
    * 75% (90% points)
    * 80% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 80%.
* In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the e-mail). After that, you can use whatever you want.
* you __can__ use validation data for training, but you __can't__ do anything with test data apart from running the evaluation procedure.

## Approach

### Network size and architecture
For this task I decided to try out multiple architectures:
* a network proposed in out class as a baseline;
* `AlexNet`;
* `Resnet-18` (since bigger options might be an overkill).

### Optimizations
The training process included the following optimizations:
* `Adam` optimizer was used since it is stronger, faster and better than `SGD`, althouhg its parameters values were default for each test case;
* a stopping criterion that terminates the training process after a certain amount of epochs without improvements on global optimum (10 each case);
* a dropout layer with `p = 0.1` for AlexNet and baseline architectures;

   
### Data augmemntation
For each test case the data was augmente using the procedure described in out class: 
```
transform_augment = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomRotation([-30, 30]),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])
```

In [1]:
import time
import pathlib

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from torchvision import transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import Subset
from sklearn.model_selection import train_test_split


def train_val_split(dataset, val_size=0.2):
    """
    Split torch datasets into train and validation parts

    Args:
        dataset: incoming torch.Dataset object
        val_size: portion of the dataset that will be used for validation (default: 0.2)
    """
    train_idx, val_idx = train_test_split(list(range(len(dataset))), test_size=val_size)
    return (Subset(dataset, train_idx), Subset(dataset, val_idx))


device = "cuda" if torch.cuda.is_available() else "cpu"

means = np.array((0.4914, 0.4822, 0.4465))  # statistics from dataset documentation
stds = np.array((0.2023, 0.1994, 0.2010))

transform_augment = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomRotation([-30, 30]),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])


test_dataset = CIFAR10("./data/", train=False, download=True, transform=transform_test)
train_val_dataset = CIFAR10("./data/", train=True, download=True, transform=transform_augment)

train_dataset, val_dataset = train_val_split(train_val_dataset)

In [2]:
def train(model, opt, train_dataset, val_dataset, num_epochs:int = 100, batch_size:int = 64, stop:int = 7, device:str = None):
    """
    A function for training torch models that I kindly took from the professor (sorry) but now with torch dataloaders

    Args:
        model: torch model to train
        opt: optimizer for that model
        train_dataset: torch.Dataset used for training
        val_dataset: torch.Dataset used for validation
        num_epochs: total amount of full passes over training data (default: 100)
        batch_size: number of samples processed in one SGD iteration (default: 64)
        stop: number of iterations that loss can decrease for before the training proces stops (default: 7)
    """
    train_loss = []
    val_accuracy = []
    best_val_acc = 0.0
    best_epoch = 0

    train_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, num_workers=1)
    val_dataloader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size, shuffle=True, num_workers=1)

    for epoch in range(num_epochs):
        # In each epoch, we do a full pass over the training data:
        start_time = time.time()
        model.train(True) # enable dropout / batch_norm training behavior
        for (X_batch, y_batch) in train_dataloader:
            X_batch = X_batch.to(torch.float32).to(device)
            y_batch = y_batch.to(torch.int64).to(device)
            # train on batch
            logits = model(X_batch)
            loss = F.cross_entropy(logits, y_batch).mean()
            loss.backward()
            opt.step()
            opt.zero_grad()
            train_loss.append(loss.item())  # .item() = convert 1-value Tensor to float

        # And a full pass over the validation data:
        model.train(False)     # disable dropout / use averages for batch_norm
        with torch.no_grad():  # do not store intermediate activations
            for (X_batch, y_batch) in val_dataloader:
                X_batch = X_batch.to(torch.float32).to(device)
                y_batch = y_batch.detach().cpu().numpy()
                logits = model(X_batch)
                y_pred = logits.argmax(-1).detach().cpu().numpy()
                val_accuracy.append(np.mean(y_batch == y_pred))

        mean_val_acc = np.mean(val_accuracy[-len(val_dataset) // batch_size :])

        if best_val_acc < mean_val_acc:
            best_val_acc = mean_val_acc
            best_epoch = epoch
            pathlib.Path("./models/").mkdir(exist_ok=True)
            torch.save(model.state_dict(), f"models/best_model.pt2")


        if epoch - best_epoch > stop:
            print("Model did not see any loss improvements for %i epochs, aborting..." % (stop))
            model.load_state_dict(torch.load(f"models/best_model.pt2", weights_only=True))
            model.eval()
            break

        # Then we print the results for this epoch:
        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, num_epochs, time.time() - start_time))
        print("  training loss (in-iteration): \t{:.6f}".format(
            np.mean(train_loss[-len(train_dataset) // batch_size :])))
        print("  validation accuracy: \t\t\t{:.2f} %".format(
            mean_val_acc * 100))

    print(f"Finished training. Best validation accuracy: {best_val_acc*100:.2f} %")

In [3]:
def evaluate(model, test_dataset):
    """
    A function for evaluating torch models but now with torch dataloaders

    Args:
        model: torch model to evaluate
        test_dataset: torch.Dataset that contains test data
    """
    model.train(False) # disable dropout / use averages for batch_norm
    test_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=500, shuffle=True, num_workers=1)
    test_batch_acc = []

    for X_batch, y_batch in test_dataloader:
        X_batch = X_batch.to(torch.float32).to(device)
        y_batch = y_batch.detach().cpu().numpy()
        logits = model(X_batch)
        y_pred = logits.max(1)[1].data.cpu().numpy()
        test_batch_acc.append(np.mean(y_batch == y_pred))

    test_accuracy = np.mean(test_batch_acc)

    print("Final results:")
    print("  test accuracy:\t\t{:.2f} %".format(
        test_accuracy * 100))

In [33]:
class ResNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super(ResNetBlock, self).__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=(3,3), stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.gelu = nn.GELU()
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=(3,3), stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if in_channels != out_channels or stride != 1:
            self.shortcut = nn.Conv2d(in_channels, out_channels, kernel_size=(1, 1), stride=stride, bias=False)

    def forward(self, x):
        inputs = self.shortcut(x)

        out = self.gelu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += inputs
        out = self.gelu(out)
        return out

In [34]:
# =========== Baseline neural network ==================
# model = nn.Sequential(
#     nn.Conv2d(3, 10, kernel_size=(3,3)),  # [10, 30, 30]
#     nn.GELU(),
#     nn.MaxPool2d(2), # [10, 15, 15]
#     nn.Flatten(),
#     nn.Linear(10 * 15 * 15, 100),
#     nn.GELU(),
#     nn.Dropout(p=0.1),
#     nn.Linear(100, 10)
# ).to(device) # this produces:
#                     58.42% accuracy on test set, ~7.1s per epoch with GELU
#                     55.48% accuracy on test set, ~6.9s per epoch with tanh
#                     54.55% accuracy on test set, ~6.8s per epoch with ReLU

# ================== AlexNet ===========================
# model = nn.Sequential(
#     nn.Conv2d(3, 16, kernel_size=(5,5)),  # [16, 28, 28]
#     nn.ReLU(),
#     nn.Conv2d(16, 64, kernel_size=(3,3)), # [64, 26, 26]
#     nn.MaxPool2d(2), # [64, 13, 13]
#     nn.Conv2d(64, 96, kernel_size=(3,3), padding=1), # [96, 13, 13]
#     nn.Conv2d(96, 96, kernel_size=(3,3), padding=1), # [96, 13, 13]
#     nn.Conv2d(96, 64, kernel_size=(3,3), padding=1), # [96, 13, 13]
#     nn.MaxPool2d(2), # [64, 6, 6]
#     nn.Flatten(),
#     nn.Linear(64 * 6 * 6, 4096),
#     nn.ReLU(),
#     nn.Linear(4096, 4096),
#     nn.ReLU(),
#     nn.Dropout(p=0.1),
#     nn.Linear(4096, 10)
# ).to(device) # this produces:
# #                   71.36% accuracy on test set, ~12.5s per epoch with ReLU
# #                   exploding everything if ReLUs are substituted with GELUs


# ================== ResNet-18 =========================
# 18 + 1 conv layers and 1 dense layer
model = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=(3,3), stride=1, padding=1),  # [16, 32, 32]
    ResNetBlock(16, 16), # [16, 32, 32]
    ResNetBlock(16, 16), # [16, 32, 32]
    ResNetBlock(16, 16), # [16, 32, 32]
    ResNetBlock(16, 32, stride=2), # [32, 16, 16]
    ResNetBlock(32, 32), # [32, 16, 16]
    ResNetBlock(32, 32), # [32, 16, 16]
    ResNetBlock(32, 64, stride=2), # [64, 8, 8]
    ResNetBlock(64, 64), # [64, 8, 8]
    ResNetBlock(64, 64), # [64, 8, 8]
    nn.AvgPool2d(2),
    nn.Flatten(),
    nn.Linear(16 * 8 * 8, 10),
    nn.Softmax(dim=1)
).to(device)# this produces:
#                  78.05% accuracy on test set, ~9.7s per epoch with ReLU
#                  80.41% accuracy on test set, ~9.9s per epoch with ReLU

opt = torch.optim.Adam(model.parameters())

train(model, opt, train_dataset, val_dataset, device=device, stop=10)
evaluate(model, test_dataset)

Epoch 1 of 100 took 10.047s
  training loss (in-iteration): 	2.205096
  validation accuracy: 			25.89 %
Epoch 2 of 100 took 9.812s
  training loss (in-iteration): 	2.164611
  validation accuracy: 			31.33 %
Epoch 3 of 100 took 9.807s
  training loss (in-iteration): 	2.150024
  validation accuracy: 			30.58 %
Epoch 4 of 100 took 9.845s
  training loss (in-iteration): 	2.126246
  validation accuracy: 			36.62 %
Epoch 5 of 100 took 9.859s
  training loss (in-iteration): 	2.083313
  validation accuracy: 			36.65 %
Epoch 6 of 100 took 9.893s
  training loss (in-iteration): 	2.062417
  validation accuracy: 			41.25 %
Epoch 7 of 100 took 9.933s
  training loss (in-iteration): 	2.028416
  validation accuracy: 			42.88 %
Epoch 8 of 100 took 9.967s
  training loss (in-iteration): 	1.996588
  validation accuracy: 			46.53 %
Epoch 9 of 100 took 9.962s
  training loss (in-iteration): 	1.969027
  validation accuracy: 			48.88 %
Epoch 10 of 100 took 9.940s
  training loss (in-iteration): 	1.946865
  