<a href="https://colab.research.google.com/github/succSeeded/dl-2025/blob/main/hws/week03_convnets/seminar_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task I: small convolution net
### First step

Let's create a mini-convolutional network with roughly such architecture:
* Input layer
* 3x3 convolution with 10 filters and _ReLU_ activation
* 2x2 pooling (or set previous convolution stride to 3)
* Flatten
* Dense layer with 100 neurons and _ReLU_ activation
* 10% dropout
* Output dense layer.


__Convolutional layers__ in torch are just like all other layers, but with a specific set of parameters:

__`...`__

__`model.add_module('conv1', nn.Conv2d(in_channels=3, out_channels=10, kernel_size=3)) # convolution`__

__`model.add_module('pool1', nn.MaxPool2d(2)) # max pooling 2x2`__

__`...`__


Once you're done (and compute_loss no longer raises errors), train it with __Adam__ optimizer with default params (feel free to modify the code above).

If everything is right, you should get at least __50%__ validation accuracy.

In [None]:
# ==============
# some prep work
# ==============

import time
import pathlib

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from cifar import load_cifar10


def compute_loss(X_batch: np.typing.ArrayLike, y_batch: np.typing.ArrayLike) -> torch.types.FloatLikeType:
    X_batch = torch.as_tensor(X_batch, dtype=torch.float32, device=device)
    y_batch = torch.as_tensor(y_batch, dtype=torch.int64, device=device)
    logits = model(X_batch)
    return F.cross_entropy(logits, y_batch).mean()


# An auxilary function that returns mini-batches for neural network training
def iterate_minibatches(X, y, batchsize):
    indices = np.random.permutation(np.arange(len(X)))
    for start in range(0, len(indices), batchsize):
        ix = indices[start: start + batchsize]
        yield X[ix], y[ix]


# **IMPORTANT** when running in colab, un-comment this
!wget https://raw.githubusercontent.com/yandexdataschool/Practical_DL/refs/heads/fall25/week03_convnets/cifar.py

X_train, y_train, X_val, y_val, X_test, y_test = load_cifar10("cifar_data")

class_names = np.array(['airplane', 'automobile', 'bird', 'cat', 'deer',
                        'dog', 'frog', 'horse', 'ship', 'truck'])

device = "cuda" if torch.cuda.is_available() else "cpu"

--2025-09-29 19:45:09--  https://raw.githubusercontent.com/yandexdataschool/Practical_DL/refs/heads/fall25/week03_convnets/cifar.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2396 (2.3K) [text/plain]
Saving to: ‘cifar.py.2’


2025-09-29 19:45:09 (54.5 MB/s) - ‘cifar.py.2’ saved [2396/2396]



In [None]:
def train(model, opt, X_train, y_train, X_val, y_val, num_epochs:int = 100, batch_size:int = 64, stop:int = 7):
    """
    A function for training torch models that I kindly took from the professor (sorry)

    Args:
        model: torch model to train
        opt: optimizer for that model
        X_train: training data [n, m], where n is the number of entries and m --- the number of features
        y_train: training targets [n,]
        X_val: validation data
        y_val: validation targets
        num_epochs: total amount of full passes over training data (default: 100)
        batch_size: number of samples processed in one SGD iteration (default: 64)
        stop: number of iterations that loss can decrease for before the training proces stops (default: 7)
    """
    train_loss = []
    val_accuracy = []
    best_val_acc = 0.0
    best_epoch = 0

    for epoch in range(num_epochs):
        # In each epoch, we do a full pass over the training data:
        start_time = time.time()
        model.train(True) # enable dropout / batch_norm training behavior
        for X_batch, y_batch in iterate_minibatches(X_train, y_train, batch_size):
            # train on batch
            loss = compute_loss(X_batch, y_batch)
            loss.backward()
            opt.step()
            opt.zero_grad()
            train_loss.append(loss.item())  # .item() = convert 1-value Tensor to float

        # And a full pass over the validation data:
        model.train(False)     # disable dropout / use averages for batch_norm
        with torch.no_grad():  # do not store intermediate activations
            for X_batch, y_batch in iterate_minibatches(X_val, y_val, batch_size):
                logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
                y_pred = logits.argmax(-1).detach().cpu().numpy()
                val_accuracy.append(np.mean(y_batch == y_pred))

        mean_val_acc = np.mean(val_accuracy[-len(X_val) // batch_size :])

        if best_val_acc < mean_val_acc:
            best_val_acc = mean_val_acc
            best_epoch = i
            pathlib.Path("./models/").mkdir(exist_ok=True)
            torch.save(model.state_dict(), f"models/best_model.pt2")


        if epoch - best_epoch > stop:
            print("Model did not see any loss improvements for %i epochs, aborting..." % (stop))
            model.load_state_dict(torch.load(f"models/best_model.pt2", weights_only=True))
            model.eval()
            break

        # Then we print the results for this epoch:
        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, num_epochs, time.time() - start_time))
        print("  training loss (in-iteration): \t{:.6f}".format(
            np.mean(train_loss[-len(X_train) // batch_size :])))
        print("  validation accuracy: \t\t\t{:.2f} %".format(
            mean_val_acc * 100))

    print(f"Finished training. Best validation accuracy: {best_val_acc*100:.2f} %")

In [None]:
def evaluate(model, X_test, y_test):
    model.train(False) # disable dropout / use averages for batch_norm
    test_batch_acc = []
    for X_batch, y_batch in iterate_minibatches(X_test, y_test, 500):
        logits = model(torch.as_tensor(X_batch, dtype=torch.float32, device=device))
        y_pred = logits.max(1)[1].data.cpu().numpy()
        test_batch_acc.append(np.mean(y_batch == y_pred))

    test_accuracy = np.mean(test_batch_acc)

    print("Final results:")
    print("  test accuracy:\t\t{:.2f} %".format(
        test_accuracy * 100))

In [None]:
# Let us create the model:
model = nn.Sequential(
    nn.Conv2d(3, 10, kernel_size=(3,3)),  # [10, 30, 30]
    nn.ReLU(),
    nn.MaxPool2d(2), # [10, 15, 15]
    nn.Flatten(),
    nn.Linear(10 * 15 * 15, 100),
    nn.ReLU(),
    nn.Dropout(p=0.1),
    nn.Linear(100, 10)
).to(device)

In [None]:
opt = torch.optim.Adam(model.parameters())

train(model, opt, X_train, y_train, X_val, y_val)

Epoch 1 of 100 took 1.418s
  training loss (in-iteration): 	1.750762
  validation accuracy: 			46.84 %
Epoch 2 of 100 took 1.640s
  training loss (in-iteration): 	1.443771
  validation accuracy: 			52.82 %
Epoch 3 of 100 took 1.501s
  training loss (in-iteration): 	1.346818
  validation accuracy: 			52.65 %
Epoch 4 of 100 took 1.395s
  training loss (in-iteration): 	1.281974
  validation accuracy: 			55.04 %
Epoch 5 of 100 took 1.386s
  training loss (in-iteration): 	1.219855
  validation accuracy: 			57.49 %
Epoch 6 of 100 took 1.402s
  training loss (in-iteration): 	1.178553
  validation accuracy: 			58.09 %
Epoch 7 of 100 took 1.387s
  training loss (in-iteration): 	1.134880
  validation accuracy: 			58.23 %
Epoch 8 of 100 took 1.387s
  training loss (in-iteration): 	1.100379
  validation accuracy: 			60.02 %
Epoch 9 of 100 took 1.385s
  training loss (in-iteration): 	1.063912
  validation accuracy: 			58.82 %
Epoch 10 of 100 took 1.544s
  training loss (in-iteration): 	1.039553
  v

In [None]:
evaluate(model, X_test, y_test)

Final results:
  test accuracy:		61.18 %


__Hint:__ If you don't want to compute shapes by hand, just plug in any shape (e.g. 1 unit) and run compute_loss. You will see something like this:

__`RuntimeError: size mismatch, m1: [5 x 1960], m2: [1 x 64] at /some/long/path/to/torch/operation`__

See the __1960__ there? That's your actual input shape.

## Task 2: adding normalization

* Add batch norm (with default params) between convolution and ReLU
  * nn.BatchNorm*d (1d for dense, 2d for conv)
  * usually better to put them after linear/conv but before nonlinearity
* Re-train the network with the same optimizer, it should get at least 60% validation accuracy at peak.



In [None]:
model = nn.Sequential(
    nn.Conv2d(3, 10, kernel_size=(3,3)),  # [10, 30, 30]
    nn.BatchNorm2d(10),
    nn.ReLU(),
    nn.MaxPool2d(2), # [10, 15, 15]
    nn.Flatten(),
    nn.Linear(10 * 15 * 15, 100),
    nn.ReLU(),
    nn.Dropout(p=0.1),
    nn.Linear(100, 10)
).to(device)

In [None]:
opt = torch.optim.Adam(model.parameters())

train(model, opt, X_train, y_train, X_val, y_val)

Epoch 1 of 100 took 1.536s
  training loss (in-iteration): 	1.555549
  validation accuracy: 			55.33 %
Epoch 2 of 100 took 1.692s
  training loss (in-iteration): 	1.264537
  validation accuracy: 			54.60 %
Epoch 3 of 100 took 1.734s
  training loss (in-iteration): 	1.150061
  validation accuracy: 			58.88 %
Epoch 4 of 100 took 1.530s
  training loss (in-iteration): 	1.077964
  validation accuracy: 			60.75 %
Epoch 5 of 100 took 1.526s
  training loss (in-iteration): 	1.028120
  validation accuracy: 			56.75 %
Epoch 6 of 100 took 1.523s
  training loss (in-iteration): 	0.981943
  validation accuracy: 			59.98 %
Epoch 7 of 100 took 1.571s
  training loss (in-iteration): 	0.937545
  validation accuracy: 			62.06 %
Epoch 8 of 100 took 1.523s
  training loss (in-iteration): 	0.907180
  validation accuracy: 			60.76 %
Epoch 9 of 100 took 1.508s
  training loss (in-iteration): 	0.874411
  validation accuracy: 			60.70 %
Epoch 10 of 100 took 1.844s
  training loss (in-iteration): 	0.841969
  v

In [None]:
evaluate(model, X_test, y_test)

Final results:
  test accuracy:		60.91 %


## Task 3: Data Augmentation

There's a powerful torch tool for image preprocessing useful to do data preprocessing and augmentation.

Here's how it works: we define a pipeline that
* makes random crops of data (augmentation)
* randomly flips image horizontally (augmentation)
* then normalizes it (preprocessing)

In [None]:
from torchvision import transforms
means = np.array((0.4914, 0.4822, 0.4465))  # statistics from dataset documentation
stds = np.array((0.2023, 0.1994, 0.2010))

transform_augment = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomRotation([-30, 30]),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])

In [None]:
from torchvision.datasets import CIFAR10
from torch.utils.data import Subset
from sklearn.model_selection import train_test_split

def train_val_dataset(dataset, val_split=0.2):
    train_idx, val_idx = train_test_split(list(range(len(dataset))), test_size=val_split)
    datasets = {}
    datasets['train'] = Subset(dataset, train_idx)
    datasets['val'] = Subset(dataset, val_idx)
    return datasets


train_dataset = CIFAR10("./cifar_data/", train=True, transform=transform_augment)

train_sets = train_val_dataset(train_dataset)

In [None]:
def train_upgraded(model, opt, train_dataset, val_dataset, num_epochs:int = 100, batch_size:int = 64, stop:int = 7, device:str = None):
    """
    A function for training torch models that I kindly took from the professor (sorry) but now with torch dataloaders

    Args:
        model: torch model to train
        opt: optimizer for that model
        X_train: training data [n, m], where n is the number of entries and m --- the number of features
        y_train: training targets [n,]
        X_val: validation data
        y_val: validation targets
        num_epochs: total amount of full passes over training data (default: 100)
        batch_size: number of samples processed in one SGD iteration (default: 64)
        stop: number of iterations that loss can decrease for before the training proces stops (default: 7)
    """
    train_loss = []
    val_accuracy = []
    best_val_acc = 0.0
    best_epoch = 0

    train_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, num_workers=1)
    val_dataloader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size, shuffle=True, num_workers=1)

    for epoch in range(num_epochs):
        # In each epoch, we do a full pass over the training data:
        start_time = time.time()
        model.train(True) # enable dropout / batch_norm training behavior
        for (X_batch, y_batch) in train_dataloader:
            X_batch = X_batch.to(torch.float32).to(device)
            y_batch = y_batch.to(torch.int64).to(device)
            # train on batch
            logits = model(X_batch)
            loss = F.cross_entropy(logits, y_batch).mean()
            loss.backward()
            opt.step()
            opt.zero_grad()
            train_loss.append(loss.item())  # .item() = convert 1-value Tensor to float

        # And a full pass over the validation data:
        model.train(False)     # disable dropout / use averages for batch_norm
        with torch.no_grad():  # do not store intermediate activations
            for (X_batch, y_batch) in val_dataloader:
                X_batch = X_batch.to(torch.float32).to(device)
                y_batch = y_batch.detach().cpu().numpy()
                logits = model(X_batch)
                y_pred = logits.argmax(-1).detach().cpu().numpy()
                val_accuracy.append(np.mean(y_batch == y_pred))

        mean_val_acc = np.mean(val_accuracy[-len(val_dataset) // batch_size :])

        if best_val_acc < mean_val_acc:
            best_val_acc = mean_val_acc
            best_epoch = i
            pathlib.Path("./models/").mkdir(exist_ok=True)
            torch.save(model.state_dict(), f"models/best_model.pt2")


        if epoch - best_epoch > stop:
            print("Model did not see any loss improvements for %i epochs, aborting..." % (stop))
            model.load_state_dict(torch.load(f"models/best_model.pt2", weights_only=True))
            model.eval()
            break

        # Then we print the results for this epoch:
        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, num_epochs, time.time() - start_time))
        print("  training loss (in-iteration): \t{:.6f}".format(
            np.mean(train_loss[-len(train_dataset) // batch_size :])))
        print("  validation accuracy: \t\t\t{:.2f} %".format(
            mean_val_acc * 100))

    print(f"Finished training. Best validation accuracy: {best_val_acc*100:.2f} %")

In [None]:
def evaluate_updated(model, test_dataset):
    model.train(False) # disable dropout / use averages for batch_norm
    test_dataloader = torch.utils.data.DataLoader(
        train_dataset, batch_size=500, shuffle=True, num_workers=1)
    test_batch_acc = []
    for X_batch, y_batch in test_dataloader:
        X_batch = X_batch.to(torch.float32).to(device)
        y_batch = y_batch.detach().cpu().numpy()
        logits = model(X_batch)
        y_pred = logits.max(1)[1].data.cpu().numpy()
        test_batch_acc.append(np.mean(y_batch == y_pred))

    test_accuracy = np.mean(test_batch_acc)

    print("Final results:")
    print("  test accuracy:\t\t{:.2f} %".format(
        test_accuracy * 100))

In [None]:
model = nn.Sequential(
    nn.Conv2d(3, 10, kernel_size=(3,3)),  # [10, 30, 30]
    nn.ReLU(),
    nn.MaxPool2d(2), # [10, 15, 15]
    nn.Flatten(),
    nn.Linear(10 * 15 * 15, 100),
    nn.ReLU(),
    nn.Dropout(p=0.1),
    nn.Linear(100, 10)
).to(device)

opt = torch.optim.Adam(model.parameters())

train_upgraded(model, opt, train_sets["train"], train_sets["val"], device=device)

Epoch 1 of 100 took 24.999s
  training loss (in-iteration): 	1.788986
  validation accuracy: 			40.60 %
Epoch 2 of 100 took 25.135s
  training loss (in-iteration): 	1.604469
  validation accuracy: 			44.21 %
Epoch 3 of 100 took 25.115s
  training loss (in-iteration): 	1.546970
  validation accuracy: 			44.31 %
Epoch 4 of 100 took 24.859s
  training loss (in-iteration): 	1.523269
  validation accuracy: 			46.31 %
Epoch 5 of 100 took 25.222s
  training loss (in-iteration): 	1.498357
  validation accuracy: 			47.24 %
Epoch 6 of 100 took 24.890s
  training loss (in-iteration): 	1.489475
  validation accuracy: 			47.65 %
Epoch 7 of 100 took 25.788s
  training loss (in-iteration): 	1.472225
  validation accuracy: 			47.55 %
Epoch 8 of 100 took 24.766s
  training loss (in-iteration): 	1.465172
  validation accuracy: 			48.51 %
Epoch 9 of 100 took 25.318s
  training loss (in-iteration): 	1.449785
  validation accuracy: 			47.58 %
Epoch 10 of 100 took 25.291s
  training loss (in-iteration): 	1.

When testing, we don't need random crops, just normalize with same statistics.

In [None]:
transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(means, stds),
])


test_dataset = CIFAR10("./cifar_data/", train=False, transform=transform_test)
evaluate_updated(model, test_dataset)

Final results:
  test accuracy:		51.23 %


So, we have got a baseline of 50%, let us improve it! First, let us change all the non-linearities to GELU:

In [None]:
model = nn.Sequential(
    nn.Conv2d(3, 10, kernel_size=(3,3)),  # [10, 30, 30]
    nn.GELU(),
    nn.MaxPool2d(2), # [10, 15, 15]
    nn.Flatten(),
    nn.Linear(10 * 15 * 15, 100),
    nn.GELU(),
    nn.Dropout(p=0.1),
    nn.Linear(100, 10)
).to(device)

opt = torch.optim.Adam(model.parameters())

train_upgraded(model, opt, train_sets["train"], train_sets["val"], device=device, stop=10)
evaluate_updated(model, test_dataset)

Epoch 1 of 100 took 25.028s
  training loss (in-iteration): 	1.738346
  validation accuracy: 			43.20 %
Epoch 2 of 100 took 24.981s
  training loss (in-iteration): 	1.557186
  validation accuracy: 			45.34 %
Epoch 3 of 100 took 25.807s
  training loss (in-iteration): 	1.489758
  validation accuracy: 			47.05 %
Epoch 4 of 100 took 24.886s
  training loss (in-iteration): 	1.458499
  validation accuracy: 			48.41 %
Epoch 5 of 100 took 24.734s
  training loss (in-iteration): 	1.435072
  validation accuracy: 			49.45 %
Epoch 6 of 100 took 24.807s
  training loss (in-iteration): 	1.417875
  validation accuracy: 			49.97 %
Epoch 7 of 100 took 25.296s
  training loss (in-iteration): 	1.400148
  validation accuracy: 			50.14 %
Epoch 8 of 100 took 25.366s
  training loss (in-iteration): 	1.385419
  validation accuracy: 			51.54 %
Epoch 9 of 100 took 25.656s
  training loss (in-iteration): 	1.373466
  validation accuracy: 			52.10 %
Epoch 10 of 100 took 25.107s
  training loss (in-iteration): 	1.