# Deep Learning Applications: Laboratory #1

In this first laboratory we will work relatively simple architectures to get a feel for working with Deep Models. This notebook is designed to work with PyTorch, but as I said in the introductory lecture: please feel free to use and experiment with whatever tools you like.

**Important Notes**:
1. Be sure to **document** all of your decisions, as well as your intermediate and final results. Make sure your conclusions and analyses are clearly presented. Don't make us dig into your code or walls of printed results to try to draw conclusions from your code.
2. If you use code from someone else (e.g. Github, Stack Overflow, ChatGPT, etc) you **must be transparent about it**. Document your sources and explain how you adapted any partial solutions to creat **your** solution.



## Exercise 1: Warming Up
In this series of exercises I want you to try to duplicate (on a small scale) the results of the ResNet paper:

> [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385), Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, CVPR 2016.

We will do this in steps using a Multilayer Perceptron on MNIST.

Recall that the main message of the ResNet paper is that **deeper** networks do not **guarantee** more reduction in training loss (or in validation accuracy). Below you will incrementally build a sequence of experiments to verify this for an MLP. A few guidelines:

+ I have provided some **starter** code at the beginning. **NONE** of this code should survive in your solutions. Not only is it **very** badly written, it is also written in my functional style that also obfuscates what it's doing (in part to **discourage** your reuse!). It's just to get you *started*.
+ These exercises ask you to compare **multiple** training runs, so it is **really** important that you factor this into your **pipeline**. Using [Tensorboard](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) is a **very** good idea -- or, even better [Weights and Biases](https://wandb.ai/site).
+ You may work and submit your solutions in **groups of at most two**. Share your ideas with everyone, but the solutions you submit *must be your own*.

First some boilerplate to get you started, then on to the actual exercises!

### Preface: Some code to get you started

What follows is some **very simple** code for training an MLP on MNIST. The point of this code is to get you up and running (and to verify that your Python environment has all needed dependencies).

**Note**: As you read through my code and execute it, this would be a good time to think about *abstracting* **your** model definition, and training and evaluation pipelines in order to make it easier to compare performance of different models.

In [None]:
# Start with some standard imports.
import numpy as np
import matplotlib.pyplot as plt
from functools import reduce
import torch
from torchvision.datasets import MNIST
from torch.utils.data import Subset
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms

#### Data preparation

Here is some basic dataset loading, validation splitting code to get you started working with MNIST.

In [None]:
# Standard MNIST transform.
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Load MNIST train and test.
ds_train = MNIST(root='./data', train=True, download=True, transform=transform)
ds_test = MNIST(root='./data', train=False, download=True, transform=transform)

# Split train into train and validation.
val_size = 5000
I = np.random.permutation(len(ds_train))
ds_val = Subset(ds_train, I[:val_size])
ds_train = Subset(ds_train, I[val_size:])

#### Boilerplate training and evaluation code

This is some **very** rough training, evaluation, and plotting code. Again, just to get you started. I will be *very* disappointed if any of this code makes it into your final submission.

In [None]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report

# Function to train a model for a single epoch over the data loader.
def train_epoch(model, dl, opt, epoch='Unknown', device='cpu'):
    model.train()
    losses = []
    for (xs, ys) in tqdm(dl, desc=f'Training epoch {epoch}', leave=True):
        xs = xs.to(device)
        ys = ys.to(device)
        opt.zero_grad()
        logits = model(xs)
        loss = F.cross_entropy(logits, ys)
        loss.backward()
        opt.step()
        losses.append(loss.item())
    return np.mean(losses)

# Function to evaluate model over all samples in the data loader.
def evaluate_model(model, dl, device='cpu'):
    model.eval()
    predictions = []
    gts = []
    for (xs, ys) in tqdm(dl, desc='Evaluating', leave=False):
        xs = xs.to(device)
        preds = torch.argmax(model(xs), dim=1)
        gts.append(ys)
        predictions.append(preds.detach().cpu().numpy())
        
    # Return accuracy score and classification report.
    return (accuracy_score(np.hstack(gts), np.hstack(predictions)),
            classification_report(np.hstack(gts), np.hstack(predictions), zero_division=0, digits=3))

# Simple function to plot the loss curve and validation accuracy.
def plot_validation_curves(losses_and_accs):
    losses = [x for (x, _) in losses_and_accs]
    accs = [x for (_, x) in losses_and_accs]
    plt.figure(figsize=(16, 8))
    plt.subplot(1, 2, 1)
    plt.plot(losses)
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Average Training Loss per Epoch')
    plt.subplot(1, 2, 2)
    plt.plot(accs)
    plt.xlabel('Epoch')
    plt.ylabel('Validation Accuracy')
    plt.title(f'Best Accuracy = {np.max(accs)} @ epoch {np.argmax(accs)}')

#### A basic, parameterized MLP

This is a very basic implementation of a Multilayer Perceptron. Don't waste too much time trying to figure out how it works -- the important detail is that it allows you to pass in a list of input, hidden layer, and output *widths*. **Your** implementation should also support this for the exercises to come.

In [None]:
class MLP(nn.Module):
    def __init__(self, layer_sizes):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(nin, nout) for (nin, nout) in zip(layer_sizes[:-1], layer_sizes[1:])])
    
    def forward(self, x):
        return reduce(lambda f, g: lambda x: g(F.relu(f(x))), self.layers, lambda x: x.flatten(1))(x)

#### A *very* minimal training pipeline.

Here is some basic training and evaluation code to get you started.

**Important**: I cannot stress enough that this is a **terrible** example of how to implement a training pipeline. You can do better!

In [None]:
# Training hyperparameters.
device = 'cuda' if torch.cuda.is_available else 'cpu'
epochs = 100
lr = 0.0001
batch_size = 128

# Architecture hyperparameters.
input_size = 28*28
width = 16
depth = 2

# Dataloaders.
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=True, num_workers=4)
dl_val   = torch.utils.data.DataLoader(ds_val, batch_size, num_workers=4)
dl_test  = torch.utils.data.DataLoader(ds_test, batch_size, shuffle=True, num_workers=4)

# Instantiate model and optimizer.
model_mlp = MLP([input_size] + [width]*depth + [10]).to(device)
opt = torch.optim.Adam(params=model_mlp.parameters(), lr=lr)

# Training loop.
losses_and_accs = []
for epoch in range(epochs):
    loss = train_epoch(model_mlp, dl_train, opt, epoch, device=device)
    (val_acc, _) = evaluate_model(model_mlp, dl_val, device=device)
    losses_and_accs.append((loss, val_acc))

# And finally plot the curves.
plot_validation_curves(losses_and_accs)
print(f'Accuracy report on TEST:\n {evaluate_model(model_mlp, dl_test, device=device)[1]}')

### Exercise 1.1: A baseline MLP

Implement a *simple* Multilayer Perceptron to classify the 10 digits of MNIST (e.g. two *narrow* layers). Use my code above as inspiration, but implement your own training pipeline -- you will need it later. Train this model to convergence, monitoring (at least) the loss and accuracy on the training and validation sets for every epoch. Below I include a basic implementation to get you started -- remember that you should write your *own* pipeline!

**Note**: This would be a good time to think about *abstracting* your model definition, and training and evaluation pipelines in order to make it easier to compare performance of different models.

**Important**: Given the *many* runs you will need to do, and the need to *compare* performance between them, this would **also** be a great point to study how **Tensorboard** or **Weights and Biases** can be used for performance monitoring.

In [None]:
import numpy as np
import os
import threading
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, Subset
import torchvision.transforms as transforms
from torch.utils.data import random_split
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from torch.utils.tensorboard import SummaryWriter  # Per TensorBoard

# Configura il dispositivo
if torch.backends.mps.is_available():
    device = torch.device("mps")  # Usa la GPU su Mac con Apple Silicon
elif torch.cuda.is_available():
    device = torch.device("cuda")  # Usa la GPU su Nvidia
else:
    device = torch.device("cpu")  # Usa la CPU

# Definizione del modello MLP
class SimpleMLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # Primo strato nascosto
        self.fc2 = nn.Linear(hidden_size, hidden_size)  # Secondo strato nascosto
        self.fc3 = nn.Linear(hidden_size, output_size)  # Strato di output

    def forward(self, x):
        x = x.flatten(1)  # Appiattisci l'input
        x = F.relu(self.fc1(x))  # Attivazione ReLU dopo il primo strato
        x = F.relu(self.fc2(x))  # Attivazione ReLU dopo il secondo strato
        x = self.fc3(x)  # Output senza attivazione (usiamo CrossEntropyLoss)
        return x

# Funzione per il training
def train(model, train_loader, val_loader, optimizer, criterion, epochs, device, writer):
    model.train()
    for epoch in range(epochs):
        running_loss = 0.0
        correct = 0
        total = 0

        # Loop di training
        for inputs, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}"):
            inputs, labels = inputs.to(device), labels.to(device)

            # Forward pass
            outputs = model(inputs)
            loss = criterion(outputs, labels)

            # Backward pass e ottimizzazione
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Calcolo delle metriche
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        # Calcolo della loss e accuratezza media
        train_loss = running_loss / len(train_loader)
        train_acc = 100 * correct / total

        # Valutazione sul validation set
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)

        # Log delle metriche su TensorBoard
        writer.add_scalar("Loss/Train", train_loss, epoch)
        writer.add_scalar("Accuracy/Train", train_acc, epoch)
        writer.add_scalar("Loss/Validation", val_loss, epoch)
        writer.add_scalar("Accuracy/Validation", val_acc, epoch)

        # Stampa delle metriche
        print(f"Epoch {epoch+1}/{epochs}: "
              f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")

# Funzione per la valutazione
def evaluate(model, dataloader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    val_loss = running_loss / len(dataloader)
    val_acc = 100 * correct / total
    return val_loss, val_acc

def launch_tensorboard(logdir="runs"):
    os.system(f"tensorboard --logdir={logdir} --port=6006")

# Funzione principale
def main():
    # Hyperparametri
    input_size = 28 * 28  # Dimensione delle immagini MNIST
    hidden_size = 128  # Dimensione degli strati nascosti
    output_size = 10  # Numero di classi (cifre 0-9)
    batch_size = 128
    epochs = 10
    lr = 0.001

    # Caricamento del dataset MNIST
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
    train_dataset = MNIST(root='./data', train=True, download=True, transform=transform)
    test_dataset = MNIST(root='./data', train=False, download=True, transform=transform)

    val_size = 5000
    train_size = len(train_dataset) - val_size  # Il resto dei dati va nel training set

    # Suddivisione del dataset di training
    train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

    # Creazione dei DataLoader
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    # Inizializzazione del modello, ottimizzatore e loss function
    model = SimpleMLP(input_size, hidden_size, output_size).to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # Inizializzazione di TensorBoard
    writer = SummaryWriter()

    # Addestramento del modello
    train(model, train_loader, val_loader, optimizer, criterion, epochs, device, writer)

    # Valutazione sul test set
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    print(f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.2f}%")

    # Chiusura di TensorBoard
    writer.close()

if __name__ == "__main__":

    main()

## Performance elevate
Sul **training set** l’accuracy raggiunge il **99.36%**, con un andamento della loss in costante diminuzione. Sul **validation set** si attesta al **98.00%**. Il risultato sul **test set** (**97.89%**) è in linea con i precedenti e indica una buona capacità di generalizzazione.

## Overfitting contenuto
Nonostante l’elevata accuratezza sul training, il comportamento su validation e test resta solido. Lo scarto tra **Train Acc** e **Test Acc** è di circa **1.5 p.p.**, un segnale che l’overfitting è sotto controllo.

## MLP su MNIST: efficacia dimostrata
I numeri confermano che, su **MNIST** (dataset relativamente semplice), anche un **MLP** essenziale può ottenere ottime **performance** senza ricorrere a reti **convoluzionali**.

## Possibili miglioramenti
- Il lieve aumento della **Val Loss** nelle ultime epoche potrebbe suggerire l’inizio di sovradattamento: vale la pena valutare **dropout** o **early stopping**.
- L’adozione di una **CNN** potrebbe sfruttare meglio la struttura spaziale delle immagini e spingere ulteriormente le prestazioni.

**In sintesi**, un **MLP ben ottimizzato** è già competitivo su MNIST; architetture più avanzate possono comunque migliorare **efficienza** e **generalizzazione**.


### Exercise 1.2: Adding Residual Connections

Implement a variant of your parameterized MLP network to support **residual** connections. Your network should be defined as a composition of **residual MLP** blocks that have one or more linear layers and add a skip connection from the block input to the output of the final linear layer.

**Compare** the performance (in training/validation loss and test accuracy) of your MLP and ResidualMLP for a range of depths. Verify that deeper networks **with** residual connections are easier to train than a network of the same depth **without** residual connections.

**For extra style points**: See if you can explain by analyzing the gradient magnitudes on a single training batch *why* this is the case. 

In [None]:
import numpy as np
import os
import threading
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, Subset
import torchvision.transforms as transforms
from torch.utils.data import random_split
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from torch.utils.tensorboard import SummaryWriter  # Per TensorBoard

# Configura il dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Blocco MLP standard senza skip connection
class MLPBlock(nn.Module):
    def __init__(self, input_size, output_size):
        super(MLPBlock, self).__init__()
        self.fc = nn.Linear(input_size, output_size)
        self.activation = nn.ReLU()

    def forward(self, x):
        return self.activation(self.fc(x))

# Blocco Residuo con connessione skip
class ResidualMLPBlock(nn.Module):
    def __init__(self, size):
        super(ResidualMLPBlock, self).__init__()
        self.fc1 = nn.Linear(size, size)
        self.fc2 = nn.Linear(size, size)
        self.activation = nn.ReLU()

    def forward(self, x):
        identity = x  # Skip connection
        out = self.activation(self.fc1(x))
        out = self.fc2(out)
        return self.activation(out + identity)  # Somma con lo skip

# MLP classico senza connessioni residue
class SimpleMLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, depth):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.hidden_layers = nn.Sequential(*[MLPBlock(hidden_size, hidden_size) for _ in range(depth)])
        self.fc_out = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = x.flatten(1)  # Appiattisce l'input
        x = F.relu(self.fc1(x))
        x = self.hidden_layers(x)
        return self.fc_out(x)

# Residual MLP con connessioni residue
class ResidualMLP(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, depth):
        super(ResidualMLP, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.residual_blocks = nn.Sequential(*[ResidualMLPBlock(hidden_size) for _ in range(depth)])
        self.fc_out = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = x.flatten(1)
        x = F.relu(self.fc1(x))
        x = self.residual_blocks(x)
        return self.fc_out(x)

# Funzione di training
def train(model, train_loader, val_loader, optimizer, criterion, epochs, device, writer, model_name):
    model.train()
    for epoch in range(epochs):
        running_loss = 0.0
        correct = 0
        total = 0

        for inputs, labels in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} ({model_name})"):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        train_loss = running_loss / len(train_loader)
        train_acc = 100 * correct / total
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)

        writer.add_scalar(f"Loss/Train_{model_name}", train_loss, epoch)
        writer.add_scalar(f"Accuracy/Train_{model_name}", train_acc, epoch)
        writer.add_scalar(f"Loss/Validation_{model_name}", val_loss, epoch)
        writer.add_scalar(f"Accuracy/Validation_{model_name}", val_acc, epoch)

        print(f"{model_name} - Epoch {epoch+1}/{epochs}: "
              f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")


def analyze_gradients(model, inputs, labels, criterion, device, writer=None, tag_prefix=""):
    model.zero_grad()
    model.train()
    inputs, labels = inputs.to(device), labels.to(device)
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()

    grad_norms = {}

    for name, param in model.named_parameters():
        if param.grad is not None:
            norm = param.grad.norm().item()
            grad_norms[name] = norm
            if writer:
                writer.add_scalar(f"Gradients/{tag_prefix}/{name}", norm)
    
    print(f"\nGradient norms for {tag_prefix}:")
    for name, norm in grad_norms.items():
        print(f"{name}: {norm:.6f}")


# Funzione di valutazione
def evaluate(model, dataloader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return running_loss / len(dataloader), 100 * correct / total

# Funzione principale
def main():
    input_size = 28 * 28
    hidden_size = 128
    output_size = 10
    batch_size = 128
    epochs = 10
    lr = 0.001
    depths = [1, 3, 5, 10]  # Profondità da confrontare

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    train_dataset = MNIST(root='./data', train=True, download=True, transform=transform)
    test_dataset = MNIST(root='./data', train=False, download=True, transform=transform)

    val_size = 5000
    train_size = len(train_dataset) - val_size  # Il resto dei dati va nel training set

    # Suddivisione del dataset di training
    train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

    # Creazione dei DataLoader
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)


    criterion = nn.CrossEntropyLoss()
    writer = SummaryWriter()

    for depth in depths:
        model_mlp = SimpleMLP(input_size, hidden_size, output_size, depth).to(device)
        optimizer_mlp = optim.Adam(model_mlp.parameters(), lr=lr)
        train(model_mlp, train_loader, val_loader, optimizer_mlp, criterion, epochs, device, writer, f"MLP_Depth{depth}")

        model_residual = ResidualMLP(input_size, hidden_size, output_size, depth).to(device)
        optimizer_residual = optim.Adam(model_residual.parameters(), lr=lr)
        train(model_residual, train_loader, val_loader, optimizer_residual, criterion, epochs, device, writer, f"ResidualMLP_Depth{depth}")

    # Prendi un batch dal training set
    sample_inputs, sample_labels = next(iter(train_loader))

    # Analizza i gradienti per il modello MLP
    analyze_gradients(model_mlp, sample_inputs, sample_labels, criterion, device, writer, tag_prefix=f"MLP_Depth{depth}")

    # Analizza i gradienti per il modello ResidualMLP
    analyze_gradients(model_residual, sample_inputs, sample_labels, criterion, device, writer, tag_prefix=f"ResidualMLP_Depth{depth}")

    writer.close()

if __name__ == "__main__":
    main()


## Risultati: con Residual Connection VS senza Residual Connection

| Profondità | Val Acc MLP (%) | Grad MLP (primo layer) | Grad MLP (ultimo layer) | Val Acc Residual MLP (%) | Grad Residual (primo layer) | Grad Residual (ultimo layer) |
|-----------:|-----------------:|------------------------:|-------------------------:|--------------------------:|-----------------------------:|------------------------------:|
| 1          | 97.74            | 0.6466                  | 0.2342                   | 97.28                     | 0.1573                       | 0.0714                        |
| 3          | 97.58            | 0.6466                  | 0.2342                   | 97.00                     | 0.1573                       | 0.0714                        |
| 5          | 97.42            | 0.6466                  | 0.2342                   | 97.76                     | 0.1573                       | 0.0714                        |
| 10         | 96.78            | 0.6466                  | 0.2342                   | 97.34                     | 0.1573                       | 0.0714                        |

## Conclusioni

- **Aumentando la profondità, le skip-connections favoriscono la propagazione del gradiente lungo tutta la rete**, preservando segnali utili anche negli strati lontani dall’output e rendendo l’ottimizzazione più stabile.

- **Nei MLP “puri” l’approfondimento tende a enfatizzare il vanishing gradient** negli strati più profondi, con un training generalmente meno efficace rispetto alle controparti residuali.


### Exercise 1.3: Rinse and Repeat (but with a CNN)

Repeat the verification you did above, but with **Convolutional** Neural Networks. If you were careful about abstracting your model and training code, this should be a simple exercise. Show that **deeper** CNNs *without* residual connections do not always work better and **even deeper** ones *with* residual connections.

**Hint**: You probably should do this exercise using CIFAR-10, since MNIST is *very* easy (at least up to about 99% accuracy).

**Tip**: Feel free to reuse the ResNet building blocks defined in `torchvision.models.resnet` (e.g. [BasicBlock](https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py#L59) which handles the cascade of 3x3 convolutions, skip connections, and optional downsampling). This is an excellent exercise in code diving. 

**Spoiler**: Depending on the optional exercises you plan to do below, you should think *very* carefully about the architectures of your CNNs here (so you can reuse them!).

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# Configura il dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# CNN semplice (shallow)
class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(128 * 8 * 8, 1024)
        self.fc2 = nn.Linear(1024, num_classes)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 128 * 8 * 8)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# CNN profonda senza residual (DeepCNN)
class DeepCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(DeepCNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 128, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 512),
            nn.ReLU(),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

# ResNet con residual connections
class ResNetDeepCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(ResNetDeepCNN, self).__init__()
        self.resnet = models.resnet18(pretrained=True)
        self.resnet.fc = nn.Linear(self.resnet.fc.in_features, num_classes)
        self.freeze_layers()

    def freeze_layers(self):
        for param in self.resnet.parameters():
            param.requires_grad = False
        for param in self.resnet.layer3.parameters():
            param.requires_grad = True
        for param in self.resnet.layer4.parameters():
            param.requires_grad = True
        for param in self.resnet.fc.parameters():
            param.requires_grad = True

    def forward(self, x):
        return self.resnet(x)

# Dataset CIFAR-10 con augmentazione
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
])

trainset = CIFAR10(root='./data', train=True, download=True, transform=transform)
testset = CIFAR10(root='./data', train=False, download=True, transform=transforms.ToTensor())

trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
testloader = DataLoader(testset, batch_size=64, shuffle=False)

# TensorBoard
writer = SummaryWriter(log_dir='./logs/cnn_comparison')

# Training loop
def train_model(model, criterion, optimizer, num_epochs=10, model_name="Model"):
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        correct = 0
        total = 0

        for inputs, labels in trainloader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        train_loss = running_loss / len(trainloader)
        train_acc = 100 * correct / total

        writer.add_scalar(f"Loss/Train_{model_name}", train_loss, epoch)
        writer.add_scalar(f"Accuracy/Train_{model_name}", train_acc, epoch)

        print(f"{model_name} - Epoch [{epoch+1}/{num_epochs}], Loss: {train_loss:.4f}, Accuracy: {train_acc:.2f}%")

# Valutazione su test set
def test_model(model, model_name="Model"):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in testloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    acc = 100 * correct / total
    writer.add_scalar(f"Accuracy/Test_{model_name}", acc)
    print(f"{model_name} Test Accuracy: {acc:.2f}%")

# === ESECUZIONE MODELLI ===

criterion = nn.CrossEntropyLoss()

# 1. CNN semplice
simple_cnn = CNN().to(device)
optimizer_simple = torch.optim.Adam(simple_cnn.parameters(), lr=0.001)
train_model(simple_cnn, criterion, optimizer_simple, model_name="CNN")
test_model(simple_cnn, model_name="CNN")

# 2. CNN profonda senza residual
deep_cnn = DeepCNN().to(device)
optimizer_deep = torch.optim.Adam(deep_cnn.parameters(), lr=0.001)
train_model(deep_cnn, criterion, optimizer_deep, model_name="DeepCNN")
test_model(deep_cnn, model_name="DeepCNN")

# 3. CNN con residual (ResNet)
resnet_cnn = ResNetDeepCNN().to(device)
optimizer_resnet = torch.optim.Adam(resnet_cnn.parameters(), lr=0.001)
train_model(resnet_cnn, criterion, optimizer_resnet, model_name="ResNet")
test_model(resnet_cnn, model_name="ResNet")

writer.close()


| Modello   | Accuracy Epoca 1 | Accuracy Epoca 10 | Test Accuracy |
|-----------|------------------|-------------------|----------------|
| CNN       | 47.98%           | 81.30%            | 75.20%         |
| DeepCNN   | 44.35%           | 82.35%            | 78.49%         |
| ResNet18  | 62.56%           | 82.48%            | 80.45%         |


- **ResNet**, grazie alle connessioni residue, evidenzia i vantaggi sulle reti profonde: avvio dell’ottimizzazione più efficace, andamento del training più stabile e **accuratezza** sul test superiore.

- Le **CNN** profonde **senza** skip-connection restano valide, ma richiedono più tempo per convergere e risultano meno robuste rispetto all’overfitting.


-----
## Exercise 2: Choose at Least One

Below are **three** exercises that ask you to deepen your understanding of Deep Networks for visual recognition. You must choose **at least one** of the below for your final submission -- feel free to do **more**, but at least **ONE** you must submit. Each exercise is designed to require you to dig your hands **deep** into the guts of your models in order to do new and interesting things.

**Note**: These exercises are designed to use your small, custom CNNs and small datasets. This is to keep training times reasonable. If you have a decent GPU, feel free to use pretrained ResNets and larger datasets (e.g. the [Imagenette](https://pytorch.org/vision/0.20/generated/torchvision.datasets.Imagenette.html#torchvision.datasets.Imagenette) dataset at 160px).

### Exercise 2.1: *Fine-tune* a pre-trained model
Train one of your residual CNN models from Exercise 1.3 on CIFAR-10. Then:
1. Use the pre-trained model as a **feature extractor** (i.e. to extract the feature activations of the layer input into the classifier) on CIFAR-100. Use a **classical** approach (e.g. Linear SVM, K-Nearest Neighbor, or Bayesian Generative Classifier) from scikit-learn to establish a **stable baseline** performance on CIFAR-100 using the features extracted using your CNN.
2. Fine-tune your CNN on the CIFAR-100 training set and compare with your stable baseline. Experiment with different strategies:
    - Unfreeze some of the earlier layers for fine-tuning.
    - Test different optimizers (Adam, SGD, etc.).

Each of these steps will require you to modify your model definition in some way. For 1, you will need to return the activations of the last fully-connected layer (or the global average pooling layer). For 2, you will need to replace the original, 10-class classifier with a new, randomly-initialized 100-class classifier.

In [None]:
# Your code here.

### Exercise 2.2: *Distill* the knowledge from a large model into a smaller one
In this exercise you will see if you can derive a *small* model that performs comparably to a larger one on CIFAR-10. To do this, you will use [Knowledge Distillation](https://arxiv.org/abs/1503.02531):

> Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, NeurIPS 2015.

To do this:
1. Train one of your best-performing CNNs on CIFAR-10 from Exercise 1.3 above. This will be your **teacher** model.
2. Define a *smaller* variant with about half the number of parameters (change the width and/or depth of the network). Train it on CIFAR-10 and verify that it performs *worse* than your **teacher**. This small network will be your **student** model.
3. Train the **student** using a combination of **hard labels** from the CIFAR-10 training set (cross entropy loss) and **soft labels** from predictions of the **teacher** (Kulback-Leibler loss between teacher and student).

Try to optimize training parameters in order to maximize the performance of the student. It should at least outperform the student trained only on hard labels in Setp 2.

**Tip**: You can save the predictions of the trained teacher network on the training set and adapt your dataloader to provide them together with hard labels. This will **greatly** speed up training compared to performing a forward pass through the teacher for each batch of training.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter

# === Impostazioni ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
writer = SummaryWriter("runs/knowledge_distillation")

# === Teacher Model (ResNet18) ===
class ResNetTeacher(nn.Module):
    def __init__(self, num_classes=10):
        super(ResNetTeacher, self).__init__()
        self.resnet = models.resnet18(pretrained=True)
        self.resnet.fc = nn.Linear(self.resnet.fc.in_features, num_classes)

    def forward(self, x):
        return self.resnet(x)

# === Student Model (CNN più piccola) ===
class SmallCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SmallCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(64 * 8 * 8, 512)
        self.fc2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 64 * 8 * 8)
        x = F.relu(self.fc1(x))
        return self.fc2(x)

# === CIFAR-10 Dataset ===
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(root="./data", train=True, transform=transform, download=True)
test_dataset = datasets.CIFAR10(root="./data", train=False, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# === Training del Teacher ===
def train_teacher(model, loader, epochs=5, lr=0.001):
    model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        model.train()
        total_loss, correct, total = 0.0, 0, 0

        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)

        avg_loss = total_loss / len(loader)
        acc = 100. * correct / total
        writer.add_scalar("Loss/Teacher", avg_loss, epoch)
        writer.add_scalar("Accuracy/Teacher", acc, epoch)
        print(f"[Teacher] Epoch {epoch+1}: Loss={avg_loss:.4f}, Accuracy={acc:.2f}%")

def evaluate(model, loader, model_name="Model"):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)
    acc = 100. * correct / total
    writer.add_scalar(f"Test_Accuracy/{model_name}", acc)
    print(f"Test Accuracy ({model_name}): {acc:.2f}%")
    return acc

# === Salvataggio delle predizioni del teacher ===
def save_teacher_predictions(model, loader):
    model.eval()
    preds = []
    with torch.no_grad():
        for images, _ in loader:
            images = images.to(device)
            outputs = model(images)
            preds.append(outputs.cpu())
    return torch.cat(preds)

# === Funzione di distillazione ===
def distillation_loss(student_logits, teacher_logits, labels, T=1, alpha=0.3):
    #T-->aumenta l'incertezza nelle soft labels del teacher
    #alpha-->peso della distillazione
    soft_targets = F.softmax(teacher_logits / T, dim=1)
    soft_predictions = F.log_softmax(student_logits / T, dim=1)
    kl = F.kl_div(soft_predictions, soft_targets, reduction="batchmean") * (T * T)
    ce = F.cross_entropy(student_logits, labels)
    return alpha * kl + (1 - alpha) * ce

# === Training dello Student con distillazione ===
def train_student_with_distillation(model, teacher_preds, loader, epochs=10, lr=0.001, T=4, alpha=0.5):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        model.train()
        total_loss, correct, total = 0.0, 0, 0

        for i, (images, labels) in enumerate(loader):
            images, labels = images.to(device), labels.to(device)
            student_logits = model(images)
            teacher_logits = teacher_preds[i * len(images):(i+1) * len(images)].to(device)

            loss = distillation_loss(student_logits, teacher_logits, labels, T, alpha)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = student_logits.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)

        avg_loss = total_loss / len(loader)
        acc = 100. * correct / total
        writer.add_scalar("Loss/Student", avg_loss, epoch)
        writer.add_scalar("Accuracy/Student", acc, epoch)
        print(f"[Student KD] Epoch {epoch+1}: Loss={avg_loss:.4f}, Accuracy={acc:.2f}%")

# === Training dello Student base (solo hard labels) ===
def train_student_baseline(model, loader, epochs=10, lr=0.001):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        model.train()
        total_loss, correct, total = 0.0, 0, 0

        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)

        avg_loss = total_loss / len(loader)
        acc = 100. * correct / total
        writer.add_scalar("Loss/Student_HardOnly", avg_loss, epoch)
        writer.add_scalar("Accuracy/Student_HardOnly", acc, epoch)
        print(f"[Student Hard] Epoch {epoch+1}: Loss={avg_loss:.4f}, Accuracy={acc:.2f}%")

# === Esecuzione ===
teacher = ResNetTeacher()
train_teacher(teacher, train_loader, epochs=5, lr=0.001)
teacher_preds = save_teacher_predictions(teacher, train_loader)
evaluate(teacher, test_loader, model_name="Teacher")

deep_student = SmallCNN()
train_student_baseline(deep_student, train_loader, epochs=10)
evaluate(deep_student, test_loader, model_name="Student_HardOnly")

student = SmallCNN()
train_student_with_distillation(student, teacher_preds, train_loader, epochs=10, T=1, alpha=0.3)
evaluate(student, test_loader, model_name="Student_KD")

writer.close()


## Risultati della Knowledge Distillation

| Modello             | Accuracy Epoca 1 | Accuracy Epoca 10 | Test Accuracy |
|---------------------|------------------|-------------------|---------------|
| Teacher             | 67.99%           | 88.97%            | 79.87%        |
| Student HardOnly    | 54.00%           | 97.75%            | 71.87%        |
| Student KD (α=0.3)  | 53.11%           | 98.15%            | 72.42%        |

- Il **Teacher (ResNet18)** si conferma un valido riferimento: **79.87%** di accuratezza sul test, con una crescita solida fino all’epoca 10.

- Lo **Student addestrato solo con hard labels** raggiunge una **training accuracy molto elevata (97.75%)**, ma sul test si ferma a **71.87%**, segnale di **overfitting**.

- Lo **Student con Knowledge Distillation (KD)** ottiene una **test accuracy superiore (72.42%)** rispetto allo student “hard-only”, pur mostrando **loss più alta** (effetto previsto del termine di distillazione).  

  In sintesi, la **KD trasferisce conoscenza utile dal teacher**, migliorando la generalizzazione dello student.


### Exercise 2.3: *Explain* the predictions of a CNN

Use the CNN model you trained in Exercise 1.3 and implement [*Class Activation Maps*](http://cnnlocalization.csail.mit.edu/#:~:text=A%20class%20activation%20map%20for,decision%20made%20by%20the%20CNN.):

> B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. CVPR'16 (arXiv:1512.04150, 2015).

Use your CNN implementation to demonstrate how your trained CNN *attends* to specific image features to recognize *specific* classes. Try your implementation out using a pre-trained ResNet-18 model and some images from the [Imagenette](https://pytorch.org/vision/0.20/generated/torchvision.datasets.Imagenette.html#torchvision.datasets.Imagenette) dataset -- I suggest you start with the low resolution version of images at 160px.

In [None]:
# Your code here.